<a href="https://colab.research.google.com/github/nabroo101/deep-learning-challenge/blob/main/Charity_deep_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Preprocessing

---

# **Charity Deep Learning Project**

In this project, we'll be exploring deep learning techniques to understand and make predictions related to charity data.

## **Step 1: Importing Dependencies**

Before we dive into the actual data analysis, let's import all the necessary libraries and modules:

```python
# Data manipulation and splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Deep learning library
import tensorflow as tf
```

## **Step 2: Loading the Data**

We'll be using a dataset named `charity_data.csv` which contains relevant information about different charities. Let's load this data and take a quick look at the first few rows:

```python
# Import the dataset
application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")

# Display the first few rows of the dataset
application_df.head()
```

---




# Alphabet Soup's Dataset Information

From Alphabet Soup’s business team, we have obtained a comprehensive CSV dataset. This dataset encompasses data from over 34,000 organizations that have been beneficiaries of funding from Alphabet Soup throughout the years.

## Columns Description:

The dataset provides a range of columns, capturing metadata about each organization:

- **EIN and NAME**: Identification columns.
  
- **APPLICATION_TYPE**: Describes the Alphabet Soup application type.

- **AFFILIATION**: Represents the affiliated sector of the industry.

- **CLASSIFICATION**: Refers to the government organization classification.

- **USE_CASE**: Specifies the use case for which the funding is granted.

- **ORGANIZATION**: Type of the organization.

- **STATUS**: Indicates whether the organization is currently active or not.

- **INCOME_AMT**: Categorizes organizations based on their income.

- **SPECIAL_CONSIDERATIONS**: Notes if there are any special considerations for the application.

- **ASK_AMT**: The amount of funding the organization requested.

- **IS_SUCCESSFUL**: Determines whether the funding was utilized effectively.




In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
import tensorflow as tf
import seaborn as sns


#  Import and read the charity_data.csv.
import pandas as pd
application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")
application_df.head()
print(application_df.info())
application_df.head()

2023-08-16 20:37:47.951743: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34299 entries, 0 to 34298
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   EIN                     34299 non-null  int64 
 1   NAME                    34299 non-null  object
 2   APPLICATION_TYPE        34299 non-null  object
 3   AFFILIATION             34299 non-null  object
 4   CLASSIFICATION          34299 non-null  object
 5   USE_CASE                34299 non-null  object
 6   ORGANIZATION            34299 non-null  object
 7   STATUS                  34299 non-null  int64 
 8   INCOME_AMT              34299 non-null  object
 9   SPECIAL_CONSIDERATIONS  34299 non-null  object
 10  ASK_AMT                 34299 non-null  int64 
 11  IS_SUCCESSFUL           34299 non-null  int64 
dtypes: int64(4), object(8)
memory usage: 3.1+ MB
None


Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1




## Data Cleaning

### Dropping Non-Beneficial Columns

For our analysis, certain columns such as 'EIN' and 'NAME' are not beneficial. These columns serve as identification for the organizations but do not offer any substantive information for our machine learning model. Therefore, we'll drop these columns from our dataset:

```python
application_df.drop(columns=["EIN", "NAME"], inplace=True)
```

**Note**: It's essential to include the `inplace=True` parameter if you wish to modify the `application_df` directly. Without it, the changes won't be saved to the `application_df` but will return a new DataFrame with the columns dropped.



In [2]:
dff = application_df["CLASSIFICATION"].value_counts()
dff

C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
         ...  
C4120        1
C8210        1
C2561        1
C4500        1
C2150        1
Name: CLASSIFICATION, Length: 71, dtype: int64

In [3]:
# Drop the non-beneficial ID columns, 'EIN' and 'NAME'.
application_df.drop(columns=["EIN"] , inplace= True)

#check data
application_df.head()

#check target: to see if there are any bug difrenceses in value count
application_df["IS_SUCCESSFUL"].value_counts()

1    18261
0    16038
Name: IS_SUCCESSFUL, dtype: int64


## Unique Value Analysis

To understand the diversity of our dataset and to ascertain potential candidates for one-hot encoding (among other preprocessing steps), we first determine the number of unique values in each column:

```python
application_df.nunique()
```

By examining the unique count, we can get insights into the categorical nature of our data and decide on the next preprocessing steps.


In [4]:
application_df.dtypes

NAME                      object
APPLICATION_TYPE          object
AFFILIATION               object
CLASSIFICATION            object
USE_CASE                  object
ORGANIZATION              object
STATUS                     int64
INCOME_AMT                object
SPECIAL_CONSIDERATIONS    object
ASK_AMT                    int64
IS_SUCCESSFUL              int64
dtype: object

In [5]:
# Determine the number of unique values in each column.
application_df_count = application_df.nunique()
application_df_count

NAME                      19568
APPLICATION_TYPE             17
AFFILIATION                   6
CLASSIFICATION               71
USE_CASE                      5
ORGANIZATION                  4
STATUS                        2
INCOME_AMT                    9
SPECIAL_CONSIDERATIONS        2
ASK_AMT                    8747
IS_SUCCESSFUL                 2
dtype: int64

In [6]:
# Look at APPLICATION_TYPE value counts for binning
application_type_count =  application_df["APPLICATION_TYPE"].value_counts()
application_type_count

T3     27037
T4      1542
T6      1216
T5      1173
T19     1065
T8       737
T7       725
T10      528
T9       156
T13       66
T12       27
T2        16
T25        3
T14        3
T29        2
T15        2
T17        1
Name: APPLICATION_TYPE, dtype: int64



1. **Choosing a Cutoff**:
    ```python
    application_types_to_replace = application_type_count[8:].index
    ```
   Here, the code selects a cutoff based on the index. The `[8:]` slice means you're taking all application types from the 8th index onward. This assumes that `application_type_count` is already sorted in descending order by count. The code is effectively saying, "Let's replace all application types from the 8th least frequent type and beyond."

2. **Replacement**:
    ```python
    for app in application_types_to_replace:
        application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app,"Other")
    ```
    This loop replaces all the application types in the `application_types_to_replace` list with the label "Other" in the DataFrame `application_df`.

3. **Checking**:
    ```python
    application_df['APPLICATION_TYPE'].value_counts()
    ```
    This line checks and displays the updated counts for each `APPLICATION_TYPE`, verifying that the replacement was successful.



---

### Explanation

---

## **Minimizing Application Types**

In our dataset, there are multiple `APPLICATION_TYPE`s, some of which occur very infrequently. Having many infrequent categories can sometimes introduce noise into our machine learning models, making them less effective. By reducing the number of infrequent categories, we aim to create a more generalizable model.

### **Procedure**:

1. **Determine Cutoff**:
   
   We choose a cutoff to determine which application types are considered "infrequent." For our current approach, we select all application types from the 8th least frequent type onward. These types will be replaced with a more general label: "Other."

    ```python
    application_types_to_replace = application_type_count[8:].index
    ```

2. **Replacement**:
   
   Next, we iterate over the identified application types and replace them in our dataset.

    ```python
    for app in application_types_to_replace:
        application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app,"Other")
    ```

3. **Validation**:
   
   Finally, to ensure our changes have taken effect, we check the updated value counts for `APPLICATION_TYPE`.

    ```python
    application_df['APPLICATION_TYPE'].value_counts()
    ```

### **Rationale**:

Minimizing the number of categories in our dataset can:

- **Enhance Model Generalization**: By reducing sparse categories, we reduce the chance of overfitting to specific categories that don't have enough representation.
  
- **Streamline Encoding**: Later, when we encode these categories for modeling, fewer categories can lead to fewer columns, making our dataset more manageable and the model faster.

- **Improve Interpretability**: Models become more interpretable with fewer categories as the focus shifts to more significant, broad categories rather than nuanced, infrequent ones.



In [7]:
# Choose a cutoff value and create a list of application types to be replaced
# use the variable name `application_types_to_replace`
application_types_to_replace = [application_type_count[application_type_count.values < 500].index]
application_types_to_replace
#review video


# Replace in dataframe
for app in application_types_to_replace:
    application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app,"Other")

# Check to make sure binning was successful
application_df['APPLICATION_TYPE'].value_counts()
# application_types_to_replace


T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
Other      276
Name: APPLICATION_TYPE, dtype: int64

In [8]:
# Look at CLASSIFICATION value counts for binning
classification_count = application_df["CLASSIFICATION"].value_counts()


In [9]:
# You may find it helpful to look at CLASSIFICATION value counts >1



## **Minimizing Classifications**

To reduce complexity in the dataset and possibly improve the model's performance, classifications that occur infrequently are grouped together under a single "Other" category.

### **Procedure**:

1. **Determine Classifications to Replace**:
    - Choose classifications that occur less than 100 times.
    ```python
    classification_to_replace = classification_count[classification_count.values < 100].index
    ```

2. **Replace in DataFrame**:
    - Replace the identified classifications in the `CLASSIFICATION` column with "Other."
    ```python
    for cls in classification_to_replace:
        application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(cls,"Other")
    ```

3. **Validate the Replacement**:
    - Verify the changes by checking the updated value counts.
    ```python
    application_df['CLASSIFICATION'].value_counts()
    ```

### **Benefits**:

- **Model Simplicity**: By reducing sparse categories, the model might be less prone to overfitting.
- **Interpretability**: Grouping infrequent classifications into a broader category simplifies interpretation.
- **Computational Efficiency**: Fewer unique values might lead to a more efficient encoding process later in the data preprocessing pipeline.


In [10]:
# Choose a cutoff value and create a list of classifications to be replaced
# use the variable name `classifications_to_replace`
classification_to_replace = classification_count[classification_count.values < 100 ].index
classification_to_replace
# Replace in dataframe
for cls in classification_to_replace:
    application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(cls,"Other")

# Check to make sure binning was successful
application_df['CLASSIFICATION'].value_counts()

C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
C7000      777
Other      669
C1700      287
C4000      194
C5000      116
C1270      114
C2700      104
Name: CLASSIFICATION, dtype: int64

In [11]:
names_count = application_df["NAME"].value_counts()
names_count


PARENT BOOSTER USA INC                                                  1260
TOPS CLUB INC                                                            765
UNITED STATES BOWLING CONGRESS INC                                       700
WASHINGTON STATE UNIVERSITY                                              492
AMATEUR ATHLETIC UNION OF THE UNITED STATES INC                          408
                                                                        ... 
ST LOUIS SLAM WOMENS FOOTBALL                                              1
AIESEC ALUMNI IBEROAMERICA CORP                                            1
WEALLBLEEDRED ORG INC                                                      1
AMERICAN SOCIETY FOR STANDARDS IN MEDIUMSHIP & PSYCHICAL INVESTIGATI       1
WATERHOUSE CHARITABLE TR                                                   1
Name: NAME, Length: 19568, dtype: int64

In [12]:
names_count = application_df["NAME"].value_counts()
names_to_replace = names_count[names_count < 250].index

for name in names_to_replace:
    application_df['NAME'] = application_df['NAME'].replace(name, "Other")

application_df["NAME"].value_counts()


Other                                                28539
PARENT BOOSTER USA INC                                1260
TOPS CLUB INC                                          765
UNITED STATES BOWLING CONGRESS INC                     700
WASHINGTON STATE UNIVERSITY                            492
AMATEUR ATHLETIC UNION OF THE UNITED STATES INC        408
PTA TEXAS CONGRESS                                     368
SOROPTIMIST INTERNATIONAL OF THE AMERICAS INC          331
ALPHA PHI SIGMA                                        313
TOASTMASTERS INTERNATIONAL                             293
MOST WORSHIPFUL STRINGER FREE AND ACCEPTED MASONS      287
LITTLE LEAGUE BASEBALL INC                             277
INTERNATIONAL ASSOCIATION OF LIONS CLUBS               266
Name: NAME, dtype: int64

# **get_dummies**
### Code:

```python
# Convert categorical data to numeric with `pd.get_dummies`
application_df = pd.get_dummies(application_df)
```

### Explanation:

- **What It Does**: The `pd.get_dummies()` function converts categorical variables into a "dummy" or "indicator" matrix. For each unique value in the categorical column, it creates a new binary column, where the value is 1 if the original column's value matches that unique value and 0 otherwise.
- **Why It's Used**: Many machine learning algorithms require numerical input, so categorical variables must be transformed into a numerical format. By creating these "dummy" binary columns, the information in the categorical variables is preserved in a way that can be utilized by algorithms.

### Example:

Suppose you have a column named "color" with values "red," "blue," and "green." After applying `pd.get_dummies()`, you'll have three new columns: "color_red," "color_blue," and "color_green." Each row will have a 1 in the column corresponding to its color and 0s in the other two.


## **Converting Categorical Data to Numeric Form**

### **Objective**:

Transform categorical variables into numerical format, allowing them to be processed by machine learning algorithms.

### **Procedure**:

1. **Use the `pd.get_dummies` Method**:
    - Apply the `pd.get_dummies()` function to the entire DataFrame to convert all categorical columns into dummy variables.
    ```python
    application_df = pd.get_dummies(application_df)
    ```

### **Result**:

- The categorical variables in the dataset have been replaced with binary dummy variables.
- Each unique value in the original categorical columns has been converted into a separate column, with binary values indicating the presence or absence of that category in each observation.

### **Benefits**:

- **Compatibility with Algorithms**: This transformation ensures that the data can be used with algorithms that require numerical input.
- **Preservation of Information**: The categorical information is preserved in a numerical format that retains the distinctions between different categories.


In [13]:
# Convert categorical data to numeric with `pd.get_dummies`
application_df = pd.get_dummies(application_df)

In [14]:
# Split our preprocessed data into our features and target arrays
y = application_df["IS_SUCCESSFUL"].values
X = application_df.drop(columns="IS_SUCCESSFUL").values

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

In [15]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

X_train_scaled.shape[1]


62

## Compile, Train and Evaluate the Model


## Defining the Deep Neural Network Model

We're defining a deep neural network with three layers: two hidden layers and an output layer. The structure of the network is as follows:

### Input Dimension

The number of input features is defined as the length of the first record in our training set (`X_train[0]`), and it is stored in the variable `number_input_features`.

### Hidden Layers

- **First Hidden Layer:** This layer contains `1.5 * number_input_features` neurons and uses the ReLU (Rectified Linear Unit) activation function.
- **Second Hidden Layer:** This layer contains `1 * number_input_features` neurons and also uses the ReLU activation function.

### Output Layer

The output layer consists of a single neuron, as this is a binary classification task. It uses the sigmoid activation function to output a probability that the input belongs to the positive class.

### Model Summary

After defining the architecture, we call the `nn.summary()` method to display a summary of the model's structure.


This architecture is flexible and can be adjusted by changing the number of hidden layers, the number of neurons in each layer, or the activation functions used.

In [21]:
 # Define the model - deep neural net
number_input_features = len(X_train[0])
hidden_nodes_layer1 =  5 * number_input_features
hidden_nodes_layer2 = 1 * number_input_features

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(
    tf.keras.layers.Dense(
        units=hidden_nodes_layer1,
        input_dim=number_input_features,
        activation="relu")
)

# Second hidden layer
nn.add(
    tf.keras.layers.Dense(
        units=hidden_nodes_layer2,
        activation="relu"))

# Output layer
nn.add(
    tf.keras.layers.Dense(
        units=1,
        activation="sigmoid"))

# Check the structure of the model
nn.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 310)               19530     
                                                                 
 dense_4 (Dense)             (None, 62)                19282     
                                                                 
 dense_5 (Dense)             (None, 1)                 63        
                                                                 
Total params: 38,875
Trainable params: 38,875
Non-trainable params: 0
_________________________________________________________________


## Compiling the Model

### Overview

The compilation step is where the learning process is configured before training the model. It includes specifying the loss function, the optimizer, and the evaluation metrics.

### Components

#### Loss Function
- **binary_crossentropy**: Suitable for binary classification problems, this loss function computes the cross-entropy loss between true labels and predicted labels.

#### Optimizer
- **adam**: An efficient and effective optimizer, commonly used in deep learning.

#### Metrics
- **accuracy**: Measures the proportion of correctly classified instances.

### Code

```python
# Compile the model
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
```

### Summary

This compilation sets the stage for training the neural network. By specifying the loss function, optimizer, and metrics, the model is prepared to learn from the training data and evaluate its performance on both training and testing datasets.

In [22]:
# Compile the model
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

## Training the Model

### Overview

Once the model has been defined and compiled, the next step is to train it on the data. Training involves feeding the input data into the model, letting it make predictions, and then updating the model's weights based on the error of its predictions. This process is repeated for a specified number of iterations, known as epochs.

### Components

#### Training Data
- **X_train_scaled**: The scaled input features for training the model.
- **y_train**: The true labels corresponding to the input features.

#### Epochs
- **epochs**: This argument specifies the number of times the learning algorithm will work through the entire training dataset. In this case, 100 epochs are chosen.

### Code

```python
# Train the model
fit_model = nn.fit(X_train_scaled, y_train, epochs=100)
```

### Summary

This code snippet is where the actual training of the neural network takes place. By calling the `fit` method on the model (`nn`) and providing the training data, the model will iteratively learn from the data 100 times. The results of this training (including information on loss and accuracy for each epoch) are stored in the variable `fit_model`, which can be further analyzed or used to make predictions on unseen data.

Note: The number of epochs is a hyperparameter that can be tuned. Choosing the right number of epochs is vital as too few may result in underfitting, while too many may lead to overfitting.

In [23]:
# Train the model
fit_model = nn.fit(X_train_scaled,y_train,epochs=100)

Epoch 1/100


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

## Evaluating the Model

### Overview

After training the neural network model, it is important to evaluate how well the model performs on unseen data. This is usually done by running the model on a test dataset and comparing the predictions to the actual values.

### Components

#### Test Data
- **X_test_scaled**: The scaled input features for the test data.
- **y_test**: The true labels corresponding to the test data.

#### Evaluation Method
- `nn.evaluate()`: This method takes the test data as input and computes the loss and any additional metrics specified when the model was compiled (in this case, accuracy). 

### Code

```python
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled, y_test, verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")
```

### Summary

The code snippet above is used to evaluate the trained model on the test dataset. The `evaluate` method returns the loss and accuracy, which are printed to the console.

- **Loss**: This value represents how well the predictions of the model align with the actual values. A lower loss indicates better alignment.
- **Accuracy**: This value represents the percentage of correct predictions made by the model out of all predictions. It is particularly useful for classification problems.

By evaluating the model on a dataset that it has not seen during training, we can get a better sense of how the model might perform on entirely new data. This helps in assessing the model's generalization capability.

In [19]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

268/268 - 0s - loss: 0.5187 - accuracy: 0.7455 - 492ms/epoch - 2ms/step
Loss: 0.5186641216278076, Accuracy: 0.7455393671989441


In [20]:
# Export our model to HDF5 file
# Save the model to disk
nn.save("AlphabetSoupCharity_Optimization.h5")
print("Model saved to disk.")
