# Step 3: Optimise the model
Using your knowledge of TensorFlow, optimise your model to achieve a target predictive accuracy higher than 75%.
Using any or all of the following methods to optimise your model:
- Adjust the input data to ensure that no variables or outliers are causing confusion in the model, such as:
    - Dropping more or fewer columns.
    - Creating more bins for rare occurrences in columns.
    - Increasing or decreasing the number of values for each bin.
- Add more neurons to a hidden layer.
- Add more hidden layers.
- Use different activation functions for the hidden layers.
- Add or reduce the number of epochs to the training regimen.

In [163]:
# Import dependencies
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.utils import to_categorical
from tensorflow.keras.layers import Dense

# Step 3A: Preprocess the Data
Preprocess the dataset like you did in Step 1, Be sure to adjust for any modifications that came out of optimising the model.


In [164]:
# Load data into a Pandas DataFrame
charity_df = pd.read_csv("Resources/charity_data.csv")
charity_df.head(1)

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1


Added outlier reduction before binning at attempt 3

In [165]:
# Calculate the IQR for each feature
Q1 = charity_df.quantile(0.25)
Q3 = charity_df.quantile(0.75)
IQR = Q3 - Q1

# Define the outlier threshold as 1.5 times the IQR above Q3 or below Q1
outlier_threshold = 1.5 * IQR

# Find the indices of the outliers
outliers = ((charity_df < (Q1 - outlier_threshold)) | (charity_df > (Q3 + outlier_threshold))).any(axis=1)

# Remove the outliers from the data
charity_df = charity_df[~outliers]

  Q1 = charity_df.quantile(0.25)
  Q3 = charity_df.quantile(0.75)
  outliers = ((charity_df < (Q1 - outlier_threshold)) | (charity_df > (Q3 + outlier_threshold))).any(axis=1)


In [166]:
# Determine the number of unique values in each column.
charity_df.nunique()

EIN                       26088
NAME                      13247
APPLICATION_TYPE             12
AFFILIATION                   6
CLASSIFICATION               65
USE_CASE                      5
ORGANIZATION                  4
STATUS                        1
INCOME_AMT                    9
SPECIAL_CONSIDERATIONS        2
ASK_AMT                     656
IS_SUCCESSFUL                 2
dtype: int64

In [167]:
# Look at APPLICATION_TYPE value counts for binning
bin1 = charity_df['APPLICATION_TYPE'].value_counts()
bin1

T3     20081
T4      1331
T19      999
T5       894
T6       882
T8       669
T7       633
T10      508
T13       54
T9        18
T12       13
T2         6
Name: APPLICATION_TYPE, dtype: int64

In [168]:
# Choose a cutoff value and create a list of application types to be replaced
# use the variable name `application_types_to_replace`
application_types_to_replace = list(bin1[bin1<100].index)

# Replace in dataframe
for app in application_types_to_replace:
    charity_df['APPLICATION_TYPE'] = charity_df['APPLICATION_TYPE'].replace(app,"Other")

# Check to make sure binning was successful
charity_df['APPLICATION_TYPE'].value_counts()

T3       20081
T4        1331
T19        999
T5         894
T6         882
T8         669
T7         633
T10        508
Other       91
Name: APPLICATION_TYPE, dtype: int64

In [169]:
# Look at CLASSIFICATION value counts for binning
bin2 = charity_df['CLASSIFICATION'].value_counts()
bin2

C1000    12554
C2000     4692
C1200     3997
C2100     1622
C3000     1537
         ...  
C1732        1
C1728        1
C4120        1
C1245        1
C2150        1
Name: CLASSIFICATION, Length: 65, dtype: int64

In [170]:
# You may find it helpful to look at CLASSIFICATION value counts >1
bin2[bin2>1]

C1000    12554
C2000     4692
C1200     3997
C2100     1622
C3000     1537
C7000      546
C1700      223
C4000      129
C5000      106
C1270       89
C2700       75
C7100       62
C2800       61
C1280       46
C1300       42
C1230       35
C2300       28
C1240       28
C1400       27
C7200       20
C6000       14
C8000       13
C1250       13
C7120       11
C1278       10
C1237        9
C8200        9
C1238        9
C1235        9
C1500        7
C1720        6
C1257        5
C7210        5
C2400        4
C1600        4
C4100        4
C1260        3
C1800        3
C1267        2
C1246        2
C1256        2
C0           2
Name: CLASSIFICATION, dtype: int64

In [171]:
# Choose a cutoff value and create a list of classifications to be replaced
# use the variable name `classifications_to_replace`
classifications_to_replace = list(bin2[bin2<10].index)

# Replace in dataframe
for cls in classifications_to_replace:
    charity_df['CLASSIFICATION'] = charity_df['CLASSIFICATION'].replace(cls,"Other")
    
# Check to make sure binning was successful
charity_df['CLASSIFICATION'].value_counts()

C1000    12554
C2000     4692
C1200     3997
C2100     1622
C3000     1537
C7000      546
C1700      223
C4000      129
Other      108
C5000      106
C1270       89
C2700       75
C7100       62
C2800       61
C1280       46
C1300       42
C1230       35
C2300       28
C1240       28
C1400       27
C7200       20
C6000       14
C1250       13
C8000       13
C7120       11
C1278       10
Name: CLASSIFICATION, dtype: int64

In [172]:
# Look at NAME value counts for binning
bin3 = charity_df['NAME'].value_counts()
bin3

PARENT BOOSTER USA INC                                    1130
TOPS CLUB INC                                              765
UNITED STATES BOWLING CONGRESS INC                         618
WASHINGTON STATE UNIVERSITY                                487
AMATEUR ATHLETIC UNION OF THE UNITED STATES INC            385
                                                          ... 
SOUTHERN ARIZONA LUTHERAN CAMPING ASSOCIATION                1
WIS TEQ NEEMIT                                               1
YOUNG ARTISTS SYMPHONY ORCHESTRA                             1
CATHOLIC CEMETERY AND CHARITABLE IRRV TR                     1
AMERICAN FEDERATION OF GOVERNMENT EMPLOYEES LOCAL 2886       1
Name: NAME, Length: 13247, dtype: int64

In [173]:
# Choose a cutoff value and create a list of names to be replaced
# use the variable name `names_to_replace`
names_to_replace = list(bin3[bin3<40].index)

# Replace in dataframe
for name in names_to_replace:
    charity_df['NAME'] = charity_df['NAME'].replace(name,"Other")
    
# Check to make sure binning was successful
charity_df['NAME'].value_counts()

Other                                                                    16923
PARENT BOOSTER USA INC                                                    1130
TOPS CLUB INC                                                              765
UNITED STATES BOWLING CONGRESS INC                                         618
WASHINGTON STATE UNIVERSITY                                                487
                                                                         ...  
VETERANS OF FOREIGN WARS OF THE U S AUXILIARY DEPARTMENT OF LOUISIANA       41
AMERICAN YOUTH FOOTBALL INC                                                 41
MUSIC TEACHERS NATIONAL ASSOCIATION INC                                     41
MODERN QUILT GUILD INC                                                      40
VETERANS OF FOREIGN WARS OF THE UNITED STATES AUX DEPT OF COLORADO          40
Name: NAME, Length: 61, dtype: int64

In [174]:
# Convert categorical data to numeric with `pd.get_dummies`
charity_df = pd.get_dummies(charity_df,dtype=float)
charity_df.head()

Unnamed: 0,EIN,STATUS,ASK_AMT,IS_SUCCESSFUL,NAME_AIR FORCE ASSOCIATION,NAME_ALABAMA FEDERATION OF WOMENS CLUBS,NAME_ALPHA PHI SIGMA,NAME_AMATEUR ATHLETIC UNION OF THE UNITED STATES,NAME_AMATEUR ATHLETIC UNION OF THE UNITED STATES INC,NAME_AMERICAN ASSOCIATION OF UNIVERSITY WOMEN,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,10520599,1,5000,1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,10547893,1,5000,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,10553066,1,6692,1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,10556855,1,5000,1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9,10571689,1,5000,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [175]:
# Drop unnecessary columns
charity_df = charity_df.drop(columns=["EIN"])

# Split our preprocessed data into our features and target arrays
X = charity_df.drop(columns=["IS_SUCCESSFUL"])
y = charity_df["IS_SUCCESSFUL"]

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Scale the data using the StandardScaler
scaler = StandardScaler()
X_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.transform(X_test)

# Convert y_train and y_test to categorical values
y_train_categorical = to_categorical(y_train)
y_test_categorical = to_categorical(y_test)

In [176]:
# Print charity_df shape
charity_df.shape

(26088, 125)

# Step 3B: Optimise Model

______________________
## Script 1:

In [177]:
number_input_features = len( X_scaler[0])

# neurons in the input layer
input_layer_nodes = 80

# Define the hidden layers
hidden_nodes_layer_1 = 80

# neurons in the output layer
output_layer_nodes = 2

# Design the neural network model
number_input_features = len(X_train.columns)

nn1 = tf.keras.models.Sequential()

# Add the input layer
nn1.add(Dense(units=input_layer_nodes, input_dim=number_input_features, activation="relu"))

# Add the first hidden layer
nn1.add(Dense(units=hidden_nodes_layer_1, activation="relu"))

# Add the output layer
nn1.add(Dense(units=output_layer_nodes, activation="sigmoid"))

# Check the structure of the model
nn1.summary()

Model: "sequential_44"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_144 (Dense)           (None, 80)                10000     
                                                                 
 dense_145 (Dense)           (None, 80)                6480      
                                                                 
 dense_146 (Dense)           (None, 2)                 162       
                                                                 
Total params: 16,642
Trainable params: 16,642
Non-trainable params: 0
_________________________________________________________________


In [178]:
# Compile the model
nn1.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model with 5 epochs
model_1 = nn1.fit(X_scaler, y_train_categorical, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [179]:
# Evaluate the model using the test data
test_loss, test_accuracy = nn1.evaluate(X_test_scaler, y_test_categorical, verbose=2)

# Print the test loss and accuracy
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

204/204 - 0s - loss: 0.4440 - accuracy: 0.7856 - 311ms/epoch - 2ms/step
Test Loss: 0.44402551651000977
Test Accuracy: 0.7856485843658447


In [180]:
# Attempt 1:

'''
268/268 - 0s - loss: 0.5492 - accuracy: 0.7313 - 377ms/epoch - 1ms/step
Test Loss: 0.5492093563079834
Test Accuracy: 0.7313119769096375
'''

# Attempt 2: Added outlier reduction before binning

'''
204/204 - 0s - loss: 0.8306 - accuracy: 0.7459 - 375ms/epoch - 2ms/step
Test Loss: 0.8305676579475403
Test Accuracy: 0.7459368109703064
'''

# Attempt 3: binning values in the "NAME" column

'''
204/204 - 0s - loss: 0.5579 - accuracy: 0.7453 - 394ms/epoch - 2ms/step
Test Loss: 0.5578669309616089
Test Accuracy: 0.7453235387802124
'''

# Attempt 4: nodes 80, 80, 2 + FIXED PREPROCESSING -> # Convert categorical data to numeric with `pd.get_dummies`

'''
204/204 - 0s - loss: 0.4481 - accuracy: 0.7850 - 356ms/epoch - 2ms/step
Test Loss: 0.44812464714050293
Test Accuracy: 0.785035252571106
'''

'\n204/204 - 0s - loss: 0.4481 - accuracy: 0.7850 - 356ms/epoch - 2ms/step\nTest Loss: 0.44812464714050293\nTest Accuracy: 0.785035252571106\n'

______________________
# Script 2:
Added Dropout layers to prevent overfitting

In [181]:
number_input_features = len( X_scaler[0])

# neurons in the input layer
input_layer_nodes = 90

# Define the hidden layers
hidden_nodes_layer_1 = 90

# neurons in the output layer
output_layer_nodes = 2

# Design the neural network model
number_input_features = len(X_train.columns)

nn2 = tf.keras.models.Sequential()

# Add the input layer
nn2.add(Dense(units=input_layer_nodes, input_dim=number_input_features, activation="tanh"))

# Add a Dropout layer to prevent overfitting
nn2.add(tf.keras.layers.Dropout(0.3))

# Add the first hidden layer
nn2.add(Dense(units=hidden_nodes_layer_1, activation="sigmoid"))

# Add another Dropout layer
nn2.add(tf.keras.layers.Dropout(0.3))

# Add the output layer 
nn2.add(Dense(units=output_layer_nodes, activation="softmax"))

# Check the structure of the model
nn2.summary()

Model: "sequential_45"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_147 (Dense)           (None, 90)                11250     
                                                                 
 dropout_10 (Dropout)        (None, 90)                0         
                                                                 
 dense_148 (Dense)           (None, 90)                8190      
                                                                 
 dropout_11 (Dropout)        (None, 90)                0         
                                                                 
 dense_149 (Dense)           (None, 2)                 182       
                                                                 
Total params: 19,622
Trainable params: 19,622
Non-trainable params: 0
_________________________________________________________________


In [191]:
# Compile the model
nn2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with 5 epochs
model_2 = nn2.fit(X_scaler, y_train_categorical, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [192]:
# Evaluate the model using the test data
test_loss, test_accuracy = nn2.evaluate(X_test_scaler, y_test_categorical, verbose=2)

# Print the test loss and accuracy
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

204/204 - 1s - loss: 0.4423 - accuracy: 0.7861 - 503ms/epoch - 2ms/step
Test Loss: 0.44234219193458557
Test Accuracy: 0.7861085534095764


In [184]:
# Attempt 1: 

'''
204/204 - 0s - loss: 0.5416 - accuracy: 0.7423 - 321ms/epoch - 2ms/step
Test Loss: 0.5415589213371277
Test Accuracy: 0.7422569990158081
'''

# Attempt 2: FIXED PREPROCESSING -> # Convert categorical data to numeric with `pd.get_dummies`

'''
204/204 - 0s - loss: 0.4471 - accuracy: 0.7778 - 368ms/epoch - 2ms/step
Test Loss: 0.44707298278808594
Test Accuracy: 0.7778288722038269
'''

'\n204/204 - 0s - loss: 0.4471 - accuracy: 0.7778 - 368ms/epoch - 2ms/step\nTest Loss: 0.44707298278808594\nTest Accuracy: 0.7778288722038269\n'

______________________
# Script 3

In [185]:
from keras.regularizers import l1, l2

In [186]:
number_input_features = len(X_scaler[0])

# neurons in the input layer
input_layer_nodes = 100

# Define the hidden layers
hidden_nodes_layer_1 = 80
hidden_nodes_layer_2 = 80

# neurons in the output layer
output_layer_nodes = 2

# Design the neural network model
number_input_features = len(X_train.columns)

nn3 = tf.keras.models.Sequential()

# Add the input layer with "relu" activation function relu -> tanh -> sigmoid
nn3.add(Dense(units=input_layer_nodes, input_dim=number_input_features, activation="relu", kernel_regularizer=l2(0.02)))

# Add the first hidden layer with "relu" activation function relu -> tanh -> sigmoid
nn3.add(Dense(units=hidden_nodes_layer_1, activation="relu", kernel_regularizer=l2(0.02)))

# Add the second hidden layer with "relu" activation function relu -> tanh -> sigmoid
nn3.add(Dense(units=hidden_nodes_layer_2, activation="sigmoid", kernel_regularizer=l2(0.01)))

# Add the output layer with "softmax" activation function # softmax -> sigmoid -> tanh -> sigmoid
nn3.add(Dense(units=output_layer_nodes, activation="sigmoid", kernel_regularizer=l2(0.02)))

# Check the structure of the model
nn3.summary()


Model: "sequential_46"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_150 (Dense)           (None, 100)               12500     
                                                                 
 dense_151 (Dense)           (None, 80)                8080      
                                                                 
 dense_152 (Dense)           (None, 80)                6480      
                                                                 
 dense_153 (Dense)           (None, 2)                 162       
                                                                 
Total params: 27,222
Trainable params: 27,222
Non-trainable params: 0
_________________________________________________________________


In [187]:
# Compile the model
nn3.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with 10 epochs
model_4 = nn3.fit(X_scaler, y_train_categorical, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [188]:
# Evaluate the model using the test data
test_loss, test_accuracy = nn3.evaluate(X_test_scaler, y_test_categorical, verbose=2)

# Print the test loss and accuracy
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

204/204 - 0s - loss: 0.5956 - accuracy: 0.7751 - 353ms/epoch - 2ms/step
Test Loss: 0.5956299304962158
Test Accuracy: 0.775068998336792


In [189]:
# Attempt 1: nodes 80, 80, 80 , 2 

'''
204/204 - 0s - loss: 0.5749 - accuracy: 0.7380 - 355ms/epoch - 2ms/step
Test Loss: 0.5749160051345825
Test Accuracy: 0.7379637956619263
'''

# Attempt 2: nodes 100, 80, 80 , 2 

'''
204/204 - 0s - loss: 0.5749 - accuracy: 0.7380 - 299ms/epoch - 1ms/step
Test Loss: 0.5749160051345825
Test Accuracy: 0.7379637956619263
'''

# Attempt 3: nodes 100, 80, 80 , 2 + FIXED PREPROCESSING -> # Convert categorical data to numeric with `pd.get_dummies`

'''
204/204 - 0s - loss: 0.5943 - accuracy: 0.7715 - 354ms/epoch - 2ms/step
Test Loss: 0.5942786931991577
Test Accuracy: 0.771542489528656
'''

'\n204/204 - 0s - loss: 0.5943 - accuracy: 0.7715 - 354ms/epoch - 2ms/step\nTest Loss: 0.5942786931991577\nTest Accuracy: 0.771542489528656\n'

In [193]:
# Save the model to an HDF5 file
nn2.save("AlphabetSoupCharity_Optimisation.h5")

## Step 4: Write a Report on the Neural Network Model
For this part of the assignment, you’ll write a report on the performance of the deep learning model you created for AlphabetSoup.
The report should contain the following:


Overview of the analysis: Explain the purpose of this analysis.

Results: Using bulleted lists and images to support your answers, address the following questions.

# Data Preprocessing
- What variable(s) are the target(s) for your model?
- What variable(s) are the features for your model?
- What variable(s) should be removed from the input data because they are neither targets nor features?

# Compiling, Training, and Evaluating the Model
- How many neurons, layers, and activation functions did you select for your neural network model, and why?
- Were you able to achieve the target model performance?
- What steps did you take in your attempts to increase model performance?

Summary: Summarise the overall results of the deep learning model. Include a recommendation for how a different model could solve this classification problem, and then explain your recommendation.