# Venture Funding with Deep Learning

For this project, I am working with a CSV containing more than 34,000 organizations that have received funding from Alphabet Soup over the years. With machine learning and neural networks, I've decide to use the features in the provided dataset to create a binary classifier model that will predict whether an applicant will become a successful business. The CSV file contains a variety of information about these businesses, including whether or not they ultimately became successful.

## Project Steps:

* I will prepare the data for use on a neural network model.

* I will compile and evaluate a binary classification model using a neural network.

* I will optimize the neural network model.

### Preparing the Data for Use on a Neural Network Model 

Using Pandas and scikit-learn’s `StandardScaler()`, I'm preprocessing the dataset so that it can be used to compile and evaluate the neural network model later. 

### Compiling and Evaluating a Binary Classification Model Using a Neural Network

Using TensorFlow to design a binary classification deep neural network model. This model should use the dataset’s features to predict whether an Alphabet Soup&ndash;funded startup will be successful based on the features in the dataset. 


### Optimizing the Neural Network Model

Using TensorFlow and Keras, I will optimize the model to improve the model's accuracy.

In [None]:
# Imports
import pandas as pd
from pathlib import Path
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from google.colab import files
uploaded = files.upload()

Saving applicants_data.csv to applicants_data.csv


---

## Preparing the data to be used on a neural network model

In [None]:
# Reading the applicants_data.csv file from the Resources folder into a Pandas DataFrame
applicant_data_df = pd.read_csv(Path("applicants_data.csv"))

# Reviewing the DataFrame
applicant_data_df


Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1
...,...,...,...,...,...,...,...,...,...,...,...,...
34294,996009318,THE LIONS CLUB OF HONOLULU KAMEHAMEHA,T4,Independent,C1000,ProductDev,Association,1,0,N,5000,0
34295,996010315,INTERNATIONAL ASSOCIATION OF LIONS CLUBS,T4,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
34296,996012607,PTA HAWAII CONGRESS,T3,CompanySponsored,C2000,Preservation,Association,1,0,N,5000,0
34297,996015768,AMERICAN FEDERATION OF GOVERNMENT EMPLOYEES LO...,T5,Independent,C3000,ProductDev,Association,1,0,N,5000,1


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Reviewing the data types associated with the columns
applicant_data_df.dtypes


EIN                        int64
NAME                      object
APPLICATION_TYPE          object
AFFILIATION               object
CLASSIFICATION            object
USE_CASE                  object
ORGANIZATION              object
STATUS                     int64
INCOME_AMT                object
SPECIAL_CONSIDERATIONS    object
ASK_AMT                    int64
IS_SUCCESSFUL              int64
dtype: object

### Dropping the “EIN” (Employer Identification Number) and “NAME” columns from the DataFrame, because they are not relevant to the binary classification model.

In [None]:
# Dropping the 'EIN' and 'NAME' columns from the DataFrame
applicant_data_df = applicant_data_df.drop(columns=["EIN","NAME"])

# Reviewing the DataFrame


In [None]:
applicant_data_df

Unnamed: 0,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1
...,...,...,...,...,...,...,...,...,...,...
34294,T4,Independent,C1000,ProductDev,Association,1,0,N,5000,0
34295,T4,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
34296,T3,CompanySponsored,C2000,Preservation,Association,1,0,N,5000,0
34297,T5,Independent,C3000,ProductDev,Association,1,0,N,5000,1


### Encoding the dataset’s categorical variables using `OneHotEncoder`, and then place the encoded variables into a new DataFrame.

In [None]:
# Creating a list of categorical variables 
categorical_variables = list(applicant_data_df.dtypes[applicant_data_df.dtypes == "object"].index)

# Displaying the categorical variables list
categorical_variables


['APPLICATION_TYPE',
 'AFFILIATION',
 'CLASSIFICATION',
 'USE_CASE',
 'ORGANIZATION',
 'INCOME_AMT',
 'SPECIAL_CONSIDERATIONS']

In [None]:
# Creating a OneHotEncoder instance
enc = OneHotEncoder(sparse=False)


In [None]:
# Encoding the categorcal variables using OneHotEncoder
encoded_data = enc.fit_transform(applicant_data_df[categorical_variables])


In [None]:
# Creating a DataFrame with the encoded variables
encoded_df = pd.DataFrame(
    encoded_data,
    columns = enc.get_feature_names(categorical_variables)
)

# Reviewing the DataFrame
encoded_df




Unnamed: 0,APPLICATION_TYPE_T10,APPLICATION_TYPE_T12,APPLICATION_TYPE_T13,APPLICATION_TYPE_T14,APPLICATION_TYPE_T15,APPLICATION_TYPE_T17,APPLICATION_TYPE_T19,APPLICATION_TYPE_T2,APPLICATION_TYPE_T25,APPLICATION_TYPE_T29,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34295,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34297,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Adding the original DataFrame’s numerical variables to the DataFrame containing the encoded variables.


In [None]:
# Adding the numerical variables from the original DataFrame to the one-hot encoding DataFrame
encoded_df = pd.concat([applicant_data_df[["STATUS", "ASK_AMT", "IS_SUCCESSFUL"]],encoded_df],axis=1)

# Reviewing the Dataframe
encoded_df


Unnamed: 0,STATUS,ASK_AMT,IS_SUCCESSFUL,STATUS.1,ASK_AMT.1,IS_SUCCESSFUL.1,APPLICATION_TYPE_T10,APPLICATION_TYPE_T12,APPLICATION_TYPE_T13,APPLICATION_TYPE_T14,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,1,5000,1,1,5000,1,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,108590,1,1,108590,1,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,5000,0,1,5000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,6692,1,1,6692,1,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,142590,1,1,142590,1,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34294,1,5000,0,1,5000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34295,1,5000,0,1,5000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34296,1,5000,0,1,5000,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34297,1,5000,1,1,5000,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [None]:
# Defining the target set y using the IS_SUCCESSFUL column
y = encoded_df["IS_SUCCESSFUL"]

# Displaying a sample of y
y


Unnamed: 0,IS_SUCCESSFUL,IS_SUCCESSFUL.1
0,1,1
1,1,1
2,0,0
3,1,1
4,1,1
...,...,...
34294,0,0
34295,0,0
34296,0,0
34297,1,1


In [None]:
# Defining features set X by selecting all columns but IS_SUCCESSFUL
X = encoded_df.drop(columns=["IS_SUCCESSFUL"])

# Reviewing the features DataFrame
X


Unnamed: 0,STATUS,ASK_AMT,STATUS.1,ASK_AMT.1,APPLICATION_TYPE_T10,APPLICATION_TYPE_T12,APPLICATION_TYPE_T13,APPLICATION_TYPE_T14,APPLICATION_TYPE_T15,APPLICATION_TYPE_T17,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,1,5000,1,5000,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,108590,1,108590,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,5000,1,5000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,6692,1,6692,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,142590,1,142590,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34294,1,5000,1,5000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34295,1,5000,1,5000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34296,1,5000,1,5000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
34297,1,5000,1,5000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Splitting the features and target sets into training and testing datasets.


In [None]:
# Splitting the preprocessed data into a training and testing dataset
# Assigning the function a random_state equal to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


### Using scikit-learn's `StandardScaler` to scale the features data.

In [None]:
# Creating a StandardScaler instance
scaler = StandardScaler()

# Fitting the scaler to the features training dataset
X_scaler = scaler.fit(X_train)

# Fitting the scaler to the features training dataset
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


---

## Compiling and Evaluating a Binary Classification Model Using a Neural Network

### Creating a deep neural network by assigning the number of input features, the number of layers, and the number of neurons on each layer using Tensorflow’s Keras.



In [None]:
# Defining the the number of inputs (features) to the model
number_input_features = len(X_train.iloc[0])
# Reviewing the number of features
number_input_features


118

In [None]:
# Defining the number of neurons in the output layer
number_output_neurons = 1

In [None]:
# Defining the number of hidden nodes for the first hidden layer
hidden_nodes_layer1 =  (number_input_features + number_output_neurons) // 2 

# Reviewing the number hidden nodes in the first layer
hidden_nodes_layer1


59

In [None]:
# Defining the number of hidden nodes for the second hidden layer
hidden_nodes_layer2 =  (hidden_nodes_layer1 + number_output_neurons) // 2 

# Reviewing the number hidden nodes in the second layer
hidden_nodes_layer2


30

In [None]:
# Creating the Sequential model instance
nn = Sequential()


In [None]:
# Adding the first hidden layer
nn.add(Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu"))


In [None]:
# Adding the second hidden layer
nn.add(Dense(units=hidden_nodes_layer2, activation="relu"))


In [None]:
# Adding the output layer to the model specifying the number of output neurons and activation function
nn.add(Dense(units=number_output_neurons, activation="sigmoid"))


In [None]:
# Displaying the Sequential model summary
nn.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 59)                7021      
                                                                 
 dense_1 (Dense)             (None, 30)                1800      
                                                                 
 dense_2 (Dense)             (None, 1)                 31        
                                                                 
Total params: 8,852
Trainable params: 8,852
Non-trainable params: 0
_________________________________________________________________


### Compiling and fitting the model using the `binary_crossentropy` loss function, the `adam` optimizer, and the `accuracy` evaluation metric.


In [None]:
# Compiling the Sequential model
#nn.compile(
   # loss= tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)#"binary_crossentropy",
    #optimizer=["adam"],
    #metrics=["accuracy"]
#)
nn.compile(optimizer='adam',loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])


In [None]:
nn

<keras.engine.sequential.Sequential at 0x7f2486cd2510>

In [None]:
tf.keras.utils.plot_model 

<function keras.utils.vis_utils.plot_model>

In [None]:
# Fitting the model using 50 epochs and the training data
fit_model = nn.fit(
    X_train_scaled,
    y_train,
    epochs=50,
)


Epoch 1/50


InvalidArgumentError: ignored

### Evaluating the model using the test data to determine the model’s loss and accuracy.


In [None]:
# Evaluating the model loss and accuracy metrics using the evaluate method and the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)

# Displaying the model loss and accuracy results
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

  return dispatch_target(*args, **kwargs)


InvalidArgumentError: ignored

### Saving and exporting the model to an HDF5 file 


In [None]:
# Setting the model's file path
file_path = Path("Resources/AlphabetSoup.h5")

# Exporting the model to a HDF5 file
nn.save(file_path)


---

## Optimizing the neural network model


### Defining at three new deep neural network models (resulting in the original plus 3 optimization attempts).



### Alternative Model 1

In [None]:
# Defining the the number of inputs (features) to the model
number_input_features = len(X_train.iloc[0])

# Reviewing the number of features
number_input_features

118

In [None]:
# Defining the number of neurons in the output layer
number_output_neurons_A1 = 1

In [None]:
# Defining the number of hidden nodes for the first hidden layer
hidden_nodes_layer1_A1 = (number_input_features + number_output_neurons ) // 2

# Reviewing the number of hidden nodes in the first layer
hidden_nodes_layer1_A1

59

In [None]:
# Creating the Sequential model instance
nn_A1 = Sequential()

In [None]:
# First hidden layer
nn_A1.add(
    tf.keras.layers.Dense(
        units=hidden_nodes_layer1_A1,
        input_dim=number_input_features,
        activation="relu"
    )
)




In [None]:
# Output layer
nn_A1.add(
    tf.keras.layers.Dense(
        units=number_output_neurons_A1,
        activation="sigmoid"
    )
)


In [None]:
# Checking the structure of the model
nn_A1.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_7 (Dense)             (None, 59)                7021      
                                                                 
 dense_8 (Dense)             (None, 1)                 60        
                                                                 
 dense_9 (Dense)             (None, 1)                 2         
                                                                 
Total params: 7,083
Trainable params: 7,083
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Compiling the Sequential model
nn_A1.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

In [None]:
# Fitting the model using 50 epochs and the training data
fit_model_A1 = n_A1.fit(
    X_train_scaled,
    y_train,
    epochs=50,
)


NameError: ignored

#### Alternative Model 2

In [None]:
# Defining the the number of inputs (features) to the model
number_input_features = len(X_train.iloc[0])

# Reviewing the number of features
number_input_features

118

In [None]:
# Defining the number of neurons in the output layer
number_output_neurons_A2 = 1

In [None]:
# Defining the number of hidden nodes for the first hidden layer
hidden_nodes_layer1_A2 = round((number_input_features * .6666666), 0)

# Reviewing the number of hidden nodes in the first layer
hidden_nodes_layer1_A2

79.0

In [None]:
# Creating the Sequential model instance
nn_A2 = Sequential()

In [None]:
# First hidden layer
nn_A2.add(
    tf.keras.layers.Dense(
        units=hidden_nodes_layer1_A2,
        input_dim=number_input_features,
        activation="sigmoid"
    )
)





In [None]:
# Output layer
nn_A2.add(
    tf.keras.layers.Dense(
        units=number_output_neurons_A2,
        activation="sigmoid"
    )
)

In [None]:
# Checking the structure of the model
nn_A2.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_10 (Dense)            (None, 79)                9401      
                                                                 
 dense_11 (Dense)            (None, 1)                 80        
                                                                 
 dense_12 (Dense)            (None, 1)                 2         
                                                                 
Total params: 9,483
Trainable params: 9,483
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Compiling the model
nn_A2.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)


In [None]:
# Fitting the model
fit_model_A2 = nn_A2.fit(
    X_train_scaled,
    y_train,
    epochs=50
)

Epoch 1/50


ValueError: ignored

### Displaying the accuracy scores achieved by each model, and comparing the results.

In [None]:
print("Original Model Results")

# Evaluating the model loss and accuracy metrics using the evaluate method and the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)

# Displaying the model loss and accuracy results
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

Original Model Results


InvalidArgumentError: ignored

In [None]:
print("Alternative Model 1 Results")

# Evaluating the model loss and accuracy metrics using the evaluate method and the test data
model_loss, model_accuracy = 

# Displaying the model loss and accuracy results
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

In [None]:
print("Alternative Model 2 Results")

# Evaluating the model loss and accuracy metrics using the evaluate method and the test data
model_loss, model_accuracy =

# Displaying the model loss and accuracy results
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

### Saving each of my alternative models as an HDF5 file.


In [None]:
# Setting the file path for the first alternative model
file_path = Path("Resources/AlphabetSoup_A1.h5")

# Exporting the model to a HDF5 file
nn_A1.save(file_path)


In [None]:
# Setting the file path for the second alternative model
file_path = Path("Resources/AlphabetSoup_A2.h5")

# Exporting the model to a HDF5 file
nn_A2.save(file_path)