# Data Modeling

Due to the size of the dataset we will break it into manageable chunks for modeling. 

With that, our group has decided to parse out the states of Florida and Maryland for our analysis. These were chosen in favor of the senators from those states who were charimen of the committee overseeing the PPP. 

From there, we will conduct a RandomForestRegression to model out Cost Per Job as the dependant variable.

In [None]:
import pandas as pd
import numpy as np

# Read back in Dataset

ppp = pd.read_csv('PPP_DATASET.csv')

In [None]:
# sklearn libraries

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Next we can split the datasets into specific states

ppp_FL = ppp[ppp['ProjectState_FL'] == 1] #<---------- Grab only Florida
ppp_MD = ppp[ppp['ProjectState_MD'] == 1] #<---------- Grab only Maryland

# Since we no longer need project states we can drop those columns easily using a Regex

ppp_FL.drop(list(ppp_FL.filter(regex = 'ProjectState')), axis = 1, inplace = True)
ppp_MD.drop(list(ppp_MD.filter(regex = 'ProjectState')), axis = 1, inplace = True)

In addition to the variables above, there are others that will not be pertinent to our model that we can drop

In [None]:
ppp_FL.drop(['Phase', 'BusinessAgeDescription_Change of Ownership',
                      'BusinessAgeDescription_Startup, Loan Funds will Open Business', 'Race_Eskimo & Aleut', 
                      'Race_Multi Group', 'Race_Puerto Rican', 'BusinessType_501(c) – Non Profit except 3,4,6,', 
                      'BusinessType_501(c)19 – Non Profit Veterans',
                      'BusinessType_501(c)3 – Non Profit', 'BusinessType_501(c)6 – Non Profit Membership', 
                      'BusinessType_Cooperative', 'BusinessType_Employee Stock Ownership Plan(ESOP)',
                      'BusinessType_Housing Co-op', 'BusinessType_Independent Contractors', 
                      'BusinessType_Joint Venture', 'BusinessType_Limited Liability Partnership', 
                      'BusinessType_Non-Profit Childcare Center', 'BusinessType_Non-Profit Organization',
                      'BusinessType_Partnership', 'BusinessType_Professional Association',
                      'BusinessType_Qualified Joint-Venture (spouses)', 'BusinessType_Rollover as Business Start-Ups (ROB', 
                      'BusinessType_Self-Employed Individuals', 'BusinessType_Single Member LLC', 
                      'BusinessType_Sole Proprietorship', 'BusinessType_Tenant in Common', 'BusinessType_Tribal Concerns',
                      'BusinessType_Trust', 'CurrentApprovalAmount', 'ForgivenessAmount'], 
            axis = 1, inplace = True)

## Running the models

As stated above there are 2 models we can end up building that would be important for our presenation. These would be for the states of Florida and Maryland.

Below we have set up the initial models for both Florida and Maryland, and we will dive deeper into the Florida model below as we consider what features are valuable for us to consider to improve accuracy of model and decrease concerns around over-fitting.

### Florida Model

In [None]:
X = ppp_FL.drop('CostPerJob', axis = 1)
y = ppp_FL['CostPerJob']

XFL_train, XFL_test, yFL_train, yFL_test = train_test_split(X, y, test_size=.1)
yFL_test = np.array(yFL_test)

rf_FL = RandomForestRegressor()
rf_FL.fit(XFL_train, yFL_train)

rf_FL.score(XFL_train, yFL_train)

### Maryland Model

In [None]:
X = ppp_MD.drop('CostPerJob', axis = 1)
y = ppp_MD['CostPerJob']

XMD_train, XMD_test, yMD_train, yMD_test = train_test_split(X, y, test_size=.1)

rf_MD = RandomForestRegressor()
rf_MD.fit(XMD_train, yMD_train)

rf_MD.score(XMD_train, yMD_train)

### Interpreting our Random Forest Models

In order to interpret our models we will be using material to interpret the random forest developed here: https://coderzcolumn.com/tutorials/machine-learning/treeinterpreter-interpreting-tree-based-models-prediction-of-individual-sample. Information about treeinterpreter can be found here: https://github.com/andosa/treeinterpreter.

In [None]:
! pip install treeinterpreter

from treeinterpreter import treeinterpreter as ti

preds, bias, contributions = ti.predict(rf_FL, XFL_test)

In [None]:
print("Bias For Sample 0                        : %.2f"%bias[0])
print("Constributions For Sample 0              : %s"%contributions[0])
print("Prediction Based on Bias & Contributions : %.2f"%(bias[0] + contributions[0].sum()))
print("Actual Target Value                      : %.2f"%yFL_test[0])
print("Target Value As Per Treeinterpreter      : %.2f"%preds[0][0])

In [None]:
import random

random_sample = random.randint(1, len(XFL_test))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %.2f"%yFL_test[random_sample])
print("Predicted Value     : %.2f"%preds[random_sample][0])

def create_contrbutions_df(contributions, random_sample, feature_names):
    contribs = contributions[random_sample].tolist()
    contribs.insert(0, bias[random_sample])
    contribs = np.array(contribs)
    contrib_df = pd.DataFrame(data=contribs, index=["Base"] + feature_names, columns=["Contributions"])
    prediction = contrib_df.Contributions.sum()
    contrib_df.loc["Prediction"] = prediction
    return contrib_df

contrib_df = create_contrbutions_df(contributions, random_sample, ppp_FL.columns)
contrib_df

In [None]:
import plotly.graph_objects as go

def create_waterfall_chart(contrib_df, prediction):
    fig = go.Figure(go.Waterfall(
        name = "Prediction", #orientation = "h", 
        measure = ["relative"] * (len(contrib_df)-1) + ["total"],
        x = contrib_df.index,
        y = contrib_df.Contributions,
        connector = {"mode":"between", "line":{"width":4, "color":"rgb(0, 0, 0)", "dash":"solid"}}
    ))

    fig.update_layout(title = "Prediction : %s"%prediction)

    return fig

create_waterfall_chart(contrib_df, contrib_df.loc["Prediction"][0])

Based on what we see here, we can improve our model even further by reducing the variables and still get contributions. Using the contributions table above we will remove some features that are not contributing to the model

In [None]:
ppp_FL.drop(['InitialApprovalAmount', 'RENT_PROCEED', 'REFINANCE_EIDL_PROCEED',
                     'HEALTH_CARE_PROCEED', 'DEBT_INTEREST_PROCEED', 'FranchiseYN', 'ForgivenYN', 
                     'Party_R', 'BusinessAgeDescription_Existing or more than 2 years old',
                     'Race_American Indian or Alaska Native', 'Race_Asian', 'Race_Native Hawaiian or Other Pacific Islander',
                     'Race_Unanswered', 'Race_White', 'Ethnicity_Hispanic or Latino', 'Ethnicity_Not Hispanic or Latino', 
                     'Ethnicity_Unknown/NotStated', 'BusinessType_Subchapter S Corporation', 'Gender_Female Owned',
                     'Veteran_Veteran', 'RuralYN_R', 'RuralYN_U'], axis = 1, inplace=True)

Now that we have trimmed the set to more valuable features we can re-run the model

In [None]:
X = ppp_FL.drop('CostPerJob', axis = 1)
y = ppp_FL['CostPerJob']

XFL_train, XFL_test, yFL_train, yFL_test = train_test_split(X, y, test_size=.1)
yFL_test = np.array(yFL_test)

rf_FL = RandomForestRegressor()
rf_FL.fit(XFL_train, yFL_train)

rf_FL.score(XFL_train, yFL_train)

In [None]:
#  Let's consider the feature importance for the Florida model:

importanceFL = rf_FL.feature_importances_
importanceFL

for i,v in enumerate(importanceFL):
    print('Feature: %0d, Score: %.5f' % (i,v))

In [None]:
from matplotlib import pyplot

pyplot.bar([x for x in range(len(importanceFL))], importanceFL)
pyplot.show()

In [None]:
importancesFL = pd.DataFrame({'feature':XFL_train.columns, 'importance': np.round(rf_FL.feature_importances_,3)})
importancesFL = importancesFL.sort_values('importance',ascending=False).set_index('feature')

importancesFL.head(25)

In [None]:
importancesFL.plot.bar()

Here is some verbiage taken from a great piece about treeinterpreter (https://coderzcolumn.com/tutorials/machine-learning/treeinterpreter-interpreting-tree-based-models-prediction-of-individual-sample).  This is meant just to help with interpretation and is a direct quote:

"The treeinterpreter is based on a concept that when making a particular prediction decision tree or random forest follows a particular path to come to that prediction. Each node in the decision tree represent some feature and makes decisions based on the feature value in the sample. The treeinterpreter divides prediction region space into regions the same as the number of leaves present in that tree. At each internal node in a tree, the prediction value will be the average of all possible predictions in data from the path going through that node. We'll have the average value for the root node as well this way which will be the average of all predictions. This way we'll have some prediction value at each node in the tree. The treeinterpreter uses these values to find out the contributions of each feature in prediction by finding out the difference in prediction by a particular node and the node in the path before it. It follows the same process for the random forest where there is more than one tree and the final prediction is taken based on an average of all trees predictions....The treeinterpreter takes as input tree-based model and samples and returns the base value for each sample, contributions of each feature into a prediction of each sample, and predictions for each sample."

"The treeinterpreter has a single method named predict() which takes as input model instance and dataset for which we need explanations. It returns three arrays as output.

- The first array is predictions for a number of samples passed to the method.
- The second array is bias or base value for each sample of data to which individual feature contribution will be added to generate a final prediction.
- The third array is of size (#samples x #no_of_features) as it has the contribution of each feature for each sample which gets added to base/bias value to generate predictions."

In [None]:
preds, bias, contributions = ti.predict(rf_FL, XFL_test)

In [None]:
print("Bias For Sample 0                        : %.2f"%bias[0])
print("Constributions For Sample 0              : %s"%contributions[0])
print("Prediction Based on Bias & Contributions : %.2f"%(bias[0] + contributions[0].sum()))
print("Actual Target Value                      : %.2f"%yFL_test[0])
print("Target Value As Per Treeinterpreter      : %.2f"%preds[0][0])

In [None]:
random_sample = random.randint(1, len(XFL_test))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %.2f"%yFL_test[random_sample])
print("Predicted Value     : %.2f"%preds[random_sample][0])

"""def create_contrbutions_df(contributions, random_sample, feature_names):
    contribs = contributions[random_sample].tolist()
    contribs.insert(0, bias[random_sample])
    contribs = np.array(contribs)
    contrib_df = pd.DataFrame(data=contribs, index=["Base"] + feature_names, columns=["Contributions"])
    prediction = contrib_df.Contributions.sum()
    contrib_df.loc["Prediction"] = prediction
    return contrib_df"""

contrib_df = create_contrbutions_df(contributions, random_sample, ppp_FL.columns)
contrib_df

In [None]:
create_waterfall_chart(contrib_df, contrib_df.loc["Prediction"][0])