<h1 style="color:red;">Credit rating assignment</h1>
<p></p>
In this assignment, we'll work our way through a simple ML exercise. Machine learning is an iterative process that starts with feature engineering (making the features ready for ML), works it way through various models and hyperparameter tuning exercises, until we find a model that seems to work well for us. 

<h3 style="color:green;">The problem: Rating creditworthiness of loan applicants</h3>

When banks issue loans to individuals, they have two goals that conflict with each other:
<ol>
    <li>Give as many loans as possible (fees, interest, all add to revenue)</li>
    <li>Try not to give loans to individuals who won't pay it back (lose money on the loan, collection costs, etc.)</li>
</ol>
    
<li>A typical machine learning program in this space tries to find a suitable tradeoff between finding many good loans and not calling a bad loan good</li>

<li>In this assignment, we'll try to build a "good" model that finds a good tradeoff between these two objectives</li>

<li>In machine learning terms, the proportion of times we get our guess right (i.e., we call a bad loan a bad loan and a good loan a good loan divided by the total number of cases) is called <span style="color:blue">accuracy</span></li>

<li>The proportion of actual good loans that we identify as good loans is known as <span style="color:blue">recall</span></li>

<li>The probability that if a loan is called good it actually is good is called <span style="color:blue">precision</span></li>

<li>The precision recall tradeoff is measured through a score called <span style="color:blue">f1 score</span></li>

<li>An important part of running an ML model is trying to figure out "which metric is right for you"</li>


    
    
<ol>
    <li>We'll try the SGD classifier, tune hyperparameters using grid search, and examine the results</li>
    <li>then, set up the data for a random forest classifier, run a grid search, and examine the results</li>
        <li>finally, run a couple of gradient booster models</li>
    <li>draw precision recall curves and roc curves for the two classifiers and compare the results</li>
    <li>note that grid search is a computing intensive activity. I've simplified the search to a few options but even those can take a long while (less than 15 minutes on my laptop but could be a couple of hours if you have an older machine)</li>
</ol>

<h3 style="color:green;">The models</h3>
<p></p>
<li><b>Model 1 SGD Classifier</b>: Vanilla version with max_iter set to 1000</li>
<li><b>Model 2 SGD Classifier round 2</b>: SGD Classifier with positive cases assigned a higher weight. One issue with our data is that positive cases are vastly outnumbered by negative cases (in other words, a model that says all cases are negative will have a pretty good accuracy). By overweighting positive cases in our model, we increase the efficacy of the model in finding an actual good solution</li>
<li><b>Model 3 SGD Classifier round 3</b>: Best SGD Classifier model after grid search</li>
<li><b>Model 4 Random Forest Classifier round 1</b>: Random Forest Classifier with base parameters (see below)</li>
<li><b>Model 5 Random Forest Classifier round 2</b>: Best model from grid search</li>
<li><b>Model 6 Gradient Booster Classfier</b></li>
<li><b>Model 7 Gradient Booster Classifier (2nd model)</b></li>

For each model, collect model metrics in the following dataframe results_df. After each model run, replace the 0.0 with the appropriate metric value


In [1]:
import pandas as pd
import numpy as np
results_df = pd.DataFrame(np.zeros(shape=(7,6)))
results_df.index=[1,2,3,4,5,6,7]
results_df.columns = ["accuracy","precision","recall","f1_score","AUC","AP"]
results_df.index.rename("Model",inplace=True)
results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h3 style="color:green;">The data</h3>
<p></p>
<li>A curated extract from the popular Lending club loan data. The data is in the file loan_data_small.csv</li>
<li>The dataset contains information about loan applications. Very basic information about the applicant and the status of the loan</li>
<li>The goal of the ML exercise is to build a model that uses information about the loan to predict whether a loan is a "good" one (i.e., it will be paid back) or a "bad" one (the money is unrecoverable)</li>
<li>Note that we're only using a fraction of the data. If you're interested, I can share the curated extract on a larger fraction which gives better results (but can crash your machine!)</li>

<h1 style="color:red;font-size:xx-large">Data preparation and feature engineering</h1>


<h3 style="color:green;">Build a binary target</h3>

<li>For the purposes of this analysis, drop rows that contain any NaN values</li>
<li><b>Target</b>: For the classifier, classify any loans that have a loan_status value of "Charged Off","Default", or "Does not meet the credit policy. Status:Charged Off" as a bad loan and give these loans a target value of 1 (we're predicting bad loans)</li>
<li><b>Input features</b>: create the input feature dataframe (i.e., drop any columns that are not an independent variable). The input variables we're interested in are "int_rate", "grade", "home_ownership","annual_income", "loan_amt", and "purpose"</li>
<p></p>
<li>The data should look like:</li>
<pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  object 
 3   home_ownership  565167 non-null  object 
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  object 
dtypes: float64(2), int64(2), object(3)
memory usage: 30.2+ MB
Out[108]:
0         False
1          True
2         False
3         False
4          True
          ...  
565162    False
565163    False
565164    False
565165     True
565166    False
Name: loan_status, Length: 565167, dtype: bool

</pre>

In [3]:
#read the file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("../class-datasets/loan_data_small.csv")

#Drop rows with NaN values
df = df.dropna()



#Prepare the y (target) variable
#The target variable should be 1 if loan_status is "Charged Off","Default", or "Does not meet the credit policy. Status:Charged Off"
#And 0 otherwise
#(Hint: Create a boolean mask series)

y = (df.loan_status == "Charged Off") | (df.loan_status == "Default") | (df.loan_status == "Does not meet the credit policy. Status:Charged Off")

#remove unwanted input features "Unnamed: 0" and "loan_status"
df = df.drop(['Unnamed: 0', 'loan_status'], axis=1)

#Examine the df and the target
df.info()

y

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  object 
 3   home_ownership  565167 non-null  object 
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  object 
dtypes: float64(2), int64(2), object(3)
memory usage: 30.2+ MB


0         False
1          True
2         False
3         False
4          True
          ...  
565162    False
565163    False
565164    False
565165     True
565166    False
Name: loan_status, Length: 565167, dtype: bool

In [4]:
df.head()

Unnamed: 0,Unnamed: 0.1,int_rate,grade,home_ownership,annual_inc,loan_amnt,purpose
0,1131156,15.61,D,OWN,20000.0,9200,debt_consolidation
1,1526956,12.62,C,OWN,35000.0,6000,car
2,150283,14.47,C,RENT,75000.0,11000,other
3,1480461,11.99,B,RENT,45000.0,8500,other
4,2188054,21.45,D,MORTGAGE,55000.0,20000,debt_consolidation


<h3 style="color:green;">Label Encoding</h3>
<li>Since we're using regression as our underlying algorithm, all values need to be numerical. ML Models generally deal with numerical data</li>
<li>But, <span style="color:blue">grade</span>, <span style="color:blue">purpose</span>, and <span style="color:blue">home_ownership</span> are not</li>
</li>
<li>sklearn's <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">LabelEncoder</a> assigns numerical values to categorical data</li>
<li>LabelEncoder replaces each categorical string value with an integer - 0, 1, 2, ...</li>
<li>After label encoding, df.info() should return:</li>
<pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565167 entries, 0 to 565166
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0.1    565167 non-null  int64  
 1   int_rate        565167 non-null  float64
 2   grade           565167 non-null  int64  
 3   home_ownership  565167 non-null  int64  
 4   annual_inc      565167 non-null  float64
 5   loan_amnt       565167 non-null  int64  
 6   purpose         565167 non-null  int64  
dtypes: float64(2), int64(5)
memory usage: 30.2 MB
</pre>

In [5]:
#replace grade, purpose, and home_ownership by label encoded versions


from sklearn.preprocessing import LabelEncoder
df.grade = LabelEncoder().fit_transform(df.grade)
df.purpose = LabelEncoder().fit_transform(df.purpose)
df.home_ownership = LabelEncoder().fit_transform(df.home_ownership)



<h3 style="color:green;">One-hot encoding</h3>

<p></p>
<li>In regression, the assumption is that values associated with a feature are ordered</li>
<li>But, this is not necessarily so for the label encoded categorical values</li>
<li>The way to deal with this in regression is to create dummy variables, one for each category, that take the value 1 if the category is present in the row and 0 otherwise</li>
<li>In ML, a procedure known as <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a> is used to do this conversion</li>
<li>One hot encoding is the process of converting a single column of categorical (integer) data with k categories into k-1 columns of 0 or 1 values</li>
<li>for example, the array with three possible categories [1,2,3,2,1] will be converted into the matrix:</li>

$$\begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 0 \end{bmatrix}$$

<li>1's are replaced by (0, 0); 2's by (1, 0); and 3's by (0, 1). Note that category 1 is implicitly coded</li>
<li><b>Documentation</b>: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html</a>

<h3 style="color:green;">Scaling</h3>

<p></p>
<li>Non-categorical independent variables need to be scaled so that they follow the same underlying distribution</li>
<li>We will normalize them so that the mean is 0 and standard deviation is 1 using sklearn's StandardScaler feature transformer</li>
<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html</a></li>

<li>All feature transformations can be encapsulated in the sklearn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html">make_column_transformer</a> object</li>
<li>Use <span style="color:blue">make_column_transformer</span> to encapsulate both the one-hot coding as well as standard scaling. Note that the one-hot encoded columns are not scaled!</li>

In [6]:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

#Make a column transformer object that scales (using StandardScaler) the two non-categorical columns
# and one hot encodes (using OneHotEncoder) the three categorical columns
# Using make_column_transformer 
preprocess = make_column_transformer(
    (StandardScaler(),['int_rate', 'annual_inc'], ),
    (OneHotEncoder(categories="auto",drop="first"),['grade', 'home_ownership','purpose'], )
)

#Generate the independent variable df
X = preprocess.fit_transform(df)
X.shape
#Should return (565167, 26)

(565167, 26)

<h3 style="color:green;">Train/Test split</h3>

<li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a></li>
<li>split the data into 70% training and 30% testing</li>
<li>make sure the x and y datasets are aligned</li>
<li>use random_state=42 to get the same split as in my code </li>
<li>x and y training data shapes: (395616, 26) (395616,)</li>
<li>x and y testing data shapes: (169551, 26) (169551,)</li>

In [7]:
from sklearn.model_selection import train_test_split
#Get x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

#And check the shape
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

"""
Should return:
(395616, 26) (395616,)
(169551, 26) (169551,)
"""

(395616, 26) (395616,)
(169551, 26) (169551,)


'\nShould return:\n(395616, 26) (395616,)\n(169551, 26) (169551,)\n'

<h1 style="color:green">The models</h1>
<li>For each model, do the following</li>
<ol>
    <li>Fit a classifier to the training data</li>
    <li>calculate the metrics</li>
    <ul>
        <li>training accuracy</li>
        <li>testing accuracy</li>
        <li>precision on test dataset</li>
        <li>recall on test dataset</li>
        <li>f1 score on test dataset</li>
        <li>area under the curve on test dataset</li>
        <li>average precision on the test dataset</li>
    </ul>
    <li>Write up a brief (pointwise) interpretation of the results
</ol>
<li>Chart the various metrics</li>


<h1 style="color:red;font-size:xx-large">Build Model 1</h1>


<h3 style="color:green;">Build the model on the training data set</h3>

<li>set random_state to 42 (if you want to get the same results that I got) and max_iter to 1000</li>
<li>set the loss function to "log_loss" ("log" if using sklearn 1.0.x or on colab)</li>

In [8]:
from sklearn.linear_model import SGDClassifier
model_1 = SGDClassifier(loss='log_loss', max_iter=1000, random_state=42)
model_1.fit(x_train,y_train) #change if you used different variable names

print(model_1.score(x_train,y_train))
print(model_1.score(x_test,y_test))
"""
You should get:
0.8846634109843889
0.8843828700508991
"""

0.8845243872846397
0.8842000342079964


'\nYou should get:\n0.8846634109843889\n0.8843828700508991\n'


<h3 style="color:green;">Model 1 metrics</h3>
<li>Report the following on the <b>test</b> data:</li>
<ul>
<li>the confusion matrix</li>
<li>the accuracy, precision, recall, f1-score, AUC, and AP </li>
</ul>


In [9]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score,recall_score,precision_score
from sklearn.metrics import average_precision_score,roc_auc_score

def get_scores(model, p=True):
    predictions = model.predict(x_test)
    cfm = confusion_matrix(y_test, predictions)
    accuracy_training = model.score(x_train, y_train)
    accuracy_testing = model.score(x_test, y_test)
    precision = precision_score(y_test, predictions)
    recall = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    auc = roc_auc_score(y_test, model.predict_proba(x_test)[:, 1])
    ap = average_precision_score(y_test, predictions)
    if p==True:
        print("Confusion Matrix: \n",cfm)
        print("Training accuracy: ",accuracy_training)
        print("Testing  accuracy: ",accuracy_testing)
        print("Precision: ",precision)
        print("Recall: ",recall)
        print("F1-Score: ",f1)
        print("AUC: ",auc)
        print("Average Precision: ",ap)
        return
    return [accuracy_testing,precision, recall, f1, auc, ap]

get_scores(model_1)
"""

You should see:

Confusion Matrix: 
 [[149948      1]
 [ 19602      0]]
Training accuracy:  0.8846634109843889
Testing  accuracy:  0.8843828700508991
Precision:  0.0
Recall:  0.0
F1-Score:  0.0
AUC:  0.692962177388246
Average Precision:  0.11561123201868465
"""

Confusion Matrix: 
 [[149914     35]
 [ 19599      3]]
Training accuracy:  0.8845243872846397
Testing  accuracy:  0.8842000342079964
Precision:  0.07894736842105263
Recall:  0.00015304560759106213
F1-Score:  0.00030549898167006107
AUC:  0.6919801407783887
Average Precision:  0.1156056207754037


'\n\nYou should see:\n\nConfusion Matrix: \n [[149948      1]\n [ 19602      0]]\nTraining accuracy:  0.8846634109843889\nTesting  accuracy:  0.8843828700508991\nPrecision:  0.0\nRecall:  0.0\nF1-Score:  0.0\nAUC:  0.692962177388246\nAverage Precision:  0.11561123201868465\n'

<h3 style="color:green;">Interpret the results</h3>
<li>In a few bullet points, write your interpreation of the results. Why are we seeing what we are seeing? Is it useful? Why is the AUC not 0.5?</li>

<h4>Interpretation</h4>
<li> The model is predicting basically every loan will be safe. Consequently, we see that Recall and Precision are almost 0. </li>
<li>Accuracy is still high as only about 11% of the loans are bad. This means that predicting every value will be safe will result in approximately 89% accuracy</li>
<li>AUC is above .5 for the same reason. The model is predicting results above a random chance level.</li>



<h3 style="color:green;">Update results_df</h3>


In [26]:

results_df.loc[1] = get_scores(model_1, p=False)



results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606
2,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 2</h1>



<li>sklearn's ML models can be given a <span style="color:blue">class_weight</span> parameter</li>
<li>weights can be given explicitly or implicitly</li>
<li>note that by increasing the weight of the true cases, our model is more likely to find true positives</li>
<li>and by decreasing the weight of the true cases, our model is more likely to find true negatives</li>
<li>In Model 2, increase the weight of positives by a factor of 9 to balance the positives and negatives</li>

<h3 style="color:green">Build model 2 and report metrics</h3>

In [11]:
model_2 = SGDClassifier(loss='log_loss', max_iter=1000, random_state=42, class_weight={1:9})
model_2.fit(x_train,y_train) #change if you used different variable names

get_scores(model_2)


Confusion Matrix: 
 [[79991 69958]
 [ 5146 14456]]
Training accuracy:  0.5571412683005743
Testing  accuracy:  0.5570418340204423
Precision:  0.17125121425355982
Recall:  0.7374757677787981
F1-Score:  0.2779572373481003
AUC:  0.6938190900016709
Average Precision:  0.1566443706365479


<h3 style="color:green;">Interpret the results</h3>


<h4>Interpretation</h4>
<li> Above, we see that the accuracy has gone down to 55%. This makes sense as the weighting is now more balanced so the model can't make the same prediction for all values.</li>
<li> However, recall is considerably higher in the new model. The model is correctly classifying about 73% of the bad loans.</li>
<li>AUC score is about the same as model 1. </li>

<h3 style="color:green;">Update results_df</h3>

In [27]:

results_df.loc[2] = get_scores(model_2, p=False)
results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606
2,0.557042,0.171251,0.737476,0.277957,0.693819,0.156644
3,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 3</h1>

<h3 style="color:green;">Tune hyperparameters using grid search</h3>
<li><span style="color:blue">parameters</span> versus <span style="color:blue">hyperparameters</span></li>
<ul>
    <li><span style="color:blue">parameters</span>: the parameters that are necessary for the model to make predictions. For example, the coefficients of the linear equation estimated by the SGD classifier are parameters of the model. Parameters are estimated by the algorithm and from the data</li>
    <li><span style="color:blue">hyperparameters</span>: parameters that are external to the model and cannot be estimated from the data. For example, in an SGD classifier, parameters like the loss function, the regularization parameter, stopping rules, etc. are hyper parameters</li>
    </ul>
<li>In ML, hyperparameters are often set intuitively and then <span style="color:red">tuned</span> using a grid search</li>
<li>In a grid search, various combinations of hyperparameters are tried and <span style="color:blue">k-fold cross validation</span> is used to test the efficacy of the parameter combination</li>
<li>the best combination is then selected as a candidate model</li>

<h3 style="color:green;">The <span style="color:blue">scoring</span> parameter</h3>
<li>since our data is imbalaced, we should look for the model with the best f1 score (precision/recall tradeoff)</li>
<li>set the scoring parameter for GridSearchCV so that it maximizes the f1 score</li>
<li>Though we should be using a much wider range of parameters, I've reduced them so that it runs fairly quickly</li>
<li>This takes about 30 seconds on my machine. Could take longer on your machine</li>

In [16]:
%%time
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
#Set up the hyperparameter options in param_grid
param_grid = {
    'loss': ['log_loss'],
    'penalty': ['elasticnet'],
    'alpha': [.001,.01,.1, .5],
    'l1_ratio': [0.01, 1],
    'class_weight': [{1:3},{1:5},'balanced',{1:9},{1:15}]
}


model_3_gs = GridSearchCV(SGDClassifier(),param_grid,cv=3,scoring='f1',n_jobs=-1)
model_3_gs.fit(x_train, y_train)

CPU times: user 3.58 s, sys: 1.18 s, total: 4.76 s
Wall time: 1min 11s


(0.28544402381805095,
 {'alpha': 0.001,
  'class_weight': 'balanced',
  'l1_ratio': 0.01,
  'loss': 'log_loss',
  'penalty': 'elasticnet'})

<h3 style="color:green;">Get the best model parameters</h3>


In [17]:
model_3_gs.best_score_, model_3_gs.best_params_


(0.28544402381805095,
 {'alpha': 0.001,
  'class_weight': 'balanced',
  'l1_ratio': 0.01,
  'loss': 'log_loss',
  'penalty': 'elasticnet'})

<h3 style="color:green;">Run the best model and report metrics</h3>
<li>Run the classifier using the best parameters</li>






In [29]:
model_3 = SGDClassifier(alpha = 0.001,
  class_weight ='balanced',
  l1_ratio =0.01,
  loss = 'log_loss',
  penalty = 'elasticnet')
model_3.fit(x_train,y_train) #change if you used different variable names

get_scores(model_3)


Confusion Matrix: 
 [[93882 56067]
 [ 6970 12632]]
Training accuracy:  0.6289760778128286
Testing  accuracy:  0.6282121603529321
Precision:  0.18387458332726822
Recall:  0.6444240383634323
F1-Score:  0.2861122750591726
AUC:  0.6863429607286761
Average Precision:  0.15960177654118377


<h3 style="color:green;">Interpret the results</h3>


<h4>Interpretation</h4>
<li>This model improves accuracy, precision, and f1_score (which, of course, is the scoring parameter used in the grid).</li>
<li>However, recall is reduced from 73% to 64%.</li>
<li>AUC Score is marginally lower.</li>


<h3 style="color:green;">Update results_df</h3>

In [30]:


results_df.loc[3] = get_scores(model_3, p=False)


results_df 

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606
2,0.557042,0.171251,0.737476,0.277957,0.693819,0.156644
3,0.628212,0.183875,0.644424,0.286112,0.686343,0.159602
4,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 4</h1>

<h3 style="color:green;">Random Forest Classifier</h3>
<li>We need to improve recall and precision so perhaps a non-linear classifier will help</li>

<h3 style="color:green;">Build, fit, and report metrics</h3>

<li>Run this with the following parameters (these are our base parameters)</li>
<li>random_state=42,n_estimators=30,max_depth=6,min_samples_leaf=2000,min_samples_split=4000,class_weight={1:5}</li>


In [31]:
from sklearn.ensemble import RandomForestClassifier
model_4 = RandomForestClassifier(random_state=42,n_estimators=30,max_depth=6,min_samples_leaf=2000,min_samples_split=4000,class_weight={1:5})
model_4.fit(x_train,y_train)


In [32]:
get_scores(model_4)

Confusion Matrix: 
 [[132351  17598]
 [ 13506   6096]]
Training accuracy:  0.81824294265146
Testing  accuracy:  0.816550772333988
Precision:  0.2572803241326918
Recall:  0.31098867462503826
F1-Score:  0.28159645232815966
AUC:  0.6928528398019439
Average Precision:  0.1596687152105522


<h3 style="color:green;">Interpreting model 4 results</h3>
<p></p>

<h4>Interpretation</h4>
<li>This model improves precision to 25% while reducing recall to 31%.</li>
<li>However, F1 score is slightly worse. </li>
<li>Accuracy improves as well.</li>

<h3 style="color:green;">Update results_df</h3>

In [37]:
results_df.loc[4] = get_scores(model_4, p=False)




results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606
2,0.557042,0.171251,0.737476,0.277957,0.693819,0.156644
3,0.628212,0.183875,0.644424,0.286112,0.686343,0.159602
4,0.816551,0.25728,0.310989,0.281596,0.692853,0.159669
5,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0


<h1 style="color:red;font-size:xx-large">Build Model 5</h1>

<h3 style="color:green;">Random Forest Grid Search</h3>
<p></p>


<li>Run the best model</li>
<li>Note that this will take a while, perhaps even a couple of hours (25 minutes on my laptop). Let it run. Get some coffee or whatever beverage you like. Then come back in a while to check out the results!</li>
<li>If you want to speed it up, remove the 500 option from n_estimators (n_estimators is the number of trees generated and is the single most expensive part of the grid search)</li>


In [43]:
%%time
from sklearn.ensemble import RandomForestClassifier


from sklearn.model_selection import GridSearchCV
from sklearn.metrics import average_precision_score,make_scorer
parameters = {
     'n_estimators':(800,), #the number of trees
     'min_samples_split': (100, 200),
    'class_weight': [{1:6}],
     'min_samples_leaf': (10,20) #
}
gs_clf = GridSearchCV(RandomForestClassifier(random_state=42),parameters,cv=5,n_jobs=-1,
                      scoring='f1')
gs_clf.fit(x_train, np.ravel(y_train))


CPU times: user 48min 48s, sys: 736 ms, total: 48min 49s
Wall time: 3h 53min 43s


<h3 style="color:green;">Get the best model parameters</h3>


In [46]:
gs_clf.best_score_, gs_clf.best_params_

(0.331162244496855,
 {'class_weight': {1: 6},
  'min_samples_leaf': 10,
  'min_samples_split': 100,
  'n_estimators': 800})

<h3 style="color:green;">Run the best model and get metrics</h3>


In [47]:

model_5 = RandomForestClassifier(random_state=42,n_estimators=800,min_samples_leaf=10,min_samples_split=100,class_weight={1:6})
model_5.fit(x_train,y_train)


In [48]:
get_scores(model_5)

Confusion Matrix: 
 [[119522  30427]
 [  9796   9806]]
Training accuracy:  0.7819248968696918
Testing  accuracy:  0.7627675448685056
Precision:  0.24373027117043222
Recall:  0.5002550760126517
F1-Score:  0.32776802874571737
AUC:  0.7309289440285697
Average Precision:  0.17970343168821004


<h3 style="color:green;">Interpreting model 5 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li>Model 5 has the highest f1 score and AUC score so far. </li>
<li>Compared to model 4, model 5 increases recall while only seeing a minimal decline in precision.</li>
<li>Accuracy, however, is slightly lower.</li>

<h3 style="color:green;">Update results df</h3>


In [49]:

results_df.loc[5] = get_scores(model_5, p=False)



results_df

Unnamed: 0_level_0,accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606
2,0.557042,0.171251,0.737476,0.277957,0.693819,0.156644
3,0.628212,0.183875,0.644424,0.286112,0.686343,0.159602
4,0.816551,0.25728,0.310989,0.281596,0.692853,0.159669
5,0.762768,0.24373,0.500255,0.327768,0.730929,0.179703
6,0.802107,0.268801,0.413733,0.325879,0.748559,0.178991
7,0.755147,0.244037,0.532905,0.33477,0.74823,0.18405


<h1 style="color:red;font-size:xx-large">Build Model 6</h1>

<li>Gradient Boosting Classifier</li>
<li>Grid search on GBC can take several days so let's just skip to the best models (I ran a 2-day reduced version)!</li>
<li>Sklearn's gradient boosting classifier uses a sample weight vector to correct for imbalances in the data</li>


In [34]:
from sklearn.ensemble import GradientBoostingClassifier

#sample_weight is a vector that indicates the weight of each 
#case in the training sample
#If you're interested, try values from 1 to 10 instead of 4
sample_weight = np.array([4 if i == 1 else 1 for i in y_train])


model_6 = GradientBoostingClassifier(min_samples_split=100,
                                     max_depth=8,
                                 min_samples_leaf=100,
                                 n_estimators=400,
                                 subsample=0.6)


model_6.fit(x_train,y_train,sample_weight=sample_weight)



In [35]:
#Calculate and print metrics

get_scores(model_6)

Confusion Matrix: 
 [[127888  22061]
 [ 11492   8110]]
Training accuracy:  0.8167288481760091
Testing  accuracy:  0.8021067407446727
Precision:  0.2688011666832389
Recall:  0.4137332925211713
F1-Score:  0.3258794928977558
AUC:  0.748559037672033
Average Precision:  0.17899100806855378


<h3 style="color:green;">Interpreting model 6 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li>Model 6 has the highest AUC score of any of the models so far.</li>
<li>Model 6 also high the highest precision, but this causes a large drop in recall.</li>

<h3 style="color:green;">Update results df</h3>


In [9]:
results_df.loc[6] = get_scores(model_6, p=False)

results_df

Unnamed: 0,accuracy,precision,recall,f1_score,AUC,AP
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606
2,0.557042,0.171251,0.737476,0.277957,0.693819,0.156644
3,0.628212,0.183875,0.644424,0.286112,0.686343,0.159602
4,0.816551,0.25728,0.310989,0.281596,0.692853,0.159669
5,0.762768,0.24373,0.500255,0.327768,0.730929,0.179703
6,0.802107,0.268801,0.413733,0.325879,0.748559,0.178991
7,0.755147,0.244037,0.532905,0.33477,0.74823,0.18405


<h1 style="color:red;font-size:xx-large">Build Model 7</h1>

<li>Same parameters but up the sample weight to 5</li>

In [39]:
from sklearn.ensemble import GradientBoostingClassifier

#sample_weight is a vector that indicates the weight of each 
#case in the training sample
#If you're interested, try values from 1 to 10 instead of 4
sample_weight = np.array([5 if i == 1 else 1 for i in y_train])


model_7 = GradientBoostingClassifier(min_samples_split=100,
                                     max_depth=8,
                                 min_samples_leaf=100,
                                 n_estimators=400,
                                 subsample=0.6)
model_7.fit(x_train,y_train,sample_weight=sample_weight)


In [40]:
#Calculate and print metrics
get_scores(model_7)


Confusion Matrix: 
 [[117590  32359]
 [  9156  10446]]
Training accuracy:  0.7715815336083475
Testing  accuracy:  0.7551474187707533
Precision:  0.2440369115757505
Recall:  0.5329048056320783
F1-Score:  0.33477013796529237
AUC:  0.7482298957328246
Average Precision:  0.1840498938212104


<h3 style="color:green;">Interpreting model 7 results</h3>

<p>
    </p>
<h4>Interpretation</h4>
<li>Model 7 has the highest f1 score and AP score of all the models.</li>
<li>Compared to model 6, model 7 has higher recall with only slightly lower precision</li>


<h3 style="color:green;">Update results df</h3>


In [None]:
results_df.loc[7] = get_scores(model_7, p=False)

In [40]:
results_df

Unnamed: 0,accuracy,precision,recall,f1_score,AUC,AP,model
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606,1
2,0.557042,0.171251,0.737476,0.277957,0.693819,0.156644,2
3,0.628212,0.183875,0.644424,0.286112,0.686343,0.159602,3
4,0.816551,0.25728,0.310989,0.281596,0.692853,0.159669,4
5,0.762768,0.24373,0.500255,0.327768,0.730929,0.179703,5
6,0.802107,0.268801,0.413733,0.325879,0.748559,0.178991,6
7,0.755147,0.244037,0.532905,0.33477,0.74823,0.18405,7


<h3 style="color:red;font-size:xx-large">Model comparison</h3>
<li>Draw a graph that shows the changes to accuracy, precision, recall, and f1 score</li>
<li>The x-axis contains the five models you have created</li>
<li>Use bokeh for the charts</li>

In [6]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, LabelSet, HoverTool
output_notebook()

In [14]:
results_df['model'] = [1,2,3,4,5,6,7]

In [15]:
results_df

Unnamed: 0,accuracy,precision,recall,f1_score,AUC,AP,model
1,0.8842,0.078947,0.000153,0.000305,0.69198,0.115606,1
2,0.557042,0.171251,0.737476,0.277957,0.693819,0.156644,2
3,0.628212,0.183875,0.644424,0.286112,0.686343,0.159602,3
4,0.816551,0.25728,0.310989,0.281596,0.692853,0.159669,4
5,0.762768,0.24373,0.500255,0.327768,0.730929,0.179703,5
6,0.802107,0.268801,0.413733,0.325879,0.748559,0.178991,6
7,0.755147,0.244037,0.532905,0.33477,0.74823,0.18405,7


In [38]:
#CHART 
cdata = ColumnDataSource(data=results_df)

tooltips_1 = [
    ("accuracy", "@accuracy")
]
tooltips_2 = [
    ("precision", "@precision")
]

tooltips_3 = [
    ("recall", "@recall")
]

tooltips_4 = [
    ("f1_score", "@f1_score")
]


p = figure(plot_height = 300, plot_width = 500, x_range=(0.5,7.5), y_range=(0, 1),
           title = 'Accuracy Comparison',
          x_axis_label = 'Model', 
           y_axis_label = 'Accuracy Score',tooltips=tooltips_1)
p.vbar(x='model', top='accuracy', source=cdata, width=0.6, color = "red", line_color='black')
p.xgrid.grid_line_color = None

p2 = figure(plot_height = 300, plot_width = 500, x_range=(0.5,7.5), y_range=(0, 1),
           title = 'Precision Comparison',
          x_axis_label = 'Model', 
           y_axis_label = 'Precision Score',tooltips=tooltips_2)
p2.vbar(x='model', top='precision', source=cdata, width=0.6, color = "red", line_color='black')
p2.xgrid.grid_line_color = None

p3 = figure(plot_height = 300, plot_width = 500, x_range=(0.5,7.5), y_range=(0, 1),
           title = 'Recall Comparison',
          x_axis_label = 'Model', 
           y_axis_label = 'Recall Score',tooltips=tooltips_3)
p3.vbar(x='model', top='recall', source=cdata, width=0.6, color = "red", line_color='black')
p3.xgrid.grid_line_color = None

p4 = figure(plot_height = 300, plot_width = 500, x_range=(0.5,7.5), y_range=(0, 1),
           title = 'F1 Score Comparison',
          x_axis_label = 'Model', 
           y_axis_label = 'F1 Score',tooltips=tooltips_4)
p4.vbar(x='model', top='f1_score', source=cdata, width=0.6, color = "red", line_color='black')
p4.xgrid.grid_line_color = None

grid = gridplot([[p,p2],[p3,p4]],sizing_mode="scale_both",merge_tools=True)
show(grid)



<h3 style="color:green;">Interpret the chart</h3>
<li>What can you say about the changes in precision and recall?</li>

In general, we see that there is a trade-off between precision and recall, with the exception of Model 1 which was quite low for both. Model 2 has the highest recall rate (73.7%), but the second lowest precision rate. Conversely, Model 6 has the highest precision rate, but the third lowest recall rate. 

<h3 style="color:green;">Chart AUC and AP</h3>


In [39]:
#CHART

tooltips_3 = [
    ("auc", "@AUC")
]

tooltips_4 = [
    ("ap", "@AP")
]
p3 = figure(plot_height = 300, plot_width = 500, x_range=(0.5,7.5), y_range=(0, 1),
           title = 'AUC Score Comparison',
          x_axis_label = 'Model', 
           y_axis_label = 'AUC Score',tooltips=tooltips_3)
p3.vbar(x='model', top='AUC', source=cdata, width=0.6, color = "red", line_color='black')
p3.xgrid.grid_line_color = None

p4 = figure(plot_height = 300, plot_width = 500, x_range=(0.5,7.5), y_range=(0, 1),
           title = 'AP Score Comparison',
          x_axis_label = 'Model', 
           y_axis_label = 'AP Score',tooltips=tooltips_4)
p4.vbar(x='model', top='AP', source=cdata, width=0.6, color = "red", line_color='black')
p4.xgrid.grid_line_color = None

grid = gridplot([[p3,p4]],sizing_mode="scale_both",merge_tools=True)
show(grid)


<h3 style="color:green;">Interpret the AUC/AP chart</h3>
<li>The AUC on the first 4 models is pretty much the same. What does that mean?</li>

This means that the area under ROC curve is about the same for the first four models. Each model may have a different tradeoff between false positive and true positive results, but the overall area under the curve is the same. 

<li>The average precision improves steadily but almost entirely by getting better at recall than at precision. What does that mean?</li>

This means that the better models improve recall easier than precision. In other words, the models classify more positive cases correctly without adding in a higher percentage of false positives which would reduce precision.     

<li>Finally, what can you do to get better results? </li>

I could get better results by trying different models and hyperparameters. I could run a grid search that tries different options on some of the existing models, and I could also try other models. Additionally, I could think more about what metrics are most important for the problem, and select a model that optimizes those metrics. 
    