# Random Forest

# Load Libraries

In [30]:
import pandas as pd 
import numpy as np
import pandas_profiling
import seaborn as sns
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score, precision_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
import pickle

# Load Sample

We load the sample we have created in our notebook called "Sample"

In [31]:
X_ada = pd.read_csv('../data/X_ada.csv', engine = 'python')

In [3]:
y_ada = pd.read_csv('../data/y_ada.csv', engine = 'python')

In [4]:
X_test = pd.read_csv('../data/X_test.csv', engine = 'python')

In [5]:
y_test = pd.read_csv('../data/y_test.csv', engine = 'python')

# Model

We use the python function RandomForestClassifier, with a number of estimators at 200, a learning rate of 0.1, a random state 0f 40 and a loss of deviance for classification,

In [6]:
clf_rf = RandomForestClassifier(random_state=123)
clf_rf.fit(X_ada,y_ada)

  clf_rf.fit(X_ada,y_ada)


RandomForestClassifier(random_state=123)

In [8]:
y_rf = clf_rf.predict(X_test)

We proceed with the prediction base on the model we just built, and to calculate the following indicators:

 - Confusion Matrix 
 - Accuracy score. 
 - Recall Score.
 - Precision.
 - Roc Auc score.
 - F1 score
 


### Confusion Matrix

In [9]:
confusion_matrix(y_test, y_rf)

array([[18035,  1414],
       [ 1800, 60632]], dtype=int64)

According to our confussion Matrix we can interpretate the following:

 - 18.035 True Negatives.
 - 1.414 False Positives.
 - 1.800 Flase Negatives.
 - 60.632 True Postives.

So with our model we have predicted that 1.414 loans where Charged Off when in reality they were fully paid and 1.800 where Fully Paid and in reality where Charged off

Of a total sample of 81.890 observations our model has predicted wrong 2.649, which is a 3% of the total

### Accuracy Score

In [29]:
accuracy_score(y_test, y_rf)

0.9607479146566359

With this model we were able to obtain a 96,77% of accuracy, which means if we have 100 observation we are able to predict altmost 97% right. The issue with this score is when our model is imbalanced, meaning this score can deceive us into believing that a bad model is a good model. So to be certain we are going to use the balanced_accuracy.

In [13]:
balanced_accuracy_score(y_test, y_rf, sample_weight=None, adjusted=False)

0.9492328323687662

We can see that our score has drop down to 95%, but still is a pretty great model.

### Recall Score

In [16]:
recall_score(y_test, y_rf)

0.9711686314710405

The ratio is number of true positives/(true positives + false negatives), it informs us about the quantity that our model can predict being 1 the best value and 0 the worst values, in our case we have obtain an outstanding result

### Precision

In [19]:
precision_score(y_test, y_rf)

0.9772104567578893

The precision is intuitively the ability of the classifier not to label as positive a sample that is negative being best value 1 and worst value 0.

### ROC AUC

The ROC is created by the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning.

In [23]:
roc_auc_score(y_test, y_rf)

0.9492328323687661

### Conclusión

In [27]:
print("The results of our Random Forest Model")


print("accuracy score", accuracy_score(y_test, y_rf))
print("balanced accuracy score", balanced_accuracy_score(y_test, y_rf))
print("recall score", recall_score(y_test, y_rf))
print("precision score", precision_score(y_test, y_rf))
print("roc auc score", roc_auc_score(y_test, y_rf))

The results of our Random Forest Model
accuracy score 0.9607479146566359
balanced accuracy score 0.9492328323687662
recall score 0.9711686314710405
precision score 0.9772104567578893
roc auc score 0.9492328323687661


In [28]:
pickle.dump(clf_rf, open("clf_rf", "wb"))