## NAIVE BAYES MODEL

__Group: Iñigo Martiarena y Carlos Rodríguez-Viña__

__Lending Club Loan Status Analysis__

### Library

In [21]:
import xgboost as xgb
import pandas as pd 
import numpy as np
import pandas_profiling
import seaborn as sns
import sklearn as sk
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score,classification_report, recall_score, balanced_accuracy_score,precision_score
import pickle

### Load Dataset

We load the sample we have created in our notebook called "Sample"

In [26]:
X_ada = pd.read_csv('../data/X_ada.csv', engine = 'python')
y_ada = pd.read_csv('../data/y_ada.csv', engine = 'python')
X_test = pd.read_csv('../data/X_test.csv', engine = 'python')
y_test = pd.read_csv('../data/y_test.csv', engine = 'python')

### Standarization Model


Due to having 58 variables in our dataset, we need to proceed to standarize our data, to ensure uniformity to certain practices within the industry.

In [32]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_ada= scaler.fit_transform(X_ada)
X_test= scaler.fit_transform(X_test)

# Model

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

In [33]:
gnb = GaussianNB()

In [34]:
gnb.fit(X_ada, y_ada)

  return f(**kwargs)


GaussianNB()

We proceed with the prediction base on the model we just built, and to calculate the following indicators:

 - Confusion Matrix 
 - Accuracy score. 
 - Recall Score.
 - Precision.
 - Roc Auc score.
 - F1 score
 

In [35]:
y_gnb = gnb.predict(X_test)

### Confussion Matrix

In [36]:
confusion_matrix(y_test, y_gnb)

array([[19449,     0],
       [62432,     0]], dtype=int64)

According to our confussion Matrix we can interpretate the following:

 - 19.449 True Negatives.
 - 0 False Positives.
 - 62.432 Flase Negatives.
 - 0 True Postives.
 
So with our model we have predicted that 62.432 loans where Charged Off when in reality they were fully paid and 62.432 where Fully Paid and in reality where Charged off

Of a total sample of 81.890 observations our model has predicted wrong 62.441, which is a 37% of the total

### Accuracy Score

In [37]:
accuracy_score(y_test, y_gnb)

0.23752763156287784

With this model we were able to obtain a 96,77% of accuracy, which means if we have 100 observation we are able to predict altmost 97% right. The issue with this score is when our model is imbalanced, meaning this score can deceive us into believing that a bad model is a good model. So to be certain we are going to use the balanced_accuracy.

In [38]:
balanced_accuracy_score(y_test, y_gnb, sample_weight=None, adjusted=False)

0.5

We can see that our score has drop down to 95%, but still is a pretty great model.

### Recall Score

In [39]:
recall_score(y_test, y_gnb)

0.0

The ratio is number of true positives/(true positives + false negatives), it informs us about the quantity that our model can predict being 1 the best value and 0 the worst values, in our case we have obtain an outstanding result

### Precision


We measure the quality of our model, the formula is TruePositive/(TruePositives+FalsePositives)

In [40]:
precision_score(y_test, y_gnb)

  _warn_prf(average, modifier, msg_start, len(result))


0.0

The precision is intuitively the ability of the classifier not to label as positive a sample that is negative being best value 1 and worst value 0.

### ROC AUC Score

The ROC is created by the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning.

In [41]:
roc_auc_score(y_test, y_gnb)

0.5

### Conclusion

In [42]:
print("The results of our Naive Model")


print("accuracy score", accuracy_score(y_test, y_gnb))
print("balanced accuracy score", balanced_accuracy_score(y_test, y_gnb))
print("recall score", recall_score(y_test, y_gnb))
print("precision score", precision_score(y_test, y_gnb))
print("roc auc score", roc_auc_score(y_test, y_gnb))

The results of our Naive Model
accuracy score 0.23752763156287784
balanced accuracy score 0.5
recall score 0.0
precision score 0.0
roc auc score 0.5


  _warn_prf(average, modifier, msg_start, len(result))


In [22]:
pickle.dump(gnb, open("gnb", "wb"))