# Logistic Classification

__Group: Iñigo Martiarena y Carlos Rodríguez-Viña__

__Lending Club Loan Status Analysis__

# Library

In [15]:
import pandas as pd 
import numpy as np
import pandas_profiling
import seaborn as sns
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score, precision_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
import pickle

### Load Daset

We load the sample we have created in our notebook called "Sample"

In [3]:
X_ada = pd.read_csv('../data/X_ada.csv', engine = 'python')
y_ada = pd.read_csv('../data/y_ada.csv', engine = 'python')
X_test = pd.read_csv('../data/X_test.csv', engine = 'python')
y_test = pd.read_csv('../data/y_test.csv', engine = 'python')

### Standarization Model

Due to having 58 variables in our dataset, we need to proceed to standarize our data, to ensure uniformity to certain practices within the industry.

In [4]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_ada = scaler.fit_transform(X_ada)
X_test= scaler.fit_transform(X_test)

# Model

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

 - __1 / (1 + e^-value)__

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.

In [5]:
classifier = LogisticRegression(random_state = 123)

In [6]:
classifier.fit(X_ada, y_ada)

  return f(**kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=123)

We proceed with the prediction base on the model we just built, and to calculate the following indicators:

Confusion Matrix
Accuracy score.
Recall Score.
Precision.
Roc Auc score.
F1 score

In [7]:
y_class = classifier.predict(X_test) 

### Confussion Matrix

In [8]:
confusion_matrix(y_test, y_class)

array([[19430,    19],
       [19978, 42454]], dtype=int64)

According to our confussion Matrix we can interpretate the following:

18.298 True Negatives.
19 False Positives.
19.978 Flase Negatives.
42.454 True Postives.
So with our model we have predicted that 19 loans where Charged Off when in reality they were fully paid and 19.978 where Fully Paid and in reality where Charged off

Of a total sample of 81.890 observations our model has predicted wrong 19.997, which is a 24% of the total

### Accuracy Score

In [21]:
accuracy_score(y_test, y_class)

0.7557797291190874

According to our confussion Matrix we can interpretate the following:

18.298 True Negatives.
1.151 False Positives.
1.489 Flase Negatives.
60.943 True Postives.
So with our model we have predicted that 1.151 loans where Charged Off when in reality they were fully paid and 1.489 where Fully Paid and in reality where Charged off

Of a total sample of 81.890 observations our model has predicted wrong 2.649, which is a 3% of the total

In [13]:
balanced_accuracy_score(y_test, y_class, sample_weight=None, adjusted=False)

0.8395134651011587

### Recall Score

In [16]:
recall_score(y_test, y_class)

0.6800038441824705

The ratio is number of true positives/(true positives + false negatives), it informs us about the quantity that our model can predict being 1 the best value and 0 the worst values, in our case we have obtain an outstanding result

### Precision

We measure the quality of our model, the formula is TruePositive/(TruePositives+FalsePositives)

In [17]:
precision_score(y_test, y_class)

0.9995526569820827

The precision is intuitively the ability of the classifier not to label as positive a sample that is negative being best value 1 and worst value 0.

### ROC AUC Score

The ROC is created by the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning.

In [18]:
roc_auc_score(y_test, y_class)

0.8395134651011587

### Conclusion

In [19]:
print("The results of our Random Forest Model")


print("accuracy score", accuracy_score(y_test, y_class))
print("balanced accuracy score", balanced_accuracy_score(y_test, y_class))
print("recall score", recall_score(y_test, y_class))
print("precision score", precision_score(y_test, y_class))
print("roc auc score", roc_auc_score(y_test, y_class))

The results of our Random Forest Model
accuracy score 0.7557797291190874
balanced accuracy score 0.8395134651011587
recall score 0.6800038441824705
precision score 0.9995526569820827
roc auc score 0.8395134651011587


In [20]:
pickle.dump(classifier, open("classifier", "wb"))