# Machine Learning Model Generation - Binary Classification

### 1. [Tutorial links](#tutorials)
### 2. [Load CSV File](#csv_file)
### 3. [Split the dataset. _Technique: Train-Test split_](#train_test)
### 4. [Generate Binary Classification Models using Train-Test split](#binomial_train_test)

> #### 4.1. [Logistic Regression Model using Train-Test split](#lr_train_test)
#### 4.2. [Random Forest Model using Train-Test split](#rf_train_test)
#### 4.3. [k-Nearest Neighbor Model using Train-Test split](#knn_train_test)
#### 4.4. [Support Vector Machine (SVM) Model using Train-Test split](#svm_train_test)
#### 4.5. [XGBoost Classifier Model using Train-Test split](#xgb_train_test)

### 5. [Analyze the model results](#analyze_results)

> #### 5.1. [Print Accuracy and AUC of all models](#print_AUC)

### 6. [Split the dataset. _Technique: K-Fold Cross Validation_](#k_fold)
### 7. [Generate Binary Classification Models using K-Fold split](#binomial_kfold)
> #### 7.1. [Logistic Regression Model using K-Fold split](#lr_kfold)
#### 7.2. [Random Forest Model using K-Fold split](#rf_kfold)
#### 7.3. [k-Nearest Neighbor Model using K-Fold split](#knn_kfold)
#### 7.4. [Support Vector Machine (SVM) Model using K-Fold split](#svm_kfold)
#### 7.5. [XGBoost Classifier Model using K-Fold split](#xgb_kfold)


## <a id='tutorials'>1. Tutorial links</a>

### Load libraries

In [None]:
# Import pandas library
import pandas as pd

# Import numpy library
import numpy as np

# Import Train-Test split library
from sklearn.linear_model import LogisticRegression

# Import RandomForestClassifier library
from sklearn.ensemble import RandomForestClassifier

# Import KNeighborsClassifier library
from sklearn.neighbors import KNeighborsClassifier

# Import Support Vector Machine (SVM) library
from sklearn import svm

# Import XGBClassifier library
from xgboost import XGBClassifier

# Import Train-Test split library
from sklearn.model_selection import train_test_split

# Import KFold split library
from sklearn.model_selection import KFold

# Import XGBClassifier library
from xgboost import XGBClassifier

# Import accuracy score computing library
from sklearn.metrics import accuracy_score

# Import metrics library
from sklearn import metrics

# Import matplotlib library
import matplotlib.pyplot as plt
%matplotlib inline

# Import warnings
import warnings
warnings.filterwarnings('ignore')

### <a id='csv_file'>2. Load CSV file saved locally after Feature Engineering<a>

In [None]:
# Load Loan Data file that is saved after Feature Engineering from local disk
loan_data = pd.read_csv("LoanData_final.csv")

# Print the shape
print (loan_data.shape)

# Print few rows to visualize the data
loan_data.head()

## <a id='train_test'>3. Split the dataset. _Technique: Train-Test split_</a>

In [None]:
# Set the Train and Test split ratio to 80:20
SPLIT_RATIO = 0.2

# Split the dataset
X_train, X_test, Y_train, Y_test = train_test_split(loan_data.drop('Loan_Status',  axis = 1), 
                                                    loan_data['Loan_Status'], 
                                                    test_size=SPLIT_RATIO, 
                                                    random_state = 21345)

# Print the shape of the Train set
print("Train dataset: ", X_train.shape, Y_train.shape)

# Print the shape of the Test set
print("Test dataset: ", X_test.shape, Y_test.shape)

## <a id='binomial_train_test'>4. Generate Binary Classification Models using Train-Test split</a>

### <a id='lr_train_test'>4.1 Logistic Regression Model using Train-Test split</a>

In [None]:
# Generate a Logistic Regression object
# liblinear removes warning due to backward compatibility. NOT Important
lr_model = LogisticRegression(solver='liblinear', random_state=2011)

# Train a Logistic Regression model with Train dataset
lr_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_lr = lr_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_lr = accuracy_score(Y_test, y_hat_lr)

# Print first 8 rows to visualize the prediction.
print ("First few predicted Loan Status values: ", y_hat_lr[:8], "\n")

# Compute accuracy score
print ("Accuracy of Logistic Regression model: ", accuracy_lr)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_lr, rownames=['Actual'], colnames=['Predicted'], margins=True)

#### 4.1.1 Print Logistic Regression Model parameters

In [None]:
# Print the Linear Regression Model
lr_model

#### 4.1.2 To output the Binary Probability using Logistic Regression model ...

As we discussed in one of the classes earlier, the Classifier models accept the Train dataset and generate the model that is capable of predicting "the probability of the outcome". <br>

In the above example, we used `lr_model.predict(X_test)` to predict the binary outcome. However, if you want a binary probability outcome, do the following -

In [None]:
# Predict the binary probability outcome
y_hat_lr_proba = lr_model.predict_proba(X_test)

# Print first 8 rows to visualize the prediction.
y_hat_lr_proba[:8]

#### 4.1.3 To draw Area Under the Curve (AUC) of Logistic Regression model ...

In [None]:
# Compute area under the curve (AUC)
auc_lr = metrics.roc_auc_score(Y_test, y_hat_lr_proba[:, 1])

# Compute False Positive Rate, True Positive Rate, and Thresholds using metrics library
fpr, tpr, threshold = metrics.roc_curve(Y_test, y_hat_lr_proba[:, 1])

# Set the plotting area/ size
plt.figure(figsize = (8, 8))

# Plot the AUC curve
plt.plot(fpr, tpr, label='AUC =' + str(auc_lr))
plt.legend(loc=4)

# Plot the base line (diagonal dotted line)
plt.plot([0, 1], [0, 1], 'k-.', lw=1)

#### 4.1.4 Scatterplot the actual and predicted classifications.

In [None]:
y_hat_lr_list=y_hat_lr_proba[:, 1:2].tolist()

# Set the plotting area/ size
plt.figure(figsize = (12, 6))

# Scatter plot the Actual and Predicted values
plt.scatter(y_hat_lr_list, Y_test)
plt.axvline(x=0.5, c='darkorange')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title("Scatter plot of 'Actual' and 'Predicted' values - Logistic Regression")

### <a id='rf_train_test'>4.2 Random Forest Model using Train-Test split</a>

In [None]:
# Generate a Random Forest Classifier object
rf_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=1, random_state=2202)

# n_estimators - represents no of trees in the forest
# n_jobs - No of cores to be used
# oob_score - 
# max_depth - depth of each tree in the forest
# min_samples_split - Minimum number of samples required to split an internal node
# min_samples_leaf  - Minimum number of samples required to be at a leaf node
# max_features - the number of features to consider when looking for best split

# Train a Random Forest model with Train dataset
rf_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_rf = rf_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_rf = accuracy_score(Y_test, y_hat_rf)

# Print first 8 rows to visualize the prediction.
print ("First few predicted Loan Status values: ", y_hat_rf[:8], "\n")

# Compute accuracy score
print ("Accuracy of Random Forest model: ", accuracy_rf)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_rf, rownames=['Actual'], colnames=['Predicted'], margins=True)

#### 4.2.1 Print Random Forest Model parameters

In [None]:
# Print the Random Forest Model
rf_model

#### 4.2.2 To output the Binary Probability using Random Forest model ...

In [None]:
# Predict the binary probability outcome
y_hat_rf_proba = rf_model.predict_proba(X_test)

# Print first 8 rows to visualize the prediction.
y_hat_rf_proba[:8]

#### 4.2.3 To draw Area Under the Curve (AUC) of Random Forest model ...

In [None]:
# Compute area under the curve (AUC)
auc_rf = metrics.roc_auc_score(Y_test, y_hat_rf_proba[:, 1])

# Compute False Positive Rate, True Positive Rate, and Thresholds using metrics library
fpr, tpr, threshold = metrics.roc_curve(Y_test, y_hat_rf_proba[:, 1])

# Set the plotting area/ size
plt.figure(figsize = (8, 8))

# Plot the AUC curve
plt.plot(fpr, tpr, label='AUC =' + str(auc_rf))
plt.legend(loc=4)

# Plot the base line (diagonal dotted line)
plt.plot([0, 1], [0, 1], 'k-.', lw=1)

### <a id='knn_train_test'>4.3 k-Nearest Neighbor Model using Train-Test split</a>

In [None]:
# Generate a k-Nearest Neighbor object
knn_model = KNeighborsClassifier(n_neighbors = 9)

# Train a k-Nearest Neighbor model with Train dataset
knn_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_knn = knn_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_knn = accuracy_score(Y_test, y_hat_knn)

# Print first 8 rows to visualize the prediction.
print ("First few predicted Loan Status values: ", y_hat_knn[:8], "\n")

# Compute accuracy score
print ("Accuracy of k-Nearest Neighbor model: ", accuracy_knn)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_knn, rownames=['Actual'], colnames=['Predicted'], margins=True)

#### 4.3.1 Print k-Nearest Neighbor Model parameters

In [None]:
# Print the k-Nearest Neighbor Model
knn_model

#### 4.3.2 To output the Binary Probability using k-Nearest Neighbor model ...

In [None]:
# Predict the binary probability outcome
y_hat_knn_proba =  << your code goes here >>

# Print first 8 rows to visualize the prediction.
y_hat_knn_proba[:8]

#### 4.3.3 To draw Area Under the Curve (AUC) of k-Nearest Neighbor model ...

In [None]:
<< HOME WORK: your code goes here >>

### <a id='svm_train_test'>4.4 Support Vector Machine (SVM)  Model using Train-Test split</a>

In [None]:
# Generate a Support Vector Machine (SVM) object
svm_model = svm.SVC(gamma=0.05, degree=5, kernel='linear')

# Train a Support Vector Machine model with Train dataset
svm_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_svm = svm_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_svm = accuracy_score(Y_test, y_hat_svm)

# Print first 8 rows to visualize the prediction.
print ("First few predicted Loan Status values: ", y_hat_svm[:8], "\n")

# Compute accuracy score
print ("Accuracy of Support Vector Machine model: ", accuracy_svm)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_svm, rownames=['Actual'], colnames=['Predicted'], margins=True)

#### 4.4.1 Print Support Vector Machine (SVM) Model parameters

In [None]:
# Print the Support Vector Machine Model
svm_model

#### 4.4.2 To output the Binary Probability using Support Vector Machine (SVM) model ...

In [None]:
# Generate a Support Vector Machine (SVM) object to determine probability of binary outcome
svm_model_proba = svm.SVC(gamma=0.05, degree=5, kernel='linear', probability=True)

# Train a Support Vector Machine model with Train dataset
svm_model_proba.fit(X_train, Y_train)

# Predict the probability outcome
y_hat_svm_proba = svm_model_proba.predict_proba(X_test)

# Print first 8 rows to visualize the prediction.
y_hat_svm_proba[:8]

#### 4.4.3 To draw Area Under the Curve (AUC) of Support Vector Machine (SVM) model ...

In [None]:
# Compute area under the curve (AUC)
auc_svm = metrics.roc_auc_score(Y_test, y_hat_svm_proba[:, 1])

# Compute False Positive Rate, True Positive Rate, and Thresholds using metrics library
fpr, tpr, threshold = metrics.roc_curve(Y_test, y_hat_svm_proba[:, 1])

# Set the plotting area/ size
plt.figure(figsize = (8, 8))

# Plot the AUC curve
plt.plot(fpr, tpr, label='AUC =' + str(auc_svm))
plt.legend(loc=4)

# Plot the base line (diagonal dotted line)
plt.plot([0, 1], [0, 1], 'k-.', lw=1)

### <a id='xgb_train_test'>4.5 XGBoost Classifier Model using Train-Test split</a>

In [None]:
# Generate a XGBoost object
xgb_model = XGBClassifier(learning_rate =0.01, 
                      subsample=0.75, 
                      colsample_bytree=0.72, 
                      min_child_weight=8,
                      max_depth=5)

# Train a XGBoost model with Train dataset
xgb_model.fit(X_train, Y_train)

# Predict the outcome
y_hat_xgb = xgb_model.predict(X_test)

# Compute the accuracy score and print it
accuracy_xgb = accuracy_score(Y_test, y_hat_xgb)

# Print first 8 rows to visualize the prediction.
print ("First few predicted Loan Status values: ", y_hat_xgb[:8], "\n")

# Compute accuracy score
print ("Accuracy of XGBoost model: ", accuracy_xgb)

# Print confusion matrix of actual and predicted values using Crosstab function
pd.crosstab(Y_test, y_hat_xgb, rownames=['Actual'], colnames=['Predicted'], margins=True)

#### 4.5.1 Print XGBoost Classifier Model parameters

In [None]:
# Print the XGBoost Model
xgb_model

#### 4.5.2 To output the Binary Probability using XGBoost Classifier model ...

In [None]:
# Predict the binary probability outcome
y_hat_xgb_proba = << your code goes here>> 

# Print first 8 rows to visualize the prediction.
y_hat_xgb_proba[:8]

#### 4.5.3 To draw Area Under the Curve (AUC) of XGBoost model ...

In [None]:
<< HOME WORK: your code goes here >>

## <a id='analyze_results'>5. Analyze the model results</a>

###  <a id='print_AUC'>Print Accuracy and AUC of all models</a>

In [None]:
auc_xgb = 0.7253211629479379  # Computed before hand
auc_knn = 0.5086206896551724  # Computed before hand

# Create a dataframe with Accuracy and AUC
acc_auc_df = pd.DataFrame(
                    {'Metrics': ['Accuracy', 'AUC'],
                    'Logistic Regression': [accuracy_lr, auc_lr],
                    'Random Forest': [accuracy_rf, auc_rf],
                    'k-NN': [accuracy_knn, auc_knn],
                    'SVM': [accuracy_svm, auc_svm],
                    'XGBoost': [accuracy_xgb, auc_xgb]}
)

acc_auc_df

## <a id='k_fold'>6. Split the dataset. _Technique: K-Fold Cross Validation_</a>

In [None]:
# Set the number of splits
NO_SPLITS = 5

# Create KFold object with number of splits
kf = KFold(n_splits=NO_SPLITS, random_state=111)

# Create temp datasets to store the X and  part of the dataset
loan_data_X = loan_data.drop('Loan_Status',  axis = 1)
loan_data_Y = loan_data['Loan_Status']

# loop through the folds and print the shape of them. 
# NOTE: Printing of the shape is not necessary during production.
# This is for illustration purpose only
# The below FOR Loop code is not required during production
for train_index, test_index in kf.split(loan_data_X):
    
    # Split train and test datasets sing fold index
    X_train, X_test = loan_data_X.iloc[train_index], loan_data_X.iloc[test_index]
    Y_train, Y_test = loan_data_Y[train_index], loan_data_Y[test_index]
    
    # Print the shape of the Train set
    print("Train dataset: ", X_train.shape, Y_train.shape)

    # Print the shape of the Test set
    print("Test dataset: ", X_test.shape, Y_test.shape, "\n")

## <a id='binomial_kfold'>7. Generate Binary Classification Models using K-Fold split</a>

### <a id='lr_kfold'>7.1 Logistic Regression Model using K-Fold split</a>

In [None]:
# Generate a Logistic Regression object
lr_KFold_model = LogisticRegression(solver='liblinear')

# Define a variable to store the sum of accuracy of each fold
accuracy_score_sum = 0

# Split the data using KFold object, run the model iteratively, and compute accuracy
for train_index, test_index in kf.split(loan_data_X):

    # Split train and test datasets using fold index
    X_train, X_test = loan_data_X.iloc[train_index], loan_data_X.iloc[test_index]
    Y_train, Y_test = loan_data_Y[train_index], loan_data_Y[test_index]

    # Train a Logistic Regression model with Train dataset
    lr_KFold_model.fit(X_train, Y_train)

    # Compute the accuracy score and print it
    # It does two things. First, with 'X_test', it predicts the 'y-hat'. 
    # Second, with 'y-hat' and 'Y_test' it computes accuracy score
    accuracy_score = lr_KFold_model.score(X_test, Y_test)
    
    # Print accuracy score
    print(accuracy_score)
    
    # Add accuracy of each iterations
    accuracy_score_sum += accuracy_score

# Compute the mean accuracy score of K-Folds and print it
mean_accuracy = accuracy_score_sum / NO_SPLITS
print ("Final accuracy of KFold CV using Logistic Regression: ", mean_accuracy)

### <a id='rf_kfold'>7.2 Random Forest Model using K-Fold split</a>

In [None]:
# Generate a Random Forest object
rf_KFold_model = RandomForestClassifier(n_estimators=100, min_samples_leaf=1, random_state=2202)

# Define a variable to store the sum of accuracy of each fold
accuracy_score_sum = 0

# Create temp datasets to store the X and Y part of the dataset
loan_data_X = loan_data.drop('Loan_Status',  axis = 1)
loan_data_Y = loan_data['Loan_Status']

# Split the data using KFold object, run the model iteratively, and compute accuracy
for train_index, test_index in kf.split(loan_data_X):

    # Split train and test datasets using fold index
    X_train, X_test = loan_data_X.iloc[train_index], loan_data_X.iloc[test_index]
    Y_train, Y_test = loan_data_Y[train_index], loan_data_Y[test_index]

    # Train a Random Forest model with Train dataset
    rf_KFold_model.fit(X_train, Y_train)

    # Compute the accuracy score and print it
    accuracy_score = rf_KFold_model.score(X_test, Y_test)
    
    # Print accuracy score
    print(accuracy_score)
    
    # Add accuracy of each iterations
    accuracy_score_sum += accuracy_score

# Compute the mean accuracy score of K-Folds and print it
mean_accuracy = accuracy_score_sum / NO_SPLITS
print ("Final accuracy of KFold CV using Random Forest model: ", mean_accuracy)

### <a id='knn_kfold'>7.3 k-Nearest Neighbor Model using K-Fold split</a>

In [None]:
<< Your code goes here.. >>

### <a id='svm_kfold'>7.4 Support Vector Machine (SVM) Model using K-Fold split</a>

In [None]:
# Generate a Support Vector Machine (SVM) object
svm_KFold_model = svm.SVC(gamma='scale')

# Define a variable to store the sum of accuracy of each fold
accuracy_score_sum = 0

# Create temp datasets to store the X and Y part of the dataset
loan_data_X = loan_data.drop('Loan_Status',  axis = 1)
loan_data_Y = loan_data['Loan_Status']

# Split the data using KFold object, run the model iteratively, and compute accuracy
for train_index, test_index in kf.split(loan_data_X):

    # Split train and test datasets using fold index
    X_train, X_test = loan_data_X.iloc[train_index], loan_data_X.iloc[test_index]
    Y_train, Y_test = loan_data_Y[train_index], loan_data_Y[test_index]

    # Train a Random Forest model with Train dataset
    svm_KFold_model.fit(X_train, Y_train)

    # Compute the accuracy score and print it
    accuracy_score = svm_KFold_model.score(X_test, Y_test)
    
    # Print accuracy score
    print(accuracy_score)
    
    # Add accuracy of each iterations
    accuracy_score_sum += accuracy_score

# Compute the mean accuracy score of K-Folds and print it
mean_accuracy = accuracy_score_sum / NO_SPLITS
print ("Final accuracy of KFold CV using Random Forest model: ", mean_accuracy)

### <a id='xgb_kfold'>7.5 XGBoost Classifier Model using K-Fold split</a>

In [None]:
<< Your code goes here.. >>