# **Introduction**

In the [paper from UCLA](https://escholarship.org/uc/item/3rs9b3d6) regarding XGBoost and Loan Predictions, they using many different data balancing methods to make sure their data is producing the best results. The accuracy for the base XGBoost algorithm (without a data balancing technique) had the best accuracy. We also observed from the paper that the class weights data balancing technique, although the accuracy was not as good as the base algorithm, had a much better f-measure as well as a much better recall (it is good that the recall is higher here because we want less false negatives i.e. so a bad loan default does not get approved). The paper also used SMOTE and ADASYN balancing methods but they did not compare as well so we will disregard them. We will compare the our XGBoost algorithm which uses K-fold cross validation as the data balancing technique versus the base and class weights results from the paper. There are different variations of K Fold Cross Validation. In this project we will implement Stratified K Fold Cross Validation as well as the Standard K Fold Cross Validation with xgboost and observe how these two fair against the UCLA paper's XGBoost models (base and class weights). We will first implement XGBoost with the parameters specified by the paper and then we will implement a standard version of XGBoost that contains less parameters and see how they both do. Both the XGBoost algorithm from the paper and the standard XGBoost algorithm that we will be implementing will have both K-fold and Stratified Kfold cross validation implementations.

In [None]:
from PIL import Image
Image.open("images/UCLAarticle_performance.png")

As can be seen above, XGBoost with class weights and XGBoost with no data balancing techniques are the superior models so therefore, we will assess our algorithm versus those ones.

In [None]:
import pandas as pd

## Load Data

Below, we are viewing the CSV file as a dataframe. We can observe the various features that will help our XGBoost model predict. The data being loaded in was modified a bit. The original data was the training data, the test data without target feature and the target features for the test data. I combined the test data without target features with its target features in order to have the data more organized. The original data from Kaggle is the in Original_CSV_Data_From_Kaggle folder while the Edited_CSV_Files are the data after the modification mentioned above. 

In [None]:
test_data = pd.read_csv('Edited_CSV_Files/TestData.csv')
train_data = pd.read_csv('Edited_CSV_Files/TrainingData.csv')
train_data.head()

In [None]:
test_data.head()

To determine how many rows and columns each dataframe has, we can call the shape function. The first value in the tuple is the amount of rows while the second value is the amount of columns.

In [None]:
train_data.shape

In [None]:
test_data.shape

## Combining data into one huge dataset 

Our data is divided into a train set and a test set. To prepare the data for Stratified K-Fold Cross Validation, we must combine the train and test set. This is due to the fact that K-Fold Cross Validation will split the data into multiple groups and choose one as a train set and another as a test set for us. Then, we loop through our XGBoost algorithm on all the different groups to obtain the best accuracy. This will help show that our model is predicting well since it is being tested against "unknown" (It is called unknown data because the algorithm is being trained on different data at each iteration and then predicting on different data each iteration) data and therefore justifing it's accuracy.

In [None]:
train_test_data = [train_data, test_data]

all_data = pd.concat(train_test_data)

# we do not need the ID column since pandas already numbers the rows for us
all_data = all_data.drop(columns = ['ID']) 
                                                
all_data

In [None]:
all_data.shape

This table has 280,000 rows. Thjis makes sense because the train set had 252,000 rows and the test set had 28,000 rows. Together, that equals 280,000.

## Check for missing values

In [None]:
all_data.isnull().sum() #if all columns sum to zero, it means there are no missing values/nulls

# **Data Preprocessing** 

## Scaling Data 

Scaling data is an important process in the data science process. We must scale our numeric data to not give more weight to values that are greater than smaller values which may not be true (ex: an age column would have an issue in this regard). We will use the Minimum Maximum Scaler formula here which will scale our numeric data between 0 and 1. We should remove the categorical data and scale the numeric data and put the dataframe back together. Here is the min max formula we will be using below:

In [None]:
Image.open("images/Min_Max_Scaler_Formula.png")

In [None]:
column_labels = ["Profession", "CITY", "STATE","Married/Single", "House_Ownership", "Car_Ownership"]
data_numeric = all_data.drop(columns = column_labels) # to have numeric data separate from categorical in order to perform scaling
data_numeric

In [None]:
data_categorical = all_data[column_labels] # to have the categorical data separate from numeric
data_categorical

In [None]:
data_numeric_scaled = (data_numeric-data_numeric.min())/(data_numeric.max()-data_numeric.min())
data_numeric_scaled

In [None]:
all_data_scaled = pd.concat([data_categorical, data_numeric_scaled], axis=1)
all_data_scaled

Now, the categorical data and the scaled numeric data are back in the same dataframe.

## One Hot Encoding

Some of the columns such as profession, city and home_ownership in the dataframe are not numeric or binary data. In order to build a sufficient ML model, we must change our categorical data into numeric or binary data. There are two popular methods of encoding: label encoding and one hot encoding. Label encoding is ideal when using ordinal data but in our case, all the data we are changing is categorical data so we would want to use one hot encoding which is best for this situation. One hot encoding will convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns which will classify all the categorical data as binary data. Below we will implement one hot encoding using the pandas library get_dummies() function. 

In [None]:

column_labels = ["Profession", "CITY", "STATE","Married/Single", "House_Ownership", "Car_Ownership"]
data_categorical = all_data_scaled.copy()

data_one_hot_encoded = pd.get_dummies(data_categorical, columns = column_labels)


The code below changes the columna that may have a bracket, comma or inequality sign in their label to an underscore. This will prevent errors later when trying to run our data through the algorithm.

In [None]:
import re
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
data_one_hot_encoded.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col 
                                for col in data_one_hot_encoded.columns.values]

Let us view the new columns after the change from categorical data to binary data. 

In [None]:
data_one_hot_encoded.columns.tolist()

As can be seen above, there are a lot more columns. This is necessary because we cannot build learning models from categorical data. There are 458 columns now as opposed to 11 before to account for the categorical data.

# **Building Our Model Like the XGBoost model in the UCLA paper**

Before building the model, lets split our target feature from our other features

In [None]:
X = data_one_hot_encoded.drop(columns = ['Risk_Flag'])
X

In [None]:
Y = pd.DataFrame(data_one_hot_encoded['Risk_Flag'])

Here, we are implementing our algorithm. Our algorithm will have the same parameters as the XGBoost algorithm in the paper.

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score


#Parameters filled for the XGBClassifier are the same as in the UCLA paper
xgb_model = XGBClassifier(n_estimators=300, max_depth = 3, learning_rate = 0.05, use_label_encoder=False, eval_metric='mlogloss', scale_pos_weight=7.1)


accuracy_list = []
recall_list = []
precision_list = []
f1_score_list = []

## K Fold Cross Validation for XGBoost from UCLA Paper

Cross-validation is a statistical method used to evaluate the skill of machine learning models. It uses resampling techniques to evaluate a model. K Fold Cross Validation has a single parameter called k that refers to the number of groups that a given data sample is to be split into. It will run the model on each group which will have its own train and test set and then it will store the evaluation metrics. The model is then discarded and relearned and evaluated using the next group and so on until there are no more groups. The difference between KFold and Stratified KFold which we will implement below is that KFold will randomly sample the data to form the groups. In Stratified K-Fold, the sampling is still selected at random but the proportion of observations regarding the target values remain the same.

Below we implement K-Fold cross validation. We will run it 3 times, where k=3, k=5 and k=10. We chose these values as they are the most common splits used and studied in experimentation.

In [None]:
from sklearn.model_selection import KFold
import numpy as np
kfcv3 = KFold(n_splits=3, random_state = 1, shuffle = True)
kfcv5 = KFold(n_splits=5, random_state = 1, shuffle = True)
kfcv10 = KFold(n_splits=10,  random_state = 1, shuffle = True)

### 3-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv3 = sum(precision_list) / len(precision_list)
avg_recall_kfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 5-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv5 = sum(precision_list) / len(precision_list)
avg_recall_kfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 10-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv10 = sum(precision_list) / len(precision_list)
avg_recall_kfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv10 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Visualization of Results for XGBoost from UCLA paper using K-Fold 

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_kfcv3, avg_accuracy_kfcv5, avg_accuracy_kfcv10],
       [avg_precision_kfcv3, avg_precision_kfcv5, avg_precision_kfcv10],
       [avg_recall_kfcv3, avg_recall_kfcv5, avg_recall_kfcv10],
       [avg_f1_score_kfcv3, avg_f1_score_kfcv5, avg_f1_score_kfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('XGBoost from UCLA Paper with K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')

## Stratified K-Fold Cross Validation for XGBoost from UCLA paper

Stratified K-Fold, as explained above, is a type of cross validation technique that will split the data into K Folds while maintaining proportion of the target values as is the main dataframe in all the k groups. Below is the implementation using k=3, k=5 and k=10.

In [None]:
from sklearn.model_selection import StratifiedKFold
skfcv3 = StratifiedKFold(n_splits=3,random_state=1, shuffle=True)
skfcv5 = StratifiedKFold(n_splits=5,random_state=1, shuffle=True)
skfcv10 = StratifiedKFold(n_splits=10,random_state=1, shuffle=True)

### Stratified 3-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv3 = sum(precision_list) / len(precision_list)
avg_recall_skfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 5-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv5 = sum(precision_list) / len(precision_list)
avg_recall_skfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 10-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv10 = sum(precision_list) / len(precision_list)
avg_recall_skfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv10 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Visualization of Results for XGBoost from UCLA paper using Stratified K-Fold 

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_skfcv3, avg_accuracy_skfcv5, avg_accuracy_skfcv10],
       [avg_precision_skfcv3, avg_precision_skfcv5, avg_precision_skfcv10],
       [avg_recall_skfcv3, avg_recall_skfcv5, avg_recall_skfcv10],
       [avg_f1_score_skfcv3, avg_f1_score_skfcv5, avg_f1_score_skfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('XGBoost from UCLA Paper with Stratified K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')

# **Building the Standard XGBoost model**

Here, we are implementing the standard XGBoost algorithm that contains less parameters than that of the paper.

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, classification_report

xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', scale_pos_weight=7.1)


accuracy_list = []
recall_list = []
precision_list = []
f1_score_list = []

## K-Fold Cross Validation for Standard XGBoost

In [None]:
from sklearn.model_selection import KFold
import numpy as np
kfcv3 = KFold(n_splits=3, random_state = 1, shuffle = True)
kfcv5 = KFold(n_splits=5, random_state = 1, shuffle = True)
kfcv10 = KFold(n_splits=10,  random_state = 1, shuffle = True)

### 3-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv3 = sum(precision_list) / len(precision_list)
avg_recall_kfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 5-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv5 = sum(precision_list) / len(precision_list)
avg_recall_kfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 10-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv10 = sum(precision_list) / len(precision_list)
avg_recall_kfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv10 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Visualization of Results for Standard XGBoost using K-Fold Cross Validation

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_kfcv3, avg_accuracy_kfcv5, avg_accuracy_kfcv10],
       [avg_precision_kfcv3, avg_precision_kfcv5, avg_precision_kfcv10],
       [avg_recall_kfcv3, avg_recall_kfcv5, avg_recall_kfcv10],
       [avg_f1_score_kfcv3, avg_f1_score_kfcv5, avg_f1_score_kfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('Standard XGBoost with K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')

## Stratified K-Fold Cross Validation for Standard XGBoost

In [None]:
from sklearn.model_selection import StratifiedKFold
skfcv3 = StratifiedKFold(n_splits=3,random_state=1, shuffle=True)
skfcv5 = StratifiedKFold(n_splits=5,random_state=1, shuffle=True)
skfcv10 = StratifiedKFold(n_splits=10,random_state=1, shuffle=True)

### Stratified 3-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv3 = sum(precision_list) / len(precision_list)
avg_recall_skfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 5-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv5 = sum(precision_list) / len(precision_list)
avg_recall_skfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 10-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    xgb_model.fit(x_train_one_fold, y_train_one_fold)
    preds = xgb_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv10 = sum(precision_list) / len(precision_list)
avg_recall_skfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv10 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Visualization of Results for Standard XGBoost using Stratified K-Fold Cross Validation

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_skfcv3, avg_accuracy_skfcv5, avg_accuracy_skfcv10],
       [avg_precision_skfcv3, avg_precision_skfcv5, avg_precision_skfcv10],
       [avg_recall_skfcv3, avg_recall_skfcv5, avg_recall_skfcv10],
       [avg_f1_score_skfcv3, avg_f1_score_skfcv5, avg_f1_score_skfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('Standard XGBoost with Stratified K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')

# **Building the Logisitic Regression Model**

In [None]:
import sklearn as sk
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, classification_report

lg_model = LogisticRegression(class_weight = "balanced", max_iter = 400)

accuracy_list = []
recall_list = []
precision_list = []
f1_score_list = []

## K-Fold Cross Validation for Logisitic Regression

In [None]:
from sklearn.model_selection import KFold
import numpy as np
kfcv3 = KFold(n_splits=3, random_state = 1, shuffle = True)
kfcv5 = KFold(n_splits=5, random_state = 1, shuffle = True)
kfcv10 = KFold(n_splits=10,  random_state = 1, shuffle = True)

### 3-Fold Cross Validation

In [None]:
for train, test in kfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    lg_model.fit(x_train_one_fold, y_train_one_fold.values.ravel())
    preds = lg_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv3 = sum(precision_list) / len(precision_list)
avg_recall_kfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 5-Fold Cross Validation

In [None]:
for train, test in kfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    lg_model.fit(x_train_one_fold, y_train_one_fold.values.ravel())
    preds = lg_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv5 = sum(precision_list) / len(precision_list)
avg_recall_kfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 10-Fold Cross Validation

In [None]:
for train, test in kfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    lg_model.fit(x_train_one_fold, y_train_one_fold.values.ravel())
    preds = lg_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv10 = sum(precision_list) / len(precision_list)
avg_recall_kfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv10 = sum(f1_score_list) / len(f1_score_list)

### Visualization of Results for Logistic Regression using K-Fold Cross Validation

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_kfcv3, avg_accuracy_kfcv5, avg_accuracy_kfcv10],
       [avg_precision_kfcv3, avg_precision_kfcv5, avg_precision_kfcv10],
       [avg_recall_kfcv3, avg_recall_kfcv5, avg_recall_kfcv10],
       [avg_f1_score_kfcv3, avg_f1_score_kfcv5, avg_f1_score_kfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('Logistic Regression with K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')

## Stratified K-Fold Cross Validation for Logistic Regression

In [None]:
from sklearn.model_selection import StratifiedKFold
skfcv3 = StratifiedKFold(n_splits=3,random_state=1, shuffle=True)
skfcv5 = StratifiedKFold(n_splits=5,random_state=1, shuffle=True)
skfcv10 = StratifiedKFold(n_splits=10,random_state=1, shuffle=True)

### Stratified 3-Fold Cross Validation

In [None]:
for train, test in skfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    lg_model.fit(x_train_one_fold, y_train_one_fold.values.ravel())
    preds = lg_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv3 = sum(precision_list) / len(precision_list)
avg_recall_skfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 5-Fold Cross Validation

In [None]:
for train, test in skfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    lg_model.fit(x_train_one_fold, y_train_one_fold.values.ravel())
    preds = lg_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv5 = sum(precision_list) / len(precision_list)
avg_recall_skfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 10-Fold Cross Validation

In [None]:
for train, test in skfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    lg_model.fit(x_train_one_fold, y_train_one_fold.values.ravel())
    preds = lg_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv10 = sum(precision_list) / len(precision_list)
avg_recall_skfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv10 = sum(f1_score_list) / len(f1_score_list)

### Visualization of Results for Logistic Regression using Stratified K-Fold Cross Validation

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_skfcv3, avg_accuracy_skfcv5, avg_accuracy_skfcv10],
       [avg_precision_skfcv3, avg_precision_skfcv5, avg_precision_skfcv10],
       [avg_recall_skfcv3, avg_recall_skfcv5, avg_recall_skfcv10],
       [avg_f1_score_skfcv3, avg_f1_score_skfcv5, avg_f1_score_skfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('Logistic Regression with Stratified K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')

# **Building the Random Forest Classifier** 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, classification_report

rfc_model = RandomForestClassifier()



accuracy_list = []
recall_list = []
precision_list = []
f1_score_list = []

## K-Fold Cross Validation for Random Forest Classifier

In [None]:
from sklearn.model_selection import KFold
import numpy as np
kfcv3 = KFold(n_splits=3, random_state = 1, shuffle = True)
kfcv5 = KFold(n_splits=5, random_state = 1, shuffle = True)
kfcv10 = KFold(n_splits=10,  random_state = 1, shuffle = True)

### 3-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    rfc_model.fit(x_train_one_fold, np.ravel(y_train_one_fold))
    preds = rfc_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv3 = sum(precision_list) / len(precision_list)
avg_recall_kfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 5-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    rfc_model.fit(x_train_one_fold, np.ravel(y_train_one_fold))
    preds = rfc_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv5 = sum(precision_list) / len(precision_list)
avg_recall_kfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### 10-Fold Cross Validation

In [None]:
%%time
for train, test in kfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    rfc_model.fit(x_train_one_fold, np.ravel(y_train_one_fold))
    preds = rfc_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_kfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_kfcv10 = sum(precision_list) / len(precision_list)
avg_recall_kfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_kfcv10 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Visualization of Results for Random Forest Classifier using K-Fold Cross Validation

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_kfcv3, avg_accuracy_kfcv5, avg_accuracy_kfcv10],
       [avg_precision_kfcv3, avg_precision_kfcv5, avg_precision_kfcv10],
       [avg_recall_kfcv3, avg_recall_kfcv5, avg_recall_kfcv10],
       [avg_f1_score_kfcv3, avg_f1_score_kfcv5, avg_f1_score_kfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('Random Forest Classifier with K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')

## Stratified K-Fold Cross Validation for Random Forest Classifiers

In [None]:
from sklearn.model_selection import StratifiedKFold
skfcv3 = StratifiedKFold(n_splits=3,random_state=1, shuffle=True)
skfcv5 = StratifiedKFold(n_splits=5,random_state=1, shuffle=True)
skfcv10 = StratifiedKFold(n_splits=10,random_state=1, shuffle=True)

### Stratified 3-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv3.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    rfc_model.fit(x_train_one_fold, np.ravel(y_train_one_fold))
    preds = rfc_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(3):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv3 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv3 = sum(precision_list) / len(precision_list)
avg_recall_skfcv3 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv3 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 5-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv5.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    rfc_model.fit(x_train_one_fold, np.ravel(y_train_one_fold))
    preds = rfc_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(5):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv5 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv5 = sum(precision_list) / len(precision_list)
avg_recall_skfcv5 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv5 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Stratified 10-Fold Cross Validation

In [None]:
%%time
for train, test in skfcv10.split(X, Y):
    x_train_one_fold, x_test_one_fold = X.iloc[train], X.iloc[test]
    y_train_one_fold, y_test_one_fold = Y.iloc[train], Y.iloc[test]
    rfc_model.fit(x_train_one_fold, np.ravel(y_train_one_fold))
    preds = rfc_model.predict(x_test_one_fold)
    precision_list.append(precision_score(y_test_one_fold, preds, average='macro', pos_label=1))
    recall_list.append(recall_score(y_test_one_fold, preds, average='macro', pos_label=1))
    f1_score_list.append(f1_score(y_test_one_fold, preds, average='macro', pos_label=1))
    accuracy_list.append(accuracy_score(y_test_one_fold, preds))

In [None]:
print('Accuracy_Scores ...... Precision_Scores ...... Recall_Scores ...... F1_Scores')

for i in range(10):
    print(round(accuracy_list[i],5), '              ' , round(precision_list[i], 5), '              ', round(recall_list[i],5), '              ', round(f1_score_list[i], 5))

In [None]:
avg_accuracy_skfcv10 = sum(accuracy_list) / len(accuracy_list)
avg_precision_skfcv10 = sum(precision_list) / len(precision_list)
avg_recall_skfcv10 = sum(recall_list) / len(recall_list)
avg_f1_score_skfcv10 = sum(f1_score_list) / len(f1_score_list)

In [None]:
accuracy_list.clear()
precision_list.clear()
recall_list.clear()
f1_score_list.clear()

### Visualization of Results for Random Forest Classifier using Stratified K-Fold Cross Validation

In [None]:
import matplotlib.pyplot as plt
data = [[avg_accuracy_skfcv3, avg_accuracy_skfcv5, avg_accuracy_skfcv10],
       [avg_precision_skfcv3, avg_precision_skfcv5, avg_precision_skfcv10],
       [avg_recall_skfcv3, avg_recall_skfcv5, avg_recall_skfcv10],
       [avg_f1_score_skfcv3, avg_f1_score_skfcv5, avg_f1_score_skfcv10]]
x = np.arange(3)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

ax.bar(x + 0.0, data[0], color = 'b', width = 0.2)
ax.bar(x + 0.2, data[1], color = 'g', width = 0.2)
ax.bar(x + 0.4, data[2], color = 'r', width = 0.2)
ax.bar(x + 0.6, data[3], color = 'y', width = 0.2)
ax.set_xticks(x + 0.3)
ax.set_title('Random Forest Classifier with Stratified K-Fold Cross Validation')
ax.set_xticklabels(['K = 3', 'K = 5', 'K = 10'])
ax.legend(labels = ['Accuracy', 'Precision', 'Recall', 'F1 Score'], bbox_to_anchor=(1.05, 1), loc='upper left')