### Predictive Modeling - probablity of default

- This code represents a typical model pipeline
- The model pipeline steps are:
    - Read in necessary libraries
    - Pull the data from a webpage
    - Split the data into train and test datasets
    - Create a Random Forest Classifier
    - Train the model on the train dataset
    - Use the model to predict the test dataset
    - Create model performance metrics

In [None]:
#Import necessary libaries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os

#Load the dataset
url = 'https://github.com/Safa1615/Dataset--loan/blob/main/bank-loan.csv?raw=true'
data = pd.read_csv(url, nrows=700)

# Save to Excel
data.to_excel('dataset.xlsx', index=False)
current_directory = os.getcwd()
file_path = os.path.join(current_directory, 'dataset.xlsx')
print(f"The file is saved at: {file_path}")

#Split the data into features (independent variables) and the target variable (default or not)
X = data.drop('default', axis=1)
y = data['default']

#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Initialize a classification model (in this case, a Random Forest classifier)
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

#Train the classifier on the training data
classifier.fit(X_train, y_train)

#Make prediction on the test data
y_pred = classifier.predict(X_test)

#Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

#Print the results
print(f"Accuracy: {accuracy: .2f}")
print("Confusion Matrix:")
print(confusion)
print("Classification Report:")
print(classification_rep)

### The provided code is a basic implementation of a Random Forest Classifier for predicting loan default. Here's a breakdown:

### Data Loading:

- The dataset is loaded from a GitHub repository using <em>pd.read_csv().

### Data Splitting:

- The data is split into features (X) and the target variable (y), which is whether a loan defaults or not.
- Further, the dataset is split into training and testing sets using <em>train_test_split().

### Model Initialization and Training:

- A Random Forest classifier is initialized with 100 trees <em>(n_estimators=100) for ensemble learning.
- The classifier is trained on the training data using <em>fit().

### Prediction:

- Predictions are made on the test data using <em>predict().

### Model Evaluation:

- Accuracy, confusion matrix, and classification report are computed using <em>accuracy_score(), confusion_matrix(), and classification_report()<em>.

### Results Printing:

- The results, including accuracy, confusion matrix, and classification report, are printed.

In [None]:
print(data)

# Credit Risk Prediction with XGBoost

## Objective:

- Build an XGBoost classifier to predict credit default based on a given dataset.

In [None]:
#Import necessary libaries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier as XGBoostClassifier
import os
import matplotlib.pyplot as plt  #added to allow printing of plots
import seaborn as sns

#Load the dataset
url = 'https://github.com/Safa1615/Dataset--loan/blob/main/bank-loan.csv?raw=true'
data = pd.read_csv(url, nrows=700)

# Save to Excel
data.to_excel('dataset.xlsx', index=False)
current_directory = os.getcwd()
file_path = os.path.join(current_directory, 'dataset.xlsx')
print(f"The file is saved at: {file_path}")

# Print the characteristics of the dataset
print(data.info())
print(data.describe())

# Print histograms of each independent variable to see the distribution
data.hist(figsize=(10, 10))
plt.show()

# Create a box plot to visualize the distribution of 'income'
plt.figure(figsize=(8, 6))
sns.boxplot(y='income', data=data)
plt.title("Box Plot of Income")
plt.ylabel("Income")
plt.show()

#Split the data into features (independent variables) and the target variable (default or not)
X = data.drop('default', axis=1)  #create a new table X, and drop the dependent variable from the table, because the model has to predict it.
y = data['default']               #add the dependent variable to a new table y of predictions, against which the model will be trained.

#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  #Set aside 20% of the records for use as a test set

#Initialize a classification model (in this case, a XGBoost classifier)
classifier = XGBoostClassifier(n_estimators=100,      #Build up to 100 trees
                               learning_rate=0.1,     #Amount by which each tree's predictions are scaled to the running prediction
                               max_depth=6,           #Maximum number of levels of each tree - helps reduce overfitting
                               eval_metric="logloss", #Metric to use to evaluate how well each tree fits the data
                               random_state=42)       #Seed value to make sure that each run produces the same results

#Train the classifier on the training data
classifier.fit(X_train, y_train)

#Make prediction on the test data
y_pred = classifier.predict(X_test)

#Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

#Print the results
print(f"Accuracy: {accuracy: .2f}")
print("Confusion Matrix:")
print(confusion)
print("Classification Report:")
print(classification_rep)

From the data.info command, we can see that there are no missing values in any of the columns, so we will not need to use imputation.

However, the accuracy and f1-score are the same as for the random forest model, so we are not getting any improvement.

Looking at the histograms, we see skewing towards the left, which suggests that we may need to create new features using a logarithm transformation.  Also, we see that there is an outlier in the income column, which could be distorting the model.  The chart below shows the results for income.

In [None]:
data = data[ data['income'] != 446 ]                # drop the row with the max income, from the describe table above.

data['income_log'] = np.log1p(data['income'])       # take the logarthim of income
data.hist("income_log", figsize=(10, 10))
plt.show()

Now we rerun the XGBoost classifier to see if we can get a more epic result. That is, a higher accuracy or F1 score.

In [None]:
#Split the data into features (independent variables) and the target variable (default or not)
X = data.drop('default', axis=1)  #create a new table X, and drop the dependent variable from the table, because the model has to predict it.
X = X.drop('income', axis=1)      #drop income because we added a new feature, 'income_log', above.
y = data['default']               #add the dependent variable to a new table y of predictions, against which the model will be trained.

print(X.info())
#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  #Set aside 20% of the records for use as a test set

#Initialize a classification model (in this case, a XGBoost classifier)
classifier = XGBoostClassifier(n_estimators=100,      #Build up to 100 trees
                               learning_rate=0.1,     #Amount by which each tree's predictions are scaled to the running prediction
                               max_depth=6,           #Maximum number of levels of each tree - helps reduce overfitting
                               eval_metric="logloss", #Metric to use to evaluate how well each tree fits the data
                               random_state=42)       #Seed value to make sure that each run produces the same results

#Train the classifier on the training data
classifier.fit(X_train, y_train)

#Make prediction on the test data
y_pred = classifier.predict(X_test)

#Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

#Print the results
print(f"Accuracy: {accuracy: .2f}")
print("Confusion Matrix:")
print(confusion)
print("Classification Report:")
print(classification_rep)

Using income_log instead of income, and dropping the outlier, improved the accuracy from 0.78 to 0.81, and the f1 score from 0.86 to 0.88 for 0, and from 0.47 to 0.56 for 1.

In [None]:
#Import necessary libaries
from sklearn.model_selection import GridSearchCV

#Split the data into features (independent variables) and the target variable (default or not)
X = data.drop('default', axis=1)  #create a new table X, and drop the dependent variable from the table, because the model has to predict it.
X = X.drop('income', axis=1)      #drop income because we added a new feature, 'income_log', above.
y = data['default']               #add the dependent variable to a new table y of predictions, against which the model will be trained.

print(X.info())

#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  #Set aside 20% of the records for use as a test set

# Define the parameter grid for the grid search
param_grid = {
    'n_estimators': [50, 100, 150, 200],  # Number of trees
    'learning_rate': [0.01, 0.1, 0.2],  # Learning rate
    'max_depth': [3, 4, 5, 6]  # Maximum depth of each tree
}

# Initialize the XGBoost classifier
xgb_classifier = XGBoostClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb_classifier,
                           param_grid=param_grid,
                           scoring='f1',
                           cv=5,  # 5-fold cross-validation
                           verbose=1)  # Print progress

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Evaluate the model with the best parameters on the test set
best_classifier = grid_search.best_estimator_
y_pred = best_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

#Print the results
print(f"Accuracy: {accuracy: .2f}")
print("Confusion Matrix:")
print(confusion)
print("Classification Report:")
print(classification_rep)

The grid search did not improve accuracy, but did improve the f1 score, which is what we were trying to optimize.  However, the overall results are not that significant.  Therefore, we conclude that we have got about the best model we can using these techniques.

In conclusion, XGBoost did not improve over Random Forest, even with a grid search.  Feature engineering was necessary to boost performance.