# DSN2099 - Project Exhibition II
## Mahesh Kakde

The dataset I am using is hosted on Kaggle - https://www.kaggle.com/datasets/maheshkakde165/credit-score-data

### Credit Score Class Classification

### Goal
Given a person’s credit-related information, build a machine learning model that can classify the credit score. There are three classes - Standard, Good and Poor, that we have to predict.

I will encode the classes as below:
1 - Poor,
2 - Standard,
3 - Good

In [None]:
# Importing necessary libraries and modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load the dataset
df = pd.read_csv("credit_score_data.csv")

In [None]:
df.shape

There are 2500 records over 22 columns.

In [None]:
df.info()

## 1. Statistical descriptions and Visualizations

There are four categorical columns including the target:
* Credit_Score - Target
* Payment_of_Min_Amount - Not Ordinal
* Credit_Mix - Ordinal
* Payment_Behaviour - Ordinal

We will use on-hot encoding to turn them to numerical.

In [None]:
num_features = ['Delay_from_due_date', 'Num_of_Delayed_Payment', 'Num_Credit_Inquiries',
       'Credit_Utilization_Ratio', 'Credit_History_Age',
        'Amount_invested_monthly', 'Monthly_Balance',
       'Age','Annual_Income', 'Num_Bank_Accounts', 'Num_Credit_Card',
       'Interest_Rate', 'Num_of_Loan', 'Monthly_Inhand_Salary',
       'Changed_Credit_Limit', 'Outstanding_Debt', 'Total_EMI_per_month']
cat_features = ['Payment_of_Min_Amount', 'Credit_Score', 'Credit_Mix', 'Payment_Behaviour']

Encode the target classes as:
1 - Poor,
2 - Standard,
3 - Good

In [None]:
mapping = {'Poor': 1, 'Standard': 2, 'Good': 3}
df['Credit_Score'] = df['Credit_Score'].replace(mapping)

Transforming the Ordinal Categorical Features to Numerical

In [None]:
# Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

# Payment Behavior
categories_PB = ["Low_spent_Small_value_payments", "Low_spent_Medium_value_payments", "Low_spent_Large_value_payments",
                    "High_spent_Small_value_payments", "High_spent_Medium_value_payments", "High_spent_Large_value_payments"]

encoder1 = OrdinalEncoder(categories=[categories_PB])
df["Payment_Behaviour"] = encoder1.fit_transform(df[["Payment_Behaviour"]])

# Credit Mix
categories_CM = ["Bad", "Standard", "Good"]

encoder1 = OrdinalEncoder(categories=[categories_CM])
df["Credit_Mix"] = encoder1.fit_transform(df[["Credit_Mix"]])

Now, transforming the non-ordinal feature 'Payment_of_Min_Amount' to Numerical using One-Hot Encoding.

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoded_data = encoder.fit_transform(df[["Payment_of_Min_Amount"]])
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(['Payment_of_Min_Amount']))
df = pd.concat([df.drop(["Payment_of_Min_Amount"], axis=1), encoded_df], axis=1)

In [None]:
df.describe()

In [None]:
df.info()

All the features are now numerical and ready to be fed to machine learning models.

In [None]:
df.head()

Now, the dataset has only numerical features. Let's visualize the data using Histograms

In [None]:
# Visualization - Histogram
df.hist(figsize=(20,15))
plt.show()

Let's use Box Plots to get a picture of outliers.

In [None]:
# Visualization - Boxplots

for column in df.columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[column])
    plt.title('Boxplot of {}'.format(column))
    plt.show()

#### Observations:

The following patterns were observed in the histogram visualization of the dataset:
-- Well Balanced (Close to Normal)
* Num_of_Delayed_Payment
* Credit_Utilization_Ratio
* Credit_History_Age
* Age
* Num_Credit_Card

-- Left Skewed Distributions
* Delay_from_due_date
* Num_Credit_Inquiries
* Amount_invested_monthly
* Monthly_Balance
* Annual_Income
* Interest_Rate
* Num_of_Loan
* Monthly_Inhand_Salary
* Changed_Credit_Limit
* Outstanding_Debt
* Total_EMI_per_month

-- Right Skewed Distributions
* Num_Bank_Accounts

-- Based on the Boxplots, many Outliers are detected in the foloowing features:
* Total_EMI_per_month
* Outstanding_Debt
* Changed_Credit_Limit
* Monthly_Inhand_Salary
* Annual_Income
* Monthly_Balance
* Amount_invested_monthly
* Num_Credit_Inquiries
* Delay_from_due_date

We should not transform the data and remove outliers as they are important for the predictions of the credit score.
But their skewed distribution and outliers will bring the model performance down. To mitigate this problem we will perform feature scaling using Standard Scaler in the sklearn library.

#### Special Treatments
We have used One-Hot encoding and Ordinal Encoding on the appriate features and later we will be using the standard scaler from sklearn library once we split the dataset into training, validation, and testing sets.

## 2. Correlation and Scatter Plots

In [None]:
# Calculating PCC
corr_matrix = df.corr()

In [None]:
corr_matrix["Credit_Score"].sort_values(ascending=True)

In [None]:
# Heatmap for visualization
plt.figure(figsize=(15, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

For Scatter plots, we don't need to plot all the pairs with each other. There will be too many if we plot scatter plots of each feature with every other feature. We will plot for features that have correlation of more than 60% with each other and all feature with the target.

In [None]:
# Get all the features other than the target that have more than 60% PCC
features = df.columns
non_target_features = [feature for feature in features if feature != 'Credit_Score']

pairs_above_threshold_non_target = []
for i in range(len(non_target_features)):
    for j in range(i+1, len(non_target_features)):  # i+1 to only consider upper triangle
        if abs(corr_matrix.loc[non_target_features[i], non_target_features[j]]) > 0.6:
            pairs_above_threshold_non_target.append((non_target_features[i], non_target_features[j]))

pairs_above_threshold_non_target

In [None]:
for (feature_x, feature_y) in pairs_above_threshold_non_target:
    sns.scatterplot(x=df[feature_x], y=df[feature_y])
    plt.title(f'Scatter Plot of {feature_x} vs {feature_y}')
    plt.xlabel(feature_x)
    plt.ylabel(feature_y)
    plt.show()

Now, we will plot scatter plots of all features with the target - Credit_Score.

In [None]:
non_target_features = [feature for feature in df.columns if feature != 'Credit_Score']

for feature in non_target_features:
    sns.scatterplot(x=df[feature], y=df['Credit_Score'])
    plt.title(f'Scatter Plot of {feature} vs Credit_Score')
    plt.xlabel(feature)
    plt.ylabel('Credit_Score')
    plt.show()

Observations:
* No feature seems to have very high correlation (more than 0.8) with the target.
* Among the negatively correlated features 'Num_Credit_Inquiries', 'Interest_Rate', 'Outstanding_Debt', 'Delay_from_due_date', 'Num_of_Loan', 'Num_Credit_Card' are strongly correlated with the target.
* Among the positvely correlated features 'Credit_History_Age' has the highest correlation with the target.

## 3. Dataset Split into Training, Validation, and Test

We will split the dataset into 20% Test, 20% Validation, and 60% Training by employing the Stratified Sampling in sklearn library.

We will first split the dataset into 40% training and 40% Test+Validate.
Then, we will split the 40% Test+Validate into 20% Test and 20% Validate.

In [None]:
df[['Credit_Score']].value_counts()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit


strat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)

# Generate the indices and split the data into training and test + validation sets
for train_index, test_valid_index in strat_split.split(df, df['Credit_Score']):
    strat_train_set = df.iloc[train_index]
    strat_test_valid_set = df.iloc[test_valid_index]

# Instantiate another StratifiedShuffleSplit object to split the test and validation sets
strat_split_test_valid = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)

# Generate the indices and split the test + validation sets into test and validation sets
for test_index, valid_index in strat_split_test_valid.split(strat_test_valid_set, strat_test_valid_set['Credit_Score']):
    strat_test_set = strat_test_valid_set.iloc[test_index]
    strat_valid_set = strat_test_valid_set.iloc[valid_index]

#### Verification of the split
Let's start with verifying the shapes of the sets.

In [None]:
strat_train_set.shape, strat_valid_set.shape, strat_test_set.shape

The shapes of the training, validation and testing sets are correct.

Let's visualize using stacked bar graphs.

In [None]:
# List of all the features
features = [
    'Delay_from_due_date',
    'Num_of_Delayed_Payment',
    'Num_Credit_Inquiries',
    'Credit_Utilization_Ratio',
    'Credit_History_Age',
    'Amount_invested_monthly',
    'Monthly_Balance',
    'Credit_Score',
    'Credit_Mix',
    'Payment_Behaviour',
    'Age',
    'Annual_Income',
    'Num_Bank_Accounts',
    'Num_Credit_Card',
    'Interest_Rate',
    'Num_of_Loan',
    'Monthly_Inhand_Salary',
    'Changed_Credit_Limit',
    'Outstanding_Debt',
    'Total_EMI_per_month',
    'Payment_of_Min_Amount_NM',
    'Payment_of_Min_Amount_No',
    'Payment_of_Min_Amount_Yes'
]

In [None]:
# Function to plot the stacked bar chart of the training, validation, and testing sets for visual inspection
def plot_category_proportions(df, train_set, valid_set, test_set, features):
    plt.figure(figsize=(14, 10))
    n_features = len(features)
    
    for feature in features:
        # Calculate the value counts for each dataset
        full_counts = df[feature].value_counts(normalize=True).sort_index()
        train_counts = train_set[feature].value_counts(normalize=True).sort_index()
        valid_counts = valid_set[feature].value_counts(normalize=True).sort_index()
        test_counts = test_set[feature].value_counts(normalize=True).sort_index()
        
        # Correctly union all indices
        all_indices = full_counts.index.union(train_counts.index).union(valid_counts.index).union(test_counts.index)
        
        train_counts = train_counts.reindex(all_indices, fill_value=0)
        valid_counts = valid_counts.reindex(all_indices, fill_value=0)
        test_counts = test_counts.reindex(all_indices, fill_value=0)
        
        proportions = pd.DataFrame({
            'Training': train_counts,
            'Validation': valid_counts,
            'Testing': test_counts
        })
        
        # Plot
        proportions.plot(kind='bar', stacked=True, figsize=(10, 6))
        plt.title(f'Proportional Distribution of {feature} Across Sets')
        plt.ylabel('Proportion of Total')
        plt.xlabel(feature)
        plt.show()

In [None]:
plot_category_proportions(df, strat_train_set, strat_valid_set, strat_test_set, features)

The plots above show that the data is split correctly.

Now, we will get the X and y dataframes for each of the training, testing and validation sets.

In [None]:
# Training
X_train = strat_train_set.drop('Credit_Score', axis=1)
y_train = strat_train_set["Credit_Score"]

# Validation
X_val = strat_valid_set.drop('Credit_Score', axis=1)
y_val = strat_valid_set["Credit_Score"]

# Testing
X_test = strat_test_set.drop('Credit_Score', axis=1)
y_test = strat_test_set["Credit_Score"]

Now, we can proceed with the model training.

## 4. Training Models

In [None]:
# Feature Scaling using Standard Scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train))
X_val = pd.DataFrame(scaler.transform(X_val))
X_test = pd.DataFrame(scaler.transform(X_test))

In [None]:
# Function Calculate and print scores
def compute_print_metrics(y_original, y_pred):
    # Computing the metrics
    acc = accuracy_score(y_original, y_pred)
    pre = precision_score(y_original, y_pred, average='macro')
    rec = recall_score(y_original, y_pred, average="macro")
    f1 = f1_score(y_original, y_pred, average="macro")

    # Print the metrics
    print("Accuracy: ", acc)
    print("Precision: ", pre)
    print("Recall: ", rec)
    print("F1 Score: ", f1)

### 4A Multinomial Logistic Regression

Let's train a base model first

In [None]:
# Multinomial Logistic Regression (softmax regression)

# Import statements
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Example for training a logistic regression model
log_reg_initial = LogisticRegression(multi_class='multinomial',
                             solver='newton-cg', C=100,
                             max_iter=1)

# Fit on training data
log_reg_initial.fit(X_train, y_train)

# Predict on validation data
y_train_pred_log = log_reg_initial.predict(X_train)
y_val_pred_log = log_reg_initial.predict(X_val)
y_test_pred_log = log_reg_initial.predict(X_test)

In [None]:
print("\nTraining Set Metrics")
compute_print_metrics(y_train, y_train_pred_log)

print("\nValidation Set Metrics")
compute_print_metrics(y_val, y_val_pred_log)

print("\nTest Set Metrics")
compute_print_metrics(y_test, y_test_pred_log)

Get best hyperparameters using Grid Search.

In [None]:
# Using Grid Serach to find the best parameters
from sklearn.model_selection import GridSearchCV

log_reg =  LogisticRegression(multi_class='multinomial')

lr_param_grid = {
    'C': [10, 15, 20],
    'solver': ['newton-cg', 'sag', 'saga'],
    'max_iter': [90, 100, 150]#, 200, 300, 500, 1000]
}


grid_search = GridSearchCV(log_reg, param_grid=lr_param_grid, scoring='accuracy', verbose=1, cv=5)


# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
log_reg = grid_search.best_estimator_

# Print the best parameters found
print("Best parameters found:", grid_search.best_params_)

# Printing the metrics
y_train_pred_log_reg = log_reg.predict(X_train)
y_val_pred_log_reg = log_reg.predict(X_val)
y_test_pred_log_reg = log_reg.predict(X_test)

print("\n\nTraining Set Metrics")
compute_print_metrics(y_train, y_train_pred_log_reg)

print("\n\nValidation Set Metrics")
compute_print_metrics(y_val, y_val_pred_log_reg)

print("\nTest Set Metrics")
compute_print_metrics(y_test, y_test_pred_log_reg)

Performance of the Logistic Regression Classifier:
-----------------------------------
- **Base Model:**

1. Training Set:
*   Accuracy:  58.16%
*   Precision:  6470%
*   Recall:  70.04%
*   F1 Score:  58.47%

2. Validation Set:
*   Accuracy:  57.50%
*   Precision:  63.93%
*   Recall:  68.54%
*   F1 Score:  57.82%

3. Test Set:
*   Accuracy:  50.00%
*   Precision:  55.92%
*   Recall:  62.30%
*   F1 Score:  50.17%
-----------------------------------

- **Best Estimator**


1. Training Set:
*   Accuracy:  76.33%
*   Precision:  76.06%
*   Recall:  75.21%
*   F1 Score: 75.60%

1. Validation Set:
*   Accuracy:  71.00%
*   Precision:  69.88%
*   Recall:  69.43%
*   F1 Score:  69.51%

1. Test Set:
*   Accuracy: 62.00%
*   Precision: 60.54%
*   Recall:  63.02%
*   F1 Score: 61.38%
-----------------------------------
**Findings:**
1. Best Hyperparameters: After performing the grid search, the best hyperparameters for logistic regression were found to be C=10,solver='saga' and max_iter = 90.

2. The accuracy on the train set of the tuned model was much better (71.1%) than the initial base model (62.3%). The tuned model performed very well compared to the base model on the validation and the test data, with a balanced precision and recall. The f1 score for the tuned model was also much better for all three sets compared to the base model.

**Hyperparameters Discussion:**

1. Regularization Strength (C):
Lower values of C increase the regularization strength, leading to a simpler model with smaller coefficients. This made the model perform worse on the training, testing and validation sets. Higher values of C had the exact opposite effect. The model was overfitting the data and performing slightly better but still too bad on the testing and validation sets.

2. Solver: We also experimented with solver, te accurracy, precision, recall and f1 score was did not very much, but based on this particular dataset 'newton-cg' gave the best peroformance.

3. max_iter: Trying different values of the max_iter hypermeter we found that the model's performance metrics did not increases significantly after a certain value, but it was taking more time to train. However, the lower values of this hyperparameter resulted in the underftting of the model.

4. Based on the number of hyperparameters available and their range of values, there were many available combinations, but Grid Search helped in narraowing down the best values for each.

### B. Support vector machines (make sure to try using kernels); hyperparameters to explore: C, kernel, degree of polynomial kernel, gamma

SVM Classifiers perform better with scaled features. Features are already scaled using Standard Scaler,

Train a base classifier first to get a reference to compare the optimized model with best hyperparamters.

In [None]:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Base model with rbf Kernel
poly_kernel_svm_clf = SVC(kernel="rbf", coef0=2, C=25, probability=True, gamma=4)

poly_kernel_svm_clf.fit(X_train, y_train)

In [None]:
print("\n\nTraining Set Metrics")
compute_print_metrics(y_train, poly_kernel_svm_clf.predict(X_train))

print("\n\nValidation Set Metrics")
compute_print_metrics(y_val, poly_kernel_svm_clf.predict(X_val))

print("\n\nTest Set Metrics")
compute_print_metrics(y_test, poly_kernel_svm_clf.predict(X_test))

It is obvious that the model is overfitting the training set, but we will use Grid Search to get the best Hyperparameters to avoid overfitting.

Let's use Grid Search to find the best parameters (including the kernel) for the SVM CLassifier

In [None]:
# Define the model
svm_clf = SVC(probability=True)

# Define the hyperparameters grid to explore
svc_param_grid = {
    'kernel': ["poly", "rbf", "sigmoid", ],
    'C': [0.005, 0.05, 0.5, 0.1, 1, 1.5, 2, 4, 6],
    'gamma': ['scale', 'auto'] + [ 0.001, 0.01, 0.1, 1, 10, 100]
}

svc_grid_search = GridSearchCV(svm_clf, param_grid=svc_param_grid,cv=10)
svc_grid_search.fit(X_train, y_train)

In [None]:
# Tuned Model
svc_clf = svc_grid_search.best_estimator_

# Print the best parameters found
print("Best parameters found:", svc_grid_search.best_params_)

# Printing the metrics for
y_train_pred_svc = svc_clf.predict(X_train)
y_val_pred_svc = svc_clf.predict(X_val)
y_test_pred_svc = svc_clf.predict(X_test)

print("\nTraining Set Metrics")
compute_print_metrics(y_train, y_train_pred_svc)

print("\nValidation Set Metrics")
compute_print_metrics(y_val, y_val_pred_svc)

print("\nTest Set Metrics")
compute_print_metrics(y_test, y_test_pred_svc)

Performance of the SVM Classifier:
-----------------------------------
- **Base Model:**

1. Training Set:
*   Accuracy:  100%
*   Precision:  100%
*   Recall:  100%
*   F1 Score:  100%

2. Validation Set:
*   Accuracy:  54.00%
*   Precision:  49.35%
*   Recall:  38.30%
*   F1 Score:  34.46%

1. Test Set:
*   Accuracy:  56.00%
*   Precision:  62.48%
*   Recall:  39.90%
*   F1 Score:  37.33%

The model clearly overfits the training data. But we will leave it as is as this is not the model we will be using for in the ensemble, it will be our tuned model.

-----------------------------------
- **Best Estimator**

1. Training Set:
*   Accuracy:  83.50%
*   Precision:  83.18%
*   Recall:  82.8%
*   F1 Score: 83.02%

1. Validation Set:
*   Accuracy:  76.%
*   Precision:  75.49%
*   Recall:  74.68%
*   F1 Score:  74.98%

1. Test Set:
*   Accuracy: 73.50%
*   Precision: 72.87%
*   Recall:  72.86%
*   F1 Score: 72.38%
-----------------------------------

**Findings:**

1. Best Hyperparameters: After performing the grid search, the best hyperparameters for SVM Classifier are C=0.1, gamma=0.1, and kernel = 'ploy'.

2. As expected the the accuracy, recall, precision, and f1 score increased for all training, testing, and validation sets after tuning the model.

**Hyperparameters Discussion:**

1. C: Higher values of C resulted in the overfitting of the model and lower values of C resulted in underfitting.

2. gamma: Not all kernels have this parameter. We tried it with 'rbf', 'poly', and 'sigmoid'. The results were similar. Higher value resulted in overfitting and the lower value resulted in underfitting.

3. kernel: Different kernels performed very differently. We exerimented with Linear, rbf, poly, and sigmoid kernels. out of which sigmoid kernel gave the best performance. That was also the kernel chosen byy grid search.



### Random Forest classifier (also analyze feature importance); hyperparameters to explore: the number of trees, max depth, the minimum number of samples required to split an internal node, the minimum number of samples required to be at a leaf node.

Train a base model to compare the optimized best model to.

In [None]:
# Base Model
from sklearn.ensemble import RandomForestClassifier

rf_clf_initial = RandomForestClassifier(n_estimators=10,
                                max_depth=3,
                                min_samples_split=10,
                                min_samples_leaf=15)

# Scaling is not required for a Random Forest Classifier
rf_clf_initial.fit(X_train, y_train)

In [None]:
# Printing the metrics
y_train_pred_rf_initial = rf_clf_initial.predict(X_train)
y_val_pred_rf_initial = rf_clf_initial.predict(X_val)
y_test_pred_rf_initial = rf_clf_initial.predict(X_test)

print("\nTraining Set Metrics")
compute_print_metrics(y_train, y_train_pred_rf_initial)

print("\nValidation Set Metrics")
compute_print_metrics(y_val, y_val_pred_rf_initial)

print("\nTest Set Metrics")
compute_print_metrics(y_test, y_test_pred_rf_initial)

In [None]:
# Feature Importance
feature_names = [col for col in df.columns if col != 'Credit_Score']
importances = rf_clf_initial.feature_importances_

# Create a DataFrame to hold the feature names and their importance scores
features_importances_df = pd.DataFrame(zip(feature_names, importances), columns=['Feature', 'Importance'])

# Sort the DataFrame by importance in descending order
features_importances_df = features_importances_df.sort_values(by='Importance', ascending=False)

# Plotting
plt.figure(figsize=(10, 8))
plt.barh(features_importances_df['Feature'], features_importances_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances')
plt.gca().invert_yaxis()  # Invert y-axis to have the most important feature on top
plt.show()

### Feature Importance from the base model.
* Credit Mix is the most important feature of all to predict the Credit Score of a person, followed by Interest Rate, Credit History Age, Number of delayed Payments.
* The least important features include Amount invested monthly, Number of Credit Inquiries, Credi Utilization ratio, and Payment of Minimum Amount (Name) (least important)

In [None]:
# Grid Search to find the best parameters

# Define the model
rf_clf = RandomForestClassifier(random_state=10)

# Define the hyperparameters grid to explore
rf_param_grid = {"n_estimators": [8, 10, 11],
                 "max_depth": [5, 10, 12, 18],
                 "min_samples_split": [8, 9, 10, 12],
                 "min_samples_leaf": [3, 4, 6]
}

rf_grid_search = GridSearchCV(rf_clf, param_grid=rf_param_grid,cv=10)
rf_grid_search.fit(X_train, y_train)

In [None]:
rf_clf = rf_grid_search.best_estimator_

# Feature Importance
feature_names = [col for col in df.columns if col != 'Credit_Score']
importances = rf_clf.feature_importances_

# Create a DataFrame to hold the feature names and their importance scores
features_importances_df = pd.DataFrame(zip(feature_names, importances), columns=['Feature', 'Importance'])

# Sort the DataFrame by importance in descending order
features_importances_df = features_importances_df.sort_values(by='Importance', ascending=False)

# Plotting
plt.figure(figsize=(10, 8))
plt.barh(features_importances_df['Feature'], features_importances_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances')
plt.gca().invert_yaxis()  # Invert y-axis to have the most important feature on top
plt.show()



### Feature Importance from the best estimator model.
* Credit Mix is the most important feature of all to predict the Credit Score of a person, followed by Interest Rate, Number of Delayed Payments, and Annual Income.
* The least important features include Monthly Balance, Payment of Minimum Amount (Yes), Payment of Minimum Amount (No), and lastly Payment of Minimum Amount (Name).

In [None]:
# Print the best parameters found
print("Best parameters found:", rf_grid_search.best_params_)

# Printing the metrics for
y_train_pred_rf = rf_clf.predict(X_train)
y_val_pred_rf = rf_clf.predict(X_val)
y_test_pred_rf = rf_clf.predict(X_test)

print("\nTraining Set Metrics")
compute_print_metrics(y_train, y_train_pred_rf)

print("\nValidation Set Metrics")
compute_print_metrics(y_val, y_val_pred_rf)

print("\nTest Set Metrics")
compute_print_metrics(y_test, y_test_pred_rf)

Performance of the Classifier:
-----------------------------------
- **Base Model:**

1. Training Set:
*   Accuracy:  75.67%
*   Precision:  74.37%
*   Recall:  77.13%
*   F1 Score:  75.51%

2. Validation Set:
*   Accuracy:  65.00%
*   Precision:  63.46%
*   Recall:  66.59%
*   F1 Score:  64.55%

3. Test Set:
*   Accuracy:  65.00%
*   Precision:  63.61%
*   Recall:  68.31%
*   F1 Score:  64.88%
-----------------------------------
- **Best Estimator**

1. Training Set:
*   Accuracy:  87.00%
*   Precision:  86.14%
*   Recall:  87.22%
*   F1 Score:  86.65%

1. Validation Set:
*   Accuracy:  75.50%
*   Precision:  74.57%
*   Recall:  75.01%
*   F1 Score:  74.68%

1. Test Set:
*   Accuracy: 73.50%
*   Precision: 72.37%
*   Recall:  74.68%
*   F1 Score: 73.01%
-----------------------------------

The tuned model performed much better than the base model. The perfromance metrics on the validation and testing set are much higher in the tuned model.

**Findings:**
1. Best Hyperparameters: After performing the grid search, the best hyperparameters for Random Forest Classifer were max_depth=18, min_samples_leaf=3, min_samples_split=12, n_estimators=11.

**Hyperparameters Discussion:**
1. max_depth: As expected, the higher values of this hyperparamter was resulting in overfitting and the lower values resulted in underfitting
2. min_samples_leaf: Higher values made the model perform better as the classifier was able to generalie better. It prevented model to create leaves that are too specific and the model was able to perfrorm better on the validation and training sets.
3. min_samples_split: Higher value of this hyperparameter put more restriction on splitting the nodes resulting in less dense trees. The model was able to generalize better and performed well on the testing and validation sets.
4. n_estimators: It specifies the number of trees in the ensemble. We kept increasing its value and saw that, the more trees the better, but after a certain point increasing the value was not changing the model's performance metrics.





## 5. Combine your classifiers into an ensemble and try to outperform each individual classifier on the validation set. Once you have found a good one, try it on the test set. Describe and discuss your findings.

We will combine the Logistic Regression, Support Vector Machine Classifer, and the Random Forest Classifier into an ensemble of Soft Voting Classfier. Soft Voting Classifier are generally better than a Hard Voting Classifier

In [None]:
# function to compare the validation set metrics for an ensemble

def compare_ensemble_val(ensemble, X_val, y_val):
    print("--------------------------------------------------------------------")
    print("Validation Set Metrics")
    print("--------------------------------------------------------------------")
    y_val_pred_ensemble = ensemble.predict(X_val)
    print(f"\nEnsemble: {ensemble}")
    compute_print_metrics(y_val, y_val_pred_ensemble)


    for clf in (log_reg, svc_clf, rf_clf):
        y_pred_est = clf.predict(X_val)
        print(f"\nEstimator: {clf}")
        compute_print_metrics(y_val, y_pred_est)

def compare_ensemble_val_and_test(ensemble, X_val, y_val, X_test, y_test):
    print("--------------------------------------------------------------------")
    print("Validation Set Metrics")
    print("--------------------------------------------------------------------")
    y_val_pred_ensemble = ensemble.predict(X_val)
    print(f"\nEnsemble: {ensemble}")
    compute_print_metrics(y_val, y_val_pred_ensemble)


    for clf in (log_reg, svc_clf, rf_clf):
        y_pred_est = clf.predict(X_val)
        print(f"\nEstimator: {clf}")
        compute_print_metrics(y_val, y_pred_est)
    print("--------------------------------------------------------------------")
    print("Test Set Metrics")
    print("--------------------------------------------------------------------")
    y_test_pred_ensemble = ensemble.predict(X_test)
    print(f"\nEnsemble: {ensemble}")
    compute_print_metrics(y_test, y_test_pred_ensemble)


    for clf in (log_reg, svc_clf, rf_clf):
        y_pred_est = clf.predict(X_test)
        print(f"\nEstimator: {clf}")
        compute_print_metrics(y_test, y_pred_est)

### Soft-Voting Classifier

In [None]:
from sklearn.ensemble import VotingClassifier

soft_voting_clf = VotingClassifier(
    estimators = [("lr", log_reg), ("svc", svc_clf), ("rf", rf_clf)],
    voting="soft"
)

soft_voting_clf.fit(X_train, y_train)

In [None]:
print("Soft Voting Classifier Performance Metrics")
print("Training Set Metrics")
compute_print_metrics(y_train, soft_voting_clf.predict(X_train))

print("\n\nValidation Set Metrics")
compute_print_metrics(y_val, soft_voting_clf.predict(X_val))

#print("\n\nTest Set Metrics")
#compute_print_metrics(y_test, soft_voting_clf.predict(X_test))

In [None]:
compare_ensemble_val(soft_voting_clf, X_val, y_val)

Soft-voting classifier did not outperform the individual classifiers.
Here is the summary of it's perfomance on the training set and validation set:

-- Training Set Metrics
* Accuracy:  83.67%
* Precision:  83.19%
* Recall:  83.31%
* F1 Score:  83.25%


-- Validation Set Metrics
* Accuracy:  74.00
* Precision:  73.41%
* Recall:  72.63%
* F1 Score:  72.91%

### Hard Voting Classifier

In [None]:
from sklearn.ensemble import VotingClassifier

hard_voting_clf = VotingClassifier(
    estimators = [("lr", log_reg), ("svc", svc_clf), ("rf", rf_clf)],
    voting="hard"
)

hard_voting_clf.fit(X_train, y_train)

In [None]:
print("Hard Voting Classifier Performance Metrics")
print("Training Set Metrics")
compute_print_metrics(y_train, hard_voting_clf.predict(X_train))

print("\n\nValidation Set Metrics")
compute_print_metrics(y_val, hard_voting_clf.predict(X_val))

#print("\n\nTest Set Metrics")
#compute_print_metrics(y_test, hard_voting_clf.predict(X_test))

In [None]:
compare_ensemble_val(hard_voting_clf, X_val, y_val)

Hard-Voting Classifier also did not outperform the individual classifiers.
Here is the summary of its performance on the training and validation sets:

-- Training Set Metrics
* Accuracy:  83.67%
* Precision:  83.31%
* Recall:  83.09%
* F1 Score:  83.20%


-- Validation Set Metrics
* Accuracy:  74.50%
* Precision:  74.04%
* Recall:  72.95%
* F1 Score:  73.34%

### Stacking

In [None]:
from sklearn.ensemble import StackingClassifier

stack = StackingClassifier(estimators = [("lr", log_reg), ("svc", svc_clf), ("rf", rf_clf)],
                           final_estimator=LogisticRegression(C=0.01, multi_class='multinomial', solver='newton-cg'))

stack.fit(X_train, y_train)

In [None]:
print("Stacking Classifier Performance Metrics")
print("Training Set Metrics")
compute_print_metrics(y_train, stack.predict(X_train))

print("\n\nValidation Set Metrics")
compute_print_metrics(y_val, stack.predict(X_val))


In [None]:
compare_ensemble_val(stack, X_val, y_val)

Stacking, like other ensembles we tred did not out-perform the individual classifiers.
Here is the summary of the stacking ensemble:

-- Training Set Metrics
* Accuracy:  80.50%
* Precision:  87.71%
* Recall:  74.09%
* F1 Score:  78.30%


-- Validation Set Metrics
* Accuracy:  70.50%
* Precision:  76.11%
* Recall:  63.05%
* F1 Score:  66.47%

### Comparing test set performance of best ensemble with the inividual classfiers

Since, Soft voting classifier performed better with a better precision and recall, we will use Soft-Coting Classifier to comapre the performance of an ensemble versus the individual classifiers.

In [None]:
compare_ensemble_val_and_test(soft_voting_clf, X_val, y_val, X_test, y_test)

On the test set, the soft-voting classifier did perform better than the Logistic Regression, but it could not out-perform the Support Vector Machine Classifier and Random Forest Classifier.