<a href="https://colab.research.google.com/github/mmkeyes140/ai_learning/blob/main/comparing_sampling_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity: Comparing Imbalanced Classifiers

In this activity, you’ll fit various balanced and imbalanced models to small business loan data. You’ll then compare the results by using the metrics that you’ve learned.


In [None]:
# Import the required modules
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

import warnings
warnings.filterwarnings("ignore")

## Step 1: Read in the CSV file from the `Resources` folder into a Pandas DataFrame.

In [None]:
# Read the sba_loans.csv file from the Resources folder into a Pandas DataFrame
loans_df = pd.read_csv('https://static.bc-edx.com/mbc/ai/m5/datasets/sba-loans.csv')

# Review the DataFrame


## Step 2: Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [None]:
# Split the data into X (features) and y (lables)

# The y variable should focus on the Default column
y = loans_df['Default']


# The X variable should include all features except the Default column
X = loans_df.copy()
X = X.drop(columns = "Default")
X.head()

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural
0,2001,11,32812,36,92801,0,1,0,1,0
1,2001,4,30000,56,90505,0,1,0,1,0
2,2001,4,30000,36,92103,0,10,0,1,0
3,2003,10,50000,36,92108,0,6,0,1,0
4,2006,7,343000,240,91345,3,65,1,0,2


### Step 3: Split the features and labels into training and testing sets, and `StandardScaler` your X data.

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# Scale the data
scaler = StandardScaler()
X_scaler = scaler.fit(X_train)
X_train_scaled = pd.DataFrame(X_scaler.transform(X_train))
X_test_scaled = pd.DataFrame(X_scaler.transform(X_test))


## Step 4: Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels.

In [None]:
# Count the distinct values in the orignal labels data
y_train.value_counts()


0    1063
1      96
Name: Default, dtype: int64

In [None]:
# Count the distinct values in the orignal labels data
y_test.value_counts()


0    348
1     39
Name: Default, dtype: int64

## Step 5: Fit two versions of a random forest model to the data: the first, a regular `RandomForest` classifier, and the second, a `BalancedRandomForest` classifier.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rf_model = RandomForestClassifier(n_estimators=128, random_state=78)

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Making predictions using the testing data
rf_predictions = rf_model.predict(X_test_scaled)

acc_score = accuracy_score(y_test, rf_predictions)
print(f"Accuracy Score: {acc_score}")


Accuracy Score: 0.9509043927648578


In [None]:
# Import BalancedRandomForestClassifier from imblearn
from imblearn.ensemble import BalancedRandomForestClassifier

# Instantiate a BalancedRandomForestClassifier instance
brf_model = BalancedRandomForestClassifier()

# Fit the model to the training data
brf_model.fit(X_train_scaled, y_train)

In [None]:
# Predict labels for testing features
brf_predictions = brf.predict(X_test_scaled)

## Step 6: Resample and fit the training data by one additional method for imbalanced data, such as `RandomOverSampler`, undersampling, or a synthetic technique. Re-esimate by `RandomForest`.

In [None]:
# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE

# Instantiate the SMOTE model instance
smote_sampler = SMOTE(random_state=1, sampling_strategy='auto')

# Fit the SMOTE model to the training data
X_resampled, y_resampled = smote_sampler.fit_resample(X_train_scaled, y_train)

# Fit the RandomForestClassifier on the resampled data
smote_model = RandomForestClassifier()
smote_model.fit(X_resampled, y_resampled)

# Generate predictions based on the resampled data model
smote_predictions = smote_model.predict(X_test_scaled)

## Step 7: Print the confusion matrixes, accuracy scores, and classification reports for the three different models.

In [None]:
# Print the confusion matrix for RandomForest on the original data
confusion_matrix(y_test, rf_predictions)

array([[341,   7],
       [ 12,  27]])

In [None]:
# Print the confusion matrix for balanced random forest data
confusion_matrix(y_test, brf_predictions)

array([[312,  36],
       [  5,  34]])

In [None]:
# Print the confusion matrix for your additional model on the resampled data
confusion_matrix(y_test, smote_predictions)

array([[335,  13],
       [ 10,  29]])

In [None]:
# Print the accuracy score for the original data
rf_accuracy = balanced_accuracy_score(y_test, rf_predictions)
print(f"Accuracy score of random forest on the original data: {rf_accuracy}")

Accuracy score of random forest on the original data: 0.8360963748894783


In [None]:
# Print the accuracy score for the balanced random forest data
brf_accuracy = balanced_accuracy_score(y_test, brf_predictions)
print(f"Accuracy score of random forest on the original data: {brf_accuracy}")

Accuracy score of random forest on the original data: 0.8841732979664014


In [None]:
# Print the accuracy score for your additional model with resampled data
smoteModel_accuracy = balanced_accuracy_score(y_test, smote_predictions)
print(f"Accuracy score of rebalanced random forest on the original data: {rf_accuracy}")

Accuracy score of rebalanced random forest on the original data: 0.8360963748894783


In [None]:
# Print the classification report for the original data
print("Classification Report for original random forest data")
print("_____________________________________________________")
print(classification_report_imbalanced(y_test, rf_predictions))

Classification Report for original random forest data
_____________________________________________________
                   pre       rec       spe        f1       geo       iba       sup

          0       0.97      0.98      0.69      0.97      0.82      0.70       348
          1       0.79      0.69      0.98      0.74      0.82      0.66        39

avg / total       0.95      0.95      0.72      0.95      0.82      0.69       387



In [None]:
# Print the classification report for the balanced random forest data
print("Classification Report for balanced random forest data")
print("_____________________________________________________")
print(classification_report(y_test, brf_predictions))

Classification Report for balanced random forest data
_____________________________________________________
              precision    recall  f1-score   support

           0       0.98      0.90      0.94       348
           1       0.49      0.87      0.62        39

    accuracy                           0.89       387
   macro avg       0.73      0.88      0.78       387
weighted avg       0.93      0.89      0.91       387



In [None]:
# Print the classification report for your additional model with resampled data
print("Classification Report for SMOTE/random forest data")
print("__________________________________________________")
print(classification_report(y_test, smote_predictions))

Classification Report for SMOTE/random forest data
__________________________________________________
              precision    recall  f1-score   support

           0       0.97      0.96      0.97       348
           1       0.69      0.74      0.72        39

    accuracy                           0.94       387
   macro avg       0.83      0.85      0.84       387
weighted avg       0.94      0.94      0.94       387



## Step 8: Evaluate the effectiveness of `RandomForest`, `BalancedRandomForest`, and your one additional imbalanced classifier for predicting the minority class.

### Answer the following question: Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?

**Question:** Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?
    
**Answer:**