# Activity: Comparing Imbalanced Classifiers

In this activity, you’ll fit various balanced and imbalanced models to small business loan data. You’ll then compare the results by using the metrics that you’ve learned.


In [1]:
# Import the required modules
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

import warnings
warnings.filterwarnings("ignore")


## Step 1: Read in the CSV file from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the sba_loans.csv file from the Resources folder into a Pandas DataFrame
loans_df = pd.read_csv('https://static.bc-edx.com/mbc/ai/m5/datasets/sba-loans.csv')

# Review the DataFrame
loans_df.head()

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,Default
0,2001,11,32812,36,92801,0,1,0,1,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,0


## Step 2: Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [3]:
# Split the data into X (features) and y (lables)

# The y variable should focus on the Default column
y = loans_df['Default']

# The X variable should include all features except the Default column
X = loans_df.drop(columns='Default')


In [4]:
X

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural
0,2001,11,32812,36,92801,0,1,0,1,0
1,2001,4,30000,56,90505,0,1,0,1,0
2,2001,4,30000,36,92103,0,10,0,1,0
3,2003,10,50000,36,92108,0,6,0,1,0
4,2006,7,343000,240,91345,3,65,1,0,2
...,...,...,...,...,...,...,...,...,...,...
1541,2006,6,150000,60,92346,0,5,0,0,2
1542,1997,4,99000,300,92021,0,4,1,0,0
1543,1997,2,50000,84,93012,0,2,0,0,0
1544,1997,1,251150,120,91352,0,3,0,0,0


In [5]:

y

0       0
1       0
2       0
3       0
4       0
       ..
1541    0
1542    0
1543    0
1544    0
1545    0
Name: Default, Length: 1546, dtype: int64

### Step 3: Split the features and labels into training and testing sets, and `StandardScaler` your X data.

In [6]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [7]:
# Scale the data

#first fit to the scaler
scaler = StandardScaler()
X_scaler = scaler.fit(X_train)

#scale that data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [8]:
X_test_scaled

array([[-0.0361011 , -1.65154289, -0.7850577 , ..., -0.74545603,
        -0.66044182,  0.48588509],
       [-0.51428596,  0.10696718,  0.06854569, ...,  1.34146074,
        -0.66044182,  0.48588509],
       [-0.0361011 , -0.47920284,  0.00292526, ...,  1.34146074,
        -0.66044182,  0.48588509],
       ...,
       [-0.75337839, -1.35845788, -0.65811983, ..., -0.74545603,
        -0.66044182,  0.48588509],
       [-0.75337839,  0.10696718,  0.74518908, ...,  1.34146074,
        -0.66044182,  0.48588509],
       [-0.27519353, -0.77228786, -0.4273726 , ...,  1.34146074,
        -0.66044182,  0.48588509]])

## Step 4: Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels.

In [9]:
# Count the distinct values in the orignal labels data
y.value_counts()

0    1411
1     135
Name: Default, dtype: int64

## Step 5: Fit two versions of a random forest model to the data: the first, a regular `RandomForest` classifier, and the second, a `BalancedRandomForest` classifier.

In [11]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rf_model = RandomForestClassifier()

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Making predictions using the testing data
rf_predictions = rf_model.predict(X_test_scaled)


In [12]:
# Import BalancedRandomForestClassifier from imblearn
from imblearn.ensemble import BalancedRandomForestClassifier

# Instantiate a BalancedRandomForestClassifier instance
brf = BalancedRandomForestClassifier()

# Fit the model to the training data
brf.fit(X_train_scaled, y_train)

In [13]:
# Predict labels for testing features
brf_predictions = brf.predict(X_test_scaled)

## Step 6: Resample and fit the training data by one additional method for imbalanced data, such as `RandomOverSampler`, undersampling, or a synthetic technique. Re-esimate by `RandomForest`.

In [18]:
# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE
# Instantiate the SMOTE model instance
smote_sampler = SMOTE(random_state=1, sampling_strategy='auto')
# Fit the SMOTE model to the training data
X_resampled, y_resampled = smote_sampler.fit_resample(X_train, y_train)
# Fit the RandomForestClassifier on the resampled data
smote_model = RandomForestClassifier()
smote_model.fit(X_resampled, y_resampled)
# Generate predictions based on the resampled data model
sm_predictions = smote_model.predict(X_test_scaled)

## Step 7: Print the confusion matrixes, accuracy scores, and classification reports for the three different models.

In [16]:
# Print the confusion matrix for RandomForest on the original data
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, rf_predictions)

array([[340,   8],
       [ 13,  26]])

In [17]:
# Print the confusion matrix for balanced random forest data
confusion_matrix(y_test, brf_predictions)

array([[308,  40],
       [  6,  33]])

In [19]:
# Print the confusion matrix for your additional model on the resampled data
confusion_matrix(y_test, sm_predictions)

array([[236, 112],
       [ 19,  20]])

In [20]:
# Print the accuracy score for the original data
from sklearn.metrics import accuracy_score
accuracy_score(y_test, rf_predictions)

0.9457364341085271

In [21]:
# Print the accuracy score for the balanced random forest data
accuracy_score(y_test, brf_predictions)

0.8811369509043928

In [22]:
# Print the accuracy score for your additional model with resampled data
accuracy_score(y_test, sm_predictions)

0.661498708010336

In [25]:
# Print the classification report for the original data
from sklearn.metrics import classification_report
print(classification_report(y_test, rf_predictions))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97       348
           1       0.76      0.67      0.71        39

    accuracy                           0.95       387
   macro avg       0.86      0.82      0.84       387
weighted avg       0.94      0.95      0.94       387



In [29]:
# Print the classification report for the balanced random forest data
print(classification_report(y_test, brf_predictions))

              precision    recall  f1-score   support

           0       0.98      0.89      0.93       348
           1       0.45      0.85      0.59        39

    accuracy                           0.88       387
   macro avg       0.72      0.87      0.76       387
weighted avg       0.93      0.88      0.90       387



In [30]:
# Print the classification report for your additional model with resampled data
print(classification_report(y_test, sm_predictions))

              precision    recall  f1-score   support

           0       0.93      0.68      0.78       348
           1       0.15      0.51      0.23        39

    accuracy                           0.66       387
   macro avg       0.54      0.60      0.51       387
weighted avg       0.85      0.66      0.73       387



## Step 8: Evaluate the effectiveness of `RandomForest`, `BalancedRandomForest`, and your one additional imbalanced classifier for predicting the minority class. 

### Answer the following question: Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?

**Question:** Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?
    
**Answer:**