# Effectiveness of Random Oversampling

In this activity, you’ll fit logistic regression models to both imbalanced data and resampled data. You’ll then compare the results by using the metrics that you’ve learned.

## Instructions

1. Read in the CSV file from the `Resources` folder into a Pandas DataFrame.  

2. Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

3. Split the features and labels into training and testing sets.

4. Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels. 

5. Resample the training data by using `RandomOverSampler`.

6. Check the number of distinct values (`value_counts`) for the resampled labels.

7. Fit two logistic regression modules: one for the resampled data and another for the original data.

 8.  Using the two logistic regression models, predict the values for the original and resampled sets.

9. Print the confusion matrixes, accuracy scores, and classification reports for the original and resampled datasets.

10. Evaluate the effectiveness of random oversampling for predicting the minority class. Answer the following question: Does the model accurately flag all the loans that eventually defaulted?


## References

Following are links to modules from the scikit learn library that will be utilized:

[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

[balanced_accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html)

Following are links to modules from the imbalanced learn library that will be utilized:

[RandomOverSampler](https://imbalanced-learn.org/stable/generated/imblearn.over_sampling.RandomOverSampler.html)

[classifiction_report_imbalanced](https://imbalanced-learn.org/stable/generated/imblearn.metrics.classification_report_imbalanced.html)

In [1]:
# Import the required modules
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

## Step 1: Read in the CSV file from the `Resources` folder into a Pandas DataFrame. 

In [2]:
# Read the sba_loans.csv file from the Resources folder into a Pandas DataFrame
loans_df = pd.read_csv(
    Path('../Resources/sba_loans.csv')
)

# Review the DataFrame
loans_df.head()

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,Default
0,2001,11,32812,36,92801,0,1,0,1,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,0


## Step 2: Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [3]:
# Split the data into X (features) and y (lables)

# The y variable should focus on the Default column
y = loans_df['Default']

# The X variable should include all features except the Default column
X = loans_df.drop(columns=['Default'])


### Step 3: Split the features and labels into training and testing sets.

In [4]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Step 4: Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels. 

In [5]:
# Count the distinct values in the orignal labels data
y_train.value_counts()

0    1063
1      96
Name: Default, dtype: int64

## Step 5: Resample the training data by using `RandomOverSampler`.

In [6]:
# Resample the data using RandomOverSampler

# Use RandomOversampler to create a model
# Set a random_state paramerter with a value of 1
random_oversampler = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_resampled, y_resampled = random_oversampler.fit_resample(X_train, y_train)

## Step 6: Check the number of distinct values (`value_counts`) for the resampled labels.

In [7]:
# Count the distinct values in the resampled labels data
y_resampled.value_counts()

1    1063
0    1063
Name: Default, dtype: int64

## Step 7: Fit two logistic regression modules: one for the resampled data and another for the original data.

In [8]:
# Declare a logistic regression model
# Set a random_state paramerter with a value of 1
model = LogisticRegression(random_state=1)

In [9]:
# Fit a logistic regression for the original data.
lr_orginal_model = model.fit(X_train, y_train)
lr_orginal_model

LogisticRegression(random_state=1)

In [10]:
# Declare a logistic regression model
# Set a random_state paramerter with a value of 1
model = LogisticRegression(random_state=1)

In [11]:
# Fit a logistic regression for the resampled data
lr_resampled_model = model.fit(X_resampled, y_resampled)
lr_resampled_model

LogisticRegression(random_state=1)

## Step 8: Using the two logistic regression models, predict the values for the original and resampled sets.

In [12]:
# Predict labels for testing features using the original logistic regression model
y_original_pred = lr_orginal_model.predict(X_test)

In [13]:
# Predict the labels for the testing features using the resampled logistic regression model
y_resampled_pred = lr_resampled_model.predict(X_test)

## Step 9: Print the confusion matrixes, accuracy scores, and classification reports for the original and resampled datasets.

In [14]:
# Print the confusion matrix for the original data
confusion_matrix(y_test, y_original_pred)


array([[343,   5],
       [ 33,   6]])

In [15]:
# Print the confusion matrix for the resampled data
confusion_matrix(y_test, y_resampled_pred)


array([[280,  68],
       [  7,  32]])

In [16]:
# Print the accuracy score for the original data
baso = balanced_accuracy_score(y_test, y_original_pred)

print(baso)

0.5697391688770999


In [17]:
# Print the accuracy score for the resampled data
basr = balanced_accuracy_score(y_test, y_resampled_pred)
print(basr)

0.8125552608311228


In [18]:
# Print the classification report for the original data
print(classification_report_imbalanced(y_test, y_original_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.91      0.99      0.15      0.95      0.39      0.16       348
          1       0.55      0.15      0.99      0.24      0.39      0.14        39

avg / total       0.88      0.90      0.24      0.88      0.39      0.16       387



In [19]:
# Print the classification report for the resampled data
print(classification_report_imbalanced(y_test, y_resampled_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.98      0.80      0.82      0.88      0.81      0.66       348
          1       0.32      0.82      0.80      0.46      0.81      0.66        39

avg / total       0.91      0.81      0.82      0.84      0.81      0.66       387



## Step 10: Evaluate the effectiveness of random oversampling for predicting the minority class. Answer the following question.

**Question:** Does the model generated using the resampled data more accurately flag all the loans that eventually defaulted?
    
**Answer:** The results regarding accuracy of the minority class are actually mixed when comparing the classifiction reports generated from the predictions with the original data versus the predictions with the resampled data. 

First, the accuracy score is much higher for the resampled data (0.81 vs 0.56), meaning that the model using resampled data was much better at detecting true positives and true negatives. 

The precision for the minority class is higher with the orignal data (0.55) versus the resampled data (0.32) meaning that the original data was better at detecting the users that were actually going to default. 

In terms of the recall, however, the minority class metric using resampled data was much better (0.82 vs 0.15). Meaning that the resampled data correctly clasified a higher percentage of the truly defaulting borrowers. 

All in, the model using resampled data was much better at detecting borrowers who are likely to default that the model generated using the original, imbalanced dataset. 