# Effectiveness of Random Oversampling

In this activity, you’ll fit logistic regression models to both imbalanced data and resampled data. You’ll then compare the results by using the metrics that you’ve learned.

## Instructions

1. Read in the CSV file from the `Resources` folder into a Pandas DataFrame.  

2. Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

3. Split the features and labels into training and testing sets.

4. Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels. 

5. Resample the training data by using `RandomOverSampler`.

6. Check the number of distinct values (`value_counts`) for the resampled labels.

7. Fit two logistic regression modules: one for the resampled data and another for the original data.

 8.  Using the two logistic regression models, predict the values for the original and resampled sets.

9. Print the confusion matrixes, accuracy scores, and classification reports for the original and resampled datasets.

10. Evaluate the effectiveness of random oversampling for predicting the minority class. Answer the following question: Does the model accurately flag all the loans that eventually defaulted?


## References

Following are links to modules from the scikit learn library that will be utilized:

[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

[balanced_accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html)

Following are links to modules from the imbalanced learn library that will be utilized:

[RandomOverSampler](https://imbalanced-learn.org/stable/generated/imblearn.over_sampling.RandomOverSampler.html)

[classifiction_report_imbalanced](https://imbalanced-learn.org/stable/generated/imblearn.metrics.classification_report_imbalanced.html)

In [27]:
# Import the required modules
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

## Step 1: Read in the CSV file from the `Resources` folder into a Pandas DataFrame. 

In [3]:
# Read the sba_loans.csv file from the Resources folder into a Pandas DataFrame
loans_df = pd.read_csv(Path("../Resources/sba_loans.csv"))

# Review the DataFrame
loans_df.head()


Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,Default
0,2001,11,32812,36,92801,0,1,0,1,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,0


## Step 2: Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [7]:
# Split the data into X (features) and y (lables)

# The y variable should focus on the Default column
y = loans_df["Default"]

# The X variable should include all features except the Default column
X = loans_df.drop(columns=["Default"])


### Step 3: Split the features and labels into training and testing sets.

In [11]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X,y)

## Step 4: Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels. 

In [12]:
# Count the distinct values in the orignal labels data
y_train.value_counts()


0    1051
1     108
Name: Default, dtype: int64

## Step 5: Resample the training data by using `RandomOverSampler`.

In [14]:
# Resample the data using RandomOverSampler

# Use RandomOversampler to create a model
# Set a random_state paramerter with a value of 1
random_oversampler = RandomOverSampler()

# Fit the original training data to the random_oversampler model
X_resampled, y_resampled = random_oversampler.fit_resample(X_train, y_train)

## Step 6: Check the number of distinct values (`value_counts`) for the resampled labels.

In [15]:
# Count the distinct values in the resampled labels data
y_resampled.value_counts()


1    1051
0    1051
Name: Default, dtype: int64

## Step 7: Fit two logistic regression modules: one for the resampled data and another for the original data.

In [16]:
# Declare a logistic regression model
# Set a random_state paramerter with a value of 1
model = LogisticRegression(random_state=1)

In [17]:
# Fit a logistic regression for the original data.
lr_orginal_model = model.fit(X_train, y_train)


In [18]:
# Declare a logistic regression model
# Set a random_state paramerter with a value of 1
model = LogisticRegression(random_state=1)

In [19]:
# Fit a logistic regression for the resampled data
lr_resampled_model = model.fit(X_resampled, y_resampled)

## Step 8: Using the two logistic regression models, predict the values for the original and resampled sets.

In [20]:
# Predict labels for testing features using the original logistic regression model
y_original_pred = lr_orginal_model.predict(X_test)

In [24]:
# Predict the labels for the testing features using the resampled logistic regression model
y_resampled_pred = lr_resampled_model.predict(X_test)

## Step 9: Print the confusion matrixes, accuracy scores, and classification reports for the original and resampled datasets.

In [25]:
# Print the confusion matrix for the original data

print(confusion_matrix(y_test, y_original_pred))

[[353   7]
 [ 23   4]]


In [26]:
# Print the confusion matrix for the resampled data

print(confusion_matrix(y_test, y_resampled_pred))

[[293  67]
 [  1  26]]


In [29]:
# Print the accuracy score for the original data

balanced_accuracy_score(y_test,y_original_pred)

0.5643518518518518

In [30]:
# Print the accuracy score for the resampled data
balanced_accuracy_score(y_test,y_resampled_pred)


0.888425925925926

In [31]:
# Print the classification report for the original data
print(classification_report_imbalanced(y_test, y_original_pred))


                   pre       rec       spe        f1       geo       iba       sup

          0       0.94      0.98      0.15      0.96      0.38      0.16       360
          1       0.36      0.15      0.98      0.21      0.38      0.13        27

avg / total       0.90      0.92      0.21      0.91      0.38      0.16       387



In [32]:
# Print the classification report for the resampled data
print(classification_report_imbalanced(y_test, y_resampled_pred))


                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.81      0.96      0.90      0.89      0.77       360
          1       0.28      0.96      0.81      0.43      0.89      0.80        27

avg / total       0.95      0.82      0.95      0.86      0.89      0.77       387



## Step 10: Evaluate the effectiveness of random oversampling for predicting the minority class. Answer the following question.

**Question:** Does the model generated using the resampled data more accurately flag all the loans that eventually defaulted?
    
**Answer:** yes, the recall was awesome