# <font color="blue">Lesson 7 Performance Metrics & Hyperparameters</font>

## SMOTE: Synthetic Minority Oversampling Technique
For this lab, we'll use the bank_data.csv file that we've worked with in previous labs. 

Alert:  
- Can be slow when data is large  
- Can work on binary or multiclass classification data

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from imblearn.over_sampling import SMOTE

In [None]:
bank_df = pd.read_csv("https://library.startlearninglabs.uw.edu/DATASCI420/Datasets/Bank%20Data.csv")

## Exploratory Analysis
Use the pandas_profiling library to perform a high-level exploratory analysis of your dataset. Explain your your observations for each variable in a summary below the profile. 

In [None]:
# Search for overall trends in the dataset
import pandas_profiling
pandas_profiling.ProfileReport(bank_df)

### Exploratory Analysis Summary
This should describe your observations/questions/issues found for each variable. Outline your plan and reasoning for dealing with each observation. 

## Imbalance
Let's check for imbalance in our dataset by looking at unique variable counts for a few features. 

In [None]:
bank_df.married.value_counts()

In [None]:
bank_df.region.value_counts()

In [None]:
bank_df.sex.value_counts()

In [None]:
bank_df.mortgage.value_counts()

### Summarize Observations on Imbalance
Summarize what you found. 

## One Hot Encode Categorical Data
We need to encode the categorical data before we can train and test models. 

In [None]:
# create a list of categorical variables
bank_df.columns

In [None]:
# store object column names in a list
obj_cols = bank_df.select_dtypes(include=["object"]).columns
obj_cols

In [None]:
# encode your dataframe
bank_df = pd.get_dummies(bank_df, columns=obj_cols, drop_first=True)
bank_df.head()

## Split data into training and test sets
We'll keep it simple and use PEP as a target again. 

In [None]:
# Separate the features from the targets
targets = bank_df['pep_YES']
features = bank_df.loc[:, bank_df.columns != "pep_YES"]

# x is for features, y is for targets
x_train, x_val, y_train, y_val = train_test_split(features, targets,
                                                test_size = .2,
                                                random_state=42)

# x_val is the x_test data 
# y_val is the y_test data

### Oversample on Training Data
If you oversample on both the training and test set, you will overfit your model. 

SMOTE creates synthetic observations of the minority class (bad candidates for PEP) by:

1. Finding the k-nearest-neighbors for minority class observations (finding similar observations)  
2. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observation.

In [None]:
# replace training_features and training_target with the correct names
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(training_features, training_target)

## Train a Random Forest Model
Now that we've oversampled our dataset, let's see if it improves our random forest model. 

In [None]:
# Create and fit the model
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)

In [None]:
# Display Scores
print('Validation Results')
print(clf_rf.score(x_val, y_val))
print(recall_score(y_val, clf_rf.predict(x_val)))

# Replace test_features and test_target with the correct names
print('\nTest Results')
print(clf_rf.score(test_features, test_target))
print(recall_score(test_target, clf_rf.predict(test_features)))

The validation results should closely match the unseen test data results.