### Using SMOTE to handle imbalanced classes

In your terminal you'll need to install the following:

`pip install -U imbalanced-learn`


Here are some articles about these techniques: 

https://medium.com/coinmonks/smote-and-adasyn-handling-imbalanced-data-set-34f5223e167

https://www.kaggle.com/residentmario/oversampling-with-smote-and-adasyn

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')


In [36]:
df = pd.read_csv('../Data/modified_titanic.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Survived,Age,Fare,Embarked,IsReverend,FamilyCount,IsMale,Title_Capt,Title_Col,...,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess,Pclass_1,Pclass_2,Pclass_3
0,0,0,22.0,7.25,S,0,1,1,0,0,...,0,1,0,0,0,0,0,0,0,1
1,2,1,26.0,7.925,S,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,3,1,35.0,53.1,S,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,4,0,35.0,8.05,S,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
4,6,0,54.0,51.8625,S,0,0,1,0,0,...,0,1,0,0,0,0,0,1,0,0


In [7]:
df = df[['Survived', 'Age', 'Fare', 'IsMale', 'Embarked']]

In [8]:
df.head()

Unnamed: 0,Survived,Age,Fare,IsMale,Embarked
0,0,22.0,7.25,1,S
1,1,26.0,7.925,0,S
2,1,35.0,53.1,0,S
3,0,35.0,8.05,1,S
4,0,54.0,51.8625,1,S


In [37]:
X = df.drop(columns = 'Embarked')
y = df['Embarked']

In [10]:
y.value_counts()

S    644
C    168
Q     10
Name: Embarked, dtype: int64

#### Train Test Split & Scale our data (we'll need it scaled for SMOTE)

In [11]:
#Let's split into our training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42, stratify =y)

In [12]:
y_train.value_counts()  #definitely imbalanced target variable...

S    579
C    151
Q      9
Name: Embarked, dtype: int64

In [154]:
## scale our data...

In [13]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

#### Let's see how a Logistic Regression does with the imbalanced data:

In [14]:
lr = LogisticRegression()

In [15]:
lr.fit(Xs_train, y_train)

LogisticRegression()

In [16]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test)

(0.7834912043301759, 0.7831325301204819)

In [17]:
#Not a bad accuracy score, but let's look at the classification report:

def metrics(y_test, y_predict):
    print('Accuracy score %s ' % accuracy_score(y_test, y_predict), '\n')
    print('----------------------------------------------------------------')
    print(pd.DataFrame(confusion_matrix(y_test, y_predict), 
                            index=['Actually_C', 'Actual_Q', 'Actual_S'], 
                            columns=['Predicted_C', 'Predicted_Q', 'Predicted_S']), '\n')
    print('-----------------------------------------------------------------')
    print(classification_report(y_test, y_predict))
    print('-----------------------------------------------------------------')

In [18]:
lr_preds = lr.predict(Xs_test)

In [19]:
metrics(y_test, lr_preds)

Accuracy score 0.7831325301204819  

----------------------------------------------------------------
            Predicted_C  Predicted_Q  Predicted_S
Actually_C            0            0           17
Actual_Q              0            0            1
Actual_S              0            0           65 

-----------------------------------------------------------------
              precision    recall  f1-score   support

           C       0.00      0.00      0.00        17
           Q       0.00      0.00      0.00         1
           S       0.78      1.00      0.88        65

    accuracy                           0.78        83
   macro avg       0.26      0.33      0.29        83
weighted avg       0.61      0.78      0.69        83

-----------------------------------------------------------------


To correct for this imbalance we're going to use something called **SMOTE - Synthentic Minority OverSampling Technique**

A little video on SMOTE and imbalanced classes:
    https://www.youtube.com/watch?v=U3X98xZ4_no

#### Let's create some synthetic data with SMOTE

In [20]:
##Now we can create synthetic data for our training set

sm = SMOTE()

Xsm_train, ysm_train = sm.fit_sample(Xs_train, y_train)

In [22]:
y_train.value_counts()

S    579
C    151
Q      9
Name: Embarked, dtype: int64

In [23]:
579*3

1737

In [163]:
Xsm_train.shape

(1737, 4)

In [164]:
type(ysm_train)

numpy.ndarray

In [24]:
print(pd.Series(ysm_train).value_counts())

Q    579
S    579
C    579
Name: Embarked, dtype: int64


#### Let's see how it does with the SMOTE data

In [25]:
#let's make a decision tree to see how this did...

lr2 = LogisticRegression()

In [26]:
lr2.fit(Xsm_train, ysm_train)

LogisticRegression()

In [27]:
lr2.score(Xsm_train, ysm_train), lr2.score(Xs_test, y_test)

(0.5831894070236039, 0.6265060240963856)

In [28]:
preds2 = lr2.predict(Xs_test)

In [29]:
metrics(y_test, preds2)

Accuracy score 0.6265060240963856  

----------------------------------------------------------------
            Predicted_C  Predicted_Q  Predicted_S
Actually_C           12            2            3
Actual_Q              0            1            0
Actual_S              9           17           39 

-----------------------------------------------------------------
              precision    recall  f1-score   support

           C       0.57      0.71      0.63        17
           Q       0.05      1.00      0.10         1
           S       0.93      0.60      0.73        65

    accuracy                           0.63        83
   macro avg       0.52      0.77      0.49        83
weighted avg       0.84      0.63      0.70        83

-----------------------------------------------------------------


### Signficant improvements! 

### We can also tune over SMOTE and our model using a pipeline and GridSearchCV

SMOTE Hyperparameters you can change:

- ***sampling_strategy***: Default is 'None', which is the same as 'No Majority' (only going to add samples to the minority classes. 
    - Float: Only for binary classification - can specifiy the desired ratio of minority to majority class samples
    - Minority: Will only resample the minority class (ONLY one)
    - Not-Minority - Resample all classes except the minority class
    - Dict: the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.


- ***k_neighbors***: similar to KNN - specifying the number of nearest neighbors to used to construct synthetic samples


#### SMOTE Docs: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

In [30]:
from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#### We need to use the imblearn Pipeline because we don't want to resample the testing data, just the training data - this pipeline will take care of that for you. 

In [31]:
pipe = Pipeline([
        ('scale', StandardScaler()),
        ('sampling', SMOTE()),
        ('lr', LogisticRegression())
    ])

In [32]:
pipe_params = {
    'sampling__sampling_strategy': ['auto', 'minority', 'not minority'],
    'sampling__k_neighbors': [2, 3, 5],
    'lr__C': [0.1, 0.3, 0.5, 1]
}

In [33]:
grid = GridSearchCV(pipe, pipe_params, scoring = 'recall_macro')

#set scoring to recall_macro so the gridsearch doesn't optimize for accuracy

In [34]:
grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('sampling', SMOTE()),
                                       ('lr', LogisticRegression())]),
             param_grid={'lr__C': [0.1, 0.3, 0.5, 1],
                         'sampling__k_neighbors': [2, 3, 5],
                         'sampling__sampling_strategy': ['auto', 'minority',
                                                         'not minority']},
             scoring='recall_macro')

In [194]:
grid.score(X_test, y_test)

0.7583710407239819

In [195]:
grid.best_params_

{'lr__C': 0.5,
 'sampling__k_neighbors': 5,
 'sampling__sampling_strategy': 'auto'}

In [196]:
grid_preds = grid.predict(X_test)

In [197]:
metrics(y_test, grid_preds)

Accuracy score 0.6024096385542169  

----------------------------------------------------------------
            Predicted_C  Predicted_Q  Predicted_S
Actually_C           12            2            3
Actual_Q              0            1            0
Actual_S             10           18           37 

-----------------------------------------------------------------
              precision    recall  f1-score   support

           C       0.55      0.71      0.62        17
           Q       0.05      1.00      0.09         1
           S       0.93      0.57      0.70        65

    accuracy                           0.60        83
   macro avg       0.51      0.76      0.47        83
weighted avg       0.84      0.60      0.68        83

-----------------------------------------------------------------
