### Using SMOTE to handle imbalanced classes

In your terminal you'll need to install the following:

`pip install -U imbalanced-learn`


Here are some articles about these techniques: 

https://medium.com/coinmonks/smote-and-adasyn-handling-imbalanced-data-set-34f5223e167

https://www.kaggle.com/residentmario/oversampling-with-smote-and-adasyn

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')


In [15]:
df = pd.read_csv('../Data/modified_titanic.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Survived,Age,Fare,Embarked,IsReverend,FamilyCount,IsMale,Title_Capt,Title_Col,...,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess,Pclass_1,Pclass_2,Pclass_3
0,0,0,22.0,7.25,S,0,1,1,0,0,...,0,1,0,0,0,0,0,0,0,1
1,2,1,26.0,7.925,S,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,3,1,35.0,53.1,S,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,4,0,35.0,8.05,S,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
4,6,0,54.0,51.8625,S,0,0,1,0,0,...,0,1,0,0,0,0,0,1,0,0


In [16]:
df = df[['Survived', 'Age', 'Fare', 'IsMale', 'Embarked']]

In [17]:
X = df.drop('Embarked', axis = 1)
y = df['Embarked']

In [20]:
y.value_counts(normalize=True)*100

S    78.345499
C    20.437956
Q     1.216545
Name: Embarked, dtype: float64

In [None]:
# when do i smote 
# when u have a hunch that your data is unbalanced, and you fit the model
# acc score VERY GOOD 
# Should suspect something wrong 
# look at sensitivity or specificity 
# try to do some sanity checking 
# make sure relevant one is looking ok 

# sometimes the minority class is 1 
# sometimes the minority class is 0 
# 

#### Train Test Split & Scale our data (we'll need it scaled for SMOTE)

In [23]:
#Let's split into our training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42, stratify =y)

In [24]:
y_train.value_counts()  #definitely imbalanced target variable...

S    579
C    151
Q      9
Name: Embarked, dtype: int64

In [30]:
## scale our data...

ss = StandardScaler()


Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

In [33]:
Xs_train.shape

(739, 4)

In [34]:
Xs_test.shape

(83, 4)

#### Let's see how a Logistic Regression does with the imbalanced data:

In [31]:
lr = LogisticRegression()

In [36]:
lr.fit(Xs_train,y_train)

LogisticRegression()

In [37]:
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test)

(0.7834912043301759, 0.7831325301204819)

In [None]:
## see got same score -- too good 

In [38]:
#Not a bad accuracy score, but let's look at the classification report:

def metrics(y_test, y_predict):
    print('Accuracy score %s ' % accuracy_score(y_test, y_predict), '\n')
    print('----------------------------------------------------------------')
    print(pd.DataFrame(confusion_matrix(y_test, y_predict), 
                            index=['Actually_C', 'Actual_Q', 'Actual_S'], 
                            columns=['Predicted_C', 'Predicted_Q', 'Predicted_S']), '\n')
    print('-----------------------------------------------------------------')
    print(classification_report(y_test, y_predict))
    print('-----------------------------------------------------------------')

In [39]:
y_predict = lr.predict(Xs_test)

In [40]:
metrics(y_test,y_predict)

Accuracy score 0.7831325301204819  

----------------------------------------------------------------
            Predicted_C  Predicted_Q  Predicted_S
Actually_C            0            0           17
Actual_Q              0            0            1
Actual_S              0            0           65 

-----------------------------------------------------------------
              precision    recall  f1-score   support

           C       0.00      0.00      0.00        17
           Q       0.00      0.00      0.00         1
           S       0.78      1.00      0.88        65

    accuracy                           0.78        83
   macro avg       0.26      0.33      0.29        83
weighted avg       0.61      0.78      0.69        83

-----------------------------------------------------------------


To correct for this imbalance we're going to use something called **SMOTE - Synthentic Minority OverSampling Technique**

A little video on SMOTE and imbalanced classes:
    https://www.youtube.com/watch?v=U3X98xZ4_no

#### Let's create some synthetic data with SMOTE

In [71]:
##Now we can create synthetic data for our training set

sm = SMOTE()

Xsm_train, ysm_train = sm.fit_resample(Xs_train, y_train)

In [72]:
Xsm_train

array([[-0.79230884,  3.12816869,  0.01103799,  0.73238141],
       [-0.79230884, -1.17510755,  0.24330832,  0.73238141],
       [-0.79230884, -0.79762717, -0.64626023,  0.73238141],
       ...,
       [ 1.26213409,  0.01011421, -0.26953298, -1.36540876],
       [ 1.26213409, -0.1714265 , -0.49780206, -1.36540876],
       [ 1.14615173,  0.01011421, -0.48865244, -1.24697926]])

In [44]:
Xsm_train.shape

(1737, 4)

In [45]:
Xs_train.shape 

# upsampling from the minority classes 
# so that we get more and more minority class 
# until all the classes are equally represented 

(739, 4)

In [42]:
type(ysm_train)

pandas.core.series.Series

In [43]:
print(pd.Series(ysm_train).value_counts())

C    579
S    579
Q    579
Name: Embarked, dtype: int64


#### Let's see how it does with the SMOTE data

In [47]:
#let's make a decision tree to see how this did...

lr2 = LogisticRegression()

In [48]:
lr2.fit(Xsm_train,ysm_train)

LogisticRegression()

In [49]:
# why not smote the test data 
# dont test on synthetic data - Test on real data

In [54]:
pred2 = lr2.predict(Xs_test)

In [55]:
metrics(y_test, pred2)

Accuracy score 0.6024096385542169  

----------------------------------------------------------------
            Predicted_C  Predicted_Q  Predicted_S
Actually_C           11            2            4
Actual_Q              0            1            0
Actual_S             10           17           38 

-----------------------------------------------------------------
              precision    recall  f1-score   support

           C       0.52      0.65      0.58        17
           Q       0.05      1.00      0.10         1
           S       0.90      0.58      0.71        65

    accuracy                           0.60        83
   macro avg       0.49      0.74      0.46        83
weighted avg       0.82      0.60      0.68        83

-----------------------------------------------------------------


In [None]:
# model over predicting (acc score lower)


### We can tune over SMOTE and our model using a pipeline and GridSearchCV

SMOTE Hyperparameters you can change:

- ***sampling_strategy***: Default is 'None', which is the same as 'No Majority' (only going to add samples to the minority classes. 
    - Float: Only for binary classification - can specifiy the desired ratio of minority to majority class samples
    - Minority: Will only resample the minority class (ONLY one)
    - Not-Minority - Resample all classes except the minority class
    - Dict: the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.


- ***k_neighbors***: similar to KNN - specifying the number of nearest neighbors to used to construct synthetic samples


#### SMOTE Docs: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

In [56]:
from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#### We need to use the imblearn Pipeline because we don't want to resample the testing data, just the training data - this pipeline will take care of that for you. 

In [82]:
pipe = Pipeline([
        ('scale', StandardScaler()),
        ('sampling', SMOTE()),
        ('logreg', LogisticRegression())
    ])

In [83]:
pipe_params = {
    'sampling__sampling_strategy': ['minority', 'not minority', 'auto'],
    'sampling__k_neighbors': [3, 5],
    'logreg__C': [0.1, 0.5, 1]
}

In [84]:
grid = GridSearchCV(pipe, pipe_params)

In [85]:
grid.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('sampling', SMOTE()),
                                       ('logreg', LogisticRegression())]),
             param_grid={'logreg__C': [0.1, 0.5, 1],
                         'sampling__k_neighbors': [3, 5],
                         'sampling__sampling_strategy': ['minority',
                                                         'not minority',
                                                         'auto']})

In [86]:
grid.score(X_test, y_test)

0.7108433734939759

In [87]:
grid.best_params_

{'logreg__C': 1,
 'sampling__k_neighbors': 3,
 'sampling__sampling_strategy': 'not minority'}

In [88]:
grid_preds = grid.predict(X_test)

In [89]:
metrics(y_test, grid_preds)

Accuracy score 0.7108433734939759  

----------------------------------------------------------------
            Predicted_C  Predicted_Q  Predicted_S
Actually_C           11            0            6
Actual_Q              0            0            1
Actual_S             17            0           48 

-----------------------------------------------------------------
              precision    recall  f1-score   support

           C       0.39      0.65      0.49        17
           Q       0.00      0.00      0.00         1
           S       0.87      0.74      0.80        65

    accuracy                           0.71        83
   macro avg       0.42      0.46      0.43        83
weighted avg       0.76      0.71      0.73        83

-----------------------------------------------------------------


In [90]:
# smoting is a technique 
# in and of themselves, techniques have no use 
# always provide context in which the techniques make sense to use 
