# Original Code

**The following code will compute SMOTE over the train observations in dataset_mock_midterm dataset, after removing all categorical variables and fill missing values with zero to simplify the code, to achieve a 40/60 split:**

In [1]:
# Load Libraries
%matplotlib inline
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
import io
from google.colab import files


# Read file
uploaded = files.upload()
dat = pd.read_csv(io.BytesIO(uploaded['dataset_mock_midterm.csv']), sep = ",")
dat = dat.drop(['date', 'severity', 'origin', 'tip_grd', 'tip_adm'], axis = 1).fillna(0) # Remove non-numerical variables and fill missing values for this notebook


# Apply SMOTE
sm = SMOTE(sampling_strategy = 0.66,
      random_state = 0,
      k_neighbors = 5)
X_res, y_res = sm.fit_resample(dat[dat['dataset'] == 'train'].drop(['dataset', 'exitus'], axis = 1), dat[dat['dataset'] == 'train']['exitus'])


# Add dataset and exitus columns
X_res['exitus'] = y_res
X_res['dataset'] = 'train'

Saving dataset_mock_midterm.csv to dataset_mock_midterm.csv


Let's see the ratios of X_res, our new train set.

In [12]:
100*X_res.exitus.value_counts()/X_res.shape[0]

0    60.242514
1    39.757486
Name: exitus, dtype: float64

We have achieved our desired 40/60 ratio over the train.



**NOTE:** Be careful here, this is the ratio over the train set. If we concatenate these new train set to our original validation and test sets, like this:

In [6]:
# Create new dataset after SMOTE.
dat = pd.concat([X_res, dat[dat['dataset'] == 'val'], dat[dat['dataset'] == 'test']])

And check the overall ratio,

In [8]:
100*dat.exitus.value_counts()/dat.shape[0]

0    67.859146
1    32.140854
Name: exitus, dtype: float64

We see that the global ratio is more unbalanced, 32/68, but **we only care about our desired ratio for the train set.**

# Exercise

**Which value of sampling_strategy do you have to set to achieve a 20/80 ratio on the train set?**

To know which value of sampling strategy we need to set, you just have to run the following code.

In [3]:
desired_minority_ratio = 20
desired_majority_ratio = 80
sampling_strategy = desired_minority_ratio/desired_majority_ratio
sampling_strategy

0.25

Threfore, in this case, we need to set sampling_strategy to 0.25. Modifying the above code with other values of desired minority and majority ratio, you can obtain the corresponding sampling_stategy value for any desired ratio.

Let's verify that this actually works.

In [4]:
# Read file
dat = pd.read_csv(io.BytesIO(uploaded['dataset_mock_midterm.csv']), sep = ",")
dat = dat.drop(['date', 'severity', 'origin', 'tip_grd', 'tip_adm'], axis = 1).fillna(0) # Remove non-numerical variables and fill missing values for this notebook


# Apply SMOTE
sm = SMOTE(sampling_strategy = sampling_strategy, ## MODIFIED!!!
      random_state = 0,
      k_neighbors = 5)
X_res, y_res = sm.fit_resample(dat[dat['dataset'] == 'train'].drop(['dataset', 'exitus'], axis = 1), dat[dat['dataset'] == 'train']['exitus'])


# Add dataset and exitus columns
X_res['exitus'] = y_res
X_res['dataset'] = 'train'

In [5]:
100*X_res.exitus.value_counts()/X_res.shape[0]

0    80.000727
1    19.999273
Name: exitus, dtype: float64

It works!