# SMOTE Tutorial

본 튜토리얼은 SMOTE 알고리즘을 이용한 Upsampling으로 Gender 간 공정성을 확보하는 과정을 보여줍니다.

In [1]:
import os.path
import sys

import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import Markdown, display

## 1. What is SMOTE

### K-Neighborhood Method

<img src="tutorial_images/The-schematic-of-NRSBoundary-SMOTE-algorithm.png" style="max-height: 400px; display: inline-block;">

### Origin Dist. (Left) vs Upsampled Dist. (Right)

<img src="tutorial_images/Scatter-Plot-of-Imbalanced-Binary-Classification-Problem.png" style="max-height: 300px; display: inline-block;">
<img src="tutorial_images/Scatter-Plot-of-Imbalanced-Binary-Classification-Problem-Transformed-by-SMOTE.png" style="max-height: 300px; display: inline-block;">

## 2. Read Sample File

In [2]:
filepath = os.path.abspath('./sample.csv')
dirname = os.path.dirname(filepath)
filename = os.path.basename(filepath)

data = pd.read_csv(filepath, header = 0)

data

Unnamed: 0,gender,feature1
0,0,3
1,1,5
2,0,6
3,1,5
4,0,5
5,0,1
6,1,2
7,1,3
8,1,4
9,1,5


### `gender` 별 `feature1`의 평균

In [3]:
data.groupby(['gender']).agg(['mean'])['feature1']

Unnamed: 0_level_0,mean
gender,Unnamed: 1_level_1
0,3.333333
1,3.75


## 3. Mitigation using SMOTE

In [4]:
gender_count = data.groupby(['gender']).count()
gender_count = [gender_count.loc[0], gender_count.loc[1]]

major_gender_bit = gender_count[0] < gender_count[1]
major_n = int(gender_count[int(major_gender_bit)])
minor_n = int(gender_count[int(~major_gender_bit)])
gender_bias = major_n / (major_n + minor_n)
display(Markdown(f'* original gender_bias: {gender_bias}'))

criticals = range(50, 60, 2)
for critical in criticals:
    critical /= 100
    if(gender_bias > critical):
        repeat_n = round(critical * major_n - (1 - critical) * minor_n)
        if int(major_gender_bit):
            upsample = data.query('gender == 0').sample(repeat_n, replace=True)  # Upsampled with 'gender: 0'
        else:
            upsample = data.query('gender == 1').sample(repeat_n, replace=True)  # Upsampled with 'gender: 1'
        a = filename.split('.')
        filename = a[0] + '_upsampled.' + a[1]
        filepath = os.path.join(dirname, filename)

        upsampled_data = pd.concat([data, upsample], ignore_index=True)
        
        grouped_up = upsampled_data.groupby(['gender'])
        mean = grouped_up.agg(['mean'])['feature1']
        
        up_gc = upsampled_data.groupby(['gender']).count()
        up_gc = [up_gc.loc[0], up_gc.loc[1]]
        major_up_n = int(up_gc[int(major_gender_bit)])
        minor_up_n = int(up_gc[int(~major_gender_bit)])
        up_gender_bias = major_up_n / (major_up_n + minor_up_n)
        
        display(Markdown(f'* critical: {critical}, repeat_n: {repeat_n}, gender_bias: {up_gender_bias}<br/>'+
                         f'major feature1 mean: {str(mean.loc[int(major_gender_bit)][0])}<br/>'+
                         f'minor feature1 mean: {str(mean.loc[int(~major_gender_bit)][0])}'))
    else:
        display(Markdown(f'* critical: {critical}, No need to sample'))
        break

* original gender_bias: 0.5714285714285714

* critical: 0.5, repeat_n: 1, gender_bias: 0.5333333333333333<br/>major feature1 mean: 3.75<br/>minor feature1 mean: 3.7142857142857144

* critical: 0.52, repeat_n: 1, gender_bias: 0.5333333333333333<br/>major feature1 mean: 3.75<br/>minor feature1 mean: 3.7142857142857144

* critical: 0.54, repeat_n: 2, gender_bias: 0.5<br/>major feature1 mean: 3.75<br/>minor feature1 mean: 3.75

* critical: 0.56, repeat_n: 2, gender_bias: 0.5<br/>major feature1 mean: 3.75<br/>minor feature1 mean: 3.375

* critical: 0.58, No need to sample

<strong>Reference:</strong>

* dnl8145@gmail.com
* https://www.researchgate.net/figure/The-schematic-of-NRSBoundary-SMOTE-algorithm_fig1_287601878
* https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/