<a href="https://colab.research.google.com/github/mobarakol/tutorial_notebooks/blob/main/SMOTE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Synthetic Minority Oversampling Technique (SMOTE)<br>
This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. <br>
1. Creating extra training data by performing certain operations on
real data. In their case, operations like rotation and skew were natural ways to perturb the training data. <br>

2. Generate synthetic examples in feature space: Take the difference between the feature vector (sample) under consideration and its nearest neighbor. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration.
Src: https://arxiv.org/pdf/1106.1813.pdf

In [4]:
!pip -q install hepml

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/1.6 MB[0m [31m13.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.1/66.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m67.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.4/526.4 kB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 kB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m88.

Dataset already exists at '../data/susy.csv.gz' and is not downloaded again.
CPU times: user 52.7 s, sys: 1.46 s, total: 54.1 s
Wall time: 55.6 s


NameError: ignored

In [18]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
# data wrangling
import pandas as pd
import numpy as np
from pathlib import Path
from hepml.core import display_large, download_dataset
from sklearn.model_selection import train_test_split


download_dataset("susy.csv.gz")
DATA = Path("../data")
columns = [
    "signal",
    "lepton 1 pT",
    "lepton 1 eta",
    "lepton 1 phi",
    "lepton 2 pT",
    "lepton 2 eta",
    "lepton 2 phi",
    "missing energy magnitude",
    "missing energy phi",
    "MET_rel",
    "axial MET",
    "M_R",
    "M_TR_2",
    "R",
    "MT2",
    "S_R",
    "M_Delta_R",
    "dPhi_r_b",
    "cos(theta_r1)",
]

# this can take a while to load ...
%time susy = pd.read_csv(DATA / "susy.csv.gz", compression="gzip", names=columns)

low_features = [
    "signal",
    "lepton 1 pT",
    "lepton 1 eta",
    "lepton 1 phi",
    "lepton 2 pT",
    "lepton 2 eta",
    "lepton 2 phi",
    "missing energy magnitude",
    "missing energy phi",
]

high_features = [
    "signal",
    "MET_rel",
    "axial MET",
    "M_R",
    "M_TR_2",
    "R",
    "MT2",
    "S_R",
    "M_Delta_R",
    "dPhi_r_b",
    "cos(theta_r1)",
]
susy_train = susy.copy().iloc[:4500000]
susy_sample = susy_train.sample(n=100000, random_state=42)
susy_low = susy_sample[low_features].copy()
susy_high = susy_sample[high_features].copy()
susy_all = susy_sample.copy()
X = susy_all.drop("signal", axis=1)
y = susy_all["signal"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

sm = SMOTE(random_state=42)
train_rows=len(X_train)
X_train = np.array(X_train).reshape(train_rows,-1)

X_train, y_train = sm.fit_resample(X_train, y_train)