#### Kod demonstracyjny do wykorzystania przy próbie zrównoważenia danych niezbalansowanych

Tworzymy podzbiór z klasy większościowej poprzez metodę Centroidów (K-means)

In [2]:
from collections import Counter
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)
print("Liczność oryginalnych klas: ", sorted(Counter(y).items()))

from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_resampled, y_resampled = cc.fit_resample(X, y)
print("Liczność klas po zastosowaniu algorytmu: ",sorted(Counter(y_resampled).items()))

Liczność oryginalnych klas:  [(0, 64), (1, 262), (2, 4674)]
Liczność klas po zastosowaniu algorytmu:  [(0, 64), (1, 64), (2, 64)]


Tworzymy podzbiór porzez randomowe losowanie, niepolecane ale jest na pewno szybkie

In [9]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)


X_resampled, y_resampled = rus.fit_resample(X, y)
print("Liczność kategorii zbioru po próbkowaniu:", sorted(Counter(y_resampled).items()))

# przykład próbkowania ze zwracaniem
import numpy as np
print("Liczność elementów po próbkowaniu bez zwracania: ", np.vstack([tuple(row) for row in X_resampled]).shape)

rus = RandomUnderSampler(random_state=0, replacement=True)
X_resampled, y_resampled = rus.fit_resample(X, y)
print("Liczność elementów po próbkowaniu ze zwracaniem: ", np.vstack(np.unique([tuple(row) for row in X_resampled], axis=0)).shape)

# na obiektach Pandas 
from sklearn.datasets import fetch_openml
df_adult, y_adult = fetch_openml(
    'adult', version=2, as_frame=True, return_X_y=True)
df_adult.head()  
df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)
df_resampled.head()  

Liczność kategorii zbioru po próbkowaniu: [(0, 64), (1, 64), (2, 64)]
Liczność elementów po próbkowaniu bez zwracania:  (192, 2)
Liczność elementów po próbkowaniu ze zwracaniem:  (181, 2)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,29.0,Private,201101.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,50.0,United-States
1,23.0,Private,188950.0,Assoc-voc,11.0,Never-married,Sales,Own-child,White,Male,0.0,0.0,40.0,United-States
2,24.0,Private,282604.0,Some-college,10.0,Married-civ-spouse,Protective-serv,Other-relative,White,Male,0.0,0.0,24.0,United-States
3,29.0,Private,174419.0,HS-grad,9.0,Never-married,Other-service,Unmarried,White,Female,0.0,0.0,30.0,United-States
4,20.0,Private,236592.0,12th,8.0,Never-married,Prof-specialty,Not-in-family,White,Female,0.0,0.0,35.0,Italy


In [7]:
df_adult.shape

(48842, 14)

In [None]:
df_resampled.head(2) 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,29.0,Private,201101.0,HS-grad,9.0,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.0,0.0,50.0,United-States
1,23.0,Private,188950.0,Assoc-voc,11.0,Never-married,Sales,Own-child,White,Male,0.0,0.0,40.0,United-States


NearMiss czyli losowanie wg odległości do najbliższych sąsiadów. Posiada 3 wersje.

In [None]:
from imblearn.under_sampling import NearMiss
nr = NearMiss(version = 1)
X_miss, y_miss = nr.fit_resample(X, y)


### OverSampling

In [11]:
# Random Oversampling czyli powielenie istniejących elementów
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
from collections import Counter
print(sorted(Counter(y_resampled).items()))


[(0, 4674), (1, 4674), (2, 4674)]


SMOTE i ADASYN

In [None]:
from imblearn.over_sampling import SMOTE, ADASYN
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

clf_smote = LinearSVC().fit(X_resampled, y_resampled)
X_resampled, y_resampled = ADASYN().fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

clf_adasyn = LinearSVC().fit(X_resampled, y_resampled)