### target 데이터의 비율이 차이가 많이 나는 경우
* 정확도는 높지만, 재현율(recall rate)이 떨어지는 현상
* imbalanced data

### 해결방법
* over-sampling : 적은 데이터를 증가 (가장 많이 사용)
* under-sampling : 많은 데이터 중 일부 제거

In [1]:
!pip install imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.8.0-py3-none-any.whl (206 kB)
Collecting scikit-learn>=0.24
  Downloading scikit_learn-0.24.1-cp38-cp38-win_amd64.whl (6.9 MB)
Installing collected packages: scikit-learn, imbalanced-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.23.2
    Uninstalling scikit-learn-0.23.2:
      Successfully uninstalled scikit-learn-0.23.2
Successfully installed imbalanced-learn-0.8.0 scikit-learn-0.24.1


In [1]:
# 오버샘플링
# Random, ADASYN, SMOTE
from imblearn.over_sampling import RandomOverSampler, ADASYN, SMOTE

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer['data'], columns=cancer['feature_names'])
df['target'] = cancer['target']
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [3]:
df['target'].value_counts()

1    357
0    212
Name: target, dtype: int64

In [4]:
df1 = df.loc[df['target']==1]
df0 = df.loc[df['target']==0][:10]
df_new = pd.concat([df1,df0])
df_new['target'].value_counts()

1    357
0     10
Name: target, dtype: int64

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(df_new.drop('target',1), df_new['target'],
                                                     random_state=32,
                                                     stratify=df_new['target'])

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

model = DecisionTreeClassifier(random_state=14)
model.fit(x_train, y_train)
pred = model.predict(x_valid)
accuracy_score(y_valid, pred)

0.9782608695652174

In [7]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_valid, pred)

array([[ 1,  2],
       [ 0, 89]], dtype=int64)

In [8]:
y_train.value_counts()

1    268
0      7
Name: target, dtype: int64

In [9]:
y_valid.value_counts()

1    89
0     3
Name: target, dtype: int64

In [10]:
from sklearn.metrics import classification_report
print(classification_report(y_valid, pred))

              precision    recall  f1-score   support

           0       1.00      0.33      0.50         3
           1       0.98      1.00      0.99        89

    accuracy                           0.98        92
   macro avg       0.99      0.67      0.74        92
weighted avg       0.98      0.98      0.97        92



In [11]:
# RandomOverSampler
# 갯수가 작은 타겟변수의 데이터를 반복해서 추가
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
data, target = ros.fit_resample(df_new.drop('target',1), df_new['target'])
data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,13.540,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.047810,0.1885,0.05766,...,15.110,19.26,99.70,711.2,0.14400,0.17730,0.23900,0.12880,0.2977,0.07259
1,13.080,15.71,85.63,520.0,0.10750,0.12700,0.04568,0.031100,0.1967,0.06811,...,14.500,20.49,96.09,630.5,0.13120,0.27760,0.18900,0.07283,0.3184,0.08183
2,9.504,12.44,60.34,273.9,0.10240,0.06492,0.02956,0.020760,0.1815,0.06905,...,10.230,15.66,65.13,314.9,0.13240,0.11480,0.08867,0.06227,0.2450,0.07773
3,13.030,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.029230,0.1467,0.05863,...,13.300,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169
4,8.196,16.84,51.71,201.9,0.08600,0.05943,0.01588,0.005917,0.1769,0.06503,...,8.964,21.96,57.26,242.2,0.12970,0.13570,0.06880,0.02564,0.3105,0.07409
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,18.250,19.98,119.60,1040.0,0.09463,0.10900,0.11270,0.074000,0.1794,0.05742,...,22.880,27.66,153.20,1606.0,0.14420,0.25760,0.37840,0.19320,0.3063,0.08368
710,12.460,24.04,83.97,475.9,0.11860,0.23960,0.22730,0.085430,0.2030,0.08243,...,15.090,40.68,97.65,711.4,0.18530,1.05800,1.10500,0.22100,0.4366,0.20750
711,13.710,20.83,90.20,577.9,0.11890,0.16450,0.09366,0.059850,0.2196,0.07451,...,17.060,28.14,110.60,897.0,0.16540,0.36820,0.26780,0.15560,0.3196,0.11510
712,12.460,24.04,83.97,475.9,0.11860,0.23960,0.22730,0.085430,0.2030,0.08243,...,15.090,40.68,97.65,711.4,0.18530,1.05800,1.10500,0.22100,0.4366,0.20750


In [12]:
df_ros = pd.concat([data, target], axis=1)
df_ros.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              714 non-null    float64
 1   mean texture             714 non-null    float64
 2   mean perimeter           714 non-null    float64
 3   mean area                714 non-null    float64
 4   mean smoothness          714 non-null    float64
 5   mean compactness         714 non-null    float64
 6   mean concavity           714 non-null    float64
 7   mean concave points      714 non-null    float64
 8   mean symmetry            714 non-null    float64
 9   mean fractal dimension   714 non-null    float64
 10  radius error             714 non-null    float64
 11  texture error            714 non-null    float64
 12  perimeter error          714 non-null    float64
 13  area error               714 non-null    float64
 14  smoothness error         7

In [13]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 367 entries, 19 to 9
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              367 non-null    float64
 1   mean texture             367 non-null    float64
 2   mean perimeter           367 non-null    float64
 3   mean area                367 non-null    float64
 4   mean smoothness          367 non-null    float64
 5   mean compactness         367 non-null    float64
 6   mean concavity           367 non-null    float64
 7   mean concave points      367 non-null    float64
 8   mean symmetry            367 non-null    float64
 9   mean fractal dimension   367 non-null    float64
 10  radius error             367 non-null    float64
 11  texture error            367 non-null    float64
 12  perimeter error          367 non-null    float64
 13  area error               367 non-null    float64
 14  smoothness error         36

In [15]:
df_ros['target'].value_counts()

1    357
0    357
Name: target, dtype: int64

In [16]:
df_new['target'].value_counts()

1    357
0     10
Name: target, dtype: int64

In [17]:
x_train, x_valid, y_train, y_valid = train_test_split(df_ros.drop('target',1), df_ros['target'],
                                                     random_state=32,
                                                     stratify=df_ros['target'])

In [18]:
model = DecisionTreeClassifier(random_state=14)
model.fit(x_train, y_train)
pred = model.predict(x_valid)
accuracy_score(y_valid, pred)

1.0

In [19]:
confusion_matrix(y_valid, pred)

array([[89,  0],
       [ 0, 90]], dtype=int64)

In [20]:
print(classification_report(y_valid, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        89
           1       1.00      1.00      1.00        90

    accuracy                           1.00       179
   macro avg       1.00      1.00      1.00       179
weighted avg       1.00      1.00      1.00       179



In [22]:
# SMOTE (Synthetic Minority Oversampling TechniquE)
# 클래스데이터의 소수의 데이터들을 해당 데이터의 근처(다른 데이터와의 사이) 새로운 데이터 생성
smote = SMOTE(k_neighbors=5)
data, target = smote.fit_resample(df_new.drop('target',1), df_new['target'])
df_smote = pd.concat([data, target], axis=1)
df_smote.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              714 non-null    float64
 1   mean texture             714 non-null    float64
 2   mean perimeter           714 non-null    float64
 3   mean area                714 non-null    float64
 4   mean smoothness          714 non-null    float64
 5   mean compactness         714 non-null    float64
 6   mean concavity           714 non-null    float64
 7   mean concave points      714 non-null    float64
 8   mean symmetry            714 non-null    float64
 9   mean fractal dimension   714 non-null    float64
 10  radius error             714 non-null    float64
 11  texture error            714 non-null    float64
 12  perimeter error          714 non-null    float64
 13  area error               714 non-null    float64
 14  smoothness error         7

In [23]:
df_smote['target'].value_counts()

1    357
0    357
Name: target, dtype: int64

In [24]:
x_train, x_valid, y_train, y_valid = train_test_split(df_smote.drop('target',1), df_smote['target'],
                                                     random_state=32,
                                                     stratify=df_smote['target'])
model = DecisionTreeClassifier(random_state=14)
model.fit(x_train, y_train)
pred = model.predict(x_valid)
accuracy_score(y_valid, pred)

0.994413407821229

In [25]:
confusion_matrix(y_valid, pred)

array([[88,  1],
       [ 0, 90]], dtype=int64)