# Binary Classification
## Tabular data
- competition: [Porto Seguro’s Safe Driver Prediction](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction)
- Kernel
  - Base: [Data Preparation & Exploration](https://www.kaggle.com/bertcarremans/data-preparation-exploration)
  - Model: [XGBoost CV (LB .284)](https://www.kaggle.com/aharless/xgboost-cv-lb-284)

- **colab ver.**

# Data Preparation & Exploration

1. Visual Inspection
2. Defining the metadata
3. Descriptive statistics
4. Imbalanced classes
5. Data quality checks
6. EDA
7. Feature engineering
8. Feature selection
9. Feature scaling

## Loading Packages

In [288]:
# Data handling
import pandas as pd #데이터프레임
import numpy as np #선형대수
import matplotlib.pyplot as plt #시각화
import seaborn as sns #시각화
import warnings #warnings 방지
warnings.filterwarnings('ignore') #warnings 무시-출력X
%matplotlib inline

pd.set_option('display.max_columns',100) #최대 100개 칼럼까지만 출력하기 - 런타임 오류 방지

# Preprocessing
from sklearn.impute import SimpleImputer #결측값 대치
from sklearn.preprocessing import PolynomialFeatures #??
from sklearn.preprocessing import StandardScaler #표준화 scaler

# Feature Selection
from sklearn.feature_selection import VarianceThreshold #??
from sklearn.feature_selection import SelectFromModel #??

#??
from sklearn.utils import shuffle #인덱스 셔플

# Modeling
from sklearn.ensemble import RandomForestClassifier #랜덤포레스트

> `%matplotlib inline`의 UsageError
`%matplotlib inline` 옆에 주석을 달면 *UsageError: unrecognized arguments* 에러가 발생
- 예) `%matplotlib inline #그래프 안에 그리기`
  - UsageError: unrecognized arguments: #그래프 안에 그리기
- **해결**) 주석을 제거한다
- [참고](https://stackoverflow.com/questions/27761707/cannot-plot-inline-with-ipython-notebook/28686533#28686533)


> colab에서 `sklearn.preprocessing.Imputer` import 불가
- 예) `cannot import name 'Imputer'`
- 원인) sklearn 버전 문제 (0.22.3 -> 0.21.3)
- [참고](https://github.com/mindsdb/lightwood/issues/75)

> sklearn 버전 update에 따른 Imputer 모듈 변환
- 0.21.3 ver. `sklearn.preprocessing.Imputer`
- 0.22.2부터 0.24.0까지 ver. `sklearn.impute.SimpleImputer`
  1. `SimpleImputer(missing_values,strategy=[mean,median])` 일변량 특정값 대치(i번째 특정값만)
  2. `IterativeImputer` 다변량 대치(i열을 출력으로 지정, 모델 추정치로 대치)
    - R의 `missForest`같은 방법(랜덤포레스트로 결측값 대치)
  3. `MissingIndicator(missing_values)` 결측여부 이진분류(결측치가 있는 열인지/결측치인지 아닌지) 
  4. `KNNImputer`  KNN 응용 결측치 대치
  - [Imputer 선택 참고](https://scikit-learn.org/0.22/modules/impute.html#impute)

## Loading Data

In [289]:
dir = 'drive/MyDrive/colab/kaggle/study/data/porto/'

In [None]:
train = pd.read_csv(dir+'train.csv')
test = pd.read_csv(dir+'test.csv')

# Visual Inspection

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
train.shape

In [None]:
# 중복 제거 후 shape
train.drop_duplicates()
train.shape # 중복 값 없음

In [None]:
test.shape #test는 중복 값 제거하지 않음->제출 폼 문제

In [None]:
train.info() #모든 변수가 numeric

# Defining the metadata
: 변수의 특징을 담은 데이터
- meta information
  - role: input, target, ID
  - level: nominal, interval, ordinal, binary
  - keep(drop여부): True, False
  - dtype: int, float, str

In [None]:
data = []

for f in train.columns:
  # role - 변수 이름 기준으로
  if f == 'target':
    role = 'target'
  elif f == 'id':
    role = 'id'
  else:
    role = 'input'

  # level - 변수 이름/타입 기준으로
  if 'bin' in f or f == 'target':
    level = 'binary'
  elif 'cat' in f or f == 'id':
    level = 'nominal'
  elif train[f].dtype == float:
    level = 'interval'
  elif train[f].dtype == int:
    level = 'ordinal'

  # keep - id/외
  keep = True #기본값
  if f == 'id':
    keep = False

  # data type
  dtype = train[f].dtype

  # data 행 만들기(dict)
  f_dict = {
      'varname':f,
      'role':role,
      'level':level,
      'keep':keep,
      'dtype':dtype
  }
  data.append(f_dict)

In [None]:
meta = pd.DataFrame(data, columns=['varname','role','level','keep','dtype'])
meta.set_index('varname', inplace=True) #원래 True(index를 varname으로)->조건에 추가하려고 False로 변환

meta

In [None]:
# ex. nominal 변수 출력
meta[(meta.level == 'nominal') & (meta.keep)].index

In [None]:
# 변수 정보 요약
pd.DataFrame({
    'count':meta.groupby(['role','level'])['role'].size()
}).reset_index()

# Descriptive statistics
: nominal제외 var. 요약 통계

1. interval
2. ordinal
3. binary

## Interval var.

In [None]:
v = meta[(meta.level=='interval') & (meta.keep)].index #조건에 맞는 변수명
train[v].describe()

- **reg var.**
  - *ps_reg_03*에 missing value 존재(-1은 missing value)
  - min~max 범위가 제각각 -> scaling 필요

- **car var.**
  - *ps_car_12*, *ps_car_14*에 missing value 존재
  - min~max 범위가 제각각 -> scaling 필요

- **calc var.**
  - missing value 없음
  - min~max 범위 일치

## Ordinal var.

In [None]:
v = meta[(meta.level=='ordinal') & (meta.keep)].index
train[v].describe()

- 유일하게 *ps_car_11*에 missing value 존재
- min~max 범위 제각각 -> Scaler 사용 필요

## Binary var.

In [None]:
v = meta[(meta.level=='binary') & (meta.keep)].index
train[v].describe()

- binary 변수의 대부분이 `0`의 분포가 압도적으로 많은 편
- 결측값은 없다

# Imbalanced classes
- 문제) `target`변수의 0,1 비율이 불균형: 0이 압도적으로 많다
  - 해결)
    1. oversampling `target=1`
    2. undersampling `target=0`
  - training set 크기가 충분히 크기 때문에 **undersampling `target=0`**으로 처리한다

In [None]:
desired_apriori = 0.10

# target=0과 target=1 케이스
idx_0 = train[train.target==0].index
idx_1 = train[train.target==1].index

# target value 개수
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])
print(f'target 0:{nb_0}({round(nb_0/train.shape[0]*100,2)}%)  1:{nb_1}({round(nb_1/train.shape[0]*100,2)}%)')

# undersampling
undersampling_rate = ((1-desired_apriori)*nb_1) / (desired_apriori*nb_0)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print(f'Rate to undersample records with target=0:{round(undersampling_rate*100,2)}%')
print(f'Number of records with target=0 after undersampling:{undersampled_nb_0}')

# undersampled index (target=0)
undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0) #idx_0 중에서 undersampled_nb_0만큼

# index list 합치기
idx_list = list(undersampled_idx) + list(idx_1)

# undersampled train set
train = train.loc[idx_list].reset_index(drop=True)

# Data quality checks
1. missing values
2. cardinality

## Check missing values
- 결측값은 -1로 표기

In [None]:
vars_with_missing = [] #missing value 존재하는 변수명

for f in train.columns:
  level = meta.loc[f,'level']
  missings = train[train[f]==-1][f].count() #결측치(-1) 개수
  if missings>0: #결측값 존재 시
    vars_with_missing.append(f)
    missings_perc = missings/train.shape[0]

    print(f'{f}({level}) has {missings} records({round(missings_perc*100,2)}%)')

print(f'\n{len(vars_with_missing)} variables with missing values')

- *ps_car_03_cat*(68.39%)와 *ps_car_05_cat*(44.26%)은 결측치 비율이 높아 **제거**한다
- *ps_car_11*(ordinal)은 결측값이 1개 뿐이므로 **mode**값으로 채운다
- 이하 변수들은 결측비율이 낮고 연속형 변수이므로 **mean**값으로 대체한다(categorical 변수 제외)

### drop columns

In [None]:
# drop columns
vars_to_drop = ['ps_car_03_cat','ps_car_05_cat']
train.drop(vars_to_drop, inplace=True, axis=1)
meta.loc[(vars_to_drop),'keep'] = False

In [None]:
test.drop(vars_to_drop, inplace=True, axis=1)

### Imputing the mode

In [None]:
# Imputing the mode
mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()

In [None]:
# test dset
mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
test['ps_car_11'] = mode_imp.fit_transform(test[['ps_car_11']]).ravel()

### Imputing the mean

In [None]:
# Imputing the mean
mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()

In [None]:
# test dset
mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
test['ps_reg_03'] = mean_imp.fit_transform(test[['ps_reg_03']]).ravel()
test['ps_car_12'] = mean_imp.fit_transform(test[['ps_car_12']]).ravel()
test['ps_car_14'] = mean_imp.fit_transform(test[['ps_car_14']]).ravel()

## Check the cardinality
- **cardinality**: 집합원의 개수 [참고](http://www.dbguide.net/knowledge.db?cmd=view&boardConfigUid=30&boardUid=142699&boardStep=1)
  - 예) 성별은 집단이 2개->cardinality가 낮은 속성
  - 예) 주민번호는 집단이 무수히 많아 cardinality가 높은 속성


In [None]:
v = meta[(meta.level=='nominal') & (meta.keep)].index

for f in v:
  dist_values = train[f].value_counts().shape[0] #고유값 개수
  print(f'Variable {f} has {dist_values} distinct values')

- *ps_car_11_cat*이 많은 고유값을 가지고 있다

**이 파트 왜 하는지 솔직히 잘 모르겠음**

In [None]:
# Script by https://www.kaggle.com/ogrellier
# Code: https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features

def add_noise(series, noise_level):
  return series * (1 + noise_level*np.random.randn(len(series)))

In [None]:
# Smoothing is computed like in the following paper by Daniele Micci-Barreca
# https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf

def target_encode(trn_series=None,# training categorical feature
                  tst_series=None,# test categorical feature
                  target=None,# target data
                  min_samples_leaf=1,# minimum samples to take category average into account
                  smoothing=1,# smoothing effect to balance categorical average vs prior
                  noise_level=0):
  assert len(trn_series)==len(target)
  assert trn_series.name==tst_series.name
  temp = pd.concat([trn_series, target],axis=1)

  # target mean
  averages = temp.groupby(by=trn_series.name)[target.name].agg(['mean','count'])

  # smoothing
  smoothing = 1 / (1+np.exp(-(averages['count']-min_samples_leaf)/smoothing))

  # average to all target data
  prior = target.mean()

  # the bigger the count the less full_avg is taken into account
  averages[target.name] = prior*(1-smoothing) + averages['mean']*smoothing
  averages.drop(['mean','count'], axis=1, inplace=True)

  # averages to trn and tst series
  ft_trn_series = pd.merge(trn_series.to_frame(trn_series.name),
                           averages.reset_index().rename(columns={'index':target.name, target.name:'average'}),
                           on=trn_series.name,
                           how='left')['average'].rename(trn_series.name+'_mean').fillna(prior)
  
  # drop index
  ft_trn_series.index=trn_series.index
  ft_tst_series = pd.merge(tst_series.to_frame(tst_series.name),
                           averages.reset_index().rename(columns={'index':target.name, target.name:'average'}),
                           on=tst_series.name,
                           how='left')['average'].rename(trn_series.name+'_mean').fillna(prior)
  
  ft_tst_series.index = tst_series.index
  return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
  

### encoding categorical vars.
: *ps_car_11_cat_te*의 집합원이 많으므로 encoding 처리가 필요

In [None]:
train_encoded, test_encoded = target_encode(train['ps_car_11_cat'],test['ps_car_11_cat'],
                                            target=train.target,
                                            min_samples_leaf=100,
                                            smoothing=10,
                                            noise_level=0.01)

In [None]:
# Encoding train set
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat',axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False

In [None]:
# Encoding test set
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat',axis=1, inplace=True)

# EDA

## Categorical variables

In [None]:
v = meta[(meta.level=='nominal') & (meta.keep)].index

for f in v:
  plt.figure()
  fig, ax = plt.subplots(figsize=(20,10))

  # categorical 변수의 라벨별 target=1의 비율
  cat_perc = train[[f,'target']].groupby([f],as_index=False).mean()
  cat_perc.sort_values(by='target',ascending=False, inplace=True)

  # Bar plot
  sns.barplot(ax=ax, x=f, y='target', data=cat_perc, order=cat_perc[f])
  plt.ylabel('% target',fontsize=18)
  plt.xlabel(f,fontsize=18)
  plt.tick_params(axis='both',which='major',labelsize=18)
  plt.show();

- *ps_car_09_cat*, *ps_car_07_cat*, *ps_car_01_cat*, *ps_ind_05_cat*, *ps_ind_04_cat*, *ps_ind_02_cat* 6개의 변수들은 결측값의 target=1 비율이 가장 높다->결측값을 쉽게 제거할 수 없다고 판단

## Interval variables

In [None]:
def corr_heatmap(v):
  correlations = train[v].corr()
  cmap = sns.diverging_palette(220,10,as_cmap=True)
  fig,ax = plt.subplots(figsize=(10,10))
  sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
              square=True,linewidths=.5,annot=True,cbar_kws={'shrink':.75})
  plt.show();

In [None]:
v = meta[(meta.level=='interval') & (meta.keep)].index
corr_heatmap(v)

- interval 변수 사이에 다중공선성이 높은 변수가 존재한다
  - (**0.7**) *ps_reg_02*와 *ps_reg_03*
  - (**0.67**) *ps_car_12*와 *ps_car_13*
  - (**0.58**) *ps_car_12*와 *ps_car_14*
  - (**0.53**) *ps_car_13*와 *ps_car_15*

### Multicollinearity

In [None]:
s = train.sample(frac=0.1)

#### *ps_reg_02*와 *ps_reg_03*

In [None]:
sns.lmplot(x='ps_reg_02', y='ps_reg_03',data=s,
           hue='target',palette='Set1',scatter_kws={'alpha':.3})
plt.show()

- target=1과 target=0의 회귀선이 거의 일치한다

#### *ps_car_12*와 *ps_car13*

In [None]:
sns.lmplot(x='ps_car_12', y='ps_car_13',data=s,
           hue='target',palette='Set1',scatter_kws={'alpha':.3})
plt.show()

#### *ps_car_12*와 *ps_car14*

In [None]:
sns.lmplot(x='ps_car_12', y='ps_car_14',data=s,
           hue='target',palette='Set1',scatter_kws={'alpha':.3})
plt.show()

#### *ps_car_13*와 *ps_car15*

In [None]:
sns.lmplot(x='ps_car_15', y='ps_car_13',data=s,
           hue='target',palette='Set1',scatter_kws={'alpha':.3})
plt.show()

- 다중공선성이 높은 변수들이 존재하지만, 그 수가 적기 때문에 해당 커널에서는 PCA를 진행하지 않는다

## Ordinal variables

In [None]:
v = meta[(meta.level=='ordinal') & (meta.keep)].index
corr_heatmap(v)

- ordinal 변수 사이에는 강한 상관관계를 가진 다중공선성이 높은 변수가 없는 것으로 볼 수 있다

# Feature engineering

## dummy variables
: dummification

In [None]:
print(f'Before dummification: {train.shape[1]} vars.')

v = meta[(meta.level=='nominal') & (meta.keep)].index #nominal 변수의 dummy화
train = pd.get_dummies(train, columns=v, drop_first=True)
print(f'After dummification: {train.shape[1]} vars.')

In [None]:
# test dset
print(f'Before dummification: {test.shape[1]} vars.')

v = meta[(meta.level=='nominal') & (meta.keep)].index #nominal 변수의 dummy화
test = pd.get_dummies(test, columns=v, drop_first=True)
print(f'After dummification: {test.shape[1]} vars.')

## interaction variables
: interval 변수들의 interaction

In [None]:
print(f'Before interactions: {train.shape[1]} vars.')

v = meta[(meta.level=='interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

interactions = pd.DataFrame(data=poly.fit_transform(train[v]),
                            columns=poly.get_feature_names(v))
interactions.drop(v,axis=1, inplace=True) #interaction 변수 외 제거
train = pd.concat([train, interactions], axis=1)
print(f'After interactions: {train.shape[1]} vars.')

In [None]:
# test dset

print(f'Before interactions: {test.shape[1]} vars.')

v = meta[(meta.level=='interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

interactions = pd.DataFrame(data=poly.fit_transform(test[v]),
                            columns=poly.get_feature_names(v))
interactions.drop(v,axis=1, inplace=True) #interaction 변수 외 제거
test = pd.concat([test, interactions], axis=1)
print(f'After interactions: {test.shape[1]} vars.')

# Feature selection

## VarianceThreshold
: remove low or zero variance features

In [None]:
selector = VarianceThreshold(threshold=.01) #variance 0.01 이하 변수 선택
selector.fit(train.drop(['id','target'],axis=1)) #id, target 제외하고 변수 선택

f = np.vectorize(lambda x: not x)

v = train.drop(['id','target'],axis=1).columns[f(selector.get_support())] #변수명 추출
print(f'{len(v)} variables have too low variance')

- 총 **28**개의 변수가 제거 대상->더 많은 변수 제거해보자

## SelectFromModel
: RandomForest 모델을 활용해서 변수 선택

In [None]:
X_train = train.drop(['id','target'],axis=1)
y_train = train['target']

feat_labels = X_train.columns #features만 추출

rf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1) #랜덤포레스트 모델
rf.fit(X_train, y_train) #모델 학습

importances = rf.feature_importances_ #학습 결과에 따른 feature importances

# print features importances
indices = np.argsort(rf.feature_importances_)[::-1]
#for f in range(X_train.shape[1]):
#  print("%2d) %-*s %f" % (f+1, 30, feat_labels[indices[f]], importances[indices[f]]))

In [None]:
# select from RandomForest
print(f'Before selection:{X_train.shape[1]}')

sfm = SelectFromModel(rf, threshold='median', prefit=True) #selection-변수의 절반만 선택(중요도 상위50%)
n_features = sfm.transform(X_train).shape[1] #selection 결과-선택된 변수 개수
print(f'After selection:{n_features}')

In [None]:
selected_vars = list(feat_labels[sfm.get_support()]) #선택된 변수명 list
train = train[selected_vars + ['target']] #id 제외한 trainset

In [None]:
# test dset
test = test[selected_vars] #id 제외한 testset

# Feature scaling
: 표준화 Standardization

In [None]:
scaler = StandardScaler()
scaler.fit_transform(train.drop(['target'], axis=1))

In [None]:
# test dset
scaler.fit_transform(test)

# XGBoost CV(LB .284)

## Loading Packages

In [None]:
# model
from xgboost import XGBClassifier

# model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

#??
from numba import jit
import gc
import time

In [None]:
max_rounds = 400
optimize_rounds = False
learning_rate = 0.07
early_stopping_rounds = 50 #early stopping 기준

# gini


In [None]:
# gini 계수
# from CPMP's kernel https://www.kaggle.com/cpmpml/extremely-fast-gini-computation

@jit
def eval_gini(y_true, y_prob):
  y_true = np.asarray(y_true)
  y_true = y_true[np.argsort(y_prob)]
  ntrue, gini, delta = 0,0,0
  n = len(y_true)

  for i in range(n-1,-1, -1):
    y_i = y_true[i]
    ntrue += y_i
    gini += y_i *delta
    delta += 1-y_i
  
  gini = 1-2*gini / (ntrue*(n-ntrue))
  return gini

In [None]:

# https://www.kaggle.com/ogrellier/xgb-classifier-upsampling-lb-0-283

def gini_xgb(preds, dtrain):
  labels = dtain.get_label()
  gini_score = -eval_gini(labels, preds)
  return [('gini',gini_score)]

- data handling은 **Data Preparation & Exploration**을 기반한다

In [None]:
y = train['target']

y_valid_pred = 0*y
t_test_pred = 0

# K-fold
: `k=5` 설정

In [None]:
train_df = train.drop(['id','target'],axis=1)

In [None]:
# K-fold 설정

k = 5
kf = KFold(n_splits=k, random_state=1, shuffle=True)
np.random.seed(0)

In [None]:
# model 구축
model = XGBClassifier(
    n_estimators=max_rounds,
    max_depth=4,
    objective='binary:logistic',
    learning_rate=learning_rate,
    subsample=0.8,
    min_child_weight=6,
    colsample_bytree=0.8,
    scale_pos_weight=1.6,
    gamma=10,
    reg_alpha=8,
    reg_lambda=1.3
)

In [None]:
# K-fold 실행

for i,(train_index,test_index) in enumerate(kf.split(train_df)):
  y_train, y_valid = y.iloc[train_index].copy(), y.iloc[test_index]
  X_train, X_valid = X.iloc[train_index,:].copy(), X.iloc[test_index,:].copy()
  X_test = test.copy()
  print('\nFold ',i)

  # encoding 처리는 생략

  # fold
  if optimize_rounds:
    eval_set=[(X_valid,y_valid)]
    fit_model = model.fit(X_train, y_train,
                          eval_set=eval_set,
                          eval_metric=gini_xgb,
                          early_stopping_rounds=early_stopping_rounds,
                          verbose=False)
    prinf(f'  Best N trees = {model.best_ntree_limit}')
    print(f'  Best gini = {model.best_score}')
  else:
    fit_model = model.fit(X_train, y_train)

  # validation predictions
  pred = fit_model.predict_proba(X_valid)[:,1]
  print(f'  Gini = {eval_gini(y_valid, pred)}')
  y_valid_pred.iloc[test_index] = pred

  # test dset prediction
  y_test_pred += fit_model.predict_proba(X_test)[:,1]
  del X_test, X_train, X_valid, y_train #초기화

y_test_pred /= k #predictions 평균
print(f'Gini for full training set:{eval_gini(y,y_valid_pred)}') #최종 gini계수