# Clustering
FIRA 빅데이터 플랫폼 과정 <데이터마이닝> - 2017.08.18.금 09:00-13:00

### 1. Data
- 1-1. `german_predictions.csv`
- 1-2. Preprocessing : dummy variables
- 1-3. Train-Test Split : `sklearn.model_selection.test_train_split`

### 2. Linear Regression
- 2-1. `sklearn.linear_model.LinearRegression`
- 2-2. Evaluation Measure 'RMS' : `sklearn.metrics.mean_squared_error`

### 3. k-nn
- 3-1. `sklearn.neighbors.KNeighborsRegressor`
- 3-2. Evaluation Measure 'RMS' : `sklearn.metrics.mean_squared_error`

### 4. SVM
- 4-1. `sklearn.svm.LinearSVR`
- 4-2. Evaluation Measure 'RMS' : `sklearn.metrics.mean_squared_error`

### 5. Cross-Validation Test
- 5-1. 필요성
- 5-2. K-Fold Cross Validation : `sklearn.cross_validation.cross_val_score`

### 6. 실습 : 신용 금액 예측 모델 

### 1. Data
---

#### 1-1.  `german_predictions.csv`
---
`german_predictions.csv`는 독일 은행에 신용거래를 신청한 신청자 1000명의 데이터이다. 관찰번호(OBS#)를 제외하고 총 28개의 변수로 이루어져있으며, 각 신청자들의 '신용 금액(AMOUNT)'가 예측하고자 하는 변수이다.

In [4]:
import pandas as pd

In [5]:
# 변수별 설명 : CodeList
pd.read_csv('CodeList_predictions.csv').iloc[:, :4]

Unnamed: 0,변수명,설명,변수 종류,코드 설명
0,OBS#,관찰번호,-,일련번호
1,CHK_ACCT,당좌 예금 계좌 상태,범주형,"0 : 0 DM 미만, 1 : 0~200 DM 미만, 2 : 200 DM 이상, 3..."
2,DURATION,신용거래 개월 수,수치형,
3,HISTORY,신용기록,범주형,"0 : 신용거래 없음, 1 : 기한 내 변제, 2 : 변제 기한 남음, 3 : 연체..."
4,NEW_CAR,신용목적,이진수형,신형 자동차 0 : 없음 1 : 있음
5,USED_CAR,신용목적,이진수형,중고 자동차 0 : 없음 1 : 있음
6,FURNITURE,신용목적,이진수형,가구/주방기기 0 : 없음 1 : 있음
7,RADIO/TV,신용목적,이진수형,라디오/TV 0 : 없음 1 : 있음
8,EDUCATION,신용목적,이진수형,교육 0 : 없음 1 : 있음
9,RETAINING,신용목적,이진수형,재교육 0 : 없음 1 : 있음


In [6]:
# df
df = pd.read_csv('german_predictions.csv').set_index('OBS#')

#### 1-2. Preprocessing : dummy variables

In [7]:
# dummy varaibles
dummy_cols = ['CHK_ACCT', 'HISTORY', 'SAV_ACCT', 'EMPLOYMENT', 'PRESENT_RESIDENT', 'JOB']
preprocessed_df = pd.get_dummies(df, columns=dummy_cols)

In [8]:
# split X, y
X, y = preprocessed_df.drop('AMOUNT', axis=1), preprocessed_df.AMOUNT

#### 1-3. Train-Test Split : `sklearn.model_selection.train_test_split`
---
지도학습을 위해 주어진 데이터를 train set과 test set으로 나눈다.

In [9]:
# import packages
from sklearn.model_selection import train_test_split

In [10]:
# train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [11]:
# random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=33)

### 2. Linear Regression
---

#### 2-1. `sklearn.linear_model.LinearRegression`
---

In [12]:
# import packages
from sklearn.linear_model import LinearRegression

In [13]:
# Parameters
# fit_intercept = True # 상수를 넣을 것인지
# normalize = False # 변수들을 normalize한 상태로 모델을 얻을 것인지

In [14]:
# lr_model
lr_model = LinearRegression()

In [15]:
# fit to model
lr_model.fit(X=X_train, y=y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [16]:
# get predictions
y_pred_lr = lr_model.predict(X=X_test)

In [17]:
# result_y_test_df 
result_y_test_df = pd.DataFrame(y_test)
result_y_test_df['Linear Regression'] = y_pred_lr

#### 2-2. Evaluation Measure 'RMS' : `sklearn.metrics.mean_squared_error`
---

In [18]:
from sklearn.metrics import mean_squared_error
import math

In [19]:
error_lr = mean_squared_error(y_pred=y_pred_lr, y_true=y_test)
rms_lr = math.sqrt(error_lr)
rms_lr

1747.6825239811367

### 3. k-nn
---

#### 3-1. `sklearn.neighbors.KNeighborRegressor`
---

In [20]:
from sklearn.neighbors import KNeighborsRegressor

In [21]:
# set parameters
n_neighbors = 4
# weights = 'uniform' # 주변 데이터의 가중치를 모두 동일하게, 'distance' 선택시, 거리에 반비례

In [22]:
# knn_model
knn_model = KNeighborsRegressor(n_neighbors=n_neighbors)

In [23]:
# fit to model
knn_model.fit(X=X_train, y=y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=4, p=2,
          weights='uniform')

In [24]:
# get predictions
y_pred_knn = knn_model.predict(X=X_test)

In [25]:
# result_y_test_df
result_y_test_df['knn'] = y_pred_knn

#### 3-2. Evaluation Measure 'RMS' : `sklearn.metrics.mean_squared_error`
---

In [26]:
# import packages
from sklearn.metrics import mean_squared_error
import math

In [27]:
# calcualte error
error_knn = mean_squared_error(y_pred=y_pred_knn, y_true=y_test)
rms_knn = math.sqrt(error_knn)
rms_knn

2139.1106893403153

### 4. SVM
---

#### 4-1. `sklearn.svm.LinearSVR`
---

In [28]:
# import packages
from sklearn.svm import LinearSVR

In [29]:
# set parameters
# C=1.0 # soft margin 허용량

In [30]:
svm_model = LinearSVR(C=10)

In [31]:
# fit to model
svm_model.fit(X=X_train, y=y_train)

LinearSVR(C=10, dual=True, epsilon=0.0, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0)

In [32]:
# get predictions
y_pred_svm = svm_model.predict(X=X_test)

In [33]:
# result_y_test_df
result_y_test_df['SVM'] = y_pred_svm 

#### 4-2. Evaluation Measure 'RMS' : `sklearn.metrics.mean_squared_error`
---

In [34]:
# import packages
from sklearn.metrics import mean_squared_error
import math

In [35]:
# calcualte error
error_svm = mean_squared_error(y_pred=y_pred_svm, y_true=y_test)
rms_svm = math.sqrt(error_svm)
rms_svm

1847.0647133190648

### 5. Cross Validation Test
---
#### 5-1. 필요성
---

In [36]:
# train, test set을 바꿔보자
X_new_train, X_new_test, y_new_train, y_new_test = train_test_split(X, y, test_size=0.3, random_state=5)

In [37]:
# 이미 학습한 모델로 새로운 test set의 prediction 값을 얻어보자
y_pred_new = knn_model.predict(X=X_new_test)

In [38]:
# rms
error_new = mean_squared_error(y_pred=y_pred_new, y_true=y_new_test)
rms_new = math.sqrt(error_new)
rms_new

1598.221750362988

#### 5-2.  K-Fold Cross Validation : `sklearn.cross_validation.cross_val_score`
---

In [39]:
# import packages
from sklearn.cross_validation import cross_val_score



In [40]:
# set parameters
estimator=lr_model # fit 한 모델
cv = 10 # 전체 데이터를 k개의 그룹으로 나누고 총 k번의 모델 training - test를 반복한다.
       # 이 때 각 회차마다 test set이 되는 그룹을 계속 바꾼다.
scorer=rms_scorer # Evaluation Measure로 사용할 Score Function - scorer(estimator, X, y)

NameError: name 'rms_scorer' is not defined

In [41]:
# rms_scorer(estimator, X, y)
def rms_scorer(estimator, X, y):
    y_pred = estimator.predict(X=X)
    error = mean_squared_error(y_pred=y_pred, y_true=y)
    return math.sqrt(error)

In [42]:
# cross_val_score
cv_result = cross_val_score(estimator=estimator, X=X, y=y, scoring=rms_scorer, cv=cv)

In [43]:
# cv_test_result_df
cv_test_result_df = pd.DataFrame(cv_result, columns=['LinearRegression'])

In [44]:
# run cv test for more models & add cols to cv_test_result_df

### 6. 실습 : 신용 금액 예측 모델
---
전처리된 데이터를 이용하여 신용 금액을 가장 잘 예측하는 모델을 찾아라.
- 전처리를 더 수행해도 무방하다. 단, 전처리를 수행한 과정을 기술하라.
- 5개 이상의 모델을 비교하여라. (* 참고 : 같은 모델에 다른 parameter를 사용한 것도 다른 모델로 본다 *)
- cross validation test를 반드시 수행하고(cv=5 이상), 결과를 df로 저장하여라.
- 가장 좋은 모델을 고르고, 그 이유를 설명하여라.

##### (Optional) Preprocessing
---
(추후에 배울 예정) 변수의 개수가 너무 많은 상태이므로, 변수의 개수를 줄이는 것이 필요하다. 모델에 적합한 변수를 골라내기 위해 `sklearn.feature_selection.RFE`를 사용한다.

In [45]:
from sklearn.feature_selection import RFE

In [48]:
X

Unnamed: 0_level_0,DURATION,NEW_CAR,USED_CAR,FURNITURE,RADIO/TV,EDUCATION,RETAINING,INSTALL_RATE,MALE_DIV,MALE_SINGLE,...,EMPLOYMENT_3,EMPLOYMENT_4,PRESENT_RESIDENT_1,PRESENT_RESIDENT_2,PRESENT_RESIDENT_3,PRESENT_RESIDENT_4,JOB_0,JOB_1,JOB_2,JOB_3
OBS#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,6,0,0,0,1,0,0,4,0,1,...,0,1,0,0,0,1,0,0,1,0
1,48,0,0,0,1,0,0,2,0,0,...,0,0,0,1,0,0,0,0,1,0
2,12,0,0,0,0,1,0,2,0,1,...,1,0,0,0,1,0,0,1,0,0
3,42,0,0,1,0,0,0,2,0,1,...,1,0,0,0,0,1,0,0,1,0
4,24,1,0,0,0,0,0,3,0,1,...,0,0,0,0,0,1,0,0,1,0
5,36,0,0,0,0,1,0,2,0,1,...,0,0,0,0,0,1,0,1,0,0
6,24,0,0,1,0,0,0,3,0,1,...,0,1,0,0,0,1,0,0,1,0
7,36,0,1,0,0,0,0,2,0,1,...,0,0,0,1,0,0,0,0,0,1
8,12,0,0,0,1,0,0,2,1,0,...,1,0,0,0,0,1,0,1,0,0
9,30,1,0,0,0,0,0,4,0,0,...,0,0,0,1,0,0,0,0,0,1


In [46]:
#rfe_svm_model
rfe_svm_model = RFE(estimator=svm_model, n_features_to_select=20)
rfe_svm_model.fit(X, y)
svm_X = X[X.columns[rfe_svm_model.support_]]

In [49]:
svm_X

Unnamed: 0_level_0,DURATION,USED_CAR,FURNITURE,RADIO/TV,INSTALL_RATE,MALE_SINGLE,MALE_MAR_WID,CO-APPLICANT,PROP_UNKN_NONE,NUM_CREDITS,NUM_DEPENDENTS,TELEPHONE,CHK_ACCT_1,CHK_ACCT_3,SAV_ACCT_0,SAV_ACCT_4,EMPLOYMENT_0,EMPLOYMENT_1,PRESENT_RESIDENT_2,JOB_3
OBS#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,6,0,0,1,4,1,0,0,0,2,1,1,0,0,0,1,0,0,0,0
1,48,0,0,1,2,0,0,0,0,1,1,0,1,0,1,0,0,0,1,0
2,12,0,0,0,2,1,0,0,0,1,2,0,0,1,1,0,0,0,0,0
3,42,0,1,0,2,1,0,0,0,1,2,0,0,0,1,0,0,0,0,0
4,24,0,0,0,3,1,0,0,1,2,2,0,0,0,1,0,0,0,0,0
5,36,0,0,0,2,1,0,0,1,1,2,1,0,1,0,1,0,0,0,0
6,24,0,1,0,3,1,0,0,0,1,1,0,0,1,0,0,0,0,0,0
7,36,1,0,0,2,1,0,0,0,1,1,1,1,0,1,0,0,0,1,1
8,12,0,0,1,2,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0
9,30,0,0,0,4,0,1,0,0,2,1,0,1,0,1,0,1,0,1,1


In [47]:
#rfe_lr_model
rfe_lr_model = RFE(estimator=lr_model, n_features_to_select=20)
rfe_lr_model.fit(X, y)
lr_X = X[X.columns[rfe_lr_model.support_]]

##### Models
---
비교할 모델은 다음과 같다.
- Linear Regression
- knn : n=2~7, weights = ['uniform', 'distance']
- svm : C=[1, 3, 5, 10]

총 1 + 6*2 + 4 = 17개의 모델을 선정하였다.

In [50]:
# Linear Regression
lr = [LinearRegression()]

# knn
knn_models = list()
for weights in ['uniform', 'distance']:
    for n in range(2, 8):
        knn_model = KNeighborsRegressor(n_neighbors=n, weights=weights)
        knn_models.append(knn_model)

# SVM
svm_cs = [1, 3, 5, 10]
svm_models = [LinearSVR(C=C) for C in svm_cs]

# models
models = lr + knn_models + svm_models

##### Cross-Validation Test
---

In [51]:
from tqdm import *
result_df = pd.DataFrame()
svm_X_result_df = pd.DataFrame()
lr_X_result_df = pd.DataFrame()

for model in tqdm(models):
    result_df[model] = cross_val_score(estimator=model, X=X, y=y, cv=cv, scoring=rms_scorer)
    svm_X_result_df[model] = cross_val_score(estimator=model, X=svm_X, y=y, cv=cv, scoring=rms_scorer)
    lr_X_result_df[model] = cross_val_score(estimator=model, X=lr_X, y=y, cv=cv, scoring=rms_scorer)

100%|██████████| 17/17 [00:01<00:00, 10.91it/s]


In [52]:
lr_X_result_df

Unnamed: 0,"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=2, p=2,  weights='uniform')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=3, p=2,  weights='uniform')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=4, p=2,  weights='uniform')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=5, p=2,  weights='uniform')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=6, p=2,  weights='uniform')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=7, p=2,  weights='uniform')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=2, p=2,  weights='distance')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=3, p=2,  weights='distance')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=4, p=2,  weights='distance')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=5, p=2,  weights='distance')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=6, p=2,  weights='distance')","KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=7, p=2,  weights='distance')","LinearSVR(C=1, dual=True, epsilon=0.0, fit_intercept=True,  intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,  random_state=None, tol=0.0001, verbose=0)","LinearSVR(C=3, dual=True, epsilon=0.0, fit_intercept=True,  intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,  random_state=None, tol=0.0001, verbose=0)","LinearSVR(C=5, dual=True, epsilon=0.0, fit_intercept=True,  intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,  random_state=None, tol=0.0001, verbose=0)","LinearSVR(C=10, dual=True, epsilon=0.0, fit_intercept=True,  intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,  random_state=None, tol=0.0001, verbose=0)"
0,2540.743328,2688.268044,2536.967131,2483.49334,2513.893762,2552.824278,2645.060476,2978.923996,2902.849155,2878.106765,2895.334929,2931.792585,3005.603021,3781.592056,3545.78819,3439.606931,3250.469944
1,2287.739999,2610.161835,2485.770836,2533.138078,2517.78107,2414.163636,2385.654742,2611.886192,2506.087789,2515.061444,2531.532526,2425.285334,2397.761131,2962.5363,2786.179182,2696.854868,2546.301215
2,2326.062973,2814.065979,2732.290044,2693.429116,2594.787135,2589.793015,2518.21927,2795.161217,2717.442478,2714.686202,2645.453664,2653.002171,2620.669069,3585.293772,3368.460401,3245.952743,3033.665115
3,2698.68503,3488.273963,3146.727214,3050.785694,2988.617176,2993.737664,2912.043409,3590.168344,3261.946514,3200.108445,3166.300611,3185.184266,3114.606376,3614.370134,3428.257131,3329.839442,3139.956519
4,1811.743922,2377.983473,2106.108221,1896.509323,1874.450574,1787.203433,1831.475146,2468.029745,2240.359984,2066.884726,2080.682841,2079.145309,2093.276721,2471.77294,2284.816314,2173.440626,2010.290341
5,1756.039703,2031.590326,1928.798805,1988.324461,2027.147806,1887.20758,1905.960366,2115.515516,2049.396848,2052.771897,2068.189537,2005.492575,2011.072166,2586.525843,2389.122738,2278.306236,2105.619977
6,2208.598924,2621.937473,2426.505445,2357.053855,2207.714859,2257.81267,2192.922711,2794.148572,2669.325026,2672.296054,2618.429024,2668.9722,2645.382884,3171.898097,2961.659467,2866.116996,2697.079606
7,2275.797483,3079.813777,3038.12882,2740.085742,2585.820963,2564.001925,2588.974548,3067.731752,3038.882264,2811.57952,2660.943705,2651.879185,2685.248946,3228.284982,3033.404041,2923.030034,2739.631612
8,2645.69342,3025.278443,2989.797992,2867.220794,2874.244897,2792.669443,2728.717011,2999.275435,3022.24377,2979.919545,2978.650942,2956.71564,2902.1879,3832.195154,3614.499236,3508.507063,3319.489813
9,2584.898722,3043.745954,2933.521131,2838.003564,2892.270095,2867.80192,2909.693619,3084.825844,3060.32197,2990.188579,3044.940587,3034.007784,3062.611082,3704.821052,3504.825109,3398.530841,3214.895176


##### Best Model
---

In [53]:
print(result_df.mean().idxmin(), result_df.mean().min())
print(svm_X_result_df.mean().idxmin(), svm_X_result_df.mean().min())
print(lr_X_result_df.mean().idxmin(), lr_X_result_df.mean().min())

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 1848.97308812
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 1822.6343169
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) 2313.6003504


은행에서 신용 금액 예측 모델을 도입하려고 할 때, 어떤 모델을 추천할 것인가?
- 가장 좋은 모델을 보여주는 것이 가장 좋은가? 
- 각 모델들의 장단점을 생각해보자.
- 은행에 모델들을 설명해야한다고 할 때 어떻게 설명할 것인가?
- 현업에 모델을 도입하려고 할 때 가장 먼저 고려해야하는 것이 무엇이라고 생각하는가?
--
- 가장 성능이 좋은 모델을 실무자가 이해하기 어려울 경우 받아들여지지 않을 가능성이 있다. (적어도 현재까지는) 인간의 눈으로 어느 정도 설명이 될 필요성이 있다.
- LR, k-nn의 경우 설명이 쉽다. 하지만 SVM의 경우 수학적 지식을 비교적 많이 요구한다. 모델의 성능으로 보자면 SVM > k-nn > LR인 경우가 대다수 이다. k-nn의 경우 비모수 방법으로 별도의 training 과정을 요구하지 않는다.
- 수결과적으로 얼마만큼의 성능을 내고, 또 해당 모델이 어떤 식으로 결과를 설명할 수 있는지를 가지고 이야기할 것이다. 또한 
- 현업에서 모델을 도입할 때 가장 먼저 고려해야하는 것은 모델이 수행한 업무를 사람이 어떻게 보완하고 책임질 것인가에 대한 문제이다. 성능이 아무리 좋아도 100%의 성능을 낼 수 없는 상황이고, 그렇다면 모델을 이용함으로써 발생하는 문제들을 보완하는 문제가 중요해진다. 그런 점들을 잘 보완할 수 있는 모델이 선택될 것이다.