## 머신러닝 알고리즘 - 기본

    - 정제된 데이터 기반 최종 전처리 (LabelEncode)
    - 검증 데이터 기반 평가 (훈련 데이터, 검증 데이터 점수 비교)
    - 결과 이해하기
    - 모델 매개변수 이해하기
    - 주요 변수 이해하기 및 시각화
    - [+1] 피쳐 엔지니어링 (데이터)
    - [+1] 매개변수 최적화 (모델)
    - 캐글에 직접 결과물 제출하기
    
    필요 데이터 (1주차 스크립트 결과물, 필요시 드롭박스 링크 통해 다운로드 받을 수 있습니다.)
    - train.csv
    - train_clean.csv
    - test_clean.csv

In [27]:
import pandas as pd
import numpy as np
import time
import operator
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import log_loss, f1_score, accuracy_score

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## 데이터 최종 전처리
    - 변수형 확인 (object 형 > 정수형으로 변환)
    - 빈도가 너무 낮은 (<10) 라벨값 제거하기

In [28]:
# load data
trn = pd.read_csv('../input/train_clean.csv')
target = pd.read_csv('../input/train.csv', usecols=['target'])
tst = pd.read_csv('../input/test_clean.csv')
test_id = tst['ncodpers']
tst.drop(['ncodpers'], axis=1, inplace=True)
trn.drop(['ncodpers'], axis=1, inplace=True)
print(trn.shape, target.shape, tst.shape)

(45619, 22) (45619, 1) (929615, 22)


In [29]:
print(trn.info(), target.info(), tst.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45619 entries, 0 to 45618
Data columns (total 22 columns):
fecha_dato               45619 non-null object
ind_empleado             45619 non-null object
pais_residencia          45619 non-null object
sexo                     45619 non-null object
age                      45619 non-null int64
fecha_alta               45619 non-null object
ind_nuevo                45619 non-null int64
antiguedad               45619 non-null int64
indrel                   45619 non-null int64
ult_fec_cli_1t           45619 non-null object
indrel_1mes              45619 non-null int64
tiprel_1mes              45619 non-null object
indresi                  45619 non-null object
indext                   45619 non-null object
conyuemp                 45619 non-null object
canal_entrada            45619 non-null object
indfall                  45619 non-null object
cod_prov                 45619 non-null int64
nomprov                  45619 non-null object
ind_

In [30]:
# check columns
trn.columns == tst.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True], dtype=bool)

In [31]:
# print columns with object type
for col in trn.columns:
    if trn[col].dtype == 'object':
        print(col)

fecha_dato
ind_empleado
pais_residencia
sexo
fecha_alta
ult_fec_cli_1t
tiprel_1mes
indresi
indext
conyuemp
canal_entrada
indfall
nomprov
segmento


In [32]:
# convert object type into int
for col in trn.columns:
    if trn[col].dtype == 'object':
        lb = LabelEncoder()
        lb.fit(pd.concat([trn[col],tst[col]]))
        trn[col] = lb.transform(trn[col])
        tst[col] = lb.transform(tst[col])

In [33]:
# check column types
for col in trn.columns:
    print(col, trn[col].dtype, tst[col].dtype)

fecha_dato int64 int64
ind_empleado int64 int64
pais_residencia int64 int64
sexo int64 int64
age int64 int64
fecha_alta int64 int64
ind_nuevo int64 int64
antiguedad int64 int64
indrel int64 int64
ult_fec_cli_1t int64 int64
indrel_1mes int64 int64
tiprel_1mes int64 int64
indresi int64 int64
indext int64 int64
conyuemp int64 int64
canal_entrada int64 int64
indfall int64 int64
cod_prov int64 int64
nomprov int64 int64
ind_actividad_cliente int64 int64
renta float64 float64
segmento int64 int64


In [34]:
# check unique target count
for t in np.unique(target):
    print(t, sum(target['target']==t))

2 9452
3 9
4 1934
5 55
6 349
7 222
8 154
9 503
10 33
11 1085
12 1219
13 246
14 4
15 21
16 8
17 2942
18 4733
19 159
20 3
21 5151
22 8218
23 9119


In [35]:
# trim data, removing rows with low_freq target
rem_targets = [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 17, 18, 19, 21, 22, 23]  # 18 classes
trn = trn[target['target'].isin(rem_targets)]
target = target[target['target'].isin(rem_targets)]
target = LabelEncoder().fit_transform(target)

for t in np.unique(target):
    print(t, sum(target==t))

  y = column_or_1d(y, warn=True)


0 9452
1 1934
2 55
3 349
4 222
5 154
6 503
7 33
8 1085
9 1219
10 246
11 21
12 2942
13 4733
14 159
15 5151
16 8218
17 9119


## 평가용 함수 정의

In [36]:
def evaluate(x, y, model):
    trn_scores = dict(); vld_scores = dict()
    sss = StratifiedShuffleSplit(n_splits=3, test_size=0.1, random_state=777)
    for t_ind, v_ind in sss.split(x,y):
        # split data
        x_trn, x_vld = x.iloc[t_ind], x.iloc[v_ind]
        y_trn, y_vld = y[t_ind], y[v_ind]

        # fit model
        model.fit(x_trn, y_trn)
        
        # eval _ trn
        preds = model.predict(x_trn)
        acc_scores = trn_scores.get('accuracy', [])
        acc_scores.append(accuracy_score(y_trn, preds))
        trn_scores['accuracy'] = acc_scores

        f1_scores = trn_scores.get('f1 score', [])
        f1_scores.append(f1_score(y_trn, preds, average='weighted'))
        trn_scores['f1 score'] = f1_scores
        
        preds = model.predict_proba(x_trn)

        log_scores = trn_scores.get('log loss', [])
        log_scores.append(log_loss(y_trn, preds))
        trn_scores['log loss'] = log_scores

        # eval _ vld
        preds = model.predict(x_vld)
        acc_scores = vld_scores.get('accuracy', [])
        acc_scores.append(accuracy_score(y_vld, preds))
        vld_scores['accuracy'] = acc_scores

        f1_scores = vld_scores.get('f1 score', [])
        f1_scores.append(f1_score(y_vld, preds, average='weighted'))
        vld_scores['f1 score'] = f1_scores
        
        preds = model.predict_proba(x_vld)

        log_scores = vld_scores.get('log loss', [])
        log_scores.append(log_loss(y_vld, preds))
        vld_scores['log loss'] = log_scores
    return trn_scores, vld_scores

def print_scores(trn_scores, vld_scores):
    prefix = '        '
    cols = ['accuracy', 'f1 score','log loss']
    print('='*50)
    print('TRAIN EVAL')
    for col in cols:
        print('-'*50)
        print('# {}'.format(col))
        print('# {} Mean : {}'.format(prefix, np.mean(trn_scores[col])))
        print('# {} Raw  : {}'.format(prefix, trn_scores[col]))

    print('='*50)
    print('VALID EVAL')
    for col in cols:
        print('-'*50)
        print('# {}'.format(col))
        print('# {} Mean : {}'.format(prefix, np.mean(vld_scores[col])))
        print('# {} Raw  : {}'.format(prefix, vld_scores[col]))

def print_time(end, start):
    print('='*50)
    elapsed = end - start
    print('{} secs'.format(round(elapsed)))
    
def fit_and_eval(trn, target, model):
    trn_scores, vld_scores = evaluate(trn,target,model)
    print_scores(trn_scores, vld_scores)
    print_time(time.time(), st)    

## 모델 학습 및 평가
    - 모델 종류
        - Decision Tree : 트리 기반 모델
        - Logistic Regression : 선형 모델
        - Naive Bayesian : 베이지안 모델
        - K-Nearest Neighbors : k-최근접 이웃 모델

    - 훈련 데이터 기반 평가 척도
        - 정확도 (accuracy)
        - F1 Score
        - Log Loss
    - 검증 데이터 기반 평가 척도 (위와 동일)

In [65]:
st = time.time()
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5,random_state=777)
fit_and_eval(trn, target, model)
# 2 sec

  'precision', 'predicted', average, warn_for)


TRAIN EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.2988586978595508
#          Raw  : [0.29759961008894847, 0.29884245156573658, 0.30013403192396737]
--------------------------------------------------
# f1 score
#          Mean : 0.23402301259625738
#          Raw  : [0.23266070803540992, 0.23386366820510057, 0.2355446615482617]
--------------------------------------------------
# log loss
#          Mean : 1.9279162902206635
#          Raw  : [1.9290720801895347, 1.9282930112300749, 1.9263837792423815]
VALID EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.29839181286549704
#          Raw  : [0.29890350877192984, 0.29671052631578948, 0.29956140350877192]
--------------------------------------------------
# f1 score
#          Mean : 0.23359897250630945
#          Raw  : [0.23436285017678818, 0.23276042675386413, 0.23367364058827608]
--------------------------------------------------
# log loss
#          M

In [38]:
st = time.time()
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(n_jobs=-1, random_state=777)
fit_and_eval(trn, target, model)
# 58 sec

  'precision', 'predicted', average, warn_for)


TRAIN EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.26795824702489746
#          Raw  : [0.26694285366150849, 0.26721091750944315, 0.26972096990374073]
--------------------------------------------------
# f1 score
#          Mean : 0.19090613644340346
#          Raw  : [0.18789881249870152, 0.1905993726288028, 0.19422022420270599]
--------------------------------------------------
# log loss
#          Mean : 2.0235229153662657
#          Raw  : [2.0237777997135376, 2.0243527818881941, 2.0224381644970659]
VALID EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.26374269005847956
#          Raw  : [0.26403508771929823, 0.26491228070175438, 0.26228070175438595]
--------------------------------------------------
# f1 score
#          Mean : 0.186729260938859
#          Raw  : [0.18391912012372935, 0.18744320021963495, 0.18882546247321272]
--------------------------------------------------
# log loss
#          Me

In [39]:
st = time.time()
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
fit_and_eval(trn, target, model)
# 2 sec

  'precision', 'predicted', average, warn_for)


TRAIN EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.20831810243288248
#          Raw  : [0.20899232362617279, 0.21004020957719019, 0.2059217740952845]
--------------------------------------------------
# f1 score
#          Mean : 0.16298665004818635
#          Raw  : [0.16192551527559029, 0.17112682182514397, 0.15590761304382481]
--------------------------------------------------
# log loss
#          Mean : 2.429744359232718
#          Raw  : [2.4326534875420545, 2.4190425768489856, 2.4375370133071148]
VALID EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.20292397660818715
#          Raw  : [0.21030701754385964, 0.20219298245614034, 0.1962719298245614]
--------------------------------------------------
# f1 score
#          Mean : 0.15859571575030967
#          Raw  : [0.16340829060730833, 0.16189284069864562, 0.15048601594497507]
--------------------------------------------------
# log loss
#          Me

In [40]:
st = time.time()
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_jobs=-1)
fit_and_eval(trn, target, model)
# 2 sec

  'precision', 'predicted', average, warn_for)


TRAIN EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.41453230981682304
#          Raw  : [0.41371999512611185, 0.41481661995857194, 0.41506031436578533]
--------------------------------------------------
# f1 score
#          Mean : 0.39576676268003524
#          Raw  : [0.39502009040889918, 0.39601347575816598, 0.39626672187304063]
--------------------------------------------------
# log loss
#          Mean : 1.2295623790258723
#          Raw  : [1.2288679789508321, 1.2305374409657581, 1.2292817171610269]
VALID EVAL
--------------------------------------------------
# accuracy
#          Mean : 0.18223684210526314
#          Raw  : [0.18552631578947368, 0.17697368421052631, 0.18421052631578946]
--------------------------------------------------
# f1 score
#          Mean : 0.15890790309705174
#          Raw  : [0.16173921385084236, 0.15302222179503969, 0.16196227364527324]
--------------------------------------------------
# log loss
#         

## 모델 매개변수 및 주요 변수 시각화 (1)
    - Decision Tree Classifier 전용
    - Decision Tree 모델에서 fit 함수 적용 후, 아래 코드 에러 없이 실행 가능

In [43]:
# Utility

def observe_model_dt(model):
    print('='*50)
    print(model)
    
    print('='*50)
    print('# Feature Importance')
    print(model.feature_importances_)
    
    print('-'*50)
    print('# Mapped to Column Name')
    prefix = '    '
    feature_importance = dict()
    for i, f_imp in enumerate(model.feature_importances_):
        print('{} {} \t {}'.format(prefix, round(f_imp,5), trn.columns[i]))
        feature_importance[trn.columns[i]] = f_imp

    print('-'*50)
    print('# Sorted Feature Importance')
    feature_importance_sorted = sorted(feature_importance.items(), key=operator.itemgetter(1), reverse=True)
    for item in feature_importance_sorted:
        print('{} {} \t {}'.format(prefix, round(item[1],5), item[0]))
    
    return feature_importance_sorted

def plot_fimp(fimp):
    x = []; y = []
    for item in fimp:
        x.append(item[0])
        y.append(item[1])

    f, ax = plt.subplots(figsize=(20, 15))
    sns.barplot(x,y,alpha=0.5)
    ax.set_title('Feature Importance for Model : Decision Tree')
    ax.set(xlabel='Column Name', ylabel='Feature Importance')

In [44]:
# 모델 상세 보기
fimp = observe_model_dt(model)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
           weights='uniform')
# Feature Importance


AttributeError: 'KNeighborsClassifier' object has no attribute 'feature_importances'

In [45]:
# 주요 변수 시각화
plot_fimp(fimp)

NameError: name 'fimp' is not defined

## 모델 매개변수 및 주요 변수 시각화 (2)
    - LogisticRegression 전용
    - LogisticRegression 모델에서 fit 함수 적용 후, 아래 코드 에러 없이 실행 가능

In [46]:
# Utility

def observe_model_lr(model):
    target_num = 0
    print('='*50)
    print(model)
    
    print('='*50)
    print('# Coefficients for target_num == {}'.format(target_num))
    print(model.coef_[target_num])
    
    print('-'*50)
    print('# Mapped to Column Name')
    prefix = '    '
    coefs = dict()
    for i, coef in enumerate(model.coef_[target_num]):
        print('{} {} \t {}'.format(prefix, round(coef,5), trn.columns[i]))
        coefs[trn.columns[i]] = np.absolute(coef)

    print('-'*50)
    print('# Sorted Feature Importance')
    coefs_sorted = sorted(coefs.items(), key=operator.itemgetter(1), reverse=True)
    for item in coefs_sorted:
        print('{} {} \t {}'.format(prefix, round(item[1],5), item[0]))
    
    return coefs_sorted

def plot_coef(coef):
    x = []; y = []
    for item in coef:
        x.append(item[0])
        y.append(item[1])

    f, ax = plt.subplots(figsize=(20, 15))
    sns.barplot(x,y,alpha=0.5)
    ax.set_title('Feature Importance for Model : Logistic Regression')
    ax.set(xlabel='Column Name', ylabel='Feature Importance')

In [47]:
# 모델 상세 보기
coef = observe_model_lr(model)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
           weights='uniform')
# Coefficients for target_num == 0


AttributeError: 'KNeighborsClassifier' object has no attribute 'coef_'

In [48]:
# 주요 변수 시각화
plot_coef(coef)

NameError: name 'coef' is not defined

## 피쳐 엔지니어링 (데이터 최적화) [+1]
    - 직접 새로운 변수를 추가 혹은 기존 변수를 삭제하여서 최적의 변수세트 생성해보기
    - 주의: 훈련 데이터에 수행한 변수 변환은 테스트 데이터에도 동일하게 수행해야함

In [49]:
# 입력 : trn, target, tst
# 출력 : new trn, new tst, same target

In [66]:
# 예시
trn['age_log'] = np.log(trn['age']+1)
tst['age_log'] = np.log(tst['age']+1)

trn['antiguedad_log'] = np.log(trn['antiguedad']+1)
tst['antiguedad_log'] = np.log(tst['antiguedad']+1)

trn['renta_log'] = np.log(trn['renta']+1)
tst['renta_log'] = np.log(tst['renta']+1)

trn['age_by10'] = (trn['age']/10).astype(int)
tst['age_by10'] = (tst['age']/10).astype(int)

In [69]:
trn[['age','age_by10']].head()

Unnamed: 0,age,age_by10
0,28,2
1,28,2
2,37,3
3,37,3
4,40,4


In [None]:
trn_v2 = trn.drop(['age'], axis=1)

## 매개변수 최적화 (모델 최적화) [+1]
    - 사용하는 모델의 매개변수를 직접 정의하여 최적의 매개변수 찾아내기
    - 참고: scikit learn 홈페이지를 통해 모델별 매개변수 확인 가능

In [50]:
# 입력 : none
# 출력: model instance

In [54]:
model = DecisionTreeClassifier(max_depth=10, random_state=777)

TypeError: __init__() got an unexpected keyword argument 'n_jobs'

## 캐글에 직접 결과물 제출하기
    - MAP@7 평가척도를 기반 (https://www.kaggle.com/c/santander-product-recommendation/details/evaluation)
    - 유저당 상위 7개의 제품을 추천해야함

In [51]:
from datetime import datetime
import os

print('='*50)
print('# Test shape : {}'.format(tst.shape))

model = LogisticRegression(n_jobs=-1, random_state=777)
model.fit(trn,target)

preds = model.predict_proba(tst)
preds = np.fliplr(np.argsort(preds, axis=1))

# Test shape : (929615, 22)


In [52]:
cols = ['ind_ahor_fin_ult1', 'ind_aval_fin_ult1', 'ind_cco_fin_ult1',
        'ind_cder_fin_ult1', 'ind_cno_fin_ult1',  'ind_ctju_fin_ult1',
        'ind_ctma_fin_ult1', 'ind_ctop_fin_ult1', 'ind_ctpp_fin_ult1',
        'ind_deco_fin_ult1', 'ind_deme_fin_ult1', 'ind_dela_fin_ult1',
        'ind_ecue_fin_ult1', 'ind_fond_fin_ult1', 'ind_hip_fin_ult1',
        'ind_plan_fin_ult1', 'ind_pres_fin_ult1', 'ind_reca_fin_ult1',
        'ind_tjcr_fin_ult1', 'ind_valo_fin_ult1', 'ind_viv_fin_ult1',
        'ind_nomina_ult1',   'ind_nom_pens_ult1', 'ind_recibo_ult1']
target_cols = [cols[i] for i, col in enumerate(cols) if i in rem_targets]

In [53]:
final_preds = []
for pred in preds:
    top_products = []
    for i, product in enumerate(pred):
        top_products.append(target_cols[product])
        if i == 6:
            break
    final_preds.append(' '.join(top_products))

out_df = pd.DataFrame({'ncodpers':test_id, 'added_products':final_preds})
file_name = datetime.now().strftime("result_%Y%m%d%H%M%S") + '.csv'
out_df.to_csv(os.path.join('../output',file_name), index=False)

결과물 출력은 https://www.kaggle.com/c/santander-product-recommendation/submissions/attach