# 通し課題模範解答 分類編 DAY 2
- kaggle の kickstarter project に関して，成功・失敗を予測するモデルを作成する
    - https://www.kaggle.com/kemical/kickstarter-projects?select=ks-projects-201801.csv
- DAY 2 では，以下を行う
    - モデルの検証
    - 前処理
    - 正則化・ハイパーパラメータの探索
    - SVM の利用

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_validate, KFold, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

In [2]:
df = pd.read_csv('../data/df_classification.csv', index_col='ID')
df.head()

Unnamed: 0_level_0,period,log_usd_goal,n_words,main_category_Comics,main_category_Crafts,main_category_Dance,main_category_Design,main_category_Fashion,main_category_Film & Video,main_category_Food,...,currency_GBP,currency_HKD,currency_JPY,currency_MXN,currency_NOK,currency_NZD,currency_SEK,currency_SGD,currency_USD,state_successful
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000002330,58,3.185811,6,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1000003930,59,4.477121,8,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
1000004038,44,4.653213,3,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
1000007540,29,3.69897,7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1000014025,34,4.69897,3,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1


## モデルの検証
- ホールドアウト法によるモデルの検証を行う
- Day1 で実装したロジスティック回帰を利用する

In [3]:
X = df.drop(columns='state_successful')
y = df['state_successful']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

In [4]:
lr_clf = SGDClassifier(loss='log', max_iter=10000, fit_intercept=True, random_state=1234, tol=1e-3)
lr_clf.fit(X_train, y_train)

SGDClassifier(loss='log', max_iter=10000, random_state=1234)

In [5]:
y_pred = lr_clf.predict(X_test)

In [6]:
acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f_1 = f1_score(y_test, y_pred)

print(f'正解率: {acc:.3}')
print(f'Precision: {precision:.3}')
print(f'Recall: {recall:.3}')
print(f'F1: {f_1:.3}')

正解率: 0.653
Precision: 0.597
Recall: 0.438
F1: 0.505


- 交差検証法によるモデルの検証を行う
- Day1 で実装したロジスティック回帰を利用する

In [7]:
lr_clf_cv = SGDClassifier(loss='log', max_iter=10000, fit_intercept=True, random_state=1234, tol=1e-3)
kf = KFold(n_splits=5, shuffle=True, random_state=1234)
cv_results = cross_validate(lr_clf_cv, X_train, y_train, cv=kf, return_estimator=True,
                            scoring=('accuracy', 'precision', 'recall', 'f1'))

In [8]:
cv_results.keys()

dict_keys(['fit_time', 'score_time', 'estimator', 'test_accuracy', 'test_precision', 'test_recall', 'test_f1'])

In [9]:
acc = cv_results['test_accuracy'].mean()
precision = cv_results['test_precision'].mean()
recall = cv_results['test_recall'].mean()
f_1 = cv_results['test_f1'].mean()

print(f'正解率: {acc:.3}')
print(f'Precision: {precision:.3}')
print(f'Recall: {recall:.3}')
print(f'F1: {f_1:.3}')

正解率: 0.637
Precision: 0.65
Recall: 0.234
F1: 0.326


## 前処理
- 連続変数に対する標準化を行う

In [10]:
std = StandardScaler()
X_train.loc[:, ['log_usd_goal', 'period']] = std.fit_transform(X_train.loc[:, ['log_usd_goal', 'period']])
X_test.loc[:, ['log_usd_goal', 'period']] = std.transform(X_test.loc[:, ['log_usd_goal', 'period']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


In [11]:
lr_clf = SGDClassifier(loss='log', max_iter=10000, fit_intercept=True, random_state=1234, tol=1e-3)
lr_clf.fit(X_train, y_train)
y_pred = lr_clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f_1 = f1_score(y_test, y_pred)

print(f'正解率: {acc:.3}')
print(f'Precision: {precision:.3}')
print(f'Recall: {recall:.3}')
print(f'F1: {f_1:.3}')

正解率: 0.647
Precision: 0.565
Recall: 0.545
F1: 0.555


- 正解率，precision はわずかに下がった
- recall，f1 は上がった

##  正則化・ハイパーパラメータ探索
- 二次の多項式までを考慮したロジスティック回帰について，以下の正則化を併用する．また，正則化の種類とパラメータをグリッドサーチによって探索する
    - L_2 正則化
    - L_1 正則化
- 以下のクラスを利用する
    - Pipeline: 複数のクラスを連結して利用するためのクラス．
    - GridSearchCV: グリッドサーチを行うためのクラス．PipeLine を併用する場合には`__`によってインスタンスの名称とパラメータの名称を連結することに注意
- 実行には30分程度かかることに注意

In [15]:
degree = 2
poly = PolynomialFeatures(degree)

parameters = {'clf__alpha': [1e-1, 1e-2, 1e-3, 1e-4, 1e-5], 'clf__penalty': ['l1', 'l2']}

clf_pl = Pipeline([("poly", poly), ("clf", SGDClassifier(loss='log', max_iter=10000, fit_intercept=True, random_state=1234, tol=1e-3))])

grid = GridSearchCV(clf_pl, param_grid=parameters, 
                         cv=kf, 
                         scoring='accuracy', 
                         verbose=3) 

grid.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] clf__alpha=0.1, clf__penalty=l1 .................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..... clf__alpha=0.1, clf__penalty=l1, score=0.620, total=   5.0s
[CV] clf__alpha=0.1, clf__penalty=l1 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.0s remaining:    0.0s


[CV] ..... clf__alpha=0.1, clf__penalty=l1, score=0.623, total=   4.8s
[CV] clf__alpha=0.1, clf__penalty=l1 .................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    9.8s remaining:    0.0s


[CV] ..... clf__alpha=0.1, clf__penalty=l1, score=0.617, total=   5.2s
[CV] clf__alpha=0.1, clf__penalty=l1 .................................
[CV] ..... clf__alpha=0.1, clf__penalty=l1, score=0.618, total=   4.8s
[CV] clf__alpha=0.1, clf__penalty=l1 .................................
[CV] ..... clf__alpha=0.1, clf__penalty=l1, score=0.619, total=   5.1s
[CV] clf__alpha=0.1, clf__penalty=l2 .................................
[CV] ..... clf__alpha=0.1, clf__penalty=l2, score=0.649, total=   3.7s
[CV] clf__alpha=0.1, clf__penalty=l2 .................................
[CV] ..... clf__alpha=0.1, clf__penalty=l2, score=0.650, total=   3.8s
[CV] clf__alpha=0.1, clf__penalty=l2 .................................
[CV] ..... clf__alpha=0.1, clf__penalty=l2, score=0.650, total=   3.7s
[CV] clf__alpha=0.1, clf__penalty=l2 .................................
[CV] ..... clf__alpha=0.1, clf__penalty=l2, score=0.649, total=   3.5s
[CV] clf__alpha=0.1, clf__penalty=l2 .................................
[CV] .

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed: 23.1min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=1234, shuffle=True),
             estimator=Pipeline(steps=[('poly', PolynomialFeatures()),
                                       ('clf',
                                        SGDClassifier(loss='log',
                                                      max_iter=10000,
                                                      random_state=1234))]),
             param_grid={'clf__alpha': [0.1, 0.01, 0.001, 0.0001, 1e-05],
                         'clf__penalty': ['l1', 'l2']},
             scoring='accuracy', verbose=3)

In [17]:
y_pred = grid.predict(X_test)

acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f_1 = f1_score(y_test, y_pred)

print(f'正解率: {acc:.3}')
print(f'Precision: {precision:.3}')
print(f'Recall: {recall:.3}')
print(f'F1: {f_1:.3}')

正解率: 0.652
Precision: 0.621
Recall: 0.355
F1: 0.452


通常のロジスティック回帰と性能はさほど変わらない様子

## SVM の利用
SVM はスモールデータに適したモデルであり，今回の課題に適用する場合には適宜データを間引かないと計算時間が爆発してしまう

In [26]:
n_sample = 10000 # サンプルサイズ
y_sampled = y_train.sample(n_sample)
X_sampled = X_train.loc[y_sampled.index, :]

In [27]:
parameters = {'kernel':['linear', 'rbf'], 'C':[1e-5, 1e-4, 1e-3]} # ここを編集する
model = SVC()
svc = GridSearchCV(model, parameters, cv=kf, verbose=3)
svc.fit(X_sampled, y_sampled)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] C=1e-05, kernel=linear ..........................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .............. C=1e-05, kernel=linear, score=0.578, total=   2.8s
[CV] C=1e-05, kernel=linear ..........................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.8s remaining:    0.0s


[CV] .............. C=1e-05, kernel=linear, score=0.595, total=   2.6s
[CV] C=1e-05, kernel=linear ..........................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.3s remaining:    0.0s


[CV] .............. C=1e-05, kernel=linear, score=0.609, total=   2.5s
[CV] C=1e-05, kernel=linear ..........................................
[CV] .............. C=1e-05, kernel=linear, score=0.602, total=   2.5s
[CV] C=1e-05, kernel=linear ..........................................
[CV] .............. C=1e-05, kernel=linear, score=0.586, total=   2.4s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................. C=1e-05, kernel=rbf, score=0.578, total=   3.1s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................. C=1e-05, kernel=rbf, score=0.595, total=   3.2s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................. C=1e-05, kernel=rbf, score=0.609, total=   3.3s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................. C=1e-05, kernel=rbf, score=0.602, total=   3.2s
[CV] C=1e-05, kernel=rbf .............................................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  1.5min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=1234, shuffle=True),
             estimator=SVC(),
             param_grid={'C': [1e-05, 0.0001, 0.001],
                         'kernel': ['linear', 'rbf']},
             verbose=3)

In [29]:
y_pred = svc.predict(X_test)

acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f_1 = f1_score(y_test, y_pred)

print(f'正解率: {acc:.3}')
print(f'Precision: {precision:.3}')
print(f'Recall: {recall:.3}')
print(f'F1: {f_1:.3}')

正解率: 0.606
Precision: 0.682
Recall: 0.047
F1: 0.0879


ロジスティック回帰と比較して Recall が低い結果が得られた