# scikit-learn 入門

参照URL: https://tutorials.chainer.org/ja/09_Introduction_to_Scikit-learn.html

## 基本的なステップ
1. データセットの準備
1. モデルを決める
1. 目的関数を決める
1. 最適化手法を選択する
1. モデルを訓練する

## 1 sckit-learn を用いた重回帰分析

### Step 1 : データセットの準備

In [2]:
from sklearn.datasets import load_boston

dataset = load_boston()

In [6]:
x = dataset.data
t = dataset.target

In [7]:
x.shape

(506, 13)

In [8]:
t.shape

(506,)

### データセットの分割

In [11]:
from sklearn.model_selection import train_test_split

x_train, x_test, t_train, t_test = train_test_split(x, t, test_size = 0.3, random_state = 0)

# test_size = 0.3 : データ全体の3割をテストに, 7割を訓練に使用
# random_state = 0 : seed値と同じ?

### モデル・目的関数・最適化手法を決める

In [15]:
from sklearn.linear_model import LinearRegression

reg_model = LinearRegression()

### モデルの訓練

In [17]:
reg_model.fit(x_train, t_train)

LinearRegression()

In [18]:
reg_model.coef_

array([-1.21310401e-01,  4.44664254e-02,  1.13416945e-02,  2.51124642e+00,
       -1.62312529e+01,  3.85906801e+00, -9.98516565e-03, -1.50026956e+00,
        2.42143466e-01, -1.10716124e-02, -1.01775264e+00,  6.81446545e-03,
       -4.86738066e-01])

In [19]:
reg_model.intercept_

37.937107741833074

In [20]:
reg_model.score(x_train, t_train)

0.7645451026942549

### 推論してみる

In [22]:
reg_model.predict(x_test[:1])

array([24.9357079])

In [24]:
t_test[0]

22.6

### テストデータを用いた評価

In [25]:
reg_model.score(x_test, t_test)

0.6733825506400177

## 各ステップの改善

### データセットの準備の改善 : 前処理

In [45]:
# 標準化してみる
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [27]:
scaler.fit(x_train)

StandardScaler()

In [28]:
scaler.mean_

array([3.35828432e+00, 1.18093220e+01, 1.10787571e+01, 6.49717514e-02,
       5.56098305e-01, 6.30842655e+00, 6.89940678e+01, 3.76245876e+00,
       9.35310734e+00, 4.01782486e+02, 1.84734463e+01, 3.60601186e+02,
       1.24406497e+01])

In [29]:
scaler.var_

array([6.95792305e+01, 5.57886665e+02, 4.87753572e+01, 6.07504229e-02,
       1.33257561e-02, 4.91423928e-01, 7.83932705e+02, 4.26314655e+00,
       7.49911344e+01, 2.90195600e+04, 4.93579208e+00, 7.31040807e+03,
       4.99634123e+01])

In [30]:
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [31]:
reg_model = LinearRegression()

reg_model.fit(x_train_scaled, t_train)

LinearRegression()

In [32]:
reg_model.score(x_train_scaled, t_train)

0.7645451026942549

In [33]:
reg_model.score(x_test_scaled, t_test)

0.6733825506400195

In [46]:
# 冪変換してみる
from sklearn.preprocessing import PowerTransformer

scaler = PowerTransformer()
scaler.fit(x_train)

x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

reg_model = LinearRegression()
reg_model.fit(x_train_scaled, t_train)

  loglike = -n_samples / 2 * np.log(x_trans.var())


LinearRegression()

In [43]:
reg_model.score(x_train_scaled, t_train)

0.7859862563286062

In [44]:
reg_model.score(x_test_scaled, t_test)

0.7002856551689581

### パイプライン化