# 前処理

## テストケースを分割する手法
#### hold-out

ランダムにが学習データとテストデータに分割する
7:3、5:5に分けるのが一般的 

学習率などのハイパーパラメータの調整(チューニング)様に検証データ(**validation data**)というもう一つのデータセットに分割する場合もある。

チューニングによってテストデータへの過学習を避ける効果がある

#### LOOCV(Leave One Out Cross Validation)
#### k-fold CV

## 特徴量スケーリング
#### 標準化(standardize)

平均を0,分散を1にすること(変換後の値を**z得点**という)。標準化することで尺度を揃えることができ、比較することが可能。

**各値から平均を引き、標準偏差で割ることで標準化ができる**

$$
z = \frac{x - \bar{x}}{s}
$$

#### 正規化(normalization)

値の範囲を0~1にrescaleする処理。外れ値の影響を受けやすい。最小値は0、最大値は1になる

$$
\frac{x-x_{min}}{x_{max}-x_{min}}
$$


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/Hitters.csv")
df.describe()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary
count,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,263.0
mean,380.928571,101.024845,10.770186,50.909938,48.02795,38.742236,7.444099,2648.68323,717.571429,69.490683,358.795031,330.118012,260.23913,288.937888,106.913043,8.040373,535.925882
std,153.404981,46.454741,8.709037,26.024095,26.166895,21.639327,4.926087,2324.20587,654.472627,86.266061,334.105886,333.219617,267.058085,280.704614,136.854876,6.368359,451.118681
min,16.0,1.0,0.0,0.0,0.0,0.0,1.0,19.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,67.5
25%,255.25,64.0,4.0,30.25,28.0,22.0,4.0,816.75,209.0,14.0,100.25,88.75,67.25,109.25,7.0,3.0,190.0
50%,379.5,96.0,8.0,48.0,44.0,35.0,6.0,1928.0,508.0,37.5,247.0,220.5,170.5,212.0,39.5,6.0,425.0
75%,512.0,137.0,16.0,69.0,64.75,53.0,11.0,3924.25,1059.25,90.0,526.25,426.25,339.25,325.0,166.0,11.0,750.0
max,687.0,238.0,40.0,130.0,121.0,105.0,24.0,14053.0,4256.0,548.0,2165.0,1659.0,1566.0,1378.0,492.0,32.0,2460.0


In [2]:
# 欠損値対応
df.dropna(inplace=True)

In [4]:
# データの準備
y_col = "Salary"
X = df.loc[:, df.columns!=y_col]
y = df[y_col]

# 標準化のために、値が数字のカラムのみを取得する
numeric_cols = X.select_dtypes(include=np.number).columns.to_list()

print(X.dtypes)
# ダミー変数を作成(カテゴリー変数を0, 1の数字に変換)
X = pd.get_dummies(X, drop_first=True)

# hold-out
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# 標準化
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
# テスト用の値も、学習用のデータでfitしたscalerを用いるので注意
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])
X

AtBat         int64
Hits          int64
HmRun         int64
Runs          int64
RBI           int64
Walks         int64
Years         int64
CAtBat        int64
CHits         int64
CHmRun        int64
CRuns         int64
CRBI          int64
CWalks        int64
League       object
Division     object
PutOuts       int64
Assists       int64
Errors        int64
NewLeague    object
dtype: object


Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,League_N,Division_W,NewLeague_N
0,293,66,1,30,29,14,1,293,66,1,30,29,14,446,33,20,0,0,0
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,632,43,10,1,1,1
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,880,82,14,0,1,0
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,200,11,3,1,0,1
4,321,87,10,39,42,30,2,396,101,12,48,46,33,805,40,4,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,497,127,7,65,48,37,5,2703,806,32,379,311,138,325,9,3,1,0,1
318,492,136,5,76,50,94,12,5511,1511,39,897,451,875,313,381,20,0,0,0
319,475,126,3,61,43,52,6,1700,433,7,217,93,146,37,113,7,0,1,0
320,573,144,9,85,60,78,8,3198,857,97,470,420,332,1314,131,12,0,0,0
