## Preprocessing: one-hotエンコーディング・欠損値処理

one-hot encodingと欠損値処理を学ぶため、ローン審査結果データを用います。

In [5]:
# import sample data: Loan screening data for classification 
import pandas as pd

df = pd.read_csv('./data/av_loan_u6lujuX_CVtuZ9i.csv', header=0)
X = df.iloc[:, :-1]          # 最終列以前を特徴量X
X = X.drop('Loan_ID',axis=1) # 1列目はID情報のため特徴量から削除
y = df.iloc[:, [-1]]         # 最終列を正解データ

# check the shape
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)

# ローン審査でNOとなったサンプルを1（正例）へ変換
class_mapping = {'N':1, 'Y':0}
y_new = y.copy()
y_new.loc[:,'Loan_Status'] = y_new['Loan_Status'].map(class_mapping)
print('--------------------')
print(y_new.groupby(['Loan_Status']).size())
X.join(y_new).head()

X shape: (614,11)
y shape: (614,1)
--------------------
Loan_Status
0    422
1    192
dtype: int64


Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,0
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,1
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,0
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,0
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,0


上記例えば、LoanAmountの1行目に欠損値を確認できます。<b>この欠損をLoanAmount列の平均値で置き換えることを欠損値補完、GenderやMarriedのようなカテゴリ変数を0/1のバイナリ変数に変換することをone-hotエンコーディング</b>と言います。

本講座では一連の前処理の統一のため、<b>(1)まずone-hotエンコードをしてカテゴリ変数の欠損をフラグ変数化し解決した上で、(2)残った連続変数の欠損値を平均値で置き換えることとします</b>。それではone-hotエンコードの実施です。オプションのdummy_na=Trueとしておきましょう。これにより欠損が入っていたというのが情報化されます。

In [6]:
ohe_columns = ['Dependents',
               'Gender',
               'Married',
               'Education',
               'Self_Employed',
               'Property_Area']
X_new = pd.get_dummies(X,
                       dummy_na=True,
                       columns=ohe_columns)
X_new.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Dependents_nan,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,...,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0


In [7]:
### TEST
ohe_columns = ['Dependents',
               'Gender',
               'Married',
               'Education',
               'Self_Employed',
               'Property_Area']
X_new = pd.get_dummies(X,
                       dummy_na=True)
X_new.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Gender_nan,Married_No,Married_Yes,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849,0.0,,360.0,1.0,0,1,0,1,0,...,1,0,0,1,0,0,0,0,1,0
1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,1,...,1,0,0,1,0,0,1,0,0,0
2,3000,0.0,66.0,360.0,1.0,0,1,0,0,1,...,1,0,0,0,1,0,0,0,1,0
3,2583,2358.0,120.0,360.0,1.0,0,1,0,0,1,...,0,1,0,1,0,0,0,0,1,0
4,6000,0.0,141.0,360.0,1.0,0,1,0,1,0,...,1,0,0,1,0,0,0,0,1,0


上記まででカテゴリ変数の数量化と欠損処理は終了です。<b>次に連続変数の欠損を平均値で置き換えます。この処理はsklearnのImputerクラスで実現できます</b>。処理の正常確認のため、LoanAmountの基礎統計量を確認しておきましょう。平均値が146.412162であることが確認して下さい。

In [8]:
X_new.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Gender_nan,Married_No,Married_Yes,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
count,614.0,614.0,592.0,600.0,564.0,614.0,614.0,614.0,614.0,614.0,...,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199,0.18241,0.796417,0.021173,0.346906,0.648208,...,0.781759,0.218241,0.0,0.814332,0.13355,0.052117,0.291531,0.379479,0.32899,0.0
std,6109.041673,2926.248369,85.587325,65.12041,0.364878,0.386497,0.402991,0.144077,0.476373,0.477919,...,0.413389,0.413389,0.0,0.389155,0.340446,0.222445,0.454838,0.485653,0.470229,0.0
min,150.0,0.0,9.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3812.5,1188.5,128.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5795.0,2297.25,168.0,360.0,1.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0
max,81000.0,41667.0,700.0,480.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


それでは連続変数の欠損値の平均値補完の実行です。preporcessingクラスからImputerを読み込みます。Imputerクラスのメソッドtransfomrを適用することで、LoanAmountの欠損値（1行目など）を、NaNから平均値（146.412162）に置き換えることができます。

In [9]:
from sklearn.preprocessing import Imputer

# インピュータークラスの実体化
# 欠損値NaNを平均値(mean)で置き換える,処理は列方向で行う.
imp = Imputer(missing_values='NaN',
              strategy='mean',
              axis=0)

# 各特徴量の平均値を学習
imp.fit(X_new)

# 学習済みのImputerを適用し, X_newの欠損値を置き換える.
X_new_columns = X_new.columns.values
X_new = pd.DataFrame(imp.transform(X_new),
                     columns=X_new_columns)

# 結果表示
X_new.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Gender_nan,Married_No,Married_Yes,...,Education_Graduate,Education_Not Graduate,Education_nan,Self_Employed_No,Self_Employed_Yes,Self_Employed_nan,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Property_Area_nan
0,5849.0,0.0,146.412162,360.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [10]:
### imp.transform に "columns=X_new_columns" オプションをつけないと、ヘッダ行がなくなる
from sklearn.preprocessing import Imputer

# インピュータークラスの実体化
# 欠損値NaNを平均値(mean)で置き換える,処理は列方向で行う.
imp = Imputer(missing_values='NaN',
              strategy='mean',
              axis=0)

# 各特徴量の平均値を学習
imp.fit(X_new)

# 学習済みのImputerを適用し, X_newの欠損値を置き換える.
#X_new_columns = X_new.columns.values
X_new = pd.DataFrame(imp.transform(X_new))

# 結果表示
X_new.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,5849.0,0.0,146.412162,360.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


以上で、one-hotエンコーディングと欠損値補完の処理は終了です。この時点でも特徴量の次元数が小さければ、そのままアルゴリズムに投入しても構いません。ただし、実務においては数百・数千次元を超えることがしばしばありますので、その際は以下の次元圧縮を行います。

## Preprocessing: 次元圧縮（RFE&PCA)

さて、特徴量が元の11次元が26次元まで増加しました。<br><b>ここではRFEを使って、予測に役立つと判断された上位10変数に絞り込むこととします。</b>

In [11]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# 特徴量因子の重要度を推定する分類器をRandomForestClassifierに設定
# 最終的に残す特徴量を10に設定
# 1回のstepで削除する次元数は5%ずつとする
selector = RFE(estimator=RandomForestClassifier(random_state=0),
               n_features_to_select=10,
               step=0.05)
selector.fit(X_new, y.as_matrix().ravel())

  # Remove the CWD from sys.path while we load stuff.


RFE(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
  n_features_to_select=10, step=0.05, verbose=0)

RFEをfitすることで、26変数のうちどの変数を残すかが決定されました。<br><b>残された変数の確認は"support_"属性を呼び出すことで可能です。</b><br>Trueが採用された変数の場所を表しています。

In [12]:
print(selector.support_)

[ True  True  True  True  True False  True False  True False False False
  True False False False False False False  True False False False  True
 False False]


fitまでで選択すべき変数を決めることができたので、実際にデータの絞り込み処理をしましょう。<br>Imputerと同様にデータの変換はtransformでできます。

In [13]:
# 26次元を10次元を圧縮
X_new_selected = selector.transform(X_new)
X_new_selected = pd.DataFrame(X_new_selected,
                              columns=X_new_columns[selector.support_])
print('---------------------------------------')
print('X shape after RFE:', X_new_selected.shape)
print('---------------------------------------')
print(X_new_selected.dtypes)
X_new_selected.head()

---------------------------------------
X shape after RFE: (614, 10)
---------------------------------------
ApplicantIncome            float64
CoapplicantIncome          float64
LoanAmount                 float64
Loan_Amount_Term           float64
Credit_History             float64
Gender_Male                float64
Married_No                 float64
Dependents_1               float64
Self_Employed_No           float64
Property_Area_Semiurban    float64
dtype: object


Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Male,Married_No,Dependents_1,Self_Employed_No,Property_Area_Semiurban
0,5849.0,0.0,146.412162,360.0,1.0,1.0,1.0,0.0,1.0,0.0
1,4583.0,1508.0,128.0,360.0,1.0,1.0,0.0,1.0,1.0,0.0
2,3000.0,0.0,66.0,360.0,1.0,1.0,0.0,0.0,0.0,0.0
3,2583.0,2358.0,120.0,360.0,1.0,1.0,0.0,0.0,1.0,0.0
4,6000.0,0.0,141.0,360.0,1.0,1.0,1.0,0.0,1.0,0.0


またRFEの亜種としてRFECVというライブラリが用意されており、これは選択する特徴量の個数を自動で判定してくれます。計算負荷は大きくなりますが、データ件数が少なく、ハイパーパラメータを減らしたい場合には有用です。

In [14]:
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

selector = RFECV(estimator=RandomForestClassifier(random_state=0),step=0.05)
X_new_selected = selector.fit_transform(X_new, y.as_matrix().ravel())

print(X_new_selected.shape)

  """


(614, 7)


RFEによる特徴量次元の絞り込みは以上で終了です。予測モデリングでは、このRFE済み特徴量Xを、交叉検証にかけベストモデルを選択することになります。さて最後に、<b>PCAによる次元圧縮の方法を確認しましょう。</b>最もシンプルな実装はPCAをパイプラインに組み込む方法です。X_new（RFEをする前の26次元の特徴量）を対象にPCAをさせる方法は以下です。

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score

# パイプラインにPCAを埋め込めば自動的に次元圧縮してくれる
clf = Pipeline([('scl',StandardScaler()),
                ('reduct',PCA(n_components=10,random_state=1)),
                ('clf', GradientBoostingClassifier(random_state=1))])

# 学習時に自動的にPCA処理が施される
clf.fit(X_new, y_new.as_matrix().ravel())
print('Normally done')

Normally done


  del sys.path[0]


学習器clfの実態はパイプラインですから、<b>これを学習済みモデルとして保存しておけば、学習済みscl、学習済みreduct(PCA)、学習済みclf（モデル）の3つが学習状態で保存されます</b>。