# 自動特徴量選択

モデルを単純化、汎化性能を上げるために良い特徴量を残す手法  
1:単変量統計  
2:モデルベース特徴量選択  
3:反復特徴量選択  
  
適するものを試す  

今回適応するもの：モデルベース特徴量選択  
教師あり学習モデルを用いて個々の特徴量の重要性を判断し、必要なもののみを残す  
すべての特徴量を同時に考慮する。  
変数間の相互作用を捉えることができる  
  
(単変量統計は個々の特徴量とターゲットとの間に統計的に顕著な関係があるかどうかを計算する)  

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [3]:
#読み込み
wine_data = pd.read_csv("train.tsv",delimiter='\t')
wine_test_data = pd.read_csv("test.tsv",delimiter='\t')

In [4]:
#外れ値だけ同じように加工する

In [5]:
#各カラムの25%と75%点を算出する
wine_data_drop_id = wine_data.drop(["id","Y"],axis=1)

q25 = wine_data_drop_id.quantile(0.25)
q75 = wine_data_drop_id.quantile(0.75)

iqr = q75 - q25
#下限
lower_bound = q25 - (iqr * 1.5)

#上限
upper_bound = q75 + (iqr * 1.5)

#平均値
wine_mean = wine_data_drop_id.mean()
print(lower_bound)
print(upper_bound)

Alcohol                          10.585
Malic acid                       -0.560
Ash                               1.685
Alcalinity of ash                 9.000
Magnesium                        60.500
Total phenols                     0.425
Flavanoids                       -1.210
Nonflavanoid phenols             -0.055
Proanthocyanins                  -0.095
Color intensity                  -0.825
Hue                               0.320
OD280/OD315 of diluted wines      0.285
Proline                        -105.000
dtype: float64
Alcohol                           15.345
Malic acid                         5.120
Ash                                3.165
Alcalinity of ash                 29.800
Magnesium                        136.500
Total phenols                      4.225
Flavanoids                         5.350
Nonflavanoid phenols               0.785
Proanthocyanins                    3.225
Color intensity                   10.095
Hue                                1.600
OD280/OD315 of

In [6]:
for col in wine_data_drop_id:
    #25%以下の場合 平均値に置き換える
    wine_data_drop_id.loc[wine_data_drop_id[col]<=lower_bound[col],col] = lower_bound[col]    
    #75%以上の場合 平均値に置き換える
    wine_data_drop_id.loc[wine_data_drop_id[col]>=upper_bound[col],col] = upper_bound[col]
    
    

モデルベース選択を実施

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#モデルベース選択のライブラリ
from sklearn.feature_selection import SelectFromModel

In [34]:
X = wine_data_drop_id
Y = wine_data["Y"]

In [35]:
#比率は3:7
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state=1)

In [36]:
select = SelectFromModel(
    RandomForestClassifier(
        n_estimators=1000,max_depth=3,random_state=0,min_samples_leaf=4))

In [37]:
select.fit(X_train,y_train)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
        max_features=None, norm_order=1, prefit=False, threshold=None)

In [38]:
#selectした学習データ
X_train_l1 = select.transform(X_train)

In [39]:
print("X_train.shape: {}",format(X_train.shape))
print("X_train_l1.shape: {}",format(X_train_l1.shape))

X_train.shape: {} (62, 13)
X_train_l1.shape: {} (62, 6)


In [40]:
i=0
for col in wine_data_drop_id.columns:
    print(col,":",select.get_support()[i])
    i+=1


Alcohol : True
Malic acid : False
Ash : False
Alcalinity of ash : False
Magnesium : False
Total phenols : False
Flavanoids : True
Nonflavanoid phenols : False
Proanthocyanins : False
Color intensity : True
Hue : True
OD280/OD315 of diluted wines : True
Proline : True


◆結果がtrueのもの  
Alcohol  
Flavanoids  
Color intensity  
hue  
OD280/OD315 of diluted wines  
Proline   

6個のカラムで実施してみる

In [41]:
X_train.columns

Index(['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium',
       'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
       'Proanthocyanins', 'Color intensity', 'Hue',
       'OD280/OD315 of diluted wines', 'Proline'],
      dtype='object')

Index(['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium',
       'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
       'Proanthocyanins', 'Color intensity', 'Hue',
       'OD280/OD315 of diluted wines', 'Proline'],
      dtype='object')

In [42]:
X_train_l1 = X_train.drop(["Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Nonflavanoid phenols","Proanthocyanins"],axis=1)


In [45]:
X_test_l1 = X_test.drop(["Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Nonflavanoid phenols","Proanthocyanins"],axis=1)

In [68]:
#モデルの作成
clf = RandomForestClassifier(n_estimators=1000,max_depth=10,random_state=1,min_samples_split=2)

clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

In [69]:
#score
score = clf.score(X_test,y_test)
print("testData score:",score)

#train_data
train_score = clf.score(X_train,y_train)
print("trainData score:",train_score)

testData score: 0.9629629629629629
trainData score: 1.0


In [70]:
#testdata
y_test_data = wine_test_data.drop(["id"],axis=1)

#predict
result = clf.predict(y_test_data)

In [71]:
# 提出用に加工
np_id = wine_test_data['id'].values
dd=pd.DataFrame({"id":np_id, "ans":result})
dd.to_csv("result_16.csv",header=False,index=False)
dd

Unnamed: 0,id,ans
0,2,1
1,4,3
2,5,2
3,7,2
4,8,1
5,10,1
6,16,2
7,18,2
8,19,1
9,22,2


result11
n_estimators=1000,max_depth=3,random_state=0,min_samples_leaf=4

98%

result13
n_estimators=5000,max_depth=5,random_state=2,min_samples_leaf=5

97%

result15 n_estimators=1000,max_depth=10,random_state=1,min_samples_split=2,min_samples_leaf=1)

100%