# 【問題1】コンペティション内容の確認

コンペティションのOverviewページ読み、「Home Credit Default Risk」について以下の観点について確認してください。


- 何を学習し、何を予測するのか
- どのようなファイルを作りKaggleに提出するか
- 提出されたものはどういった指標値で評価されるのか

**何を学習し、何を予測するのか?**

個人のクレジットの情報や以前の応募情報などから、各データが債務不履行になるかどうかを予測する⇒クライアントの返済能力を予測

**どのようなファイルを作りKaggleに提出するか?**

⇒SK_ID_CURRとTARGETの2カラムからなるsubmission.csvファイルを新たに作り、Kaggleに提出

**提出されたものはどういった指標値で評価されるのか?**

⇒ROC曲線

## 【問題2】学習と検証
データを簡単に分析、前処理し、学習、検証するまでの一連の流れを作成・実行してください。


検証にはこのコンペティションで使用される評価指標を用いるようにしてください。学習に用いる手法は指定しません。

In [3]:
#ライブラリのインポート

# data analysis and wrangling
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler #標準化（平均0,分散1となるように変換）
from sklearn.model_selection import train_test_split #データ分割

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.neighbors import KNeighborsClassifier # 最近傍法
from sklearn.linear_model import LogisticRegression # ロジスティック回帰
from sklearn.svm import SVC #SVC 
from sklearn.tree import DecisionTreeClassifier # 決定木
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier # ランダムフォレスト

#model_evaluation & tuning
from sklearn.metrics import accuracy_score # 正解率
from sklearn.metrics import precision_score # 適合率
from sklearn.metrics import recall_score # 再現率
from sklearn.metrics import f1_score # F値
from sklearn.metrics import confusion_matrix # 混合行列
from sklearn.metrics import classification_report # 
from sklearn.model_selection import cross_val_score #CV

In [4]:
df_train = pd.read_csv('application_train.csv') 
df_train

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# application_test.csvを読み込む

df_test = pd.read_csv('application_test.csv') 
df_test

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,...,0,0,0,0,,,,,,
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


In [6]:
# 欠損値の確認関数

def missing_values_summary(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'mis_val_count', 1 : 'mis_val_percent'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    'mis_val_percent', ascending=False).round(1)
    print ("カラム数：" + str(df.shape[1]) + "\n" + "欠損値のカラム数： " + str(mis_val_table_ren_columns.shape[0]))
    return mis_val_table_ren_columns

In [7]:
# trainデータの欠損値状況確認

application_train_mv = missing_values_summary(df_train)
application_train_mv.head(30)

カラム数：122
欠損値のカラム数： 67


Unnamed: 0,mis_val_count,mis_val_percent
COMMONAREA_MEDI,214865,69.9
COMMONAREA_AVG,214865,69.9
COMMONAREA_MODE,214865,69.9
NONLIVINGAPARTMENTS_MEDI,213514,69.4
NONLIVINGAPARTMENTS_MODE,213514,69.4
NONLIVINGAPARTMENTS_AVG,213514,69.4
FONDKAPREMONT_MODE,210295,68.4
LIVINGAPARTMENTS_MODE,210199,68.4
LIVINGAPARTMENTS_MEDI,210199,68.4
LIVINGAPARTMENTS_AVG,210199,68.4


In [8]:
# trainデータの欠損値1500以上がある列をリスト化して、削除

df_drop = application_train_mv[application_train_mv['mis_val_count'] >= 1500]
dropping_columns_list = list(df_drop.index)
df_train_dropped = df_train.drop(dropping_columns_list, axis=1)
df_train_dropped

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0,0,0,0,0,0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0,0,0,0,0,0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0,0,0,0,0,0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,0,0,0,0,0,0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,0,0,0,0,0,0
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,0,0,0,0,0,0
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,0,0,0,0,0,0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# testデータの欠損値状況確認

application_test_mv = missing_values_summary(df_test)
application_test_mv

カラム数：121
欠損値のカラム数： 64


Unnamed: 0,mis_val_count,mis_val_percent
COMMONAREA_MODE,33495,68.7
COMMONAREA_MEDI,33495,68.7
COMMONAREA_AVG,33495,68.7
NONLIVINGAPARTMENTS_MEDI,33347,68.4
NONLIVINGAPARTMENTS_AVG,33347,68.4
...,...,...
OBS_60_CNT_SOCIAL_CIRCLE,29,0.1
DEF_30_CNT_SOCIAL_CIRCLE,29,0.1
OBS_30_CNT_SOCIAL_CIRCLE,29,0.1
AMT_ANNUITY,24,0.0


In [10]:
# trainデータで削除した欠損値1500以上がある列名のリストを使って、testデータから同じカラムを削除

df_test_dropped = df_test.drop(dropping_columns_list, axis=1)
df_test_dropped

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0,0,0,0,0,0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0,0,0,0,0,0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0,0,0,0,0,0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0,0,0,0,0,0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,...,0,0,0,0,0,0,0,0,0,0
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,...,0,0,0,0,0,0,0,0,0,0
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,...,0,0,0,0,0,0,0,0,0,0
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# df_dropped['CODE_GENDER']

In [12]:
# trainデータの性別のフラグ化
from sklearn.preprocessing import LabelEncoder

#LabelEncoderのインスタンスを生成
le = LabelEncoder()
#ラベルを覚えさせる
le = le.fit(df_train_dropped['CODE_GENDER'])
#ラベルを整数に変換し、新しいカラムに入れる
df_train_dropped['FLAG_CODE_GENDER'] = le.transform(df_train_dropped['CODE_GENDER'])
df_train_dropped

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,FLAG_CODE_GENDER
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0,0,0,0,0,1
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0,0,0,0,0,0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0,0,0,0,0,1
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,0,0,0,0,0,0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,0,0,0,0,0,1
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,0,0,0,0,0,0
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,0,0,0,0,0,0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# testデータの性別のフラグ化

le = le.fit(df_test_dropped['CODE_GENDER'])
#ラベルを整数に変換し、新しいカラムに入れる
df_test_dropped['FLAG_CODE_GENDER'] = le.transform(df_test_dropped['CODE_GENDER'])
df_test_dropped

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,FLAG_CODE_GENDER
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0,0,0,0,0,0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0,0,0,0,0,1
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0,0,0,0,0,1
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0,0,0,0,0,0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,...,0,0,0,0,0,0,0,0,0,0
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,...,0,0,0,0,0,0,0,0,0,0
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,...,0,0,0,0,0,0,0,0,0,0
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,...,0,0,0,0,0,0,0,0,0,1


In [14]:
# trainデータのNAME_INCOME_TYPEフラグ化

le = le.fit(df_train_dropped['NAME_INCOME_TYPE'])
#ラベルを整数に変換し、新しいカラムに入れる
df_train_dropped['NAME_INCOME_TYPE'] = le.transform(df_train_dropped['NAME_INCOME_TYPE'])
df_train_dropped

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,FLAG_CODE_GENDER
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0,0,0,0,0,1
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0,0,0,0,0,0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0,0,0,0,0,1
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,0,0,0,0,0,0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,0,0,0,0,0,1
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,0,0,0,0,0,0
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,0,0,0,0,0,0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# testデータのNAME_INCOME_TYPEのフラグ化

le = le.fit(df_test_dropped['NAME_INCOME_TYPE'])
#ラベルを整数に変換し、新しいカラムに入れる
df_test_dropped['NAME_INCOME_TYPE'] = le.transform(df_test_dropped['NAME_INCOME_TYPE'])
df_test_dropped

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,FLAG_CODE_GENDER
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0,0,0,0,0,0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0,0,0,0,0,1
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0,0,0,0,0,1
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0,0,0,0,0,0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,...,0,0,0,0,0,0,0,0,0,0
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,...,0,0,0,0,0,0,0,0,0,0
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,...,0,0,0,0,0,0,0,0,0,0
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,...,0,0,0,0,0,0,0,0,0,1


In [16]:
#1回目の特徴量とカテゴリを選択(NAME_INCOME_TYPEとFLAG_CODE_GENDERとTARGET)

df_train01 = df_train_dropped[["NAME_INCOME_TYPE", "FLAG_CODE_GENDER", "TARGET"]]
df_train01

Unnamed: 0,NAME_INCOME_TYPE,FLAG_CODE_GENDER,TARGET
0,7,1,1
1,4,0,0
2,7,1,0
3,7,0,0
4,7,1,0
...,...,...,...
307506,7,1,0
307507,3,0,0
307508,7,0,0
307509,1,0,1


In [17]:
# scikit-learnでの処理のため、ndarrayへ変換

X = df_train01.drop('TARGET', axis=1).values
y = df_train01.loc[:, 'TARGET'].values
display(X)
display(y)

array([[7, 1],
       [4, 0],
       [7, 1],
       ...,
       [7, 0],
       [1, 0],
       [1, 0]])

array([1, 0, 0, ..., 0, 1, 0])

In [18]:
# 訓練データ75%、検証データ25%用に分割

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=0)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(230633, 2)

(76878, 2)

(230633,)

(76878,)

In [19]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr = lr.predict_proba(X_test)[:, 1] # 予測
y_pred_lr.shape

(76878,)

In [20]:
y_pred_lr

array([0.49612706, 0.50086412, 0.50086412, ..., 0.43536637, 0.49612706,
       0.40558729])

In [21]:
# ROC AUC で評価
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred_lr)

0.5708259087640042

In [22]:
#Kaggar上のapplication_testデータの1回目の特徴量とカテゴリを選択

df_test01 = df_test_dropped[["NAME_INCOME_TYPE", "FLAG_CODE_GENDER"]]
df_test01

Unnamed: 0,NAME_INCOME_TYPE,FLAG_CODE_GENDER
0,6,0
1,6,1
2,6,1
3,6,0
4,6,1
...,...,...
48739,6,0
48740,1,0
48741,1,0
48742,1,1


In [23]:
# Kaggar上のapplication_testデータをndarrayへ変換

X_test_k = df_test01.values
X_test_k

array([[6, 0],
       [6, 1],
       [6, 1],
       ...,
       [1, 0],
       [1, 1],
       [6, 0]])

In [24]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr = lr.predict_proba(X_test_k)[:, 1] # 予測
y_pred_lr.shape

(48744,)

In [25]:
df_test['SK_ID_CURR']

0        100001
1        100005
2        100013
3        100028
4        100038
          ...  
48739    456221
48740    456222
48741    456223
48742    456224
48743    456250
Name: SK_ID_CURR, Length: 48744, dtype: int64

In [26]:
out_df = pd.DataFrame({'SK_ID_CURR': df_test['SK_ID_CURR'], 'TARGET': y_pred_lr})
out_df.to_csv('submission.csv', index=False)

**kaggleへ一回目の提出：スコアは0.58681だった。**

### [問題4] 特徴量エンジニアリング

**トライ１：標準化を行う**

In [27]:
# 標準化
scaler = StandardScaler()

scaler.fit(X_train)

X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

In [28]:
# 同じくロジスティック回帰で学習と推定

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train_transformed, y_train) # 学習
y_pred_lr2 = lr.predict_proba(X_test_transformed)[:, 1] # 予測
y_pred_lr2.shape

(76878,)

In [29]:
y_pred_lr2

array([0.49612344, 0.50087002, 0.50087002, ..., 0.43536466, 0.49612344,
       0.40558653])

In [30]:
# ROC AUC で評価
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred_lr2)

0.5708259087640042

**標準化を行った結果：あまり変わらなかった。**

**トライ２：特徴量を'REGION_RATING_CLIENT_W_CITY'、'FLAG_CODE_GENDER'に変更してみる**

In [31]:
#特徴量とカテゴリを選択(REGION_RATING_CLIENT_W_CITYとFLAG_CODE_GENDERとTARGET)

df_train02 = df_train_dropped[["REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER", "TARGET"]]
df_train02

Unnamed: 0,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER,TARGET
0,2,1,1
1,1,0,0
2,2,1,0
3,2,0,0
4,2,1,0
...,...,...,...
307506,1,1,0
307507,2,0,0
307508,3,0,0
307509,2,0,1


In [32]:
# scikit-learnでの処理のため、ndarrayへ変換

X = df_train02.drop('TARGET', axis=1).values
y = df_train02.loc[:, 'TARGET'].values
display(X)
display(y)

array([[2, 1],
       [1, 0],
       [2, 1],
       ...,
       [3, 0],
       [2, 0],
       [1, 0]])

array([1, 0, 0, ..., 0, 1, 0])

In [33]:
# 訓練データ75%、検証データ25%用に分割

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=0)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(230633, 2)

(76878, 2)

(230633,)

(76878,)

In [34]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr3 = lr.predict_proba(X_test)[:, 1] # 予測
y_pred_lr3.shape

(76878,)

In [35]:
y_pred_lr3

array([0.56527265, 0.44155963, 0.55514541, ..., 0.45171885, 0.56527265,
       0.45171885])

In [36]:
# ROC AUC で評価
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred_lr3)

0.5781264644600103

In [37]:
#Kaggar上のapplication_testデータの2回目の特徴量とカテゴリを選択

df_test02 = df_test_dropped[["REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER"]]
df_test02

Unnamed: 0,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER
0,2,0
1,2,1
2,2,1
3,2,0
4,2,1
...,...,...
48739,3,0
48740,2,0
48741,2,0
48742,2,1


In [38]:
# Kaggar上のapplication_testデータをndarrayへ変換

X_test_k2 = df_test02.values
X_test_k2

array([[2, 0],
       [2, 1],
       [2, 1],
       ...,
       [2, 0],
       [2, 1],
       [2, 0]])

In [39]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr2 = lr.predict_proba(X_test_k2)[:, 1] # 予測
y_pred_lr2.shape

(48744,)

In [40]:
out_df = pd.DataFrame({'SK_ID_CURR': df_test['SK_ID_CURR'], 'TARGET': y_pred_lr2})
out_df.to_csv('submission02.csv', index=False)

**kaggleへ二回目の提出：スコアは0.58857で、一回目の0.58681とほぼ変わらなかった。**

トライ3：特徴量を'NAME_INCOME_TYPE'、'REGION_RATING_CLIENT_W_CITY'、'FLAG_CODE_GENDER'にしてみる。

In [41]:
#特徴量とカテゴリを選択(NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITYとFLAG_CODE_GENDERとTARGET)

df_train03 = df_train_dropped[["NAME_INCOME_TYPE", "REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER", "TARGET"]]
df_train03

Unnamed: 0,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER,TARGET
0,7,2,1,1
1,4,1,0,0
2,7,2,1,0
3,7,2,0,0
4,7,2,1,0
...,...,...,...,...
307506,7,1,1,0
307507,3,2,0,0
307508,7,3,0,0
307509,1,2,0,1


In [42]:
# scikit-learnでの処理のため、ndarrayへ変換

X = df_train03.drop('TARGET', axis=1).values
y = df_train03.loc[:, 'TARGET'].values
display(X)
display(y)

array([[7, 2, 1],
       [4, 1, 0],
       [7, 2, 1],
       ...,
       [7, 3, 0],
       [1, 2, 0],
       [1, 1, 0]])

array([1, 0, 0, ..., 0, 1, 0])

In [43]:
# 訓練データ75%、検証データ25%用に分割

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=0)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(230633, 3)

(76878, 3)

(230633,)

(76878,)

In [44]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr4 = lr.predict_proba(X_test)[:, 1] # 予測
y_pred_lr4.shape

(76878,)

In [45]:
y_pred_lr4

array([0.58770644, 0.39766106, 0.50308122, ..., 0.43026579, 0.58770644,
       0.40501224])

In [46]:
# ROC AUC で評価

roc_auc_score(y_test, y_pred_lr4)

0.5933148587071063

**分析：特徴量を'NAME_INCOME_TYPE'、'REGION_RATING_CLIENT_W_CITY'、'FLAG_CODE_GENDER'に増やしたことにより、ROCスコアが少し上がった。**

In [47]:
#Kaggar上のapplication_testデータの3回目の特徴量とカテゴリを選択

df_test03 = df_test_dropped[["NAME_INCOME_TYPE","REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER"]]
df_test03

Unnamed: 0,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER
0,6,2,0
1,6,2,1
2,6,2,1
3,6,2,0
4,6,2,1
...,...,...,...
48739,6,3,0
48740,1,2,0
48741,1,2,0
48742,1,2,1


In [48]:
# Kaggar上のapplication_testデータをndarrayへ変換

X_test_k3 = df_test03.values
X_test_k3

array([[6, 2, 0],
       [6, 2, 1],
       [6, 2, 1],
       ...,
       [1, 2, 0],
       [1, 2, 1],
       [6, 2, 0]])

In [49]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr3 = lr.predict_proba(X_test_k3)[:, 1] # 予測
y_pred_lr3.shape

(48744,)

In [50]:
out_df = pd.DataFrame({'SK_ID_CURR': df_test['SK_ID_CURR'], 'TARGET': y_pred_lr3})
out_df.to_csv('submission03.csv', index=False)

**kaggleへ三回目の提出：スコアは0.60537で、一回目(0.58681)・二回目(0.58857)から微増。特徴量を増やした効果が少し見られた。**

**トライ4：特徴量をさらに増やして、'NAME_HOUSING_TYPE'、'NAME_INCOME_TYPE'、'REGION_RATING_CLIENT_W_CITY'、'FLAG_CODE_GENDER'にしてみる。**

In [51]:
# trainデータのNAME_HOUSING_TYPEフラグ化

le = le.fit(df_train_dropped['NAME_HOUSING_TYPE'])
#ラベルを整数に変換し、新しいカラムに入れる
df_train_dropped['NAME_HOUSING_TYPE'] = le.transform(df_train_dropped['NAME_HOUSING_TYPE'])
df_train_dropped

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,FLAG_CODE_GENDER
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0,0,0,0,0,1
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0,0,0,0,0,0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0,0,0,0,0,1
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,0,0,0,0,0,0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,0,0,0,0,0,1
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,0,0,0,0,0,0
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,0,0,0,0,0,0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
# testデータのNAME_INCOME_TYPEのフラグ化

le = le.fit(df_test_dropped['NAME_HOUSING_TYPE'])
#ラベルを整数に変換し、新しいカラムに入れる
df_test_dropped['NAME_HOUSING_TYPE'] = le.transform(df_test_dropped['NAME_HOUSING_TYPE'])
df_test_dropped

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,FLAG_CODE_GENDER
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0,0,0,0,0,0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0,0,0,0,0,1
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0,0,0,0,0,1
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0,0,0,0,0,0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48739,456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,...,0,0,0,0,0,0,0,0,0,0
48740,456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,...,0,0,0,0,0,0,0,0,0,0
48741,456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,...,0,0,0,0,0,0,0,0,0,0
48742,456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,...,0,0,0,0,0,0,0,0,0,1


In [53]:
#特徴量とカテゴリを選択(NAME_HOUSING_TYPE,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITYとFLAG_CODE_GENDERとTARGET)

df_train04 = df_train_dropped[["NAME_HOUSING_TYPE","NAME_INCOME_TYPE", "REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER", "TARGET"]]
df_train04

Unnamed: 0,NAME_HOUSING_TYPE,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER,TARGET
0,1,7,2,1,1
1,1,4,1,0,0
2,1,7,2,1,0
3,1,7,2,0,0
4,1,7,2,1,0
...,...,...,...,...,...
307506,5,7,1,1,0
307507,1,3,2,0,0
307508,1,7,3,0,0
307509,1,1,2,0,1


In [54]:
# scikit-learnでの処理のため、ndarrayへ変換

X = df_train04.drop('TARGET', axis=1).values
y = df_train04.loc[:, 'TARGET'].values
display(X)
display(y)

array([[1, 7, 2, 1],
       [1, 4, 1, 0],
       [1, 7, 2, 1],
       ...,
       [1, 7, 3, 0],
       [1, 1, 2, 0],
       [1, 1, 1, 0]])

array([1, 0, 0, ..., 0, 1, 0])

In [55]:
# 訓練データ75%、検証データ25%用に分割

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=0)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(230633, 4)

(76878, 4)

(230633,)

(76878,)

In [56]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr5 = lr.predict_proba(X_test)[:, 1] # 予測
y_pred_lr5

array([0.57849299, 0.38973777, 0.4937332 , ..., 0.42316106, 0.57849299,
       0.3985915 ])

In [57]:
# ROC AUC で評価

roc_auc_score(y_test, y_pred_lr5)

0.5947142752538882

**分析：特徴量を4つに増やしたことにより、ROCスコアが少し上がった。**

In [58]:
#Kaggar上のapplication_testデータの3回目の特徴量とカテゴリを選択

df_test04 = df_test_dropped[["NAME_HOUSING_TYPE", "NAME_INCOME_TYPE","REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER"]]
df_test04

Unnamed: 0,NAME_HOUSING_TYPE,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER
0,1,6,2,0
1,1,6,2,1
2,1,6,2,1
3,1,6,2,0
4,1,6,2,1
...,...,...,...,...
48739,1,6,3,0
48740,1,1,2,0
48741,1,1,2,0
48742,1,1,2,1


In [59]:
# Kaggar上のapplication_testデータをndarrayへ変換

X_test_k4 = df_test04.values
X_test_k4

array([[1, 6, 2, 0],
       [1, 6, 2, 1],
       [1, 6, 2, 1],
       ...,
       [1, 1, 2, 0],
       [1, 1, 2, 1],
       [1, 6, 2, 0]])

In [60]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr4 = lr.predict_proba(X_test_k4)[:, 1] # 予測
y_pred_lr4.shape

(48744,)

In [61]:
out_df = pd.DataFrame({'SK_ID_CURR': df_test['SK_ID_CURR'], 'TARGET': y_pred_lr4})
out_df.to_csv('submission04.csv', index=False)

**kaggleへ四回目の提出：スコアは0.60923で、三回目の0.60537とほぼ変わらなかった。⇒増やした特徴量NAME_HOUSING_TYPEはあまり効果がなかった。**

**トライ５：Week3の分析結果に沿って、さらに特徴量を増やしてみる。**

In [64]:
#特徴量とカテゴリを選択

df_train05 = df_train_dropped[["CNT_FAM_MEMBERS", "REG_CITY_NOT_WORK_CITY", "REG_CITY_NOT_LIVE_CITY", "AMT_CREDIT", "DEF_30_CNT_SOCIAL_CIRCLE", "NAME_HOUSING_TYPE","NAME_INCOME_TYPE", "REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER", "TARGET"]]
df_train05

Unnamed: 0,CNT_FAM_MEMBERS,REG_CITY_NOT_WORK_CITY,REG_CITY_NOT_LIVE_CITY,AMT_CREDIT,DEF_30_CNT_SOCIAL_CIRCLE,NAME_HOUSING_TYPE,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER,TARGET
0,1.0,0,0,406597.5,2.0,1,7,2,1,1
1,2.0,0,0,1293502.5,0.0,1,4,1,0,0
2,1.0,0,0,135000.0,0.0,1,7,2,1,0
3,2.0,0,0,312682.5,0.0,1,7,2,0,0
4,1.0,1,0,513000.0,0.0,1,7,2,1,0
...,...,...,...,...,...,...,...,...,...,...
307506,1.0,0,0,254700.0,0.0,5,7,1,1,0
307507,1.0,0,0,269550.0,0.0,1,3,2,0,0
307508,1.0,1,0,677664.0,0.0,1,7,3,0,0
307509,2.0,1,1,370107.0,0.0,1,1,2,0,1


In [65]:
# 欠損値を0で埋める
df_train05.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [66]:
# scikit-learnでの処理のため、ndarrayへ変換

X = df_train05.drop('TARGET', axis=1).values
y = df_train05.loc[:, 'TARGET'].values
display(X)
display(y)

array([[1., 0., 0., ..., 7., 2., 1.],
       [2., 0., 0., ..., 4., 1., 0.],
       [1., 0., 0., ..., 7., 2., 1.],
       ...,
       [1., 1., 0., ..., 7., 3., 0.],
       [2., 1., 1., ..., 1., 2., 0.],
       [2., 1., 0., ..., 1., 1., 0.]])

array([1, 0, 0, ..., 0, 1, 0])

In [67]:
# 訓練データ75%、検証データ25%用に分割

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=0)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(230633, 9)

(76878, 9)

(230633,)

(76878,)

In [68]:
#ロジスティック回帰

lr = LogisticRegression(random_state=0, class_weight='balanced') # インスタンス作成
lr.fit(X_train, y_train) # 学習
y_pred_lr6 = lr.predict_proba(X_test)[:, 1] # 予測
y_pred_lr6

array([0.48994866, 0.48457526, 0.47831838, ..., 0.48839525, 0.48142508,
       0.47290754])

In [69]:
# ROC AUC で評価

roc_auc_score(y_test, y_pred_lr6)

0.5164231749173771

**分析：特徴量を増やしたことにより、ROCスコアが逆に下がった。**

**トライ6：手法をランダムフォレストに変えてみる。**

In [73]:
#特徴量とカテゴリを選択(NAME_HOUSING_TYPE,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITYとFLAG_CODE_GENDERとTARGET)

df_train04 = df_train_dropped[["NAME_HOUSING_TYPE","NAME_INCOME_TYPE", "REGION_RATING_CLIENT_W_CITY", "FLAG_CODE_GENDER", "TARGET"]]
df_train04

Unnamed: 0,NAME_HOUSING_TYPE,NAME_INCOME_TYPE,REGION_RATING_CLIENT_W_CITY,FLAG_CODE_GENDER,TARGET
0,1,7,2,1,1
1,1,4,1,0,0
2,1,7,2,1,0
3,1,7,2,0,0
4,1,7,2,1,0
...,...,...,...,...,...
307506,5,7,1,1,0
307507,1,3,2,0,0
307508,1,7,3,0,0
307509,1,1,2,0,1


In [74]:
# scikit-learnでの処理のため、ndarrayへ変換

X = df_train04.drop('TARGET', axis=1).values
y = df_train04.loc[:, 'TARGET'].values
display(X)
display(y)

array([[1, 7, 2, 1],
       [1, 4, 1, 0],
       [1, 7, 2, 1],
       ...,
       [1, 7, 3, 0],
       [1, 1, 2, 0],
       [1, 1, 1, 0]])

array([1, 0, 0, ..., 0, 1, 0])

In [75]:
# 訓練データ75%、検証データ25%用に分割

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=0)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(230633, 4)

(76878, 4)

(230633,)

(76878,)

In [76]:
# ランダムフォレスト
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred_rfc = rfc.predict_proba(X_test)[:, 1]
y_pred_rfc

array([0.1172892 , 0.05103601, 0.09013345, ..., 0.04715589, 0.1172892 ,
       0.06795673])

In [77]:
# ROC AUC で評価

roc_auc_score(y_test, y_pred_rfc)

0.6029238443372646

**分析：ランダムフォレストにしたら、ROCスコアが微増。**

In [78]:
# ランダムフォレストでKaggar上のapplication_testデータを予測

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred_rfc2 = rfc.predict_proba(X_test_k4)[:, 1]

In [79]:
out_df = pd.DataFrame({'SK_ID_CURR': df_test['SK_ID_CURR'], 'TARGET': y_pred_rfc2})
out_df.to_csv('submission05.csv', index=False)

**kaggleへ提出：スコアは0.58070で、前回の0.60923より下がった。⇒検証データでのROCスコアは微増だったが、テストデータに対してはあまり良くなかったようだ。さらに勉強して他の方法で特徴量エンジニアリングなどする必要がある。**