## 3回目での工夫
- 前処理の工夫
  - sibspとparchから同乗家族の数を作成するとどうだろう
  - CabinはNaNの場合は0, それ以外は1としてるけど、頭文字からエリアを分けれるかも
- モデルの工夫
  - モデル自体はRandomForestで固定
  - モデルのパラメータ調整
- データ分析
  - 女子供 と 男 では生存率が変わりそう


## 4回目(今回)での工夫
- 正規化、標準化などすれば変わるかもしれない
- モデルのアンサンブル
- 名前の情報から家族グループを作成して分析
- Ticketは、値の形式が違うのが多いので、それぞれの値の意味を調べてみる

In [106]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [107]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")
ex_df = pd.read_csv("data/gender_submission.csv")

In [108]:
# train_df, test_dfを結合
test_df["Survived"] = np.nan
df = pd.concat([train_df, test_df], ignore_index=True)

In [109]:
# 前処理
# CabinはNaNの場合は0, それ以外は頭文字で分類
df["Cabin"] = df["Cabin"].apply(lambda x: x[0] if pd.notnull(x) else "Missing")


# Embarkedは欠損値はCで補完する
df["Embarked"] = df["Embarked"].fillna("C")

# Fareは欠損値をPclass, Sex, Parch, SibSpの平均値で補完する
df["Fare"] = df["Fare"].fillna(
    df.groupby(["Pclass", "Sex", "Parch", "SibSp"])["Fare"].transform("mean")
)


# ------------------------------
# Age を Pclass, Sex, Parch, SibSp からランダムフォレストで推定
# ------------------------------
from sklearn.ensemble import RandomForestRegressor


age_df = df[["Age", "Pclass", "Sex", "Parch", "SibSp"]]
age_df = pd.get_dummies(age_df, columns=["Pclass", "Sex"])

# 学習データとテストデータに分離
known_age = age_df[age_df.Age.notnull()].values
unknown_age = age_df[age_df.Age.isnull()].values

# 学習データをX, yに分離
X_train = known_age[:, 1:]
y_train = known_age[:, 0]

# ランダムフォレストで推定モデルを構築
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X_train, y_train)

# Ageの補完
predictedAges = rfr.predict(unknown_age[:, 1::])
df.loc[(df.Age.isnull()), "Age"] = predictedAges

# # 年齢別生存曲線と死亡曲線
# facet = sns.FacetGrid(df[0:890], hue="Survived", aspect=2)
# facet.map(sns.kdeplot, "Age", shade=True)
# facet.set(xlim=(0, df.loc[0:, "Age"].max()))
# facet.add_legend()
# plt.show()
# ------------------------------------------------------------------

# 不要な列を削除
df.drop(["Name", "Ticket"], axis=1, inplace=True)

# 女子または16歳未満の場合は1, それ以外は0
df["FemaleOrChild"] = df.apply(
    lambda row: (
        1
        if ((pd.notnull(row["Age"]) and row["Age"] < 16) or row["Sex"] == "female")
        else 0
    ),
    axis=1,
)

# ワンホットエンコーディング
df = pd.get_dummies(df, columns=["Pclass", "Embarked", "Cabin"])

# 家族の数を計算
df["Family"] = df["SibSp"] + df["Parch"]

# 不要な列を削除
df.drop(["PassengerId", "Sex", "SibSp", "Parch"], axis=1, inplace=True)

df.isnull().sum()
# 前処理終了

Survived         418
Age                0
Fare               0
FemaleOrChild      0
Pclass_1           0
Pclass_2           0
Pclass_3           0
Embarked_C         0
Embarked_Q         0
Embarked_S         0
Cabin_A            0
Cabin_B            0
Cabin_C            0
Cabin_D            0
Cabin_E            0
Cabin_F            0
Cabin_G            0
Cabin_Missing      0
Cabin_T            0
Family             0
dtype: int64

In [None]:
from sklearn.model_selection import train_test_split

# データの分割
known_data = df[df.Survived.notnull()]
X_known = known_data.drop("Survived", axis=1).values
y_known = known_data["Survived"].values
X_train, X_valid, y_train, y_valid = train_test_split(X_known, y_known, random_state=0)

In [114]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# パラメータ調整
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False],
}

rf = RandomForestClassifier(random_state=0)
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    verbose=1,
)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

# RandomForest
model = RandomForestClassifier(random_state=0, **grid_search.best_params_)
model.fit(X_train, y_train)

score = model.score(X_valid, y_valid)
print("Validation Accuracy: {:.4f}".format(score))

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best parameters: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 50}
Best cross-validation accuracy: 0.8293569745258669
Validation Accuracy: 0.8341


In [115]:
# 提出用データの作成
unknown_data = df[df.Survived.isnull()]
X_unknown = unknown_data.drop("Survived", axis=1).values
submit = test_df[["PassengerId"]]
submit["Survived"] = model.predict(X_unknown).astype(int)
submit.to_csv("submit4.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submit["Survived"] = model.predict(X_unknown).astype(int)
