[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/matsunagadaiki151/Kaggle_Titanic_Tutorial/blob/main/Answer/Titanic_pycaret_answer.ipynb)

## 準備

1. KaggleのTitanic Datasetは必要なのでダウンロードしてGoogle Driveに上げる。 \
https://www.kaggle.com/c/titanic/data よりダウンロードが可能(Kaggleアカウントが必要) 

以下を`drive/MyDrive/Kaggle` 下に配置する。
- gender_submission.csv
- train.csv
- test.csv

2. ドライブをマウントする。 \
左のフォルダのアイコンからやる。

## 以下コード

### データの読み込み

In [1]:
# 必要なものをインストール
!pip install matplotlib==3.3.3
!pip install category_encoders
!pip install pycaret



In [2]:
# データセットが入っているフォルダに移動
%cd /content/drive/MyDrive/Kaggle

/content/drive/MyDrive/Kaggle


In [3]:
# submitフォルダを作成
!mkdir submit

mkdir: cannot create directory ‘submit’: File exists


In [4]:
# 必要なライブラリをインストール
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
import category_encoders as ce
from pycaret.classification import *

  import pandas.util.testing as tm


In [4]:
issubmit = True  # Kaggleにサブミットするかどうか

In [5]:
# 学習データを読み込む
train = pd.read_csv('train.csv')
print(train.shape) # 形状を確認
train.head()  # 最初の5行を見る。 

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
# テストデータを読み込む
test = pd.read_csv('test.csv')
print(test.shape) # 形状を確認
test.head(8) # 最初の8行を見る。 

(418, 11)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S


In [9]:
# 欠損値の数を確認する。
print(train.isnull().sum())
print('-'*40)
# テストデータの欠損値の数を確認する。
print(test.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
----------------------------------------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


今回はAutoMLを使う。AutoMLはデータ処理、モデル選択、学習、予測を全部やってくれる便利なものである。 \
今回は、手軽に使用できるPycaretを使う。

## 前処理

Pycaretでは欠損値の処理やエンコーディングは`setup()`が全部やってくれる。

In [11]:
#setup()を利用する。
clf = setup(train, target='Survived')

Unnamed: 0,Description,Value
0,session_id,4626
1,Target,Survived
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(891, 12)"
5,Missing Values,True
6,Numeric Features,3
7,Categorical Features,8
8,Ordinal Features,False
9,High Cardinality Features,False


## モデル構築

`setup()`を実行した後、`compare_models()`と書くだけで様々なモデルのパフォーマンスを計算してくれる。 \
これだけですでにfold数10のCrossValidationまで実行している。(Fold数を変えたい場合は、引数を変更しよう。)

In [18]:
# 様々なモデルで学習を比較する。
compare_models(fold=5)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.8267,0.8713,0.6947,0.8304,0.7537,0.6215,0.63,0.674
ridge,Ridge Classifier,0.8203,0.0,0.72,0.7945,0.7526,0.6121,0.6166,0.056
rf,Random Forest Classifier,0.8203,0.867,0.6734,0.8305,0.7391,0.6044,0.616,0.674
gbc,Gradient Boosting Classifier,0.8187,0.8682,0.6647,0.8314,0.7339,0.5993,0.612,0.442
lightgbm,Light Gradient Boosting Machine,0.8154,0.8514,0.7116,0.7861,0.7438,0.6004,0.605,0.106
lr,Logistic Regression,0.8106,0.8626,0.72,0.7713,0.7426,0.5932,0.5962,0.318
dt,Decision Tree Classifier,0.8106,0.786,0.6859,0.7923,0.731,0.5865,0.5939,0.066
ada,Ada Boost Classifier,0.8074,0.8442,0.6695,0.7914,0.7248,0.5783,0.5836,0.23
knn,K Neighbors Classifier,0.6982,0.7106,0.5337,0.623,0.5711,0.3411,0.346,0.19
lda,Linear Discriminant Analysis,0.6049,0.598,0.547,0.504,0.5175,0.1896,0.1907,0.206


ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=4626, verbose=0,
                     warm_start=False)

 私の手元だとExtra Tree Classifierが最もAccuracyが高かったため、これを使って学習する。

In [19]:
et = create_model('et')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.7143,0.8226,0.5417,0.65,0.5909,0.3742,0.3778
1,0.873,0.922,0.75,0.9,0.8182,0.7219,0.7289
2,0.7937,0.8275,0.5833,0.8235,0.6829,0.5365,0.554
3,0.8387,0.8049,0.6087,0.9333,0.7368,0.6279,0.6577
4,0.871,0.9153,0.8696,0.8,0.8333,0.7284,0.7301
5,0.7903,0.8679,0.6522,0.75,0.6977,0.5384,0.5415
6,0.8548,0.8974,0.7391,0.85,0.7907,0.6804,0.6843
7,0.7742,0.8081,0.6667,0.7273,0.6957,0.5167,0.5179
8,0.8387,0.8745,0.6667,0.8889,0.7619,0.6437,0.6589
9,0.9032,0.9413,0.7917,0.95,0.8636,0.7896,0.7975


## 予測

予測も`predict_model(model, data=test_data)`の1行でできる。
予測を含めたdataframeが返されるのでそこから予測結果を取り出そう。

In [22]:
test_preds_df = predict_model(et, data=test)
test_preds_df.head()

In [25]:
test_preds = test_preds_df['Label'].values
print(test_preds)

418


In [26]:
# 提出
if issubmit:
    os.makedirs('submit/', exist_ok=True)
    submit = pd.read_csv('gender_submission.csv')
    submit['Survived'] = test_preds
    submit.to_csv('submit/my_submit9.csv', index=False)

## 次にやること

- ハイパーパラメータチューニングをしてみよう。
- アンサンブルしてみよう。
- モデルの詳細を可視化してみよう。

## 参考サイト 
- pycaret公式ドキュメント : https://pycaret.org/setup/
- pycaret実装日本語解説記事 : https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea