# 比鐵達尼更豪華、更耐久 https://reurl.cc/5lmYez

- 注意：過去用過的所有資料都可以點擊[這個連結](https://reurl.cc/QdMz09)下載， 縮短網址如後： https://reurl.cc/QdMz09

目前網路上流傳的鐵達尼問題，出自十年前 Kaggle 辦的一場[競賽](https://www.kaggle.com/c/titanic)，處理這個問題的資料集需要用到特徵工程所提到的各種技巧，進而多多練習使用各種工具，例如：Pandas、Seaborn、Matplotlib。並從這些練習中改善資料分析的眼光，收穫所謂的 insight。

有一份學術刊物叫做 [Journal of Statistics Education](https://amstat.tandfonline.com/toc/ujse20/current)，1997 年第五卷第一期發表了一篇文章 [The “Unusual Episode” and a Second Statistics Course](https://amstat.tandfonline.com/doi/full/10.1080/10691898.1997.11910524#.XxsLtvgzblw)，裏面提到鐵達尼資料雖然是一場悲劇，但是它的資料集對統計教育的意義深遠。

如果想對鐵達尼的資料內容追根究柢，網路上有一個網站叫做[Encyclopedia Titanica](https://www.encyclopedia-titanica.org/)，裏面有許多豐富的參考資料。其實，目前網路流傳不同版本的鐵達尼教學資料都是從這個網站提供的原始資料編纂的。

根據 Vanderbilt 大學生物統計（biostat）系[網站](https://biostat.app.vumc.org/wiki)收集的教學輔助資料，一位叫 Robert Dawson 的老師，從 Encyclopedia Titanica 中整理了一份 titanic3.xls，titanic3.xls 和 Kaggle 的版本不同，尤其是缺失值的欄位和比例，對於資料分析和技巧練習是更大的挑戰。參閱[這篇文章](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt) ，可以了解資料是如何收集和建構的。

scikit-learn 從 0.20 之後，增加一個新函數 [fetch_openml](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html) ，這函數可從 [OpenML](https://openml.org/) 網站下載資料回來，
OpenML 網站中也有鐵達尼資料，因此又多了一個下載資料源。本函數的使用方法說明在後面。

補充：Vanderbilt 大學網站，曾經改版及遷移網址，目前的超連結是新版的網址。為了以防萬一，我將 titanic3.csv 下載回來，放了一份副本在我的 data 收集裏。

`RMS Titanic` was a British passenger liner that sank in the North Atlantic Ocean in the early hours of `15 April 1912`, after colliding with an iceberg during her maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. RMS Titanic was the largest ship afloat at the time she entered service and was the second of three Olympic-class ocean liners operated by the White Star Line.

# 載入必要定義以及程式庫

In [None]:
try:
    from google.colab import drive, files
    in_colab = True
except ModuleNotFoundError:
    in_colab = False

if in_colab:
    home_dir = ''
    drive.mount('/content/drive')
    groot_dir = '/content/drive/My Drive/adventures/'
else:
    from pathlib import Path
    home_dir = str(Path.home())
    groot_dir = home_dir + '/Google Drive/adventures/'

import matplotlib as mpl
mpl.rc('axes', labelsize=12)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
mpl.rc('font', size=12)

from datetime import datetime
from dateutil.relativedelta import *
import matplotlib.pyplot as plt
import sklearn
assert sklearn.__version__ >= "0.21"
import seaborn as sns
import pandas as pd
import numpy as np
import math
import os
import sys
import gdown
import requests
# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
from pandas.plotting import register_matplotlib_converters

figure_dir = groot_dir + 'figure/titanic/'
data_dir = groot_dir + 'titanic/'

gfigure = lambda name: figure_dir + name + '.png'
output_fig = lambda name: plt.savefig( gfigure(name), dpi = 300)

local_time = lambda x, offset: x + relativedelta(hours= offset)
def local_now(hours = 8):
    return datetime.now() + relativedelta(hours = hours if in_colab else 0)

def print_now():
    return print(local_now())

def print_local_now():
    return print('Local Time:', local_now())

def DropboxLink(did, fname):
    return 'https://dl.dropboxusercontent.com/s/%s/%s' % \
    (did, fname)

def fetch_gdrive_file(fid, local_save):
    remote_url = 'https://drive.google.com/uc?id=' + fid
    gdown.download(remote_url, local_save, quiet = False)

def fetch_file_via_requests(url, save_in_dir):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter below
    output_fpath = save_in_dir + local_filename
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(output_fpath, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
                    # f.flush()
    return output_fpath
        
def start_plot(figsize=(10, 8), style = 'whitegrid', dpi = 100):
    fig = plt.figure(figsize=figsize, dpi=dpi)
    gs = fig.add_gridspec(1,1)
    plt.tight_layout()
    with sns.axes_style(style):
        ax = fig.add_subplot(gs[0,0])
    return ax

def start_plot_hires(figsize=(10, 8), style = 'whitegrid',
        dpi = 100):
    fig = plt.figure(figsize=figsize, dpi=dpi)
    gs = fig.add_gridspec(1,1)
    plt.tight_layout()
    with sns.axes_style(style):
        ax = fig.add_subplot(gs[0,0])
    return ax

TITANIC_TRAIN = '1PrxmUKRQWSlYgtMU13l1E0ob4hVJI20O'
TITANIC_TEST = '1iiU-W5rdRnbhZDt92rmQeOY6H9KsV-1X'

print('\nThis module is aimed to explore titanic dataset and beyond...')

print('\nRunning on %s' % sys.platform)
print('Python Version', sys.version)
print('Data storage root points to ==>', groot_dir)
print('Wine Quality data will be stored at ==>', data_dir)
print('\nLibraries and dependenciess imported')
print_local_now()


This module is aimed to explore titanic dataset and beyond...

Running on darwin
Python Version 3.8.5 (v3.8.5:580fbb018f, Jul 20 2020, 12:11:27) 
[Clang 6.0 (clang-600.0.57)]
Data storage root points to ==> /Users/roger/Google Drive/adventures/
Wine Quality data will be stored at ==> /Users/roger/Google Drive/adventures/titanic/

Libraries and dependenciess imported
Local Time: 2020-08-02 15:39:04.396191


# Fetch files 下載資料集

- [Titanic on Kaggle](https://www.kaggle.com/c/titanic/data)

In [None]:
fetch_file_via_requests(
    DropboxLink('4j4npddumn17e9p', 'train.csv'), data_dir )

fetch_file_via_requests(
    DropboxLink('dpx3t2z46tckq3o', 'test.csv'), data_dir )

fetch_file_via_requests( 
'https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv',
data_dir)

print('data collected form remote site')
print_now()

# sci-kit learn fetch_openml

- [OpenML](https://openml.org/)
- [fetch_openml](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml)

In [None]:
from sklearn.datasets import fetch_openml

assert sklearn.__version__ >= '0.22'

titanic = fetch_openml('titanic', version=1, as_frame=True)

In [None]:
titanic_data = titanic['data']
titanic_data.columns

Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare',
       'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [None]:
type(titanic_data)

pandas.core.frame.DataFrame

# Load data into memory

In [None]:
t3_csv = os.path.join(data_dir, 'titanic3.csv')
t3 = pd.read_csv(t3_csv)
t3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [None]:
train_csv = os.path.join(data_dir, 'train.csv')
test_csv = os.path.join(data_dir, 'test.csv')

In [None]:
tdf = pd.read_csv(train_csv)
tdf.shape, tdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


((891, 12), None)

In [None]:
tdf = tdf.drop(['PassengerId'], axis = 1)
print('Titanic dataset ready for exploring')
print_now()

Titanic dataset ready for exploring
2020-08-04 13:06:36.888185


# 資料集的特性

- [titanic3 欄位說明](https://www.rdocumentation.org/packages/PASWR/versions/1.1/topics/titanic3)
- [Kaggle Version](https://www.kaggle.com/c/titanic/data)

## Features （欄位解釋）:
* `Survived`: Survival, 1 = Yes, 0= No (就是我們要預測的 target )
* `PassengerId`: Unique Id of a passenger
* `Pclass`: Refer to ticket's class. 1= 1st, 2= 2nd , 3 = 3rd (1st is the highest class) 艙等
* `Sex`: Sex (femalr or male)  性別
* `Age`: Age in years   年齡
* `Sibsp`: # of siblings / spouses aboard the Titanic  同行的兄弟姊妹及配偶數目   
* `Sarch`: # of parents / children aboard the Titanic 同行的子女及雙親的數目
* `Ticket`: Ticket number 船票編號
* `Parch` : defines family relations such as mother, father,daughter, son, stepdaughter, stepson (Some children travelled only with a nanny, therefore parch=0 for them.)  在船上同家族的父母及小孩的數目
* `Fare`: Passenger fare (cost written on the pessenger's ticket) 船票價格
* `Cabin`: Cabin number  船艙號碼  
* `Embarked`: Port of Embarkation (defines which passenger embarked on the Titanic from which port (C = Cherbourg, Q = Queenstown, S = Southampton )) 登船碼頭（口岸）

## 有多少缺失值

- [PyPI missingno](https://pypi.org/project/missingno/)

## 資料整體分佈

- [seaborn set](https://seaborn.pydata.org/generated/seaborn.set.html)
- [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html?highlight=kdeplot#seaborn.kdeplot)
- [distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html?highlight=distplot#seaborn.distplot)
- [Pandas.DataFrame.hist](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html)
- [Shapiro-Wilk Test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test)

## Correlation 關聯

# 用 pandas-profiling 做基本調查


安裝 Pandas Profiling，需在終端機（Windows 環境則是命令列 CMD）輸入下列指令：

```
pip install -U pandas-profiling[notebook]
jupyter nbextension enable --py widgetsnbextension
```

參考資料：
- [Installation](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/installation.html)
- [Getting Started](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/getting_started.html)

In [None]:
%%time
from pandas_profiling import ProfileReport
profile = ProfileReport(tdf, title='Titanic Analysis')
print('report generated')
print_now()

report generated
2020-08-04 15:12:29.747282
CPU times: user 1.29 s, sys: 438 ms, total: 1.73 s
Wall time: 1.88 s


# 安裝 virtualenv

```
pip3 install virtualenv
```

參考連結如下：

- [virtuelenv homepage](https://virtualenv.pypa.io/)
- [PyPI virtualenv](https://pypi.org/project/virtualenv/)
- [虛擬環境與套件（Python Documentation）](https://docs.python.org/zh-tw/3/tutorial/venv.html)
- [Python — Virtualenv虛擬環境安裝](https://medium.com/python4u/python-virtualenv%E8%99%9B%E6%93%AC%E7%92%B0%E5%A2%83%E5%AE%89%E8%A3%9D-9d6be2d45db9)
------

若需要安裝新的測試環境，在終端機命令列環境輸入以下指令：


```
virtualenv new_env
cd new_env
source bin/activate
```

退出測試環境，只要一行指令即可

```
deactivate
```

若要刪除整個環境，確定已經「退出」環境後，刪除整個目錄即可

### virtualenv 指令範例

在命令列環境裏面，執行下列指令，可安裝虛擬環境以及執行檔案分析工作（# 開頭的文字，表示這是說明，不需要執行）：

```
pip3 install virtualenv

#
# 回到 home 目錄
#
cd
pwd

#
# 建立虛擬工作環境 ptest
#
virtualenv ptest
cd ptest
source bin/activate

#
# 安裝 pandas-profiling
#
pip3 install pandas-profiling

#
# 假設 train.csv 已經下載到 ～/Downloada 目錄
#
bin/pandas_profiling ~/Downloads/train.csv ~/Downloads/test2020.html
```

# 自製分析工具

## bar_count

In [None]:
colors = ['teal', 'brown', 'darkorange', 'lavender',  'sienna', 'azure', 'purple',
           'navy', 'lightblue', 'pink']

counts = lambda df, feature: df[feature].value_counts()

var_analysis = lambda df, feature, y: \
    df[[feature, y]].groupby(feature, as_index=False).mean().sort_values(by=y, ascending = False)

def bar_count(df, feature, figsize = (10, 10), 
        dpi = 100, ax = None):
    if ax is None:
        fig, ax = plt.subplots(figsize = figsize, dpi = dpi)
    else:
        fig = ax.figure   
    # k = df[feature].value_counts()
    k = df.groupby([feature]).size()
    q = 100*(k/len(df))

    ax.bar([str(e) for e in k.index] if type(k.index[0]) is np.int64 else k.index, 
           k, color = colors, alpha = 0.7)
    
    for j, (i, p) in enumerate(zip(k.index, ax.patches)):
        # ypos = max(k[i] - 75, 25)
        ypos = max(p.get_height() - 60, 65)
        ax.text(j, ypos , '%d' % k[i], fontsize = 12, ha = 'center')
        ax.text(j, ypos - 48, '%.1f%%' % q[i], fontsize = 11, 
            ha = 'center')
     
    ax.set_ylabel('Counts (#)')
    ax.set_xlabel(feature)
    return ax

# sns.set(style='darkgrid')
# bar_count(tdf, 'Parch', figsize = (5,3))


## bar_ratio

In [None]:
def bar_ratio(df, feature, target = 'Survived', figsize = (6, 6), 
        dpi = 100, ax = None):
    
    if ax is None:
        fig, ax = plt.subplots(figsize = figsize, dpi = dpi)
    else:
        fig = ax.figure   

    a = tdf.groupby([feature]).agg(np.mean)[target]*100
    a = a.reset_index()
    sns.barplot(x = feature, y = target, 
        data = a, ax = ax, palette='coolwarm') 
    for j, (i, p) in enumerate(zip(a.index, ax.patches)):
        ypos = max(p.get_height() - 30, 20)
        ax.text(j, ypos , '%.1f%%' % a.loc[i]['Survived'], 
            fontsize = 12, 
            ha = 'center')
     
    ax.set_ylabel('Survival Rate (%)')
    ax.set_xlabel(feature)
    return ax

# sns.set(style='ticks')
# bar_ratio(tdf, 'Pclass', figsize = (6,3))

## var_corr

In [None]:
def var_corr(df, feature, target):
    fig ,ax = plt.subplots(1, 3, figsize = (15, 5), dpi = 100)
    
    bar_count(tdf, feature, ax = ax[0])
#     sns.countplot(feature, data = df, palette = 'Blues', 
#                 ax = ax[0])
    
    sns.countplot( x=  feature, hue = target , data = df, 
        palette = 'Set1', ax = ax[1])
    
    sns.countplot( x=  target, hue = feature , data = df, 
        palette = 'Set2', ax = ax[2])
    plt.tight_layout()

def var_corr2(df, feature, target):
    fig ,ax = plt.subplots(1, 3, figsize = (15, 5), dpi = 100)
    
    bar_count(tdf, feature, ax = ax[0])

    bar_ratio(df, feature, target, ax = ax[1])
    
    sns.countplot( x=  target, hue = feature , data = df, 
        palette = 'Set2', ax = ax[2])
    plt.tight_layout()

# 欄位特性

## Count (個數)

## Distribution, Histogram

## Seaborn Boxplot

- [Start Here:Titanic Project From Beginner to Expert](https://www.kaggle.com/muhammetcakmak/start-here-titanic-project-from-beginner-to-expert/notebook)

## jointplot

# 特徵工程 Feture Engineering

## 增加哪些欄位是合理的？

In [None]:
tdf.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

regular expression

### 姓名與頭銜

In [None]:
# normalize the titles
normalized_titles = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master",
    "Lady" :      "Royalty"
}
# map the normalized titles to the current titles 
tdf['Title'] = title_slice.map(normalized_titles)

### 同行人數

* `Sibsp`: # of siblings / spouses aboard the Titanic     
* `Sarch`: # of parents / children aboard the Titanic     

* `Parch` : defines family relations such as mother, father,daughter, son, stepdaughter, stepson (Some children travelled only with a nanny, therefore parch=0 for them.)    


## 缺失值（missing values）如何處理

- Deleging records
- Dropping variables
- Replacing with zero/ last known values/ mean/ median/ mode/ specific constant
- Interpolation
- Predicting the value with specific algorithm


### 最簡單的做法

### 複雜版

In [None]:
mean = tdf2['Age'].mean()
std = tdf2['Age'].std()
is_null = tdf2["Age"].isnull().sum()
rand_age = np.random.randint(mean-std, mean+std, size = is_null)
age_slice = tdf2["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
tdf2['Age'] = age_slice

In [None]:
tdf2.Age.mean(), tdf2.Age.std()

(29.54901234567901, 13.575467551733238)

## Categorical Encoding

### label encoding

### frequency encoding, target, count

### Python Class: Category Encoder

- [Category Encoder](http://contrib.scikit-learn.org/category_encoders/)

# one-stop shopping

- [Kaggle Titanic](https://www.kaggle.com/c/titanic/data)

In [None]:
train_csv = os.path.join(data_dir, 'train.csv')
test_csv = os.path.join(data_dir, 'test.csv')

train_df = pd.read_csv(train_csv)
test_df = pd.read_csv(test_csv)


## Simple Version

In [None]:
%%time

train_csv = os.path.join(data_dir, 'train.csv')
test_csv = os.path.join(data_dir, 'test.csv')

train_df = pd.read_csv(train_csv)
test_df = pd.read_csv(test_csv)
split_point = len(train_df)

tdf = pd.concat([train_df, test_df], axis = 0).reset_index(drop = True)

#
# covert male/female to 1/0
#
tdf.Sex = tdf.Sex.map({"male": 0, "female":1})

#
# Dealing with 'missing values'
#
most_embarked = tdf.Embarked.mode()[0]
tdf.Embarked = tdf.Embarked.fillna(most_embarked)
tdf['Age'] = tdf['Age'].fillna(tdf['Age'].mean())
tdf['Fare'] = tdf.Fare.fillna(tdf.Fare.median())

#
# one hot encoding
#
tdf_simple = pd.get_dummies(tdf, prefix_sep='_',
    columns = ['Pclass', 'Embarked'])

#
# drop irrelevant features
#
tdf_simple.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1, 
    inplace = True)

# tdf_simple.to_csv(os.path.join(data_dir, 'simple_train.csv'), index=False)

CPU times: user 25.1 ms, sys: 6.98 ms, total: 32.1 ms
Wall time: 36.4 ms


In [None]:
display(tdf_simple.head())
tdf_simple.shape

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
0,0.0,0,22.0,1,0,7.25,0,0,1,0,0,1
1,1.0,1,38.0,1,0,71.2833,1,0,0,1,0,0
2,1.0,1,26.0,0,0,7.925,0,0,1,0,0,1
3,1.0,1,35.0,1,0,53.1,1,0,0,0,0,1
4,0.0,0,35.0,0,0,8.05,0,0,1,0,0,1


(1309, 12)

In [None]:
simple_train = tdf_simple[:split_point]
simple_test = tdf_simple[split_point:]
simple_train = simple_train.astype({'Survived':int})
simple_test.drop(['Survived'], axis = 1, inplace = True)
simple_train.to_csv(os.path.join(data_dir, 'simple_train.csv'), index=False)
simple_test.to_csv(os.path.join(data_dir, 'simple_test.csv'), index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


## Complete version


1. 將 Sex 欄位轉換為 0 / 1
1. 增加 Title 欄位
1. 增加 relative 欄位
1. 處理缺失數值
    - Embarked
    - Age
    - Fare

1. One Hot Encoding
1. 標準化(Standardization)數字型欄位
1. 捨棄 'PassengerId', 'Ticket', 'Cabin', 'Name'

### here we go

In [None]:
%%time

from sklearn.preprocessing import MinMaxScaler, StandardScaler

train_csv = os.path.join(data_dir, 'train.csv')
test_csv = os.path.join(data_dir, 'test.csv')

train_df = pd.read_csv(train_csv)
test_df = pd.read_csv(test_csv)

tdf = pd.concat([train_df, test_df], axis = 0).reset_index(drop = True)
split_point = len(train_df)

#
# convert Sex to integer-coded 
#
tdf.Sex = tdf.Sex.map({"male": 0, "female":1})

# create a new feature to extract title names from 
# the Name column
tdf['Title'] = tdf.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())

#
# normalize the titles
#
normalized_titles = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master",
    "Lady" :      "Royalty"
}
# map the normalized titles to the current titles 
tdf.Title = tdf.Title.map(normalized_titles)

tdf['relatives'] = tdf['SibSp'] + tdf['Parch']

#
# Dealing with 'missing values'
#
most_embarked = tdf.Embarked.mode()[0]
tdf.Embarked = tdf.Embarked.fillna('C')

# returned value of pandas.groupby is a GroupBy object
# that contains information about the groups
grouped = tdf.groupby(['Sex', 'Pclass', 'Title'])  

# apply the grouped median value on the Age NaN
tdf.Age = grouped.Age.apply(lambda x: x.fillna(x.median()))

grouped = tdf.groupby('Pclass')
tdf['Fare'] = grouped.Fare.apply(lambda x: x.fillna(x.median()))

#
# drop irrelevant features
#
tdf.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1, 
    inplace = True)

cat_features = tdf.select_dtypes(include=['object']).columns.tolist()
num_features = \
    tdf.select_dtypes(include= np.number).columns.tolist()
num_features.remove('Sex')
num_features.remove('Survived')
# print(num_features)
se = StandardScaler()
tdf[num_features] = se.fit_transform(tdf[num_features])

tdf_ohe = pd.get_dummies(tdf, prefix_sep='_',
    columns = cat_features)

print('DataFrame is ready for future')
print_now()

DataFrame is ready for future
2020-08-06 10:49:54.116104
CPU times: user 66.9 ms, sys: 70.8 ms, total: 138 ms
Wall time: 216 ms


### Split train, test

In [None]:
my_train = tdf_ohe[:split_point]
my_test = tdf_ohe[split_point:]
my_train = my_train.astype({'Survived':int})
my_test.drop(['Survived'], axis = 1, inplace = True)
display(my_train.head())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,relatives,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty
0,0,0.841916,0,-0.541471,0.481288,-0.445,-0.503176,0.073352,0,0,1,0,0,1,0,0,0
1,1,-1.546098,1,0.648868,0.481288,-0.445,0.734809,0.073352,1,0,0,0,0,0,1,0,0
2,1,0.841916,1,-0.243886,-0.479087,-0.445,-0.490126,-0.558346,0,0,1,0,1,0,0,0,0
3,1,-1.546098,1,0.42568,0.481288,-0.445,0.383263,0.073352,0,0,1,0,0,0,1,0,0
4,0,0.841916,0,0.42568,-0.479087,-0.445,-0.487709,-0.558346,0,0,1,0,0,1,0,0,0


In [None]:
my_train.to_csv( os.path.join(data_dir, 'my_train.csv'), index = False)
my_test.to_csv( os.path.join(data_dir, 'my_test.csv'), index = False)
print_now()

2020-08-06 10:50:40.775981


# SVC

In [None]:
%%time
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold

df = pd.read_csv(os.path.join(data_dir, 'my_train.csv'))
df2 = pd.read_csv(os.path.join(data_dir, 'simple_train.csv'))

y = df.Survived
X = df.drop(['Survived'], axis = 1)

y2 = df2.Survived
X2 = df2.drop(['Survived'], axis = 1)

model = SVC()
model.fit(X.values, y)
print(model.score(X, y))

model = SVC()
model.fit(X2.values, y2)
print(model.score(X2, y2))

# 來看看效果吧

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier


models = [
    SVC(),
    RandomForestClassifier(),
    DecisionTreeClassifier(),
    KNeighborsClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    LGBMClassifier(),
    LogisticRegression(max_iter=1000)
]

titles = ['SVC', 'RF', 'DT', 'kNN', 
    'AdaBoost','GB', 'LightGBM', 'LR']

In [None]:
%%time

from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold
import matplotlib.pyplot as plt

# evaluate each model in turn
# results = []
# names = []
# scores = []
seed = 7
scoring = 'accuracy'

df = pd.read_csv(os.path.join(data_dir, 'my_train.csv'))
df2 = pd.read_csv(os.path.join(data_dir, 'simple_train.csv'))

y_train = df.Survived
X_train = df.drop(['Survived'], axis = 1)

y_simple = df2.Survived
X_simple = df2.drop(['Survived'], axis = 1)

kfold = KFold(n_splits=5, random_state=seed)
model_comp = pd.DataFrame()
model_simple = pd.DataFrame()

for i, model in enumerate(models):
    # kfold = StratifiedKFold(n_splits=5, random_state=0, shuffle=False)
    scores = cross_val_score(model, X_train, y_train, 
        cv=kfold, scoring=scoring)

    model_comp[titles[i]] = scores
    msg = "complex: %s: %f (%f)" % (titles[i], scores.mean(), scores.std())
    print(msg)
    scores = cross_val_score(model, X_simple, y_simple, 
        cv=kfold, scoring=scoring)

    model_simple[titles[i]] = scores
    msg = "simple: %s: %f (%f)" % (titles[i], scores.mean(), scores.std())
    print(msg)


In [None]:
fig, ax = plt.subplots(1,2, figsize = (18, 7), sharey=True, dpi = 120)
sns.boxplot(data = model_comp, ax = ax[0])
ax[0].set_title('Complex')
sns.boxplot(data = model_simple, ax = ax[1])
ax[1].set_title('Simple')
# output_fig('model comparison')

In [None]:
fig,ax = plt.subplots(figsize=(7,5), dpi = 120)
sns.boxplot(data = model_comp)

In [None]:
fig,ax = plt.subplots(figsize=(7,5), dpi = 100)
sns.boxplot(data = model_simple)

# ClassificationError

- Precision= $\frac{TP}{TP+FP}$

- Recall = $\frac{TP}{TP+FN}$

- F1 = $2 \cdot \frac {P \times R}{P+R}$

In [None]:
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold
from lightgbm import LGBMClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import plot_roc_curve

In [None]:
df = pd.read_csv(os.path.join(data_dir, 'my_train.csv'))
X = df.drop(['Survived'], axis = 1)
y = df.Survived

model = KNeighborsClassifier()

fig,ax = plt.subplots(figsize=(6, 6), dpi=100)
model.fit(X, y)
plot_roc_curve(model, X, y, ax=ax)


In [None]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

fig,ax = plt.subplots(figsize=(5,4), dpi = 120)
model = GaussianNB()
model.fit(X, y)
plot_confusion_matrix(model, X, y, ax=ax, cmap = 'Blues')

In [None]:
model = GaussianNB()
model.fit(X, y)
y_pred = model.predict(X)
cm = confusion_matrix(y, y_pred).T
fig,ax = plt.subplots(figsize=(5,4), dpi = 120)
sns.heatmap(cm, cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_xlabel('True label')
ax.set_ylabel('Predicted label')

In [None]:
from yellowbrick.classifier import ClassPredictionError

model = GaussianNB()
vs = ClassPredictionError(model, 
    classes=['died', 'survived'], color = 'RdBu')
vs.fit(X, y)
vs.score(X, y)
vs.poof() 


- [Discrimination Threshold](https://www.scikit-yb.org/en/latest/api/classifier/threshold.html): This visualizer only works for binary classification.
- [Precision-Recal Analysis](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#:~:text=The%20precision%2Drecall%20curve%20shows,a%20low%20false%20negative%20rate.) (1,1) 代表 Precision, Recall = 1 也就是完美預測，因此我們的PR曲線越往右上角凸起則代表更好的模型表現，反之越平則代表越差


- $AP=\sum_{n}(R_n - R_{n-1}\cdot P_n)$

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
from sklearn.naive_bayes import GaussianNB


ax = start_plot(figsize=(4, 3), dpi = 180)
model = GaussianNB()
model.fit(X, y)
plot_precision_recall_curve(model, X, y,ax=ax)

In [None]:
from yellowbrick.classifier import DiscriminationThreshold
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
vs = PrecisionRecallCurve(model, size = (600, 400))
vs.fit(X, y)
vs.score(X, y)
vs.poof() 
# plt.show()

In [None]:
from yellowbrick.classifier import DiscriminationThreshold
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.naive_bayes import GaussianNB

model = LogisticRegression(max_iter=1000)
vs = DiscriminationThreshold(model, size = (800, 600))
vs.fit(X, y)
vs.score(X, y)
vs.poof(outpath = figure_dir + 'discri.png', dpi = 300) 


# End of File