# Pandas, Numpy, Scikit-learn 練習 1 
## Kaggle Titanic: Machine Learning from Disaster

學習內容:
 * 用pandas去處理原始數據
 * 訓練一個random forest classifer進行預測

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandasql import sqldf

In [2]:
train = pd.read_csv('data/titanic/train.csv')

### 處理數據
* 處理missing value

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


dataframe.info()可以查看dataframe每個字段的類型和數據條數

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


從上面數据, 我們可以看到大部份字段都有891條記錄，只有Age, Cabin, Embarked少於891條記錄，這說明數據有NaN(Not a number)或者缺失的情況。

Name, Ticket, Cabin看來可能對我們做Prediction沒有大的幫助，所以可以直接刪掉。

In [5]:
# axis=1 指的是刪掉column
# 如果 axis=0 指的是剛掉row value = ['Name', 'Ticket', 'Cabin']
train = train.drop(['Name', 'Ticket', 'Cabin'], axis=1)

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB


In [7]:
# 默認dropna帶參數 how='any' : if any NA values are present, drop that label
train.dropna(how='any', inplace=True)

Sex, Embarked是object類型, 我們需要把這兩個字段轉成數值字段。

In [8]:
train['Sex'].unique()

array(['male', 'female'], dtype=object)

In [9]:
train['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [10]:
# 生成一個新字段sex_num, 把male轉成0, female轉成1
train['sex_num'] = train['Sex'].map({'male': 0, 'female': 1}).astype(int)
train['embarked_num'] = train['Embarked'].map({'S': 0, 'C': 1, 'Q': 1}).astype(int)

In [11]:
# 我們現在只需要sex_nu, embarked_num, 所以可以刪掉Sex, Embarked
train = train.drop(['Sex', 'Embarked'], axis=1)

In [12]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
PassengerId     712 non-null int64
Survived        712 non-null int64
Pclass          712 non-null int64
Age             712 non-null float64
SibSp           712 non-null int64
Parch           712 non-null int64
Fare            712 non-null float64
sex_num         712 non-null int64
embarked_num    712 non-null int64
dtypes: float64(2), int64(7)
memory usage: 55.6 KB


In [13]:
cols = train.columns.tolist()
print cols

['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'sex_num', 'embarked_num']


一會我們會把train會轉成一個2d-array，而Survived是我們需要predict的字段，我們先把它放到最後一個字段

In [14]:
cols[2:]

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'sex_num', 'embarked_num']

In [15]:
[cols[0]] + cols[2:] + [cols[1]]

['PassengerId',
 'Pclass',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'sex_num',
 'embarked_num',
 'Survived']

In [16]:
train = train[[cols[0]] + cols[2:] + [cols[1]]]

In [17]:
train.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,sex_num,embarked_num,Survived
0,1,3,22.0,1,0,7.25,0,0,0
1,2,1,38.0,1,0,71.2833,1,1,1
2,3,3,26.0,0,0,7.925,1,0,1
3,4,1,35.0,1,0,53.1,1,0,1
4,5,3,35.0,0,0,8.05,0,0,0


In [18]:
train_data = train.values

選用一個比較流行和效果比較好的模型，具體模型細節之後會再學習

n_estimators = 100, 指的是用了100棵樹

In [19]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 100)

把train分成兩個train_sample, test_sample

90% 的trian_data用來做training, 10%的用來做testing


In [21]:
train_data.shape

(712, 9)

In [23]:
train_data

array([[   1.,    3.,   22., ...,    0.,    0.,    0.],
       [   2.,    1.,   38., ...,    1.,    1.,    1.],
       [   3.,    3.,   26., ...,    1.,    0.,    1.],
       ..., 
       [ 888.,    1.,   19., ...,    1.,    0.,    1.],
       [ 890.,    1.,   26., ...,    0.,    1.,    1.],
       [ 891.,    3.,   32., ...,    0.,    1.,    0.]])

In [47]:
split = int(train_data.shape[0] * 0.9)

In [45]:
train_sample = train_data[:split, :]
test_sample = train_data[split:, :]

### 訓練模型

model.fit(X, y)
* X: 所有字段
* y: survived字段

In [54]:
train_sample[0:2]

array([[  1.    ,   3.    ,  22.    ,   1.    ,   0.    ,   7.25  ,
          0.    ,   0.    ,   0.    ],
       [  2.    ,   1.    ,  38.    ,   1.    ,   0.    ,  71.2833,
          1.    ,   1.    ,   1.    ]])

In [60]:
model = model.fit(train_sample[:, 0:-1], train_sample[:, -1])

In [61]:
result = model.predict(test_sample[:, 0:-1])

### Accuracy

計算準確度的方法是用訓練出來的模型來預測沒看過的數據(unseen data)

In [68]:
acc = 0
for i in range(len(result)):
    if result[i] == test_sample[:, -1][i]:
        acc += 1
print 'Accuracy: %.2f %%' % (acc / float(len(result)) * 100.0)

Accuracy: 83.33 %


我們當前的模型有83.33%的準確度