# Pandas, Numpy, Scikit-learn 練習 2 
## Kaggle Titanic: Machine Learning from Disaster

學習內容:
 * 用統計方法填充NaN或缺失值
 * 數據類型轉換: Dummy variables
 * 模型調參數: GridSearchCV
 * 提交submission

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandasql import sqldf

In [2]:
train = pd.read_csv('data/titanic/train.csv')

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [4]:
train.drop(['Name', 'Cabin', 'Ticket'], axis=1, inplace=True)

填充NaN的方法有很多，簡單的有以下幾種:
------

如果是數值類型，可以用
1. mean, 平均數
2. median, 中位數
3. mode, 最常出現數

如果是字符類型，可以用
1. mode, 最常出現字符


### Mode Returns
mode : ndarray
    Array of modal values.
count : ndarray
    Array of counts for each mode.

In [5]:
from scipy.stats import mode
mean_age = train['Age'].mean()
train['Age'].fillna(mean_age, inplace=True)

mode_embarked = mode(train['Embarked'])[0][0]
train['Embarked'].fillna(mode_embarked, inplace=True)

  flag = np.concatenate(([True], aux[1:] != aux[:-1]))


In [6]:
print train['Embarked'].unique()
print train['Sex'].unique()

['S' 'C' 'Q']
['male' 'female']


因為Embarked是有分類字段，如果轉換成0, 1, 2就代表S < C < Q的關係。實際上我們並不想這樣，所以我們為每個分類單獨建一個feature。

* Embarked_S: {0, 1} 
* Embarked_C: {0, 1}
* Embarked_Q: {0, 1}

pandas提供這種轉換的功能. pd.get_dummies()

In [7]:
pd.get_dummies(train['Embarked'], prefix='Embarked').head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [8]:
dummy_embarked = pd.get_dummies(train['Embarked'], prefix='Embarked')

In [9]:
train = pd.concat([train, dummy_embarked], axis=1)

In [10]:
train['sex_num'] = train['Sex'].map({'male': 0, 'female': 1}).astype(int)

In [11]:
train.drop(['Embarked', 'Sex'], axis=1, inplace=True)

In [12]:
cols = train.columns.tolist()

In [13]:
print cols

['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'sex_num']


In [14]:
train = train[[cols[0]] + cols[2:] + [cols[1]]]

In [15]:
train_data = train.values

分 trian_sample / test_sample兩份數據
----
因為真正的 test data, 我們是沒有真正的預測label, 所以我們只能從train_data裡分出一部份數據來進行測試預測結果

train_sample: 會用來調模型參數

test_sample: 會用來測試模型效果



In [16]:
import random
idx = np.array(range(train_data.shape[0]))
random.shuffle(idx)
split_n = int(train_data.shape[0] * 0.9)

In [17]:
train_sample = train_data[idx[:split_n]]
test_sample = train_data[idx[split_n:]]

In [18]:
print train_sample.shape[0] + test_sample.shape[0]
print train_data.shape[0]

891
891


In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### 模型調參數

Random Forest內置有很多參數可以調，GridSearchCV提供了方法讓我們可以去用已有的數據找到最好的參數。

Random Forest是由好多棵樹組成：
1. 每個樹用了多少個feature是由max_feature控制的。默認值是'auto', max_features=sqrt(n_features)
2. 森林的層數是由max_depth來控制。默認值是None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


我們想通過grid_search去找到最合理的max_features和max_depth


In [20]:
grid_search = GridSearchCV(RandomForestClassifier(), 
                           cv=5,
                           n_jobs=10,
                           param_grid={'n_estimators': [10, 300],
                                       'max_features': ['sqrt', 'log2', None],
                                       'max_depth': [5, 10, None]})

In [21]:
grid_search.fit(train_sample[:, :-1], train_sample[:, -1])

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=10,
       param_grid={'n_estimators': [10, 300], 'max_features': ['sqrt', 'log2', None], 'max_depth': [5, 10, None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [22]:
print 'Best Score: %f' % grid_search.best_score_
print 'Best params: %s' % grid_search.best_params_

Best Score: 0.828964
Best params: {'max_features': 'sqrt', 'n_estimators': 10, 'max_depth': 10}


In [23]:
m = RandomForestClassifier(**{'max_features': 'sqrt', 'n_estimators': 10, 'max_depth': 10})    
m = m.fit(train_sample[:, :-1], train_sample[:, -1])

In [24]:
r = m.predict(test_sample[:, :-1])

In [25]:
X = train_sample[:, :-1]
y = train_sample[:, -1]

In [32]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    m = RandomForestClassifier(**{'max_features': 'log2', 'min_samples_split': 10, 'n_estimators': 10, 'max_depth': 10})    
    m = m.fit(X_train, y_train)
    r = m.predict(X_test)
    print np.sum(r==y_test) / float(len(y_test))

0.777777777778
0.8
0.75
0.8
0.8
0.85
0.8375
0.8625
0.8875
0.85


In [35]:
m = RandomForestClassifier(**{'max_features': 'log2', 'min_samples_split': 10, 'n_estimators': 10, 'max_depth': 10})    
m = m.fit(X, y)
r = m.predict(test_sample[:, :-1])
print np.sum(r==test_sample[:, -1]) / float(test_sample.shape[0])

0.755555555556


In [29]:
def preprocessing(df):
    df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)
    df['Age'].fillna(mean_age, inplace=True)
    df['Fare'].fillna(df['Fare'].mean(), inplace=True)
    df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)
    df['sex_num'] = df['Sex'].map({'male': 0, 'female': 1}).astype(int)
    df.drop(['Embarked', 'Sex'], axis=1, inplace=True)
    return df.values

In [36]:
test = pd.read_csv('data/titanic/test.csv')
test_data = preprocessing(test)
result = m.predict(test_data)
out = pd.DataFrame(data=zip(test_data[:, 0].astype(int), result.astype(int)), columns=['PassengerId', 'Survived'])
out.to_csv('titanic2_2_out.csv', index=False)

In [37]:
!open titanic2_2_out.csv

提交submission! yay!

https://www.kaggle.com/c/titanic/submissions/attach