# Kaggle 

数据竞赛平台

## 一、准备工作

### 安装 kaggle 命令行
`pip install kaggle`
在 kaggle profile页面下创建 API token 放在 ~/.kaggle/kaggle.json 下。

### 选择入门竞赛
在竞赛下选择 Getting Started. 选择 Titanic。加入并同意

### 下载数据

`kaggle competitions download -c titanic`

自动下载到 ~/.kaggle/competions/ 下

### 加载数据并预览

In [1]:
import pandas as pd
ROOT_PATH = '~/.kaggle/competitions/titanic/'
train = pd.read_csv(ROOT_PATH + 'train.csv')
test = pd.read_csv(ROOT_PATH + 'test.csv')
train.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [2]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB



## 二、缺失数据处理 (missing data)
* 通过 `df.isnull` 寻找缺失值
* 将数值型填补为中位数
* 将分类型值补为最高频的值
* 使用平均值在正态分布范围内随机填补，以免影响数据
* 人工根据经验通过其它字段填补缺失值
* 理论上可以通过其它字段推测缺失值，相当于将缺失值作为目标训练，成本较高
```python
# find NA values
df[df.isnull().any(axis=1)]
# fill numeric NA value with median
# fill categorical NA value with most common category
df = df.fillna({
    'Age': train['Age'].median(),
    'Embarked': train['Embarked'].value_counts().index[0]
})
```

In [4]:
PassengerIds = test['PassengerId']
# Cabin Too much empty, PassengerId doesnt affect.
train.drop(['Cabin', 'PassengerId', 'Ticket', 'Name'], axis=1, inplace=True)
test.drop(['Cabin', 'PassengerId', 'Ticket', 'Name'], axis=1, inplace=True)

In [5]:

train[train.isnull().any(axis=1)].head(2)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
5,0,3,male,,0,0,8.4583,Q
17,1,2,male,,0,0,13.0,S


In [6]:
train['Age'].median()

28.0

In [7]:
train.fillna({'Age': train['Age'].median()}, inplace=True)
test.fillna({'Age': train['Age'].median()}, inplace=True)
train[train.isnull().any(axis=1)].head(2)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
61,1,1,female,38.0,0,0,80.0,
829,1,1,female,62.0,0,0,80.0,


In [8]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [9]:
train.fillna({'Embarked': train['Embarked'].value_counts().index[0]}, inplace=True)
test.fillna({'Embarked': train['Embarked'].value_counts().index[0]}, inplace=True)
train[train.isnull().any(axis=1)].head(2)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked


In [10]:
test[test.isnull().any(axis=1)].head(2)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
152,3,male,60.5,0,0,,S


In [11]:
test = test.fillna({'Fare': test['Fare'].median()})
test[test.isnull().any(axis=1)].head(2)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked


## 三、分类数据处理
1. `df.dtypes` 预览所有数据类型
1. `df.select_dtypes(include=['object']).head()` 预览object数据内容
2. `df=pd.get_dummies(df, columns=['Sex','Title'])`
3. [replace](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) `df.replace({'A': {'one': 1, 'two':2}, 'B':{'four':4}}, inplace=True)`

In [12]:
train.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

In [13]:
train.select_dtypes(include=['object']).head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


In [14]:
train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean()

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [15]:
train[['Embarked', 'Sex', 'Survived']].groupby(['Embarked', 'Sex'], as_index=False).mean()

Unnamed: 0,Embarked,Sex,Survived
0,C,female,0.876712
1,C,male,0.305263
2,Q,female,0.75
3,Q,male,0.073171
4,S,female,0.692683
5,S,male,0.174603


In [16]:
replace_map = {'Sex': {'female': 0, 'male': 1}, 'Embarked': {'C':0, 'Q':1, 'S': 2}}
train.replace(replace_map, inplace=True)
test.replace(replace_map, inplace=True)
train.head(2)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0


## 三、数值型数据处理

pd.cut(df['Age'], 10)
pd.qcut(df['Age'], 4)

In [17]:
train['CateAge'] = pd.cut(train['Age'], 6)
train[['CateAge', 'Survived']].groupby(['CateAge'], as_index=False).mean()
#train[['CateAge', 'Survived']].groupby(['CateAge'], as_index=False).count()


Unnamed: 0,CateAge,Survived
0,"(0.34, 13.683]",0.591549
1,"(13.683, 26.947]",0.354839
2,"(26.947, 40.21]",0.372038
3,"(40.21, 53.473]",0.39
4,"(53.473, 66.737]",0.348837
5,"(66.737, 80.0]",0.142857


In [18]:
train['CateFare'] = pd.qcut(train['Fare'],4)
train[['CateFare', 'Survived']].groupby(['CateFare'], as_index=False).mean()


Unnamed: 0,CateFare,Survived
0,"(-0.001, 7.91]",0.197309
1,"(7.91, 14.454]",0.303571
2,"(14.454, 31.0]",0.454955
3,"(31.0, 512.329]",0.581081


### 数据清理
临时增加的数据删除

In [19]:
train.drop(['CateAge', 'CateFare'], axis=1, inplace=True)
def clean_df(df):
    df.loc[df['Fare'] <= 7.91, 'Fare'] = 0
    df.loc[df['Fare'] > 31.0, 'Fare'] = 3
    df.loc[df['Fare'] > 14.454, 'Fare'] = 2
    df.loc[df['Fare'] > 7.91, 'Fare'] = 1
    df['Fare'] = df['Fare'].astype(int)
    # judge by experience
    df.loc[df['Age']<8, 'Age'] = 0 # baby
    df.loc[df['Age']>50, 'Age'] = 4 # old
    df.loc[df['Age']>30, 'Age'] = 3 # middle
    df.loc[df['Age']>18, 'Age'] = 2 # young
    df.loc[df['Age']>=8, 'Age'] = 1 # child
    df['Age'] = df['Age'].astype(int)
clean_df(train)
clean_df(test)
train.head(2)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,2,1,0,0,2
1,1,1,0,3,1,0,3,0


In [20]:
test.head(2)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,3,0,0,0,1
1,3,0,3,1,0,0,2


## 模型选择

In [21]:
y = train['Survived'].ravel()
X = train.drop(['Survived'],axis=1).values
print(type(X), type(y))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [22]:
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import ensemble
from xgboost import XGBClassifier

classifiers = [
    SVC(probability=True),
    KNeighborsClassifier(3),
    XGBClassifier(),
    DecisionTreeClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),
    ensemble.AdaBoostClassifier()
]

In [23]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
import warnings
warnings.filterwarnings('ignore')
acc_dict={}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2)
for train_index, test_index in  sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    for clf in classifiers:
        # 分类器的名字
        name=clf.__class__.__name__
        # 使用分类器训练数据
        clf.fit(X_train, y_train)
        # 预测测试数据
        train_predictions = clf.predict(X_test)
        # 检验精度
        acc = accuracy_score(y_test, train_predictions)
        # 累计精度
        if name in acc_dict:
            acc_dict[name] += (acc/10.)
        else:
            acc_dict[name] = (acc/10.)
# 打印各分类器的精度            
for name in acc_dict:
    print(name, acc_dict[name])

SVC 0.8312849162011174
KNeighborsClassifier 0.8100558659217877
XGBClassifier 0.8268156424581006
DecisionTreeClassifier 0.8050279329608938
GradientBoostingClassifier 0.8184357541899443
RandomForestClassifier 0.8111731843575418
AdaBoostClassifier 0.8117318435754189


## 导出结果并上传

In [24]:
#clf = ensemble.RandomForestClassifier() 
#clf = XGBClassifier()
#clf = SVC()
clf = DecisionTreeClassifier()
clf.fit(X, y)
results = clf.predict(test.values)
submission = pd.DataFrame({
    'PassengerId': PassengerIds,
    'Survived': results
})
#导出 csv
submission.to_csv(ROOT_PATH + 'submission.csv', index=False)
print('Exported')

Exported




## 模型选择
这篇文章解释了特征工程和模型选择。
https://www.kaggle.com/sinakhorami/titanic-best-working-classifier

Accuracy 由高到低
* SVC
* KNeighborsClassifier
* GradientBoostingClassifier
* QuadraticDiscriminantAnalysis
* DecisionTreeClassifier
* AdaBoostClassifier
* RandomForestClassifier
* LogisticRegression
* LinearDiscriminantAnalysis
* GussianNB

## 可视化
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

seaborn 数据可视化

### ensemble 融合模型？？
* RandomForestClassifier
* AdaBoostClassifier
* GradientBoostingClassifier
* ExtraTreesClassifier
* xgboost

### Feature importances
* rf.feature_importances(x_train, y_train)




## DL vs SVM vs RF
https://www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html

#### 随机森林vs SVMs
* 随机森林是一个无忧的方案。没有真正的超参数需要调节，通常树越多越好。robust。复杂度随样本数、trees增长。
* SVM：超参数，选择合适的kernel，regularization penalties，the slack variable，多分类训练时需要训练多个SVM。复杂度随样本数、分类数线性提升。


#### Deep Learning vs SVMs
* SVM对小样本优秀，复杂度低。
* DL需要大样本，复杂度高。设置一个DL网络需要更多经验。DL对于复杂问题效果很好，比如图像分类、自然语言处理、语音识别，可以比较少关心特征工程。

#### 建议：
* 对模型定义一个性能指标
* 需要什么性能、硬件、时间
* 从简单的模型开始
* 不满足条件，则尝试更复杂的模型。

## xgboost 如何？


In [25]:
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
#y_train = train['Survived'].ravel()
#train = train.drop(['Survived'], axis=1)
#x_train = train.values # Creates an array of the train data
#x_test = test.values # Creats an array of the test data