# Kaggle 

数据竞赛平台

## 安装 kaggle 命令行
`pip install kaggle`
在 kaggle profile页面下创建 API token 放在 ~/.kaggle/kaggle.json 下。

## 选择入门竞赛
在竞赛下选择 Getting Started. 选择 Titanic。加入并同意

## 下载数据

`kaggle competitions download -c titanic`

自动下载到 ~/.kaggle/competions/ 下

## 加载数据
加载并预览

In [1]:
import pandas as pd

train = pd.read_csv('~/.kaggle/competitions/titanic/train.csv')
test = pd.read_csv('~/.kaggle/competitions/titanic/test.csv')
train.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C



## 缺失数据处理 (missing data)
* 通过 `df.isnull` 寻找缺失值
* 将数值型填补为中位数
* 将分类型值补为最高频的值
* 人工根据经验通过其它字段填补缺失值
* 理论上可以通过其它字段推测缺失值，相当于将缺失值作为目标训练，成本较高
```python
# find NA values
df[df.isnull().any(axis=1)]
# fill numeric NA value with median
# fill categorical NA value with most common category
df = df.fillna({
    'Age': train['Age'].median(),
    'Embarked': train['Embarked'].value_counts()[0]
})
```

In [2]:
# Cabin Too much empty, PassengerId doesnt affect.
train = train.drop(['Cabin', 'PassengerId'], axis=1)
test = test.drop(['Cabin', 'PassengerId'], axis=1)

In [3]:

train[train.isnull().any(axis=1)].head(2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,Q
17,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,S


In [4]:
train['Age'].median()

28.0

In [5]:
train = train.fillna({'Age': train['Age'].median()})
test = test.fillna({'Age': train['Age'].median()})
train[train.isnull().any(axis=1)].head(2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,


In [6]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [7]:
train = train.fillna({'Embarked': train['Embarked'].value_counts()[0]})
test = test.fillna({'Embarked': train['Embarked'].value_counts()[0]})
train[train.isnull().any(axis=1)].head(2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked


In [8]:
test[test.isnull().any(axis=1)].head(2)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
152,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,S


In [9]:
test = test.fillna({'Fare': test['Fare'].median()})
test[test.isnull().any(axis=1)].head(2)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked


## 分类数据处理
1. `df.dtypes` 预览所有数据类型
1. `df.select_dtypes(include=['object']).head()` 预览object数据内容
2. `df=pd.get_dummies(df, columns=['Sex','Title'])`
3. [replace](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) `df=df.replace({'A': {'one': 1, 'two':2}, 'B':{'four':4}}, inplace=True)`

In [10]:
train.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Embarked     object
dtype: object

In [11]:
train.select_dtypes(include=['object']).head()


Unnamed: 0,Name,Sex,Ticket,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,S
4,"Allen, Mr. William Henry",male,373450,S




## 模型选择
这篇文章解释了特征工程和模型选择。
https://www.kaggle.com/sinakhorami/titanic-best-working-classifier

Accuracy 由高到低
* SVC
* KNeighborsClassifier
* GradientBoostingClassifier
* QuadraticDiscriminantAnalysis
* DecisionTreeClassifier
* AdaBoostClassifier
* RandomForestClassifier
* LogisticRegression
* LinearDiscriminantAnalysis
* GussianNB

## 可视化
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

seaborn 数据可视化

### ensemble 融合模型？？
* RandomForestClassifier
* AdaBoostClassifier
* GradientBoostingClassifier
* ExtraTreesClassifier
* xgboost

### Feature importances
* rf.feature_importances(x_train, y_train)




## DL vs SVM vs RF
https://www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html

#### 随机森林vs SVMs
* 随机森林是一个无忧的方案。没有真正的超参数需要调节，通常树越多越好。robust。复杂度随样本数、trees增长。
* SVM：超参数，选择合适的kernel，regularization penalties，the slack variable，多分类训练时需要训练多个SVM。复杂度随样本数、分类数线性提升。


#### Deep Learning vs SVMs
* SVM对小样本优秀，复杂度低。
* DL需要大样本，复杂度高。设置一个DL网络需要更多经验。DL对于复杂问题效果很好，比如图像分类、自然语言处理、语音识别，可以比较少关心特征工程。

#### 建议：
* 对模型定义一个性能指标
* 需要什么性能、硬件、时间
* 从简单的模型开始
* 不满足条件，则尝试更复杂的模型。

## xgboost 如何？


In [None]:
# 探索数据

import pandas as pd
pd.read_csv

In [None]:
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data