# 背景介绍


经历前一段时间对titanic数据集的学习，有以下总结:

- Cabin 这个特征很难对于结果的帮助是有限的，比如尝试了'IF_Cabin'特征，如果乘客没有Cabin就为0，有就为1
- 构建 Deck 特征没有帮助.根据在客舱栏中找到的字母，我们可以设计一个甲板特征，指示乘客在哪个甲板上（A-G，T或U表示未知）。但是噪音很大，对分数没有帮助
- Embarked 没有帮助，我不知道为什么很多人甚至把它包括在他们的里。它对生存机会没有影响
- 分类特征转换问题上，如果将分类特征转换为顺序特征（例如将pclass转换为pclass_1、pclass_2和pclass_3具有可能值0、1的特征），某些算法的性能可能会更好。优点是在某些情况下精度更高，缺点是-你失去了pclass之间的关系（意味着算法会认为那些是独立的、无序的类，而实际上它们是有序的-pclass=1比pclass=3“更好”），你添加的维并不总是好的，存在维的诅咒。在我的具体案例中，将pclass转换为3个功能并没有帮助
- 在Fare和Age上做分区是有一定的帮助的
- 归一化和标准化。标准化是助于提高分数。缩放功能对许多ML算法（如KNN）很有帮助，例如，它确实提高了它们的分数。这种转化器假定特性的正态分布，但有时minmaxscaler可能更好，标准化受到离群值影响效果较小



# 关于 Kaggle 的分数
我花了很多时间坚持不懈地努力提高分数，即使是很小的一点，从0.76到0.83（目前为止）,我的目标应该是用可靠的评分构建最健壮和可归纳的模型，但不一定是最佳评分

## 模块读取

In [8]:
# NumPy
import numpy as np

# Dataframe operations
import pandas as pd

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Scalers
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

# Models
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn.linear_model import Perceptron
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

# Cross-validation
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
from sklearn.model_selection import cross_validate

# GridSearchCV
from sklearn.model_selection import GridSearchCV

#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix

## 读取数据集

In [9]:
# dealing the data on data_df
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")
data_df = train_df.append(test_df) # train + test.

In [10]:
print(data_df.shape,train_df.shape,test_df.shape)

(1309, 12) (891, 12) (418, 11)


# 特征工程

 - **Age**
 
基于Name提取了称号的关键词，按groupby('Title')的中位数来填充Age，以便更准确地输入年龄。使用中位数是因为年龄分布并不总是正态的，所以它通常优于平均值。

In [11]:
data_df['Title'] = data_df['Name']
# Cleaning name and extracting Title
for name_string in data_df['Name']:
    data_df['Title'] = data_df['Name'].str.extract('([A-Za-z]+)\.', expand=True)

# Replacing rare titles with more common ones
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
data_df.replace({'Title': mapping}, inplace=True)
titles = ['Dr', 'Master', 'Miss', 'Mr', 'Mrs', 'Rev']
for title in titles:
    age_to_impute = data_df.groupby('Title')['Age'].median()[titles.index(title)]
    data_df.loc[(data_df['Age'].isnull()) & (data_df['Title'] == title), 'Age'] = age_to_impute
    
# Substituting Age values in TRAIN_DF and TEST_DF:
train_df['Age'] = data_df['Age'][:891]
test_df['Age'] = data_df['Age'][891:]

# Dropping Title feature
data_df.drop('Title', axis = 1, inplace = True)

 - **Adding Family_Size**
 
Family_Size =  Parch + SibSp.

In [12]:
data_df['Family_Size'] = data_df['Parch'] + data_df['SibSp']

# Substituting Age values in TRAIN_DF and TEST_DF:
train_df['Family_Size'] = data_df['Family_Size'][:891]
test_df['Family_Size'] = data_df['Family_Size'][891:]

 - **Adding Family_Survival**
 
 这个特征学习自 [S.Xu's kernel](https://www.kaggle.com/shunjiangxu/blood-is-thicker-than-water-friendship-forever), 他将家庭为单位来计算家庭生存率，如何判断为一个家庭，用last_name和Fare来groupby，得到各个家庭是否全部活着--1，还是全部死了--0

In [13]:
data_df['Last_Name'] = data_df['Name'].apply(lambda x: str.split(x, ",")[0])
data_df['Fare'].fillna(data_df['Fare'].mean(), inplace=True)

DEFAULT_SURVIVAL_VALUE = 0.5
data_df['Family_Survival'] = DEFAULT_SURVIVAL_VALUE

for grp, grp_df in data_df[['Survived','Name', 'Last_Name', 'Fare', 'Ticket', 'PassengerId',
                           'SibSp', 'Parch', 'Age', 'Cabin']].groupby(['Last_Name', 'Fare']): # the same，first return the type of values, second return others cols
    
    if (len(grp_df) != 1):
        # A Family group is found.
        for ind, row in grp_df.iterrows():  # iterrows() return index,series
            smax = grp_df.drop(ind)['Survived'].max()
            smin = grp_df.drop(ind)['Survived'].min()
            passID = row['PassengerId']
            if (smax == 1.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
            elif (smin==0.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0

print("Number of passengers with family survival information:", 
      data_df.loc[data_df['Family_Survival']!=0.5].shape[0])

Number of passengers with family survival information: 420


In [16]:
for _, grp_df in data_df.groupby('Ticket'): #only groupby('Ticket')  _, grp_df, _ return the ticket.values after group; grp_df return others cols, bin by 1,2,3
#     print(_,grp_df)
    if (len(grp_df) != 1):
        for ind, row in grp_df.iterrows():
#             print(ind,row)
            if (row['Family_Survival'] == 0) | (row['Family_Survival']== 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
                elif (smin==0.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0
                        
print("Number of passenger with family/group survival information: " 
      +str(data_df[data_df['Family_Survival']!=0.5].shape[0]))

# # Family_Survival in TRAIN_DF and TEST_DF:
train_df['Family_Survival'] = data_df['Family_Survival'][:891]
test_df['Family_Survival'] = data_df['Family_Survival'][891:]

Number of passenger with family/group survival information: 546


 - **Making FARE BINS**
 
FareBin = 3 is indeed greater than FareBin = 1.

In [167]:
data_df['Fare'].fillna(data_df['Fare'].median(), inplace = True)

# Making Bins
data_df['FareBin'] = pd.qcut(data_df['Fare'], 5)

label = LabelEncoder()
data_df['FareBin_Code'] = label.fit_transform(data_df['FareBin'])

train_df['FareBin_Code'] = data_df['FareBin_Code'][:891]
test_df['FareBin_Code'] = data_df['FareBin_Code'][891:]

train_df.drop(['Fare'], 1, inplace=True)
test_df.drop(['Fare'], 1, inplace=True)

 - **Making AGE BINS**
 


In [168]:
data_df['AgeBin'] = pd.qcut(data_df['Age'], 4)

label = LabelEncoder()
data_df['AgeBin_Code'] = label.fit_transform(data_df['AgeBin'])

train_df['AgeBin_Code'] = data_df['AgeBin_Code'][:891]
test_df['AgeBin_Code'] = data_df['AgeBin_Code'][891:]

train_df.drop(['Age'], 1, inplace=True)
test_df.drop(['Age'], 1, inplace=True)

 - **Mapping SEX and cleaning data (ATTENTION!! dropping garbage) **

In [169]:
train_df['Sex'].replace(['male','female'],[0,1],inplace=True)
test_df['Sex'].replace(['male','female'],[0,1],inplace=True)

train_df.drop(['Name', 'PassengerId', 'SibSp', 'Parch', 'Ticket', 'Cabin',
               'Embarked'], axis = 1, inplace = True)
test_df.drop(['Name','PassengerId', 'SibSp', 'Parch', 'Ticket', 'Cabin',
             'Embarked'], axis = 1, inplace = True)

So now our datasets look like this:

In [179]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Family_Size,Family_Survival,FareBin_Code,AgeBin_Code
0,0,3,0,1,0.5,0,0
1,1,1,1,1,0.5,4,3
2,1,3,1,0,0.5,1,1
3,1,1,1,1,0.0,4,2
4,0,3,0,0,0.5,1,2


# 训练

 - **Creating X and y**

In [171]:
X = train_df.drop('Survived', 1)
y = train_df['Survived']
X_test = test_df.copy()

 - **Scaling features**

In [172]:
std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)
X_test = std_scaler.transform(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  This is separate from the ipykernel package so we can avoid doing imports until


 - **Grid Search CV**
 
    use KNN.

In [173]:
n_neighbors = [6,7,8,9,10,11,12,14,16,18,20,22]
algorithm = ['auto']
weights = ['uniform', 'distance']
leaf_size = list(range(1,50,5))
hyperparams = {'algorithm': algorithm, 'weights': weights, 'leaf_size': leaf_size, 
               'n_neighbors': n_neighbors}
gd=GridSearchCV(estimator = KNeighborsClassifier(), param_grid = hyperparams, verbose=True, 
                cv=10, scoring = "roc_auc")
gd.fit(X, y)
print(gd.best_score_)
print(gd.best_estimator_)

Fitting 10 folds for each of 240 candidates, totalling 2400 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0.879492358564
KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=18, p=2,
           weights='uniform')


[Parallel(n_jobs=1)]: Done 2400 out of 2400 | elapsed:   30.9s finished




This gave my result, maybe yours diff:

> KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=18, p=2, weights='uniform')

This gave 0.879492358564 ROC_AUC score (not accuracy score!). 我有很多模型的ROC_AUC约为0.93-0.94，但在测试时，它们的结果大多较低!!

 - **Using a model found by grid searching**

In [174]:
gd.best_estimator_.fit(X, y)
# y_pred = gd.best_estimator_.predict(X_test)

KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=18, p=2,
           weights='uniform')

When I submitted the result, the model I've specified above yielded [0.82775] public score.

- **Using another K**

In [175]:
knn = KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski', 
                           metric_params=None, n_jobs=1, n_neighbors=6, p=2, 
                           weights='uniform')
# knn.fit(X, y)
# y_pred = knn.predict(X_test)

Use this one, I got 0.83253!!!

- **vote**

Make a little progress.Using esemble of the KNN and xgboost!

In [176]:
vote_est = [
    
    #Nearest Neighbor: http://scikit-learn.org/stable/modules/neighbors.html
    ('knn', gd.best_estimator_),
    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    ('xgb', XGBClassifier()),
    

]
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) 

#Hard Vote or majority rules
vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, X,y, cv  = cv_split)
vote_hard.fit(X, y)

print("Hard Voting Training w/bin score mean: {:.2f}". format(vote_hard_cv['train_score'].mean()*100)) 
print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-'*10)

y_pred = vote_hard.predict(X_test)

Hard Voting Training w/bin score mean: 85.24
Hard Voting Test w/bin score mean: 83.96
Hard Voting Test w/bin score 3*std: +/- 5.57
----------




- **Making submission**

In [177]:
temp = pd.DataFrame(pd.read_csv("../data/test.csv")['PassengerId'])
temp['Survived'] = y_pred
temp.to_csv("../submission.csv", index = False)

 ## 结果
 
 So when I submitted the score I got 0.83732, top 3%. Maybe it's enough for me.

# 结论
这里也有几点。
- 当然，KNN不是唯一。我使用了SVMS、Adaboosing、GradientBoosting、RandomForests等，许多不同的模型，这里省略了所有这些。但knn显示了我所有设计的特征的可靠结果，所以我更喜欢它的简单。大家可以使用其他估算器来尝试数据集，特别是可以使用调优的超参数来尝试xgboost。您有可能在本内核中描述的相同功能上获得更高的分数。
- 如果你进一步设计家庭小组，我认为你会得到更好的结果。查看[在泰坦尼克号上寻找“真实”家庭](https://www.kaggle.com/erikbruin/finding-the-real-families-on-the-titanic)了解家庭。
- 一般来说，分组乘客是提高评分的好方法。尝试搜索组。
- 你可以使用一些有趣的功能，比如[Slogl in Oscar's Kernel](https://www.kaggle.com/pliptor/divide-and-converse-0-82296)，试试看。


但是，要准备好投入大量时间只为一小部分收入。似乎真正更好的模型将产生86-88+，但很难达到。