I am just starting on Kaggle and instead of a very detailed analysis of the features my objective is to produce a quick and dirty model with minimum code to understand how the systems work. If the model scores in the average range I will consider it success.

The first objective is to clean up the data - fill in or drop entirely the missing values for Age, Fare, Cabin and Embarked. Columns that are unlikely to matter are PassengerId, Ticket and Name and are obvious to drop. I am not comfortable filling the missing Age with something arbitrary like median train values so I drop it. Fare should be highly correlated with class so I fill it with the mean for the corresponding class. Cabin can be replaced with a binary depending on if it exists or not and Embarked is replaced with indices 0,1,2 and the missing values are replaced with the majority (S) given that it is just 2 out of 891. Sex and Embarked string are replaced with integers.

I did this just as an exercise to familiarize myself with Kaggle and so I was quite surprised that the score was 0.799 which placed me in the top 20%.


In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
print(train.shape, train.columns.values)
print(train.isnull().sum())
train.head(3)


(891, 12) ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [2]:
#select the features and separate the target

features = ['Pclass','Sex','SibSp','Parch','Fare','Cabin','Embarked']
X_train= train[features]
X_test=test[features]
y_train=train['Survived']

# do all the replacements and transformations as described in the introduction

for Z in [X_train, X_test]:
    
    A = Z.groupby(['Pclass'])['Fare'].mean()
    B = Z['Pclass'].map({1: A.iloc[0], 2:A.iloc[1], 3:A.iloc[2]})
    Z['Fare'] = Z['Fare'].fillna(B)
    Z['Embarked'] = Z['Embarked'].fillna('S')
    Z['Cabin'] = Z['Cabin'].fillna(1)
    Z['Sex'] = Z['Sex'].map({'male': 0, 'female':1}).astype(int)
    Z['Embarked'] = Z['Embarked'].map({'C': 0, 'Q':1, 'S':2}).astype(int)
    Z.loc[Z['Cabin'] != 1, 'Cabin'] = 0
    Z['Cabin'] = pd.to_numeric(Z['Cabin'], errors='coerce')
    
# double check the dataset that it looks reasonable    
X_train.describe()    


Unnamed: 0,Pclass,Sex,SibSp,Parch,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,2.308642,0.352413,0.523008,0.381594,32.204208,0.771044,1.536476
std,0.836071,0.47799,1.102743,0.806057,49.693429,0.420397,0.791503
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0,7.9104,1.0,1.0
50%,3.0,0.0,0.0,0.0,14.4542,1.0,2.0
75%,3.0,1.0,1.0,0.0,31.0,1.0,2.0
max,3.0,1.0,8.0,6.0,512.3292,1.0,2.0


In [3]:
#use Gradient Boosting with default paramenetrs to train the model and submit the prediction

from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier().fit(X_train, y_train)
print('Accuracy on the training set: {:.2f}' .format(clf.score(X_train, y_train)))

prediction = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": clf.predict(X_test)})
prediction.to_csv('my_submission.csv', index=False)


Accuracy on the training set: 0.87
