In [1]:
import pandas as pd

In [2]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We have following 10 features of each passenger:

* *Pclass* is the Ticket-class: first(1), second(2), and third(3) class tickets. This is an ordinal integer feature.

* *Name* is the name of passenger. Although the name itself does not reveal too much information to prediction of survival, there is some titles attached to names that can indicate a certain group. For example, Mrs is an indication of marriage. This is a categorical text string feature.

* *Sex* is the gender of passenger. Either female or male. This is a categorical text string feature.

* *Age* is the integer age of passenger. There might be some NaN values in this column. This is an integer feature.

* *SibSp* is the number of siblings and spouse aboard the Titanic. Sibling includes brother, sister, stepbrother, and stepsister, while spouse includes husband or wife (mistresses and fiances were ignored). This is an ordinal integer feature.

* *Parch* is the number of parents and children aboard the Titanic. This is also an ordinal integer feature.

* *Ticket* is a character string that gives the ticket number.

* *Fare* is a float feature showing how much money the passenger paid for their trip.

* *Cabin* is the cabin number of each passenger. There are NaN in this column. This is also another string feature.

* *Embarked* is the port of embarkation as a categorical character feature.

In summary we have 1 floating point feature (fare), 1 integer variable (age), 3 ordinal integer features (plcass, sibsp, parch), 2 categorical features (sex, embarked), and 3 text string features (ticket, cabin, name).

In [4]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Missing values

Take a look at missing values in training data:

In [5]:
print(train.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


It can be seen that 177 values were missing in Age, 687 values were missing in Cabin, and 2 values were missing in Embarked.

Let's fill in the empty values with the available information. Firstly, let's take a look at missing values of Embarked:

In [6]:
df = train[train["Embarked"].isnull()]
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Then we find similar passengers:

In [7]:
train[(train["Sex"]=="female") & (train["SibSp"]==0) & (train["Parch"]==0) & (train["Fare"]>70.0) & (train["Fare"]<90.0)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
218,219,1,1,"Bazzani, Miss. Albina",female,32.0,0,0,11813,76.2917,D15,C
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.5,B77,S
290,291,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0,0,0,19877,78.85,,S
310,311,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C
504,505,1,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.5,B79,S
627,628,1,1,"Longley, Miss. Gretchen Fiske",female,21.0,0,0,13502,77.9583,D9,S
759,760,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.5,B77,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


We can see that the similar passengers either embarked from C or S. Let's make it S here

In [8]:
train.ix[61,"Embarked"] = "S"
train.ix[829,"Embarked"] = "S"

Confirm there is no null values in column of Embarked now:

In [9]:
print(train.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


In [10]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


There is a good explanation of [feature extraction and derivation](https://www.kaggle.com/headsortails/pytanic) in Kaggle. But to make things as simple as possible, let's just simply select the following features: Pclass, Sex, Age, SibSp, ParCh, Fare, Embarked.

## Preparing the data

Since not all classifiers can handle string input, it's a better practice to adjust the column types to numbers(integers or floats). For instance, string categorical feature Sex can be represent by 1(male) and 0(female).

In [11]:
print(train.Pclass.unique())
print(train.Sex.unique())
print(train.SibSp.unique())
print(train.Parch.unique())
print(train.Embarked.unique())

[3 1 2]
['male' 'female']
[1 0 3 4 2 5 8]
[0 1 2 5 3 4 6]
['S' 'C' 'Q']


In [12]:
train["Sex"] = train["Sex"].astype("category")
train["Sex"].cat.categories = [0,1]
train["Sex"] = train["Sex"].astype("int")

train["Embarked"] = train["Embarked"].astype("category")
train["Embarked"].cat.categories = [0,1,2]
train["Embarked"] = train["Embarked"].astype("int")

In [13]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,2
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,2


## Modeling

### Splitting the train sample into two sub-samples: training and testing

To avoid overfitting, it's best practice to split the training data further into two sets.

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
training, testing = train_test_split(train, test_size=0.2, random_state=0)
print("Total sample size = %i; training sample size = %i, testing sample size = %i"\
     %(train.shape[0],training.shape[0],testing.shape[0]))

Total sample size = 891; training sample size = 712, testing sample size = 179


### Look at model features

In [16]:
cols = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
tcols = ["Survived"] + cols

In [17]:
df = training[tcols].dropna()

In [18]:
X = df.loc[:,cols]

In [19]:
y = df.loc[:,['Survived']]
y = [val[0] for val in y.values]

In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
clf_log = LogisticRegression()
clf_log = clf_log.fit(X,y)
score_log = clf_log.score(X,y)
print(score_log)

0.784588441331


In [22]:
coef = clf_log.coef_[0]

In [23]:
pd.DataFrame(list(zip(cols,coef)))

Unnamed: 0,0,1
0,Pclass,-0.825371
1,Sex,-2.233182
2,Age,-0.032287
3,SibSp,-0.312647
4,Parch,-0.021213
5,Fare,0.004752
6,Embarked,0.01084


TODO: negative meanings?

### Model validation

In [24]:
from sklearn.model_selection import cross_val_score
import numpy as np

In [25]:
scores = cross_val_score(clf_log, X, y, cv=5)
print(scores)
print("Mean score = %.3f, Std deviation = %.3f"%(np.mean(scores),np.std(scores)))

[ 0.74782609  0.8245614   0.78070175  0.78947368  0.78947368]
Mean score = 0.786, Std deviation = 0.024


Final validation with the testing data set:

In [26]:
df_test = testing[tcols].dropna()
X_test = df_test.loc[:,cols]
y_test = df_test.loc[:,['Survived']]
y_test = [val[0] for val in y_test.values]
score_log_test = clf_log.score(X_test,y_test)  # clf_log has been trained by training data
print(score_log_test)

0.79020979021


## Preparing prediction for submission

Finally, we use our model to predict the test data, and write to a submission file according to the submission rules (418 rows; only include the columns PassengerId and Survived).

In [27]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Firstly, prepare the data for prediction.

In [28]:
test["Sex"] = test["Sex"].astype("category")
test["Sex"].cat.categories = [0,1]
test["Sex"] = test["Sex"].astype("int")

test["Embarked"] = test["Embarked"].astype("category")
test["Embarked"].cat.categories = [0,1,2]
test["Embarked"] = test["Embarked"].astype("int")

In [29]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",1,34.5,0,0,330911,7.8292,,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",0,47.0,1,0,363272,7.0,,2
2,894,2,"Myles, Mr. Thomas Francis",1,62.0,0,0,240276,9.6875,,1
3,895,3,"Wirz, Mr. Albert",1,27.0,0,0,315154,8.6625,,2
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22.0,1,1,3101298,12.2875,,2


In [30]:
df_2 = test.loc[:,cols]

Firstly, we need to deal with missing values in test data.

In [31]:
print(df_2.isnull().sum())

Pclass       0
Sex          0
Age         86
SibSp        0
Parch        0
Fare         1
Embarked     0
dtype: int64


As we can see, 86 missing values in Age. Since Age is one of our selected features, we need to fillin the blanks with some default values.

In [32]:
df_2 = df_2.fillna(method="pad")

In [33]:
print(df_2.isnull().sum())

Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64


Now the test data is ready for prediction:

In [34]:
surv_pred = clf_log.predict(df_2)

In [35]:
surv_pred

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0,

In [36]:
submit = pd.DataFrame({"PassengerID": test["PassengerId"],
                       "Survived": surv_pred})

In [37]:
submit

Unnamed: 0,PassengerID,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [38]:
submit.shape

(418, 2)

In [39]:
submit.to_csv("out.csv",index=False)