In [1]:
import pandas as pd

In [10]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

In [7]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We have following 10 features of each passenger:

* *Pclass* is the Ticket-class: first(1), second(2), and third(3) class tickets. This is an ordinal integer feature.

* *Name* is the name of passenger. Although the name itself does not reveal too much information to prediction of survival, there is some titles attached to names that can indicate a certain group. For example, Mrs is an indication of marriage. This is a categorical text string feature.

* *Sex* is the gender of passenger. Either female or male. This is a categorical text string feature.

* *Age* is the integer age of passenger. There might be some NaN values in this column. This is an integer feature.

* *SibSp* is the number of siblings and spouse aboard the Titanic. Sibling includes brother, sister, stepbrother, and stepsister, while spouse includes husband or wife (mistresses and fiances were ignored). This is an ordinal integer feature.

* *Parch* is the number of parents and children aboard the Titanic. This is also an ordinal integer feature.

* *Ticket* is a character string that gives the ticket number.

* *Fare* is a float feature showing how much money the passenger paid for their trip.

* *Cabin* is the cabin number of each passenger. There are NaN in this column. This is also another string feature.

* *Embarked* is the port of embarkation as a categorical character feature.

In summary we have 1 floating point feature (fare), 1 integer variable (age), 3 ordinal integer features (plcass, sibsp, parch), 2 categorical features (sex, embarked), and 3 text string features (ticket, cabin, name).

In [8]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Missing values

Take a look at missing values in training data:

In [9]:
print(train.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


It can be seen that 177 values were missing in Age, 687 values were missing in Cabin, and 2 values were missing in Embarked.

Let's fill in the empty values with the available information. Firstly, let's take a look at missing values of Embarked:

In [12]:
df = train[train["Embarked"].isnull()]
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Then we find similar passengers:

In [22]:
train[(train["Sex"]=="female") & (train["SibSp"]==0) & (train["Parch"]==0) & (train["Fare"]>70.0) & (train["Fare"]<90.0)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
218,219,1,1,"Bazzani, Miss. Albina",female,32.0,0,0,11813,76.2917,D15,C
256,257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.5,B77,S
290,291,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0,0,0,19877,78.85,,S
310,311,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C
504,505,1,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.5,B79,S
627,628,1,1,"Longley, Miss. Gretchen Fiske",female,21.0,0,0,13502,77.9583,D9,S
759,760,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.5,B77,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


We can see that the similar passengers either embarked from C or S. Let's make it S here

In [38]:
train.ix[61,"Embarked"] = "S"
train.ix[829,"Embarked"] = "S"

Confirm there is no null values in column of Embarked now:

In [41]:
print(train.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


In [42]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


There is a good explanation of [feature extraction and derivation](https://www.kaggle.com/headsortails/pytanic) in Kaggle. But to make things as simple as possible, let's just simply select the following features: Sex, Age, SibSp, ParCh, Fare, Embarked.

## Preparing the data

Since not all classifiers can handle string input, it's a better practice to adjust the column types to integers.