# Kaggle Machine Learning Competition: Predicting Titanic Survivors

Kaggle is an online resource for competetive data science competitions (now owned by Google).

They publish datasets and ask readers to submit algorithms that produce a desired result. The desired results depend on the problem, but you never get to see the test dataset that they use for scoring.

This is one of their "introductory" data science competitions. It's particularly interesting because it raises some very important ethical questions.

If you'd like to know more about the the competition visit the original [competition site](https://www.kaggle.com/c/titanic-gettingStarted).

## Dataset

The data has been loaded into this ipython notebook for you. Please use that for training/validation.

## Competition time!

We will run a little internal competition and the winner will get to talk through their result. Feel free to work in teams, or by yourself.

## Evaluation

For each passenger in the test set, you must predict whether or not they survived the sinking ( 0 for deceased, 1 for survived ). Your score is the percentage of passengers you correctly predict (i.e. accuracy).

We will compare scores at the end.

Good luck!

## Data Set

<pre>
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
</pre>

## Setup Imports and Variables

In [56]:
import pandas as pd
import numpy as np
import pylab as plt

# Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))

# Size of matplotlib figures that contain subplots
fizsize_with_subplots = (10, 10)

# Size of matplotlib histogram bins
bin_size = 10

## Explore the Data

Read the data:

In [57]:
X_train = pd.read_csv('data/titanic_train.csv')
X_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,268,1,3,"Persson, Mr. Ernst Ulrik",male,25.0,1,0,347083,7.775,,S
1,578,1,1,"Silvey, Mrs. William Baird (Alice Munger)",female,39.0,1,0,13507,55.9,E44,S
2,134,1,2,"Weisz, Mrs. Leopold (Mathilde Francoise Pede)",female,29.0,1,0,228414,26.0,,S
3,725,1,1,"Chambers, Mr. Norman Campbell",male,27.0,1,0,113806,53.1,E8,S
4,58,0,3,"Novel, Mr. Mansouer",male,28.5,0,0,2697,7.2292,,C


In [58]:
X_train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
704,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
705,203,0,3,"Johanson, Mr. Jakob Alfred",male,34.0,0,0,3101264,6.4958,,S
706,327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S
707,843,1,1,"Serepeca, Miss. Augusta",female,30.0,0,0,113798,31.0,,C
708,560,1,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1,0,345572,17.4,,S


View the data types of each column:

In [59]:
X_train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Object types (usually strings) are a problem for most algorithms (maybe not trees?) so you'll usually have to convert these into useable numeric values.

Get some basic information on the DataFrame:

In [60]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 709 entries, 0 to 708
Data columns (total 12 columns):
PassengerId    709 non-null int64
Survived       709 non-null int64
Pclass         709 non-null int64
Name           709 non-null object
Sex            709 non-null object
Age            566 non-null float64
SibSp          709 non-null int64
Parch          709 non-null int64
Ticket         709 non-null object
Fare           709 non-null float64
Cabin          151 non-null object
Embarked       707 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.5+ KB


We can see that Age, Cabin and Embarked have missing (null) values. These will have to be imputed or removed.

Generate various descriptive statistics on the DataFrame:

In [61]:
X_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,709.0,709.0,709.0,566.0,709.0,709.0,709.0
mean,448.06347,0.35402,2.330042,29.771201,0.51763,0.368124,30.678466
std,256.783104,0.478553,0.830875,14.648229,1.097654,0.803347,49.01246
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,230.0,0.0,2.0,20.0,0.0,0.0,7.8958
50%,451.0,0.0,3.0,29.0,0.0,0.0,13.8625
75%,664.0,1.0,3.0,38.0,1.0,0.0,30.5
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Now that we have a general idea of what the dataset looks like, I would recommend that you use the following steps as a guideline. But of course, it's entirely up to you as to which approach you take.

## Feature Analysis

Go through each feature create some visual aids to help with analysis. I would start with histograms. For features with many classes (e.g. age) you may want to use a kernel density estimate, rather than a histogram.

Then, remember the goal, to predict survive or not survive. So for each of the features create bar charts for the feature showing survived and not survived components for each value of the feature.

## Data Cleaning

Once you have a thourough understanding of the data, it is time to try and clean up the data. You will have to choose what to do for each feature. E.g. drop rows/columns entirely? Impute values? Encode a value for the null?

If you're finding this hard, it might be easier to just start with a few features that are complete, or near complete.

## Modelling

This is the fun part. Start picking models that you can train your data upon. I'd recommend starting with something easy and as you gain confidence start considering more complex models.

Bear in mind that you will spend a lot of time going backwards and forwards to re-clean the data and tune the algorithm. So don't try and compare too many models. Pick a model and understand it's weaknesses before you move on.

You may even consider to take an entirely statistical approach at this point. A bayesian interpretation of the data could yeild some very interesting results (although this is probably more difficult at this stage, this isn't a statistics course :-) ).

## Evaluation

When you have finished, we will compare everyone's models. I will provide some code that you can all run.

To set expectations:
    
- 50% is just a random guess. This is your baseline.
- 75-85% is pretty good. You should be around this range.
- 100% is world-beating, but possible. If you get to 100% you've probably done something wrong. :-D