# Kaggle's Titanic Data Science Starter Competition

Exploring the data set and putting together solution(s) for Kaggle's starter competition [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic).

## Data pre-processing

We need to load in the training data set and decide on pre-processing steps including filling in missing data, normalizing values etc.

### Loading the data

In [1]:
import pandas as pd

training_data = pd.read_csv('train.csv')

training_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


### Correcting missing values and scaling

Looks like there are plenty of columns with missing data.

In [3]:
print("Total rows: {}".format(training_data.size))

print("Variables with missing rows:")
training_data.isnull().sum()

Total rows: 10692
Variables with missing rows:


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Looking at this and the overall shape of the data, here's my initial plan:

- Pclass: keep as is: ordinal value should work, even though it's inverted (higher number is lower class cabin)
- Name: omit (could try some fancy stuff like inferring ethnicity, but skip for now)
- Sex: code to 0 / 1
- Age: replace missing with median 
- SibSp: keep as is
- Parch: keep as is
- Ticket: omit (doesn't seem like low hanging fruit, could look more closely for pattern later)
- Fare: keep as is (keep, as fare could be finer grained proxy for socio economic status, sense of entitlement / power in getting on boat)
- Cabin: omit (10% are missing values, could look more closely later, one idea would be to break this out into a few different boolean variables for major section, A->E in case proximity to life boats or something like that ends up being predictive)
- Embarked: omit (could one-hot encode it, but can't see how this would affect survivorship, let's be lazy to start)

Let's sketch out a helper function to construct a preprocessor so that we can reuse it later on the test dataset before evaluating the model.

In [None]:
def make_preprocesser(training_data):
    """
    Constructs a preprocessing function ready to apply to new dataframes.
    
    Crucially, the interpolating that is done based on the training data set
    is remembered so it can be applied to test datasets (e.g the mean age that 
    is used to fill in missing values for 'Age' will be fixed based on the mean
    age within the training data set).
    
    Params:
        df: pandas.DataFrame containing the training data
    """
    
    def preprocess(df, scale=True):
        """
        Preprocesses a dataframe so it is ready for use with a model (either for training or prediction).
        
        Params:
            scale: whether to apply feature scaling. E.g with random forests feature scaling isn't necessary.
        """
        return df
    
    return preprocess