# Posten hackathon starter kit
This kernel is meant as a starting off point for the Posten hackathon on `17.12.18`. 

# Titanic data set preparations

Start by importing the libraries we will use and read the data sets from the disk.

- The train data set is the part of the data set for which we know if a passenger survived
- The task is to predict which passengers in the test data set who survived

It's easy to solve this using google, but a lot more fun to solve it using machine-learning.

This notebook contains a demo for a simple machine learning models and visualizations that you can use to predict a result. Feel free to fork the notebook and improve it by making the machine learning models consider more attributes of the passengers.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from matplotlib import cm

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

# Investigating the data sets

The data has been loaded as a pandas dataframe, full [API documentation](http://pandas.pydata.org/pandas-docs/stable/)

Here's a short demo of what pandas can do.

In [None]:
train.info() # basic information about the data set, such as how many values that are null and how much memory the data occupies

We can see there are 891 rows (passengers) in the data set, and we have 12 attributes to work with.  Some are numeric and others are strings ( `object`).  Not all the attributes are complete, e.g. `Age` where we only have 714 non-null values.

# Checking out correlations of the numeric variables

In [None]:
plt.figure(figsize=(18, 12)) # make the plot 18 by 12 inches
sns.heatmap(train.corr(), cmap=cm.coolwarm) # plot it

Studying the correlation matrix is a good way to find out which variables in the data that can help us predict whether a passenger survived.  Keep in mind that a strong negative correlation (aka the dark blue squares) are still strongly correlated.  Values close to 0 are the ones we don't care too much about.

By looking at the Survived column, we can see that it correlates well with Sex, meaning we can tell a lot about whether a person survived but looking at their gender.  We'll use this to our advantage later.

# Feature extraction

Start by concatenating train and tests sets to make sure we do the same transformations to both

In [None]:
df = pd.concat([train, test])

Extract title from the name column and group them into categories

In [None]:
df['title'] = df.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
plt.figure(figsize = (15,10))
plt.yscale('log')
sns.countplot(x = 'title', data = df, hue = 'Survived')

In [None]:
df.groupby('title')['Age'].median()
titlemap = {
    'Don': 'Noble',
    'Dona': 'Noble',
    'Mme': 'Mrs',
    'Mlle': 'Miss',
    'Sir': 'Noble',
    'Jonkheer': 'Noble',
    'Lady': 'Noble',
    'the Countess': 'Noble',
    'Major': 'Military',
    'Capt': 'Military',
    'Col': 'Military',
    'Dr': 'Military',
    'Ms': 'Miss'
}
df['titlecat'] = df.title.apply(lambda title: titlemap.get(title, title))
plt.figure(figsize = (15,10))
df.groupby('titlecat')['Survived'].mean().sort_values(ascending = False).plot.bar()

In [None]:
titlecats = df.groupby('titlecat')['Survived'].mean().sort_values(ascending = False).index.tolist()
df['titleno'] = df.titlecat.apply(lambda t: titlecats.index(t))
df.head()

Use title category to fill in missing ages

In [None]:
df['titlecatage'] = df.groupby('titlecat')['Age'].transform(lambda group: group.median())
df['Age'] = df.Age.fillna(df.titlecatage)

In [None]:
# We have one missing fare, fill in the median fare from the passenger class
df.loc[df.Fare.isnull(), 'Fare'] = df[df.Pclass == 3].Fare.median()

In [None]:
# Find passenger group size, people who shared tickets traveled together
df['groupsize'] = df.groupby('Ticket')['Ticket'].transform(len)
df['alone'] = (df.groupsize == 1) & (df.SibSp == 0) & (df.Parch == 0)
sns.countplot(x = 'alone', data = df, hue = 'Survived')

In [None]:
# Group passengers into age buckets
df['agebucket'] = pd.cut(df.Age, bins = [df.Age.min(), 15, 23, 40, df.Age.max()], include_lowest=True).cat.codes
sns.countplot(x='agebucket', hue='Survived', data=df)

In [None]:
# Two british passengers have no value for "embarked", assume Southampton
df.loc[df.Embarked.isnull(), 'Embarked'] = 'S'

In [None]:
# Make numeric category for embarked
embs = df.groupby('Embarked')['Survived'].mean().sort_values(ascending = False).index.tolist()
df['embarkedno'] = df.Embarked.apply(lambda t: embs.index(t))

In [None]:
# Find passenger deck from cabin number (use T for unknown, there was only one T in the data set)
df['deck'] = df.Cabin.str[0].fillna('T')
decks = df.groupby('deck')['Survived'].mean().sort_values(ascending = False).index.tolist()
df['deckno'] = df.deck.apply(lambda t: decks.index(t))
df['gooddeck'] = df.deckno <= 4
sns.countplot(x='deckno', data=df, hue='Survived')

In [None]:
# Are there any parents/children?
df['hadparch'] = df.Parch != 0

In [None]:
df['fareperperson'] = df.Fare / df.groupsize

In [None]:
df['ismarried'] = (df.SibSp > 0) & (df.titlecat.isin({'Mr', 'Mrs'}))
df.groupby(['Sex', 'ismarried'])['Survived'].mean().plot.bar()

# Making a simple model

It definitely looks like the gender is very predictive for whether someone survived. So we should use it in a model.

We will use scikit-learn to demonstrate, you can see the [supervised learning](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) section of the documentation for detailed information about how this works.

In [None]:
import sklearn.tree
import sklearn.linear_model
import sklearn.ensemble

In [None]:
import sklearn.compose
import sklearn.pipeline
import sklearn.preprocessing

In [None]:
sns.countplot(x='Parch', data=df, hue='Survived')

In [None]:
plt.figure(figsize=(15,8))
df.groupby('Age')['Survived'].mean().plot.bar()

In [None]:
transformer = sklearn.compose.ColumnTransformer([
    ('cat', sklearn.preprocessing.OneHotEncoder(categories='auto'), ['deck', 'ismarried', 'Embarked', 'agebucket', 'gooddeck', 'hadparch', 'Sex', 'Pclass', 'title', 'alone']),
    ('num', sklearn.preprocessing.MinMaxScaler(), ['Age', 'fareperperson'])
])
transformer.fit(df)

model = sklearn.pipeline.Pipeline([
    ('k_best', sklearn.feature_selection.SelectKBest(sklearn.feature_selection.chi2)),
    ('classifier', sklearn.ensemble.RandomForestClassifier())
])

parameters = {
    'classifier__n_estimators': [15,20,25,30],
    'classifier__max_depth': [5, 6, 7, 8],
    'k_best__k': [26,28,30,32,34,36,38]
}

search = sklearn.model_selection.GridSearchCV(model, parameters, cv=5)

train = df[df.Survived.notnull()].sample(frac=1).copy()
X_train = transformer.transform(train)
y_train = train.Survived

search.fit(X_train, y_train)
tree = search.best_estimator_

print(search.best_params_)

import sklearn.model_selection

sklearn.model_selection.cross_val_score(tree, X_train, y_train, cv=5)

In [None]:
tree.fit(X_train, y_train)

# Submit to kaggle

Let's submit our prediction to kaggle. We'll do that by predicting on the test data set, for which we don't know the answer, and write it to a CSV file.

To do the prediction, we need to make the exact same changes to the `test` data set as we've done to the `train` data set, or our model won't be able to make sense of it.

In [None]:
# Fit the model to the training data set
tree.fit(X_train, y_train)
test = df[df.Survived.isnull()].copy()
test['Survived'] = tree.predict(transformer.transform(test)).astype(int)

test[['PassengerId', 'Survived']].head()

Now we've written 'Survived' attribute for the test set, it's time to submit to kaggle. We can do that by writing a CSV file to the current working directory, containing _only_ the PassengerId and Survived column:

In [None]:
test[['PassengerId', 'Survived']].to_csv('predictions.csv', header=True, index=False)

You can't actually see the file yet, you'll have to commit the notebook and run it first.  When you've done that you can see the output and submit it to the competition.  However, there's a limit to how often you can do this, so please use the train dataset and cross validations to check if your model looks improved.

# Where to next?

I've already submitted this, so I know it'll give a decent score, but not a great one.  There's a lot of things we can do to improve on it, but that's what I'm hoping you will do today.  Here are some ideas:

- See if another classifier will do better.  Decision trees are great because they're simple, not because they give the best results.
- We've just looked at a single attribute, `Sex`, that's wasting a lot of data.  See if you can find other attributes which improve the model
- There is a lot of data in the strings, see if you can extract something useful from those.  Perhaps the title in the `Name` column can be used for something?

Can you get to 80%?  It's certainly possible, but it'll take a lot of work and some smart problem solving.

**Good luck, and have fun**