# Exploring the Titanic dataset

In [None]:
import sklearn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
titanic_df = pd.read_csv('datasets/titanic_train.csv')
titanic_df.head(10)

### Dataset description

1.  Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file.
2.  Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
3.  Name - Name
4.  Sex - Sex
5.  Age - Age
6.  Sibsp - Number of Siblings/Spouses Aboard
7.  Parch - Number of Parents/Children Aboard
8.  Ticket - Ticket Number
9.  Fare - Passenger Fare
10. Cabin - Cabin
11. Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [None]:
titanic_df.shape

We have 891 records and 12 columns

Some of these columns contain no useful information and so we will drop these.

In [None]:
titanic_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], 'columns', inplace=True)
titanic_df.head()

When working wth large datasets in is quite common to have records with misssing data, we will check for any records that have missing fields.

In [None]:
titanic_df[titanic_df.isnull().any(axis=1)].count()

As we can see there are quite a few records with missing data. It is possible to use different techniques to predict what the missing data could be.
For this example we will just omit any records with missing data.

In [None]:
titanic_df = titanic_df.dropna()

In [None]:
titanic_df.shape

In [None]:
titanic_df[titanic_df.isnull().any(axis=1)].count()

We can use the `describe` method to get some statistical information about our dataset.

In [None]:
titanic_df.describe()

Notice the mean for the 'Survived' column is 0.404494, this means only about 40% of the passengers in our data survived. (This is higher than the 31% of the total passenger survival rate)

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(titanic_df['Age'], titanic_df['Survived'])

plt.xlabel('Age')
plt.ylabel('Survived')

This shows us that there is very little to no correlation between the passenger's age and if the passnger survived

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(titanic_df['Fare'], titanic_df['Survived'])

plt.xlabel('Fare')
plt.ylabel('Survived')

We can see that there area few outliers of passengers that survived that paid much higher fares.

As we are looking at this in terms of binary classification a scatterplot is of very little use.
Let's try looking at the data in a crosstab table.

In [None]:
pd.crosstab(titanic_df['Sex'], titanic_df['Survived'])

We can see from this that there was a much higher survival rate for women than men

In [None]:
pd.crosstab(titanic_df['Pclass'], titanic_df['Survived'])

We can use one of pandas built in functions to view correlations between columns

In [None]:
titanic_data_corr = titanic_df.corr()

titanic_data_corr

We can see from this that class is negatively correlated with survival, the lower the class the lower the chance of survival.

Fare is positively correlated with survival, the higher the fare the higher the chance of survival.

We can use the Seaborn library to show this correlation matrix as a heatmap.

In [None]:
fig, ax = plt.subplots(figsize=(12,10))

sns.heatmap(titanic_data_corr, annot=True)

## Processing the data


Our data set contains categorical or discrete values, we need to convert these values to numerical values for machine learning models.

Scikit-learn provides us a very useful library called 'Preprocessing' to help us to process our data.

First we'll convert categorical values to ordered integer values. We can use the `LabelEncoder()` function for this. Normally the label encoder is used for ordinal data, that is where the order matters, e.g., 'small', 'medium' & 'large'. 

However we can still use this when we have data of a binary nature, in this case the 'Sex' column only contains two values, i.e. Female and Male.

In our case this will give us a value of '0' for female and a value of '1' for male.


In [None]:
from sklearn import preprocessing

label_encoding = preprocessing.LabelEncoder()
titanic_df['Sex'] = label_encoding.fit_transform(titanic_df['Sex'].astype(str))

titanic_df.head()

In [None]:
label_encoding.classes_

Categories with no intrinsic ordering can be converted to numeric values using one-hot encoding.

In [None]:
titanic_df = pd.get_dummies(titanic_df, columns=['Embarked'])

titanic_df.head()

Each value in the 'Embarked' column now has it's own column. This is known as it's one-hot representation.

We now have our data in a form that is ready to train an ML model, we are going to shuffle the data set and save it as a csv file.

In [None]:
titanic_df = titanic_df.sample(frac=1).reset_index(drop=True)

titanic_df.head()

In [None]:
titanic_df.to_csv('datasets/titanic_processed.csv', index=False)

In [None]:
!ls datasets