# Cleaning the Titanic dataset

As before we start by importing some libraries.

We will use pandas again to handle our data.

We will also import `matplotlib` and `seaborn`, these libraries are used to create some visualisations of our data.

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
titanic_df = pd.read_csv('datasets/titanic.csv')

In [None]:
titanic_df.shape

### Dataset description

Here we have a description of the column headers of the CSV data.

1. PassengerId - Passenger unique Id
2.  Survival - Survival (0 = No; 1 = Yes).
3.  Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
4.  Name - Name
5.  Sex - Sex
6.  Age - Age
7.  Sibsp - Number of Siblings/Spouses Aboard
8.  Parch - Number of Parents/Children Aboard
9.  Ticket - Ticket Number
10.  Fare - Passenger Fare
11. Cabin - Cabin
12. Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


We will drop some of the columns that have no relevance to our model.

In [None]:
titanic_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], 'columns', inplace=True)
titanic_df.head(10)


In [None]:
titanic_df.shape

When working wth large datasets in is quite common to have records with misssing data, we will check for any records that have missing fields.

We can use the `.isnull()` function from the pandas library to check if a value is null.

The following line will count how many null values there are in each column of our dataframe.

In [None]:
titanic_df[titanic_df.isnull().any(axis=1)].count()

As we can see there are quite a few records with missing data. It is possible to use different techniques to predict what the missing data could be.
For this example we will just omit any records with missing data.

In [None]:
titanic_df = titanic_df.dropna()

In [None]:
titanic_df.shape

Lets just do a sanity check to see if there are any null values left in our dataframe.

In [None]:
titanic_df[titanic_df.isnull().any(axis=1)].count()

We can use the `describe` method to get some statistical information about our dataset.

In [None]:
titanic_df.describe()

Notice the mean for the 'Survived' column is 0.404494, this means only about 40% of the passengers in our data survived. (This is higher than the 31% of the total passenger survival rate)


Let's see if we plot some of our data on a scatter plt can we see anything interesting.

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(titanic_df['Age'], titanic_df['Survived'])

plt.xlabel('Age')
plt.ylabel('Survived')

This shows us that there is very little to no correlation between the passenger's age and if the passnger survived

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(titanic_df['Fare'], titanic_df['Survived'])

plt.xlabel('Fare')
plt.ylabel('Survived')

We can see that there area few outliers of passengers that survived that paid much higher fares.

Try to create a scatterplot for a different feature.

In [None]:
# Write your code here

As we are looking at this in terms of binary classification a scatterplot is of very little use.

Let's try looking at the data in a crosstab table(confusion matrix).

In [None]:
pd.crosstab(titanic_df['Sex'], titanic_df['Survived'])

We can see from this that there was a much higher survival rate for women than men

In [None]:
pd.crosstab(titanic_df['Pclass'], titanic_df['Survived'])

We can use one of pandas built in functions to view correlations between columns

In [None]:
titanic_data_corr = titanic_df.corr()

titanic_data_corr

We can see from this that class is negatively correlated with survival, the lower the class the lower the chance of survival.

Fare is positively correlated with survival, the higher the fare the higher the chance of survival.

We can use the Seaborn library to show this correlation matrix as a heatmap.

In [None]:
fig, ax = plt.subplots(figsize=(12,10))

sns.heatmap(titanic_data_corr, annot=True)

Now that we have cleaned our data up we can move onto to preproccessing, this is where we make sure the data is in a form that the model will understand so that it can be used for training.

We'll save our cleaned data to disk for the next step.

In [None]:
titanic_df.to_csv('datasets/titanic_cleaned.csv', index=False)