# Processing the Titanic dataset


Our data set contains categorical or discrete values, we need to convert these values to numerical values for machine learning models.

Scikit-learn provides us a very useful library called 'Preprocessing' to help us to process our data.

We start by importing the libraries we will use and we'll import the cleaned dataset we created in the previous notebook.

In [None]:
import sklearn

import pandas as pd

In [None]:
titanic_df = pd.read_csv('datasets/titanic_cleaned.csv')
titanic_df.head(10)

Our data set contains categorical or discrete values, we need to convert these values to numerical values for machine learning models.

Scikit-learn provides us a very useful library called 'Preprocessing' to help us to process our data.

First we'll convert categorical values to ordered integer values. We can use the `LabelEncoder()` function for this. Normally the label encoder is used for ordinal data, that is where the order matters, e.g., 'small', 'medium' & 'large'. 

However we can still use this when we have data of a binary nature, in this case the 'Sex' column only contains two values, i.e. Female and Male.

In our case this will give us a value of '0' for female and a value of '1' for male.


In [None]:
from sklearn import preprocessing

label_encoding = preprocessing.LabelEncoder()
titanic_df['Sex'] = label_encoding.fit_transform(titanic_df['Sex'].astype(str))

titanic_df.head()

We can look at the numerical value that the label encoder assigned to each value by looking at the `.classes_` attribute. 

This returns an array of each of the values and the index of that value in the array corresponds to the value now used in our dataframe.

In [None]:
label_encoding.classes_

Categories with no intrinsic ordering can be converted to numeric values using a technique calledone-hot encoding.

Scikit-learn has abuilt in function called `.get_dummies()` that will convert a column to it's one-hot encoded form.

In [None]:
titanic_df = pd.get_dummies(titanic_df, columns=['Embarked'])

titanic_df.head()

Each value in the 'Embarked' column now has it's own column. This is known as it's one-hot representation. (The 'Embarked' column isn't neccesary for our model but I included it to demonstrate the concept of one-hot encoding)

We now have our data in a form that is ready to train an ML model, we are going to shuffle the data set and save it as a csv file.

In [None]:
titanic_df = titanic_df.sample(frac=1).reset_index(drop=True)

titanic_df.head()

In [None]:
titanic_df.to_csv('datasets/titanic_processed.csv', index=False)