In [44]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [45]:
# Loading a csv file with pandas
train_data = pd.read_csv('data/train.csv')

In [46]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# We usually have three types of variables, continuous, ordinal and categorical

Continuous variables are basically numeric features, it can take any numerical value. In this dataset age is an example of a continuous feature. These features are probably a good starting point since we can use them directly in a machine learning model.

Categorical variables are different as in that a variable is categorized in two or more categories. In this dataset sex is a categorical variable. It has just two categories: male, female. Often categorical variables are often strings and have to be transformed in some way as they cannot be used directly in a machine learning model. A machine learning model works with numbers, not with strings.

Ordinal variables are similar to categorical features, however there is some order in it. Eg. small, medium, tall is a categorical feature but there is still order in the features. In this dataset the Pclass is an ordinal variable, there are just 3 classes however there is order in the classes.

# The desired state of the data before we can apply machine learning


Before we can train a machine learning we have to transform the dataset into a format that a machine learning model can use.

As mentioned earlier, a machine learning model expects numbers, there should be no string columns.

Q1: how are we going to transform the string columns to a numerical column(s)?

We want to extract as much data from our dataset as possible.

Q2: how can we extract more information from the dataset then the data already provides?


Eventually we want to get to a situation where each column is a feature and the target variable in the end.

Eg:

feat_1, feat_2, feat_3, 'Survived'


# Part 1: a good start, numerical features

In [50]:
numerical_cols = ['Age', 'Fare']

In [51]:
num_features = train_data[numerical_cols]

In [52]:
num_features.head(10)

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05
5,,8.4583
6,54.0,51.8625
7,2.0,21.075
8,27.0,11.1333
9,14.0,30.0708


# How to deal with NaN values (missing values)?

There are actually many ways to deal with NaN values, however we will just fill them up with the mean of the dataset. However we can already think of smarter ways like the mean age of a specific gender of the person.

In [55]:
num_features['Age'] = num_features['Age'].fillna(num_features['Age'].mean())

In [56]:
num_features.head(10)

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05
5,29.699118,8.4583
6,54.0,51.8625
7,2.0,21.075
8,27.0,11.1333
9,14.0,30.0708


# Part 2: categorical features

Again, there is more then 1 way of converting categorical features to a usefull format. However I will show you one of the most common ways of transforming categorical features.

In [57]:
sex_data = train_data[['Sex']]

In [58]:
sex_data.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [59]:
sex_data = pd.get_dummies(sex_data)

In [60]:
sex_data.head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


This method is called one hot encoding, it will create a column of each category and fill it up with 1 if that was the category for this record. The good thing is that a machine learning model can handle this type of data versus a column consisting "male" and "female".

In [62]:
# To merge together the numerical features and the one hot encoded sex feature

df_features = num_features.join(sex_data, how='outer')

In [65]:
df_features['Survived'] = train_data['Survived']

In [67]:
df_features.head()

Unnamed: 0,Age,Fare,Sex_female,Sex_male,Survived
0,22.0,7.25,0,1,0
1,38.0,71.2833,1,0,1
2,26.0,7.925,1,0,1
3,35.0,53.1,1,0,1
4,35.0,8.05,0,1,0


Great, we have our first dataset that can actually be used in a machine learning model!

# Your turn!

Can we extract more features from our dataset?

Some questions:

- Can we extract more numerical and categorical features right from the dataset?
- How are we going to deal with the ordinal feature Pclass?
- How are we going to create features from the Name and Ticket
- Can we create even more information, for example by creating a "family size" feature from SibSp and Parch?