# Data PreProcessing with Titanic: Machine Learning from Disaster as an Example
> #### 1. Preparing data
> #### 2. Importing the required Libraries
> #### 3. Importing the Data Set
> #### 4. Handling the Missing Data
> #### 5. Encoding Categorical Data
> #### 6. Splitting the dataset into test set and training set
> #### 7. Feature Scaling

This notebook was inspired by [100-Days-Of-ML-Code](https://github.com/Avik-Jain/100-Days-Of-ML-Code) of [Avik-Jain](https://github.com/Avik-Jain).

<div>
    <a href="https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Info-graphs/Day%201.jpg">
        <img align=left src="https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/Info-graphs/Day%201.jpg" width = 50%>
    </a>
</div>

## 1. Preparing data
First of all, let's download data [Titanic: Machine Learning from Disaster as an Example](https://www.kaggle.com/c/titanic/data) from [Kaggle](https://www.kaggle.com).<br>
I put my data [here](https://github.com/marsguo18/100-Days-Of-ML/tree/master/Data/Titanic_data).<br>
And then we should know how machine learning(ML) work. We use **train_data** to train a ML model. And then we put the **test_data** into this model to predict the results. Technically, both **train_data** and **test_data** must be preprocessed. But we will only do the **train_data** here in this notebook.<br>
**PS:** If you want to see how the following codes work, remember to click the Run button.

## 2. Importing the required Libraries

In [1]:
import numpy as np
import pandas as pd

## 3. Importing the Data Set

In [2]:
path_train = '../Data/Titanic_data/train.csv'
train = pd.read_csv(path_train)

## 4. Handling the Missing Data
### (1) Take a first glance of the data.

In [3]:
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### (2) Explore more detailed information of these data, such as data's quantity for each feature, quantity of abnormal data and so on.

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [5]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### (3) Survey missing data

In [6]:
train.isnull().head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
5,False,False,False,False,False,True,False,False,False,False,True,False
6,False,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False,False,False,True,False


In [7]:
sum_null = train.isnull().sum()
missing_data = pd.concat([sum_null], axis=1, keys=['sum_null'])
missing_data

Unnamed: 0,sum_null
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


### (4) Deal with the missing data
As we can see, we have to deal with Age (177), Cabin (687), and Embarked (2). Generally, we can replace the missing data by the Mean or Median of the entire column. 
#### [1] Age

In [8]:
train['Age'] = train['Age'].fillna(train['Age'].median())
train['Age'].isnull().sum()

0

#### [2] Cabin
**However**, the data sklearn only deal with is **numerical value**. As we can see here, data of Cabin, one of the feature of data train isn't numerical value but object. So sklearn can't deal with the missing data of Cabin. 

In [9]:
train['Cabin'][train['Cabin'].notnull()].head()

1      C85
3     C123
6      E46
10      G6
11    C103
Name: Cabin, dtype: object

In [10]:
train['Cabin'].describe()

count     204
unique    147
top        G6
freq        4
Name: Cabin, dtype: object

The first letter of the cabin indicates the Desk, which tell us the probable location of the passenger in the Titanic. So we should keep this information. I supposed that passengers without a cabin have a missing value displayed instead of the cabin number. Therefore, I decide to replace the missing data by the type of cabin 'X'.

In [11]:
train['Cabin'] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in train['Cabin'] ])
train['Cabin'].head()

0    X
1    C
2    X
3    C
4    X
Name: Cabin, dtype: object

#### [3] Embarked

In [12]:
train['Embarked'].isnull().sum()

2

In [13]:
train['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

As we can see above, the frequentest data of Embarked is "S". So we can fill "S" into two missing data.

In [14]:
train["Embarked"] = train["Embarked"].fillna("S")
train['Embarked'].isnull().sum()

0

## 5. Encoding Categorical Data

In [15]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,X,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,X,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,X,S


Often features are not given as continuous values but categorical. As we can see above, values from "Sex", "Cabin" and "Embarked" are not continuous values, which will give our model more pressure to handle. That is the reason why we need to encode Categorical Data.

As we can see from the following output, "Sex" have two categorical "male" and "female". We can turn these categorical values into continuous by using the **LabelEncoder** class from **sklearn.preprocessing**.

In [16]:
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [17]:
from sklearn.preprocessing import LabelEncoder
labelencoder_train = LabelEncoder()
train['Sex'] = labelencoder_train.fit_transform(train['Sex'])
train['Sex'].value_counts()

1    577
0    314
Name: Sex, dtype: int64

As we can see, <br>
male --> 1 and female --> 0

Same operation as "Age", we can encode Categorical Data of "Cabin".

In [18]:
train['Cabin'].value_counts()

X    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: Cabin, dtype: int64

In [19]:
labelencoder_train = LabelEncoder()
train['Cabin'] = labelencoder_train.fit_transform(train['Cabin'])
train['Cabin'].value_counts()

8    687
2     59
1     47
3     33
4     32
0     15
5     13
6      4
7      1
Name: Cabin, dtype: int64

As we can see, <br>
X --> 8, C --> 2, B --> 1, D --> 3, E --> 4, A --> 0, F --> 5, G --> 6 and T --> 7.

In [20]:
train['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [21]:
from sklearn.preprocessing import LabelEncoder
labelencoder_train = LabelEncoder()
train['Embarked'] = labelencoder_train.fit_transform(train['Embarked'])
train['Embarked'].value_counts()

2    646
0    168
1     77
Name: Embarked, dtype: int64

As we can see, <br>
S --> 2, C --> 0 and Q --> 1.

In [22]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,8,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,2,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,8,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,2,2
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,8,2


## 6. Splitting the dataset into test set and training set
Because dataset from Kaggle have already splitted into test and train. We won't use the Titanic's data to do this step. Following is the generally solution to split the dataset.
```python
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
```

## 7. Feature Scaling

**Feature Scaling** is a method used to standardize the range of independent variables or features of data. It can make the wide range for example 0 ot 100 into -1 to 1, which can fit the model easily and make the results better.

In [23]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,8,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,2,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,8,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,2,2
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,8,2


As we can see above, "Pclass", "Age", "SibSp", "Parch", "Fare", "Cabin" and "Embarked" has wide range of data. So we are going to use the **StandardScaler** class from **sklearn.preprocessing** to make the range narrower.

In [24]:
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
pclass_scaled = scaler.fit(train[['Pclass']])
train['Pclass'] = scaler.fit_transform(train[['Pclass']], pclass_scaled)
age_scaled = scaler.fit(train[['Age']])
train['Age'] = scaler.fit_transform(train[['Age']], age_scaled)
sibsp_scaled = scaler.fit(train[['SibSp']])
train['SibSp'] = scaler.fit_transform(train[['SibSp']], sibsp_scaled)
parch_scaled = scaler.fit(train[['Parch']])
train['Parch'] = scaler.fit_transform(train[['Parch']], parch_scaled)
fare_scaled = scaler.fit(train[['Fare']])
train['Fare'] = scaler.fit_transform(train[['Fare']], fare_scaled)
cabin_scaled = scaler.fit(train[['Cabin']])
train['Cabin'] = scaler.fit_transform(train[['Cabin']], cabin_scaled)
embarked_scaled = scaler.fit(train[['Embarked']])
train['Embarked'] = scaler.fit_transform(train[['Embarked']], embarked_scaled)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,0.827377,"Braund, Mr. Owen Harris",1,-0.565736,0.432793,-0.473674,A/5 21171,-0.502445,0.522067,0.585954
1,2,1,-1.566107,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,0.663861,0.432793,-0.473674,PC 17599,0.786845,-1.917594,-1.942303
2,3,1,0.827377,"Heikkinen, Miss. Laina",0,-0.258337,-0.474545,-0.473674,STON/O2. 3101282,-0.488854,0.522067,0.585954
3,4,1,-1.566107,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,0.433312,0.432793,-0.473674,113803,0.42073,-1.917594,0.585954
4,5,0,0.827377,"Allen, Mr. William Henry",1,0.433312,-0.474545,-0.473674,373450,-0.486337,0.522067,0.585954
