<a href="https://colab.research.google.com/github/mustafabozkaya/Data_Science_Bootcamp/blob/master/Week1/data_pre_processing_using_titanic_dataset_4ad8d5446bda4f609a09c24ddccc23f8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction


On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

In the Hollywood blockbuster that was modelled on this tragedy, it seemed to be the case that upper-class people, women and children were more likely to survive than others. But did these properties (socio-economic status, sex and age) really influence one's survival chances? 

Based on data of a subset of 891 passengers on the Titanic, I will make a model that can be used to predict survival of other Titanic passengers. 

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

### Outline

- Preprocessing/cleaning of the provided data


# Dataset Download:

train: https://docs.google.com/spreadsheets/d/1hFOPnxVT9fyT4TFlwuGGbDLfclY43P48UV24PNfAW2M/edit?usp=sharing


In [None]:
# Only for Google Colab users
from google.colab import files
uploaded = files.upload()


Saving train - train_titanic dataset.csv to train - train_titanic dataset.csv


### Preprocessing
First, let's load the training data to see what we're dealing with. We will import the file to a pandas DataFrame:

## Loading Libraries

In [None]:
import pandas as pd
import numpy as np

## Loading Data

In [None]:
train_data = pd.read_csv('train - train_titanic dataset.csv')
train_data1 = pd.read_csv('train - train_titanic dataset.csv')

In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Now, let's take a look at the first few rows of the DataFrame:

In [None]:
train_data.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


# A bit about the dataset

'Pclass' column contains a number which indicates class of the passenger's ticket:  1 for first class, 2 for second class and 3 for third class. 

This could function as a proxy for the socio-economic status of the passenger ('upper', 'middle', 'low'). 


The 'SibSp' column contains the number of siblings + spouses of the passenger also aboard the Titanic;

the 'ParCh' column indicates the number of parents + children of the passenger also aboard the Titanic. 

The 'Ticket' column contains the ticket numbers of passengers (which are not likely to have any predictive power regarding survival);

'Cabin' contains the cabin number of the passenger, if he/she had a cabin, and lastly, 

'Embarked' indicates the port of embarkation of the passenger: **C**herbourg, **Q**ueenstown or **S**outhampton. The meaning of the other columns is clear, I think.

Let's check some more info on the DataFrame: 

In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The DataFrame contains 891 entries in total, with 12 features. Of those 12 features, 10 have non-null values for every entry, and 2 do not: 'Age', which has 714 non-null entries, and 'Cabin', which has only 204 non-null entries (of course, not everyone had a cabin).


If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values.

# **Dropping Columns which are not useful**

Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.

In [None]:
cols = ['Name', 'Ticket', 'Cabin']
train_data = train_data.drop(cols, axis=1)
train_data1 = train_data1.drop(cols, axis=1)

In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB


# Dropping rows having missing values
Next if we want we can drop all rows in the data that has missing values (NaN). You can do it like

In [None]:
train_data = train_data.dropna()

In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Sex          712 non-null    object 
 4   Age          712 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Fare         712 non-null    float64
 8   Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB


# Problem with dropping rows having missing values

1.   List item
2.   List item


After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can. We will see it later.
Creating Dummy Variables
Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.

In [None]:
dummies = []

cols = ['Pclass', 'Sex', 'Embarked']
for col in cols:
    dummies.append(pd.get_dummies(train_data1[col]))
titanic_dummies = pd.concat(dummies, axis=1)    

# And finally we concatenate to the original dataframe column wise


In [None]:
train_data1 = pd.concat((train_data1,titanic_dummies), axis=1)

Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the dataframe

In [None]:
train_data1 = train_data1.drop(['Pclass', 'Sex', 'Embarked'], axis=1)

In [None]:
train_data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Age          714 non-null    float64
 3   SibSp        891 non-null    int64  
 4   Parch        891 non-null    int64  
 5   Fare         891 non-null    float64
 6   1            891 non-null    uint8  
 7   2            891 non-null    uint8  
 8   3            891 non-null    uint8  
 9   female       891 non-null    uint8  
 10  male         891 non-null    uint8  
 11  C            891 non-null    uint8  
 12  Q            891 non-null    uint8  
 13  S            891 non-null    uint8  
dtypes: float64(2), int64(4), uint8(8)
memory usage: 48.9 KB


In [None]:
train_data1.head(3) #let's the overview of data now after creating dummy variables

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,1,2,3,female,male,C,Q,S
0,1,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,3,1,26.0,0,0,7.925,0,0,1,1,0,0,0,1


# Taking Care of Missing Data
All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a interpolate() function that will replace all the missing NaNs to interpolated values.

In [None]:
train_data1['Age'] = train_data1['Age'].interpolate()

Now lets observe the data columns. Notice age which is interpolated now with imputed new values.

In [None]:
train_data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Age          891 non-null    float64
 3   SibSp        891 non-null    int64  
 4   Parch        891 non-null    int64  
 5   Fare         891 non-null    float64
 6   1            891 non-null    uint8  
 7   2            891 non-null    uint8  
 8   3            891 non-null    uint8  
 9   female       891 non-null    uint8  
 10  male         891 non-null    uint8  
 11  C            891 non-null    uint8  
 12  Q            891 non-null    uint8  
 13  S            891 non-null    uint8  
dtypes: float64(2), int64(4), uint8(8)
memory usage: 48.9 KB


# Converting the dataframe to numpy

Now that we have converted all the data to numeric, its time for preparing the data for machine learning models. 

This is where scikit and numpy come into play:
X = Input set with 14 attributes
y = Small y Output, in this case ‘Survived’
Now we convert our dataframe from pandas to numpy and we assign input and output

In [None]:
X = train_data1.values
y = train_data1['Survived'].values

X has still Survived values in it, which should not be there. So we drop in numpy column which is the 1st column.

In [None]:
X = np.delete(X, 1, axis=1)

# Dividing data set into training set and test set
Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit model_selection

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)