# Data Processing : Titanic Data 

<img style="float: left; width: 400px;" src="image_titanic_ship.png">

##  Overview

The RMS Titanic was a British passenger liner. It sank in the North Atlantic Ocean on 
15 April 1912 after striking an iceberg during her maiden voyage from Southampton, UK, 
to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, 
making it one of the worst passenger ship disasters in history. 

1. https://en.wikipedia.org/wiki/Titanic
2. https://en.wikipedia.org/wiki/Titanic#/media/File:RMS_Titanic_3.jpg
3. https://titanicfacts.net/titanic-survivors/

Publicly available Titatnic dataset contains survival information about 1309 passengers. We 
will investigate the dataset with the use of the Python libraries including NumPy, 
Scipy, Pandas, Matplotlib, and Seaborn.

##  Dataset Information

<img style="float: left; padding-bottom: 50px; " src="image_titanic_data.png" width="1000" height="100">


#### Pclass: A proxy for socio-economic status (SES)

1 = Upper class

2 = Middle class

3 = Lower class

#### Survived : Indicator whhethher or not a passengers survived

0 : No : Did not Survived 

1 : Yes: survived

#### Name : Name of the passengers.Format : last name, first name


#### Sex : Gender of passengers

Category : female, male


#### Age : Age is in years. 

Age is fractional if less than 1.

#### SibSp : the count of siblings and spouse between 0 to 8

The dataset defines family relations in the following way:

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

#### ParCh : the count of parents and children between 0  to 9

The dataset defines family relations in the following way:

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them


#### Ticket : Ticket  number

#### Fare : Ticket price in British Pound

#### Cabin : Cabin number 

#### Embarked :  The  place where the traveler got on-board the ship. 

There are three possible values for Embark 

Southampton (S): about 70% of the people boarded from Southampton

Cherbourg   (C): about 20% boarded from Cherbourg

Queenstown  (Q): the rest boarded from Queenstown


#### Boat : Lifeboat (if survived)

#### Body : Body number (if did not survive and body was recovered)

#### Home.dest : Home/Destination of the passengers

## Data Processing Steps

Part 1: Read Final Clean Data set

Part 2: Split data into train and test datasets

Part 3: Process Categorical Data : Convert into Numerical Data

----------------

##  Load Useful  Python Modules

numpy,  pandas, re, scipy

In [2]:
import numpy as np
import pandas as pd
import re
import scipy

sklearn modules for data processing

In [3]:
from sklearn import preprocessing 
from sklearn.impute import SimpleImputer, KNNImputer

-----------

## Part 1: Read Data

In [4]:
dataPath = "/Users/nururrahman/Desktop/StartUp/DataScienceInitiative/Bootcamp/"
dataFile = "data_titanic_clean.csv"

In [5]:
#df = pd.read_csv( dataPath + dataFile) 
df = pd.read_csv(data_titanic_clean.csv)

NameError: name 'data_titanic_clean' is not defined

In [None]:
df.head(3)

-------------------

## Part 2 : How to Split Data into Training and Test Data sets

#### Training data set : Choose a portion of the original data without replacement  

The variable 'portion', also known as fraction, ranges from 0 to  1

We choose this fraction to be 0.75 for training set. The typical number is between 0.70 to 0.90

Higher fraction means more data will be used for model training purpose

#### Test data set : The remiaining portion of the original data  that is not in the training data  set 

In [6]:
portion  = 0.75
train_df = df.sample(frac=portion, replace=False)

In [7]:
# Check train data set
train_df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
765,3,1,"Dean, Mrs. Bertram (Eva Georgetta Light)",female,33.0,1,2,CA_2315,20.575,UNKNOWN,S,10,Devon_EnglandWichita_KS,Dean,Mrs.,40,4
256,1,1,"Salomon, Mr. Abraham L",male,28.0,0,0,111163,26.0,UNKNOWN,S,1,NewYork_NY,Salomon,Mr.,22,1
288,1,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0,0,17466,25.9292,D17,S,8,Brooklyn_NY,Swift,Mrs.,51,1


Get the row indices of the training dataset

In [8]:
train_index = train_df.index.tolist()

Test set : Choose a portion of the original data that is not  included in the Training set 

In [9]:
test_df = df[ ~df.index.isin(train_index) ]

In [10]:
# Check test data set 
test_df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,Hudson_NY,Andrews,Miss.,33,2
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,UNKNOWN,Belfast_NI,Andrews,Mr.,22,1
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,D,Bayside_Queens_NY,Appleton,Mrs.,45,3


The rows indices of Training and Test sets are messed up

Reindex the rows to prettify the dataframes 

In [11]:
train_df = train_df.reset_index(drop=True)
test_df  = test_df.reset_index(drop=True)

In [12]:
# Check final row indices 
# train_df.head(3)
# test_df.head(3)

Collect Training and Test datasets as a tuple  

In [13]:
train_test_original = (train_df, test_df)

#### Advanced Topic

Sometimes you will see something  like this

In [14]:
train_df = df.sample(frac=0.75, replace=False, random_state=42)

What does this 'random_state=42' mean?

'random_state' can be any integer number greater than 0 

Let us choose a random integer between 100 and 500. 

In [15]:
# rand_seed = 199
rand_seed = np.random.randint(100,500)

In [16]:
print(rand_seed)

361


In [17]:
portion  = 0.75
train_df = df.sample(frac=portion, replace=False, random_state=rand_seed)

In [18]:
# # Check train data set
# train_df.head(3)

# train_index = train_df.index.tolist()
# test_df = df[ ~df.index.isin(train_index) ]

# # Check test data set 
# test_df.head(3)

-----------------------------------------

## Part 3 : Process  Categorical Data for ML Modeling

### 3.1: Categorical Data  Converstion :  Method 1 
#### OrdinalEncoder : Convert Object Type to Integer Type 

In [19]:
def convertObject_OrdinalEncoder(data):
    """
    
    OrdinalEncoder: Convert Categorical Features of Object Type to Integer Type
    
    """
    
    catIndex = np.where(data.dtypes == object)[0]
    
    columns = []
    
    for ind in catIndex:
        label_encoder = preprocessing.LabelEncoder()
        feature = label_encoder.fit_transform(data.iloc[:,ind])
        columns.append(feature)
        
    d1 = data.drop( data.columns[catIndex], axis=1)
    d2 = pd.DataFrame( np.column_stack( columns ), columns=data.columns[catIndex] )
    dd = pd.concat([d1,d2], axis=1)
    
    return dd

Pass the dataframe inside the function and get the result

In [20]:
res = convertObject_OrdinalEncoder(df)

Check output 

In [21]:
# res
# res.shape 

In [23]:
train_df    = res.sample(frac=0.75, replace=False, random_state=rand_seed)
train_index = train_df.index.tolist()

In [24]:
test_df = res[ ~res.index.isin(train_index) ]

In [25]:
train_df = train_df.reset_index(drop=True)
test_df  = test_df.reset_index(drop=True)

In [26]:
train_test_ordinal = (train_df, test_df)

In [27]:
train_df.head(3)

Unnamed: 0,Pclass,Survived,Age,SibSp,Parch,Fare,NameLength,FamilySize,Name,Sex,Ticket,Cabin,Embarked,Boat,HomeDest,LastName,Title
0,3,0,21.0,0,0,8.05,24,1,194,1,728,186,2,27,340,124,7
1,3,1,5.0,0,0,12.475,29,1,370,0,593,186,2,4,236,236,6
2,1,1,18.0,1,0,227.525,49,2,69,0,832,97,0,13,236,37,8
