# Data Processing : Titanic Data 

<img style="float: left; width: 400px;" src="image_titanic_ship.png">

##  Overview

The RMS Titanic was a British passenger liner. It sank in the North Atlantic Ocean on 
15 April 1912 after striking an iceberg during her maiden voyage from Southampton, UK, 
to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, 
making it one of the worst passenger ship disasters in history. 

1. https://en.wikipedia.org/wiki/Titanic
2. https://en.wikipedia.org/wiki/Titanic#/media/File:RMS_Titanic_3.jpg
3. https://titanicfacts.net/titanic-survivors/

Publicly available Titatnic dataset contains survival information about 1309 passengers. We 
will investigate the dataset with the use of the Python libraries including NumPy, 
Scipy, Pandas, Matplotlib, and Seaborn.

##  Dataset Information

<img style="float: left; padding-bottom: 50px; " src="image_titanic_data.png" width="1000" height="100">


#### Pclass: A proxy for socio-economic status (SES)

1 = Upper class

2 = Middle class

3 = Lower class

#### Survived : Indicator whhethher or not a passengers survived

0 : No : Did not Survived 

1 : Yes: survived

#### Name : Name of the passengers.Format : last name, first name


#### Sex : Gender of passengers

Category : female, male


#### Age : Age is in years. 

Age is fractional if less than 1.

#### SibSp : the count of siblings and spouse between 0 to 8

The dataset defines family relations in the following way:

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

#### ParCh : the count of parents and children between 0  to 9

The dataset defines family relations in the following way:

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them


#### Ticket : Ticket  number

#### Fare : Ticket price in British Pound

#### Cabin : Cabin number 

#### Embarked :  The  place where the traveler got on-board the ship. 

There are three possible values for Embark 

Southampton (S): about 70% of the people boarded from Southampton

Cherbourg   (C): about 20% boarded from Cherbourg

Queenstown  (Q): the rest boarded from Queenstown


#### Boat : Lifeboat (if survived)

#### Body : Body number (if did not survive and body was recovered)

#### Home.dest : Home/Destination of the passengers

## Data Processing Steps

Part 1: Read Final Clean Data set

Part 2: Split data into train and test datasets

Part 3: Process Categorical Data : Convert into Numerical Data

----------------

##  Load Useful  Python Modules

numpy,  pandas, re, scipy

In [15]:
import numpy as np
import pandas as pd
import re
import scipy

sklearn modules for data processing

In [16]:
from sklearn import preprocessing 
from sklearn.impute import SimpleImputer, KNNImputer

-----------

## Part 1: Read Data

In [17]:
dataPath = "/Users/nururrahman/Desktop/StartUp/DataScienceInitiative/Bootcamp/"
dataFile = "data_titanic_clean.csv"

In [18]:
#df = pd.read_csv( dataPath + dataFile) 
df = pd.read_csv('data_titanic_clean.csv')

In [19]:
df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,StLouis_MO,Smith,Miss.,29,1
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22_C26,S,11,Montreal_PQ__Chesterville_ON,Klasen,Master.,30,4
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22_C26,S,UNKNOWN,Montreal_PQ__Chesterville_ON,Spector,Miss.,28,4


-------------------

## Part 2 : How to Split Data into Training and Test Data sets

#### Training data set : Choose a portion of the original data without replacement  

The variable 'portion', also known as fraction, ranges from 0 to  1

We choose this fraction to be 0.75 for training set. The typical number is between 0.70 to 0.90

Higher fraction means more data will be used for model training purpose

#### Test data set : The remiaining portion of the original data  that is not in the training data  set 

In [20]:
portion  = 0.75
train_df = df.sample(frac=portion, replace=False)

In [21]:
# Check train data set
train_df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
138,1,0,"Graham, Mr. George Edward",male,38.0,0,1,PC_17582,153.4625,C91,S,UNKNOWN,Winnipeg_MB,Kennedy,Mr.,25,2
390,2,0,"Deacon, Mr. Percy William",male,17.0,0,0,SOC_14879,73.5,UNKNOWN,S,UNKNOWN,UNKNOWN,Dooley,Mr.,25,1
1086,3,0,"Olsson, Miss. Elina",female,31.0,0,0,350407,7.8542,UNKNOWN,S,UNKNOWN,UNKNOWN,Dika,Miss.,19,1


Get the row indices of the training dataset

In [22]:
train_index = train_df.index.tolist()

Test set : Choose a portion of the original data that is not  included in the Training set 

In [28]:
#~df.index.isin(train_index)

In [29]:
test_df = df[ ~df.index.isin(train_index) ]

In [30]:
# Check test data set 
test_df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,StLouis_MO,Smith,Miss.,29,1
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22_C26,S,UNKNOWN,Montreal_PQ__Chesterville_ON,Davidson,Mr.,36,4
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3,NewYork_NY,Omont,Mr.,19,1


The rows indices of Training and Test sets are messed up

Reindex the rows to prettify the dataframes 

In [31]:
train_df = train_df.reset_index(drop=True)
test_df  = test_df.reset_index(drop=True)

In [32]:
# Check final row indices 
# train_df.head(3)
# test_df.head(3)

Collect Training and Test datasets as a tuple  

In [34]:
train_test_original = (train_df, test_df)

In [37]:
type(train_test_original[1])

pandas.core.frame.DataFrame

In [35]:
train_test_original

(     Pclass  Survived                                            Name     Sex  \
 0         1         0                       Graham, Mr. George Edward    male   
 1         2         0                       Deacon, Mr. Percy William    male   
 2         3         0                             Olsson, Miss. Elina  female   
 3         3         0                             Zakarian, Mr. Ortin    male   
 4         2         0                  Sobey, Mr. Samuel James Hayden    male   
 ..      ...       ...                                             ...     ...   
 977       2         1                       Mellors, Mr. William John    male   
 978       1         1  Chambers, Mrs. Norman Campbell (Bertha Griggs)  female   
 979       1         0                      Ryerson, Mr. Arthur Larned    male   
 980       1         0                          Astor, Col. John Jacob    male   
 981       1         1                   Endres, Miss. Caroline Louise  female   
 
       Age  Si

#### Advanced Topic

Sometimes you will see something  like this

In [39]:
help(df.sample)

Help on method sample in module pandas.core.generic:

sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) -> 'FrameOrSeries' method of pandas.core.frame.DataFrame instance
    Return a random sample of items from an axis of object.
    
    You can use `random_state` for reproducibility.
    
    Parameters
    ----------
    n : int, optional
        Number of items from axis to return. Cannot be used with `frac`.
        Default = 1 if `frac` = None.
    frac : float, optional
        Fraction of axis items to return. Cannot be used with `n`.
    replace : bool, default False
        Allow or disallow sampling of the same row more than once.
    weights : str or ndarray-like, optional
        Default 'None' results in equal probability weighting.
        If passed a Series, will align with target object on index. Index
        values in weights not found in sampled object will be ignored and
        index values in sampled object not in weights will b

In [40]:
train_df = df.sample(frac=0.75, replace=False, random_state=42)

What does this 'random_state=42' mean?

'random_state' can be any integer number greater than 0 

Let us choose a random integer between 100 and 500. 

In [41]:
# rand_seed = 199
rand_seed = np.random.randint(100,500)

In [42]:
print(rand_seed)

116


In [43]:
portion  = 0.75
train_df = df.sample(frac=portion, replace=False, random_state=rand_seed)

In [None]:
# # Check train data set
# train_df.head(3)

# train_index = train_df.index.tolist()
# test_df = df[ ~df.index.isin(train_index) ]

# # Check test data set 
# test_df.head(3)

-----------------------------------------

## Part 3 : Process  Categorical Data for ML Modeling

### 3.1: Categorical Data  Converstion :  Method 1 
#### OrdinalEncoder : Convert Object Type to Integer Type 

In [45]:
def convertObject_OrdinalEncoder(data):
    """
    
    OrdinalEncoder: Convert Categorical Features of Object Type to Integer Type
    
    """
    
    catIndex = np.where(data.dtypes == object)[0]
    
    columns = []
    
    for ind in catIndex:
        label_encoder = preprocessing.LabelEncoder()
        feature = label_encoder.fit_transform(data.iloc[:,ind])
        columns.append(feature)
        
    d1 = data.drop( data.columns[catIndex], axis=1)
    d2 = pd.DataFrame( np.column_stack( columns ), columns=data.columns[catIndex] )
    dd = pd.concat([d1,d2], axis=1)
    
    return dd

Pass the dataframe inside the function and get the result

In [47]:
res = convertObject_OrdinalEncoder(df)

Check output 

In [85]:
 res
 #res.shape 

Unnamed: 0,Pclass,Survived,Age,SibSp,Parch,Fare,NameLength,FamilySize,Name,Sex,Ticket,Cabin,Embarked,Boat,HomeDest,LastName,Title
0,1,1,29.00,0,0,211.3375,29,1,21,0,187,43,2,11,308,747,6
1,1,1,0.92,1,2,151.5500,30,4,23,1,49,79,2,2,230,413,5
2,1,0,2.00,1,2,151.5500,28,4,24,0,49,79,2,27,230,753,6
3,1,0,30.00,1,2,151.5500,36,4,25,1,49,79,2,27,230,194,7
4,1,0,25.00,1,2,151.5500,47,4,26,0,49,79,2,27,230,357,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,14.50,1,0,14.4542,20,2,1291,0,259,186,0,27,340,19,6
1305,3,0,28.00,1,0,14.4542,21,2,1292,0,259,186,0,27,340,635,6
1306,3,0,26.50,0,0,7.2250,25,1,1293,1,250,186,0,27,340,91,7
1307,3,0,27.00,0,0,7.2250,19,1,1294,1,264,186,0,27,340,692,7


In [50]:
train_df    = res.sample(frac=0.75, replace=False, random_state=rand_seed)
train_index = train_df.index.tolist()

In [51]:
test_df = res[ ~res.index.isin(train_index) ]

In [52]:
train_df = train_df.reset_index(drop=True)
test_df  = test_df.reset_index(drop=True)

In [53]:
train_test_ordinal = (train_df, test_df)

In [54]:
train_df.head(3)

Unnamed: 0,Pclass,Survived,Age,SibSp,Parch,Fare,NameLength,FamilySize,Name,Sex,Ticket,Cabin,Embarked,Boat,HomeDest,LastName,Title
0,3,0,28.0,0,0,7.8958,20,1,761,1,538,186,0,27,340,3,7
1,3,1,28.0,0,0,7.8792,33,1,808,0,391,186,1,10,340,484,6
2,2,0,30.0,0,0,13.0,27,1,445,1,218,186,2,27,340,127,7


In [55]:
test_df.head()

Unnamed: 0,Pclass,Survived,Age,SibSp,Parch,Fare,NameLength,FamilySize,Name,Sex,Ticket,Cabin,Embarked,Boat,HomeDest,LastName,Title
0,1,1,29.0,0,0,211.3375,29,1,21,0,187,43,2,11,308,747,6
1,1,0,25.0,1,2,151.55,47,4,26,0,49,79,2,27,230,357,8
2,1,1,53.0,2,0,51.4792,45,3,50,0,76,61,2,26,21,826,8
3,1,0,47.0,1,0,227.525,22,2,68,1,832,97,0,27,236,406,0
4,1,0,28.0,0,0,25.925,19,1,97,1,789,186,2,27,236,751,7


In [61]:
df4 = pd.DataFrame({"Sex":['M','M','F','F'], "Ocupation":["Student", "Doctor","Professional","Student"]})
df4

Unnamed: 0,Sex,Ocupation
0,M,Student
1,M,Doctor
2,F,Professional
3,F,Student


In [72]:
df5 = pd.get_dummies(df4)
df5

Unnamed: 0,Sex_F,Sex_M,Ocupation_Doctor,Ocupation_Professional,Ocupation_Student
0,0,1,0,0,1
1,0,1,1,0,0
2,1,0,0,1,0
3,1,0,0,0,1


In [73]:
df5.rename(columns = {"Sex_F": "F", "Sex_M": "M", "Ocupation_Doctor":"D", "Ocupation_Professional":"P", "Ocupation_Student":"S"})

Unnamed: 0,F,M,D,P,S
0,0,1,0,0,1
1,0,1,1,0,0
2,1,0,0,1,0
3,1,0,0,0,1


In [84]:
pd.get_dummies(df4, prefix=' ', prefix_sep='_')

Unnamed: 0,_F,_M,_Doctor,_Professional,_Student
0,0,1,0,0,1
1,0,1,1,0,0
2,1,0,0,1,0
3,1,0,0,0,1


In [86]:
train_df.head()

Unnamed: 0,Pclass,Survived,Age,SibSp,Parch,Fare,NameLength,FamilySize,Name,Sex,Ticket,Cabin,Embarked,Boat,HomeDest,LastName,Title
0,3,0,28.0,0,0,7.8958,20,1,761,1,538,186,0,27,340,3,7
1,3,1,28.0,0,0,7.8792,33,1,808,0,391,186,1,10,340,484,6
2,2,0,30.0,0,0,13.0,27,1,445,1,218,186,2,27,340,127,7
3,3,1,28.0,0,0,7.7333,32,1,830,0,576,186,1,10,340,594,6
4,3,0,25.0,0,0,7.7417,21,1,426,1,620,186,1,27,236,601,7
