# Data Processing : Titanic Data 

<img style="float: left; width: 400px;" src="image_titanic_ship.png">

##  Overview

The RMS Titanic was a British passenger liner. It sank in the North Atlantic Ocean on 
15 April 1912 after striking an iceberg during her maiden voyage from Southampton, UK, 
to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, 
making it one of the worst passenger ship disasters in history. 

1. https://en.wikipedia.org/wiki/Titanic
2. https://en.wikipedia.org/wiki/Titanic#/media/File:RMS_Titanic_3.jpg

Publicly available Titatnic dataset contains survival information about 1309 passengers. We 
will investigate the dataset with the use of the Python libraries including NumPy, 
Scipy, Pandas, Matplotlib, and Seaborn.

##  Dataset Information

<img style="float: left; padding-bottom: 50px; " src="image_titanic_data.png" width="1000" height="100">


#### Pclass: A proxy for socio-economic status (SES)

1 = Upper class

2 = Middle class

3 = Lower class

#### Survived : Indicator whhethher or not a passengers survived

0 : No : Did not Survived 

1 : Yes: survived

#### Name : Name of the passengers.Format : last name, first name


#### Sex : Gender of passengers

Category : female, male


#### Age : Age is in years. 

Age is fractional if less than 1.

#### SibSp : the count of siblings and spouse between 0 to 8

The dataset defines family relations in the following way:

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

#### ParCh : the count of parents and children between 0  to 9

The dataset defines family relations in the following way:

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them


#### Ticket : Ticket  number

#### Fare : Ticket price in British Pound

#### Cabin : Cabin number 

#### Embarked :  The  place where the traveler got on-board the ship. 

There are three possible values for Embark 

Southampton (S): about 70% of the people boarded from Southampton

Cherbourg   (C): about 20% boarded from Cherbourg

Queenstown  (Q): the rest boarded from Queenstown


#### Boat : Lifeboat (if survived)

#### Body : Body number (if did not survive and body was recovered)

#### Home.dest : Home/Destination of the passengers

## Categorical Data Processing Steps

Part 1: Read Final Clean Data

Part 2: Process Categorical Data and Convert into Numerical

Part 3: Create train, test split

----------------

##  Load Python Modules

numpy,  pandas, re, scipy

In [1]:
import numpy as np
import pandas as pd
import re
import scipy

In [2]:
from IPython import display

sklearn modules

In [3]:
from sklearn import preprocessing 
from sklearn.impute import SimpleImputer, KNNImputer

-----------

## Part 1: Read Data

In [4]:
dataFile = "data_titanic_final.csv"
df = pd.read_csv( dataFile )

In [5]:
df.head(2)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,StLouis_MO,Allen,Miss.,29,1
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22_C26,S,11,Montreal_PQ__Chesterville_ON,Allison,Master.,30,4


In [6]:
# df.Cabin.unique()

In [7]:
df.Embarked.unique()

array(['S', 'C', 'UNKNOWN', 'Q'], dtype=object)

-------------------

## Part 2 : Process  Categorical Data for  ML  Modeling

A random integer between 100 and 500

In [8]:
theSeed = np.random.randint(100,500)

### 2.1 : Data in Original Format : Do not Convert Object Type

Training set : Choose a portion of the original data without replacement  

'Portion' ranges from 0 to  1

In [9]:
portion  = 0.75
train_df = df.sample(frac=portion, replace=False, random_state=theSeed)

In [10]:
# Check train set
train_df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
872,3,1,"Howard, Miss. May Elizabeth",female,28.0,0,0,A_2_39186,8.05,UNKNOWN,S,C,UNKNOWN,Howard,Miss.,27,1
967,3,0,"Lindblom, Miss. Augusta Charlotta",female,45.0,0,0,347073,7.75,UNKNOWN,S,UNKNOWN,UNKNOWN,Lindblom,Miss.,33,1
511,2,0,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,UNKNOWN,Q,UNKNOWN,Cambridge_MA,Myles,Mr.,25,1


Get the row indices of the training dataset

In [11]:
train_index = train_df.index.tolist()

Test set : Choose a portion of the original data that is not  included in the Training set 

In [12]:
test_df = df[ ~df.index.isin(train_index) ]

In [13]:
# Check test set 
test_df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22_C26,S,11,Montreal_PQ__Chesterville_ON,Allison,Master.,30,4
13,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.0,0,0,19877,78.85,UNKNOWN,S,6,UNKNOWN,Barber,Miss.,28,1
16,1,0,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC_17558,247.5208,B58_B60,C,UNKNOWN,Montreal_PQ,Baxter,Mr.,24,2


The rows indices of Training and Test sets are messed up

Reindex the rows to prettify the dataframes 

In [14]:
train_df = train_df.reset_index(drop=True)
test_df  = test_df.reset_index(drop=True)

In [15]:
# Check final row indices 
# train_df.head(3)
# test_df.head(3)

Collect Training and Test dataset as a tuple  

In [16]:
train_test_original = (train_df, test_df)

-----------------------------------------

### 2.2 : Categorical Data  Converstion 1 
#### OrdinalEncoder : Convert Object Type to Integer Type 

In [17]:
def convertObject_OrdinalEncoder(feature, df):
    """
        OrdinalEncoder: Convert Categorical Features of Object Type to Integer Type
        Cabin & Embarked have Problematic Levels
    """
    #df[ feature[0] ] = df.apply(lambda row: str(row.Cabin).replace(" ",""), axis=1)
    #df[ feature[1] ] = df.apply(lambda row: str(row.Embarked).replace(" ",""), axis=1)
    df[ feature[0] ] = df.loc[:,feature[0]].apply(lambda row: str(row).replace(" ",""))
    df[ feature[1] ] = df.loc[:,feature[1]].apply(lambda row: str(row).replace(" ",""))
    
    catIndex = np.where(df.dtypes == object)[0]
    
    columns = []
    for ind in catIndex:
        label_encoder = preprocessing.LabelEncoder()
        feature = label_encoder.fit_transform(df.iloc[:,ind])
        columns.append(feature)
        
    d1 = df.drop( df.columns[catIndex], axis=1)
    d2 = pd.DataFrame( np.column_stack( columns ), columns=df.columns[catIndex] )
    dd = pd.concat([d1,d2], axis=1)
    
    return dd

In [18]:
feature = ["Cabin", "Embarked"]
dat = convertObject_OrdinalEncoder(feature, df)

In [19]:
train_df    = dat.sample(frac=0.75, replace=False, random_state=None)
train_index = train_df.index.tolist()
test_df     = dat[ ~dat.index.isin(train_index) ]

In [20]:
train_df = train_df.reset_index(drop=True)
test_df  = test_df.reset_index(drop=True)

In [21]:
train_test_ordinal = (train_df, test_df)

In [22]:
train_df.head(3)

Unnamed: 0,Pclass,Survived,Age,SibSp,Parch,Fare,NameLength,FamilySize,Name,Sex,Ticket,Cabin,Embarked,Boat,HomeDest,LastName,Title
0,3,0,10.0,0,2,24.15,25,3,1206,0,425,186,2,27,340,811,6
1,2,0,18.0,0,0,73.5,25,1,306,1,862,186,2,27,216,195,7
2,3,0,30.0,0,0,8.05,29,1,1125,1,707,186,2,27,340,752,7


### 2.3 : Categorical  Data  Conversion 2

#### OneHotEncoder : Convert Object Type to Integer Type 

In [23]:
def convertObject_OneHotEncoder(feature, df):
    """
        OneHotEncoder : Convert Categorical Features of Object Type to Integer Type
        Cabin & Embarked have Problematic Levels
    """

    df[ feature[0] ] = df.loc[:,feature[0]].apply(lambda row: str(row).replace(" ",""))
    df[ feature[1] ] = df.loc[:,feature[1]].apply(lambda row: str(row).replace(" ",""))
    
    catIndex = np.where(df.dtypes == object)[0]
    df_cat   = df.iloc[:, catIndex]
    #print(f"Total catIndex : {len(catIndex)}")

    intIndex = np.where(df.dtypes == int)[0]
    df_int   = df.iloc[:, intIndex]
    #print(f"Total intIndex : {len(intIndex)}")

    floatIndex = np.where(df.dtypes == float)[0]
    df_float   = df.iloc[:, floatIndex]
    #print(f"Total intIndex : {len(floatIndex)}")
    
    imputer = SimpleImputer(strategy='most_frequent')
    imputed_int = imputer.fit_transform(df_int.values)
    df_int = pd.DataFrame(imputed_int, columns=df_int.columns)

    imputer = SimpleImputer(strategy='median')
    imputed_float = imputer.fit_transform(df_float.values)
    df_float = pd.DataFrame(imputed_float, columns=df_float.columns)

    enc = preprocessing.OneHotEncoder(categories='auto', handle_unknown='ignore')
    fit = enc.fit( df_cat.values )
    trns= fit.transform( df_cat.values )
    
    colName = enc.get_feature_names_out( df_cat.columns.tolist() )
    df_cat_trns = pd.DataFrame( trns.toarray(), columns=colName )
    
    dd = pd.concat([df_cat_trns, df_int, df_float], axis=1)
    return dd

In [24]:
X = df.drop(columns=['Survived'], axis=0)
Y = df[['Survived']]

In [25]:
feature = ["Cabin", "Embarked"]
X = convertObject_OneHotEncoder(feature, X)
dat = pd.concat([Y, X], axis=1)

In [26]:
train_df = dat.sample(frac=0.75, replace=False, random_state=None)
train_index = train_df.index.tolist()
test_df = dat[ ~dat.index.isin(train_index) ]

In [27]:
train_df = train_df.reset_index(drop=True)
test_df  = test_df.reset_index(drop=True)

In [28]:
train_test_onehot = (train_df, test_df)

In [29]:
train_df.head(3)

Unnamed: 0,Survived,"Name_Abbing, Mr. Anthony","Name_Abbott, Master. Eugene Joseph","Name_Abbott, Mr. Rossmore Edward","Name_Abbott, Mrs. Stanton (Rosa Hunt)","Name_Abelseth, Miss. Karen Marie","Name_Abelseth, Mr. Olaus Jorgensen","Name_Abelson, Mr. Samuel","Name_Abelson, Mrs. Samuel (Hannah Wizosky)","Name_Abrahamsson, Mr. Abraham August Johannes",...,Title_Mrs.,Title_Rev.,Title_Sir.,Pclass,SibSp,Parch,NameLength,FamilySize,Age,Fare
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3,0,0,20,1,24.0,7.25
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,0,0,29,1,29.0,13.8583
2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3,1,0,36,2,36.5,17.4
