# Data Analysis : Titanic Data 

<img style="float: left; width: 400px;" src="image_titanic_ship.png">

##  Overview

The RMS Titanic was a British passenger liner. It sank in the North Atlantic Ocean on 
15 April 1912 after striking an iceberg during her maiden voyage from Southampton, UK, 
to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, 
making it one of the worst passenger ship disasters in history. 

1. https://en.wikipedia.org/wiki/Titanic
2. https://en.wikipedia.org/wiki/Titanic#/media/File:RMS_Titanic_3.jpg
3. https://titanicfacts.net/titanic-survivors/

Publicly available Titatnic dataset contains survival information about 1309 passengers. We 
will investigate the dataset with the use of the Python libraries including NumPy, 
Scipy, Pandas, Matplotlib, and Seaborn.

##  Dataset Information

<img style="float: left; padding-bottom: 50px; " src="image_titanic_data.png" width="1000" height="100">


#### Pclass: A proxy for socio-economic status (SES)

1 = Upper class

2 = Middle class

3 = Lower class

#### Survived : Indicator whhethher or not a passengers survived

0 : No : Did not Survived 

1 : Yes: survived

#### Name : Name of the passengers.Format : last name, first name


#### Sex : Gender of passengers

Category : female, male


#### Age : Age is in years. 

Age is fractional if less than 1.

#### SibSp : the count of siblings and spouse between 0 to 8

The dataset defines family relations in the following way:

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

#### ParCh : the count of parents and children between 0  to 9

The dataset defines family relations in the following way:

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them


#### Ticket : Ticket  number

#### Fare : Ticket price in British Pound

#### Cabin : Cabin number 

#### Embarked :  The  place where the traveler got on-board the ship. 

There are three possible values for Embark 

Southampton (S): about 70% of the people boarded from Southampton

Cherbourg   (C): about 20% boarded from Cherbourg

Queenstown  (Q): the rest boarded from Queenstown


#### Boat : Lifeboat (if survived)

#### Body : Body number (if did not survive and body was recovered)

#### Home.dest : Home/Destination of the passengers

## Data Analysis  Steps

Part 1: Read Raw Data

Part 2: Explore Data

Part 3: Process Data

Part 4: Engineer New Data

Part 5: Get Final Clean Data

----------------

##  Load Useful  Python Modules

numpy,  pandas, re, scipy

In [49]:
import numpy as np
import pandas as pd
import re
import scipy

In [50]:
from IPython import display

In [51]:
import matplotlib
from matplotlib import style
from matplotlib import pyplot as plt
%matplotlib inline

sklearn modules

In [52]:
from sklearn import preprocessing 
from sklearn.impute import SimpleImputer, KNNImputer

-----------

## Part 1: Read Data

In [53]:
dataPath = "/Users/nururrahman/Desktop/StartUp/DataScienceInitiative/Bootcamp/"
dataFile = "data_titanic_raw.csv"

In [54]:
#df = pd.read_csv( dataPath + dataFile)
df = pd.read_csv('data_titanic_raw.csv')

-------------------

## Part 2 : Explore Data

#### 2.01 : Have a Quick Look at  the Data

In [55]:
df.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,Body,Home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


#### 2.02 : Check Data Properties

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Pclass     1309 non-null   int64  
 1   Survived   1309 non-null   int64  
 2   Name       1309 non-null   object 
 3   Sex        1309 non-null   object 
 4   Age        1046 non-null   float64
 5   SibSp      1309 non-null   int64  
 6   Parch      1309 non-null   int64  
 7   Ticket     1309 non-null   object 
 8   Fare       1308 non-null   float64
 9   Cabin      295 non-null    object 
 10  Embarked   1307 non-null   object 
 11  Boat       486 non-null    object 
 12  Body       121 non-null    float64
 13  Home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


#### 2.03 : Find the shape of the dataset

shape of the dataset = how many rows, how many columns 

In [57]:
df.shape

(1309, 14)

#### 2.04 : Print the name of the columns

In [58]:
df.columns

Index(['Pclass', 'Survived', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'Boat', 'Body', 'Home.dest'],
      dtype='object')

#### 2.05 : Extract any column of the dataset  

In [59]:
df['Pclass']
df['Pclass'].to_list()      # Convert a series to a Python list 
df['Pclass'].to_numpy()     # Convert a series to a Numpy array  

array([1, 1, 1, ..., 3, 3, 3], dtype=int64)

In [60]:
df['Survived']

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: Survived, Length: 1309, dtype: int64

#### 2.06 : Which class has the highest number of passengers?   

In [61]:
# df.groupby(['Pclass'], as_index=True).size()

df.groupby(['Pclass'], as_index=False).size()

Unnamed: 0,Pclass,size
0,1,323
1,2,277
2,3,709


In [62]:
#df['Pclass'].value_counts()

In [63]:
df.columns

Index(['Pclass', 'Survived', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'Boat', 'Body', 'Home.dest'],
      dtype='object')

In [64]:
df['Sex']

0       female
1         male
2       female
3         male
4       female
         ...  
1304    female
1305    female
1306      male
1307      male
1308      male
Name: Sex, Length: 1309, dtype: object

##### Which class has the highest number of passengers?

In [65]:
df.groupby(['Pclass']).size()

Pclass
1    323
2    277
3    709
dtype: int64

#### 2.07 : How many passengers survived? How many did not survive? 

In [66]:
#a = df.groupby(['Survived'], as_index=True).size()
#type(a)
df.groupby(['Survived'], as_index=False).size()

Unnamed: 0,Survived,size
0,0,809
1,1,500


In [67]:
(df.groupby(['Survived'], as_index=False).size()*100)/len(df)

Unnamed: 0,Survived,size
0,0.0,61.802903
1,0.076394,38.197097


#### 2.08 : How many passengers from each class  survived?

In [68]:
df.groupby(['Pclass','Survived'], as_index=False).size()

Unnamed: 0,Pclass,Survived,size
0,1,0,123
1,1,1,200
2,2,0,158
3,2,1,119
4,3,0,528
5,3,1,181


In [69]:
df.groupby(['Survived'], as_index=False).size()

Unnamed: 0,Survived,size
0,0,809
1,1,500


#### 2.09 : Show a few passengers' name.

In [70]:
df['Name']

0                         Allen, Miss. Elisabeth Walton
1                        Allison, Master. Hudson Trevor
2                          Allison, Miss. Helen Loraine
3                  Allison, Mr. Hudson Joshua Creighton
4       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
                             ...                       
1304                               Zabour, Miss. Hileni
1305                              Zabour, Miss. Thamine
1306                          Zakarian, Mr. Mapriededer
1307                                Zakarian, Mr. Ortin
1308                                 Zimmerman, Mr. Leo
Name: Name, Length: 1309, dtype: object

#### 2.10 : How many females and males are recorded in the dataset?

In [71]:
df.groupby(['Sex'], as_index=False).size()

Unnamed: 0,Sex,size
0,female,466
1,male,843


#### 2.11 : Find the distribution of female and male who survived and did not survive 

In [72]:
df.groupby(['Sex', 'Survived'], as_index=False).size()

Unnamed: 0,Sex,Survived,size
0,female,0,127
1,female,1,339
2,male,0,682
3,male,1,161


#### 2.12 : Explore Age of the passengers

Find the youngest passeger : The minium  value of Age

In [73]:
age = df.Age
min_age = age.min()

print( 'age of the youngest passenger :',  min_age)

# Using numpy
np.min(age)

age of the youngest passenger : 0.17


0.17

Find the oldset passenger : The maximum value of Age 

In [74]:
age = df.Age
max_age = age.max()
print( 'age of the oldest passenger :', age.max() )

# Using numpy
# np.max(age)

age of the oldest passenger : 80.0


In [75]:
type(df.Age)

pandas.core.series.Series

Mean and Median age of passenger   

In [76]:
print('mean age of the passengers   :', df.Age.mean() )
print('median age of the passengers :', df.Age.median() )

mean age of the passengers   : 29.881137667304014
median age of the passengers : 28.0


In [77]:
( df.Age.quantile( [0.25, 0.50, 0.75, 1.0] ) )

0.25    21.0
0.50    28.0
0.75    39.0
1.00    80.0
Name: Age, dtype: float64

#### 2.13 : Find the most expensive ticket in the dataset.

Who purchased the most expensive ticket?

Which embarking port sold the most expensive ticket? 

In [78]:
df['Fare'].isnull().sum()

1

In [79]:
fare = df['Fare'].to_numpy()
fare

array([211.3375, 151.55  , 151.55  , ...,   7.225 ,   7.225 ,   7.875 ])

In [80]:
# max_fare = np.max(fare)
max_fare = np.nanmax(fare)
print(max_fare)

512.3292


In [81]:
#help(np.nanmax)

In [82]:
ind = np.where( fare==max_fare )
print(ind)

(array([ 49,  50, 183, 302], dtype=int64),)


In [83]:
type(ind)

tuple

In [84]:
# Find rows with the most expensive tickets
#df.iloc[ind, :]

In [85]:
ind = np.where( fare==max_fare )[0]
print(ind)

[ 49  50 183 302]


Find the rows with the most expensive tickets

In [86]:
df.iloc[ind, :]

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,Body,Home.dest
49,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,3,,"Austria-Hungary / Germantown, Philadelphia, PA"
50,1,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C,3,,"Germantown, Philadelphia, PA"
183,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,3,,
302,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C,3,,


Find names of the most expensive ticket owners

In [87]:
df.iloc[ind, :]['Name']

49                    Cardeza, Mr. Thomas Drake Martinez
50     Cardeza, Mrs. James Warburton Martinez (Charlo...
183                               Lesurer, Mr. Gustave J
302                                     Ward, Miss. Anna
Name: Name, dtype: object

Find the names of the embarking ports that sold the most expensive ticket 

In [88]:
df.iloc[ind, :]['Embarked']

49     C
50     C
183    C
302    C
Name: Embarked, dtype: object

#### 2.14 : Find the least expensive ticket in the dataset.

In [89]:
min_fare = np.nanmin(fare)
print(min_fare)

0.0


In [90]:
ind = np.where( fare==min_fare )
print(ind)

(array([   7,   70,  125,  150,  170,  223,  234,  363,  384,  410,  473,
        528,  581,  896,  898,  963, 1254], dtype=int64),)


In [91]:
ind = np.where( fare==min_fare )[0]
print(ind)

[   7   70  125  150  170  223  234  363  384  410  473  528  581  896
  898  963 1254]


In [92]:
df.groupby(['Sex']).agg({'Fare':['mean', 'median', 'max'] })

Unnamed: 0_level_0,Fare,Fare,Fare
Unnamed: 0_level_1,mean,median,max
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,46.198097,23.0,512.3292
male,26.154601,11.8875,512.3292


#### 2.15 : Explore the Embarked column of the dataset 

How many people boarded from Cherbourg (C)?

How many people boarded from Queenstown (Q)?

How many people boarded from Southampton (S)?

Method 1

First, find absolute values 

Next, find the percentage

In [93]:
g1 = df.groupby(['Embarked'], as_index=False).size()
g1

Unnamed: 0,Embarked,size
0,C,270
1,Q,123
2,S,914


In [94]:
 g1['Percent'] = g1.apply( lambda col : 100*col[1]/df.shape[0], axis=1 )
g1 

Unnamed: 0,Embarked,size,Percent
0,C,270,20.626432
1,Q,123,9.396486
2,S,914,69.824293


In [95]:
df.shape

(1309, 14)

In [96]:
100*g1['size'].to_numpy()/1309

array([20.62643239,  9.39648587, 69.82429335])

Method 2

Combined all steps together

In [97]:
g1 = df.groupby(['Embarked'], as_index=True).agg({'Embarked':'count'})
g2 = g1.rename(columns={'Embarked':'Count'})
g2['Percent'] = g2.apply( lambda col : np.round(100*col/df.shape[0],2), axis=0 )
g2.index.name=''
g2

Unnamed: 0,Count,Percent
,,
C,270.0,20.63
Q,123.0,9.4
S,914.0,69.82


#### 2.16 : From which port did majority of the Pclass==3 passengers embark?

In [98]:
df.groupby(['Embarked','Pclass'], as_index=False).size()

Unnamed: 0,Embarked,Pclass,size
0,C,1,141
1,C,2,28
2,C,3,101
3,Q,1,3
4,Q,2,7
5,Q,3,113
6,S,1,177
7,S,2,242
8,S,3,495


#### 2.17 : How many passengers traveled alone?

SibSp : Sibling-Spouse

Parch : Parent-Child 

Those who has 'SibSp'==0 and 'Parch'==0 in the dataset travelled alone

In [99]:
tmp = df.groupby(['SibSp', 'Parch'], as_index=False).size()
tmp

Unnamed: 0,SibSp,Parch,size
0,0,0,790
1,0,1,52
2,0,2,43
3,0,3,2
4,0,4,2
5,0,5,2
6,1,0,183
7,1,1,90
8,1,2,29
9,1,3,5


The first row of the dataframe has 'SibSp'==0 and 'Parch'==0

In [100]:
total = tmp['size'][0]
print('number of people who traveled alone :', total)

number of people who traveled alone : 790


In [101]:
# How many passengers had 'Parch'==7?
# df.groupby(['Parch'], as_index=False).size()

In [102]:
# There are no passenger with 'Parch'==7

----------------------

## Part  3 : Process Data

#### 3.1 : Check null values in each column of the dataframe

In [103]:
s = df.isnull().sum()
print('for a given column, the number of rows with null values:\n')
print(s)

for a given column, the number of rows with null values:

Pclass          0
Survived        0
Name            0
Sex             0
Age           263
SibSp           0
Parch           0
Ticket          0
Fare            1
Cabin        1014
Embarked        2
Boat          823
Body         1188
Home.dest     564
dtype: int64


Find the percent of null values

In [104]:
print('shape of the dataframe : ', df.shape)
print('number of rows    :', df.shape[0])
print('number of columns :', df.shape[1])

shape of the dataframe :  (1309, 14)
number of rows    : 1309
number of columns : 14


Percentage of null values = Divide the null count for each column by the number of total rows of the dataframe

In [105]:
# Get the result as a series 
s = 100 * s / df.shape[0]
print(s)

Pclass        0.000000
Survived      0.000000
Name          0.000000
Sex           0.000000
Age          20.091673
SibSp         0.000000
Parch         0.000000
Ticket        0.000000
Fare          0.076394
Cabin        77.463713
Embarked      0.152788
Boat         62.872422
Body         90.756303
Home.dest    43.086325
dtype: float64


In [106]:
# Get the result  as a dataframe
# s = 100 * s / df.shape[0]
# pd.DataFrame( s[s>0.0], columns=['Missing Fraction'] ).reset_index(drop=False)

The column 'Body' has too many null  values. 

We can drop this column.

In [107]:
df2 = df.drop(columns=["Body"])

Dropping a column can be done in many ways. Explore it later at home.

In [108]:
# df2 = df.drop(columns=["Body"], axis=0)
# df2 = df.drop(columns=["Body"], axis=0, inplace=False)
# df.drop(columns=["Body"], axis=0, inplace=True)

Check the shape of the new dataset

In [109]:
df2.shape

(1309, 13)

In [110]:
df2.columns

Index(['Pclass', 'Survived', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'Boat', 'Home.dest'],
      dtype='object')

#### 3.2 : Impute null values

There are various ways one can impute null values. Check out sklearn page on data imputation:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute

Univariate Imputation   : This type of algoritm imputes values in the i-th column dimension using only non-missing values in that column dimension. Example : impute.SimpleImputer. 

Multivariate Imputation : This imputation algorithms use the entire set of available feature dimensions to estimate the missing values.  Example : impute.IterativeImputer.

#### In this tutorial we  will follow Univariate Imputation with the following choices: 

#### Fill null vales in Categoroical column by : 'UNKNOWN'

#### Fill null vales in Numerical column by : Median value of the column  

In [111]:
# The number of null values in each column
s1 = df2.isnull().sum() 
print(s1)

Pclass          0
Survived        0
Name            0
Sex             0
Age           263
SibSp           0
Parch           0
Ticket          0
Fare            1
Cabin        1014
Embarked        2
Boat          823
Home.dest     564
dtype: int64


In [112]:
# print( s1.index )
# print(s1.values )

In [113]:
# Filter s to show only those columns that have null values
s2 = s1[ s1.values>0 ] 
print(s2)

Age           263
Fare            1
Cabin        1014
Embarked        2
Boat          823
Home.dest     564
dtype: int64


In [114]:
print( s2.index )
print(s2.values )

Index(['Age', 'Fare', 'Cabin', 'Embarked', 'Boat', 'Home.dest'], dtype='object')
[ 263    1 1014    2  823  564]


Convert Pandas series to Python list 

In [115]:
indexS = s2.index.tolist()
valueS = s2.values.tolist()
print( 'name of the columns :', indexS )
print('number of null values in each column:', valueS)

name of the columns : ['Age', 'Fare', 'Cabin', 'Embarked', 'Boat', 'Home.dest']
number of null values in each column: [263, 1, 1014, 2, 823, 564]


#### Perform imputation. Use for loop to  go over each column

if column type is 'object', impute null values by 'UNKNOWN' using pandas fillna() method

if column type is 'float', impute null values by 'median'  using sklearn SimpleImputer() module

In [116]:
# # A quick example of data imputation 

data = pd.DataFrame(
    [["a", "x"],
     [np.nan, "y"],
     ["a", "y"],
     ["b", "y"],
     ["a",np.nan]], 
    dtype="category",
    columns=['col1','col2']
)

print( data )
print(' ')

imp = SimpleImputer(strategy="most_frequent")
imp_data = imp.fit_transform(data)

imp_data = pd.DataFrame(imp_data, columns=['col1','col2'])
print(imp_data)

  col1 col2
0    a    x
1  NaN    y
2    a    y
3    b    y
4    a  NaN
 
  col1 col2
0    a    x
1    a    y
2    a    y
3    b    y
4    a    y


In [117]:
for col in indexS:
    if df2[col].dtype == 'object':
        df2[col].fillna('UNKNOWN', inplace=True)
        
    elif df2[col].dtype == 'float':           
        imputer = SimpleImputer(strategy='median')
        imputed = imputer.fit_transform( df2[[col]] )
        df2[col] = imputed
    else:
        pass

Check whether there are still null values in each column of the dataframe

In [118]:
s = df2.isnull().sum() 
print(s)

Pclass       0
Survived     0
Name         0
Sex          0
Age          0
SibSp        0
Parch        0
Ticket       0
Fare         0
Cabin        0
Embarked     0
Boat         0
Home.dest    0
dtype: int64


In [119]:
df2

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,Home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,UNKNOWN,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,UNKNOWN,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,UNKNOWN,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,UNKNOWN,C,UNKNOWN,UNKNOWN
1305,3,0,"Zabour, Miss. Thamine",female,28.00,1,0,2665,14.4542,UNKNOWN,C,UNKNOWN,UNKNOWN
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,UNKNOWN,C,UNKNOWN,UNKNOWN
1307,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,UNKNOWN,C,UNKNOWN,UNKNOWN


#### 3.3 : Clean data of some of the categorical columns

The column values are strings.

The string might contains dots(.) or commas (,) or some other characters

Check the dataframe once again

In [120]:
df2.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,Home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,UNKNOWN,"Montreal, PQ / Chesterville, ON"


3.3.1 Clean the values of 'Ticket'

Check the 'Ticket' column

In [121]:
# df['Ticket']

df2[['Ticket']].groupby('Ticket').size().sort_values()

Ticket
345769           1
349245           1
349246           1
349247           1
349248           1
                ..
PC 17608         7
S.O.C. 14879     7
CA 2144          8
1601             8
CA. 2343        11
Length: 929, dtype: int64



'Ticket' column has categories that are seperated by whitespace, dot and/or slash.

We need to remove these as part of data cleaning.  

In [122]:
df2['Ticket'] = df2.apply(lambda row: str(row.Ticket).replace(" ", "_").replace(".","").replace("/","_"), axis=1)

3.3.2 Clean the values of 'Cabin'

Check the 'Cabin' column

In [123]:
# df['Cabin']

df2[['Cabin']].groupby('Cabin').size().sort_values()

Cabin
A10                   1
D38                   1
D34                   1
D22                   1
D11                   1
                   ... 
F4                    4
B57 B59 B63 B66       5
G6                    5
C23 C25 C27           6
UNKNOWN            1014
Length: 187, dtype: int64

'Cabin' column has categories that are seperated by whitespace, dot and/or slash.

We need to remove these as part of data cleaning.  

In [124]:
df2['Cabin']  = df2.apply(lambda row: str(row.Cabin).replace(" ", "_").replace(".","").replace("/","_"), axis=1)

3.3.3 Clean the values of 'Boat' 

Check the 'Boat' column

In [125]:
# df2['Boat']

df2[['Boat']].groupby('Boat').size().sort_values()

Boat
13 15 B      1
15 16        1
8 10         1
5 9          1
C D          2
13 15        2
5 7          2
1            5
B            9
A           11
2           13
12          19
6           20
D           20
8           23
7           23
16          23
11          25
9           25
3           26
5           27
10          29
4           31
14          33
15          37
C           38
13          39
UNKNOWN    823
dtype: int64

In [126]:
df2['Boat']   = df2.apply(lambda row: str(row.Boat).replace(" ", "_").replace(".","").replace("/","_"), axis=1)

3.3.4 Clean the values of 'Home.dest' 

Check the 'Home.dest' column

In [127]:
# df['Home.dest']
df2[['Home.dest']].groupby('Home.dest').size().sort_values()

Home.dest
?Havana, Cuba                     1
Liverpool, England / Belfast      1
London  Vancouver, BC             1
London / Birmingham               1
London / Chicago, IL              1
                               ... 
Paris, France                     9
Montreal, PQ                     10
London                           14
New York, NY                     64
UNKNOWN                         564
Length: 370, dtype: int64

The feature 'Home.dest' can be processed further to split the Home and Destination of each passenger in the dataset.

df['Home.dest'].unique()

3.3.4 Fix the name of the column from 'Home.dest' to 'HomeDest'

In [128]:
df2 = df2.rename(columns={'Home.dest' : 'HomeDest'}, inplace=False)

First copy the dataframe to a new dataframe

Next, check the data in HomeDest column

Third, clean data  values

Finally, add the clean column to the new dataframe 

In [129]:
data = df2.copy()

In [130]:
data['HomeDest']

0                          St Louis, MO
1       Montreal, PQ / Chesterville, ON
2       Montreal, PQ / Chesterville, ON
3       Montreal, PQ / Chesterville, ON
4       Montreal, PQ / Chesterville, ON
                     ...               
1304                            UNKNOWN
1305                            UNKNOWN
1306                            UNKNOWN
1307                            UNKNOWN
1308                            UNKNOWN
Name: HomeDest, Length: 1309, dtype: object

Clean row values of the column 'HomeDest'

In [131]:
homedest = [str(x).replace(" ","") for x in df2["HomeDest"].tolist()]
homedest = [str(x) if 'and/or' not in x else str(x).replace('and/or','/') for x in homedest]
homedest = [str(x) if 'Guernsey/Montclair' not in x else str(x).replace('Guernsey/Montclair,NJ','Montclair,NJ') for x in homedest]
homedest = [str(x).replace(",","_").replace("/","__") for x in homedest]

In [132]:
df2['HomeDest'] = np.array(homedest)

---------

## Part 4 : Feature Engineering

Create three new features from column 'Name'
1. LastName
2. NameLength
3. Title

Create a new feature name combining 'Parch' and 'SibSp'
1. FamilySize

### 4.1 Create  Features from 'Name'  

Inspect passengers names

In [133]:
df2['Name'].tolist()[0:10]

['Allen, Miss. Elisabeth Walton',
 'Allison, Master. Hudson Trevor',
 'Allison, Miss. Helen Loraine',
 'Allison, Mr. Hudson Joshua Creighton',
 'Allison, Mrs. Hudson J C (Bessie Waldo Daniels)',
 'Anderson, Mr. Harry',
 'Andrews, Miss. Kornelia Theodosia',
 'Andrews, Mr. Thomas Jr',
 'Appleton, Mrs. Edward Dale (Charlotte Lamson)',
 'Artagaveytia, Mr. Ramon']

In [134]:
thestr =  'Allison, Mrs. Hudson J C (Bessie Waldo Daniels)'
thestr.split(',')

['Allison', ' Mrs. Hudson J C (Bessie Waldo Daniels)']

#### 4.1.1 : Creat LastName

In [135]:
LastName = [ x.split(",")[0] for x in df2.Name.tolist() ]
LastName = [ x.strip() for x in LastName ]

# LastName = [ (x.split(",")[0].strip()) for x in df.Name.tolist() ]
# print(LastName)

Some of the last names comprised of multiple words having white space in between.

Remove the white space.

In [136]:
LastName = [str(x).replace(" ", "_")  for x in LastName]

#### 4.1.2 : Creat Title

In [137]:
Title = [ x.split(",")[1] for x in df2.Name.tolist() ]
Title = [ x.strip().split(" ")[0] for x in Title ]

# Title = [ (x.split(",")[1].strip()).split(" ")[0] for x in df2.Name.tolist() ]
# print(Title)

#### Take  a closer look at the feature  'Title' 

Find the unique number of Titles

In [138]:
unique_title = np.unique( np.array(Title) )
print('number of unique titles :', len(unique_title) )
print( unique_title )

number of unique titles : 18
['Capt.' 'Col.' 'Don.' 'Dona.' 'Dr.' 'Jonkheer.' 'Lady.' 'Major.'
 'Master.' 'Miss.' 'Mlle.' 'Mme.' 'Mr.' 'Mrs.' 'Ms.' 'Rev.' 'Sir.' 'the']


Change the word 'the' to more appropriate Title

In [139]:
ind = np.where("the" == np.array(Title))  # np.where return a tuple of nd.array
print( "row index where 'the' appears in the Title :",  ind )

# Get the proper index because 'np.where' returns a tuple of nd.array
ind = ind[0][0]
print('index value :', ind)

print("check the name at the indexed position :", df2.Name.tolist()[ind] ) 

row index where 'the' appears in the Title : (array([245], dtype=int64),)
index value : 245
check the name at the indexed position : Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)


The actual name is : Lucy Noel Martha Dyer-Edwards, the Countess. of Rothes

In [140]:
Title[ ind ] = "Countess."

Combine "Capt.","Col.","Major." into one category

In [141]:
indList = []
for title in ["Capt.","Col.","Major."]:
    ind = np.where(title == np.array(Title))
    indList.extend( ind[0] )
    #print(indList )
for ind in indList: 
    Title[ ind ] = "Army."

Conbine "Ms." and "Mlle." into one category

In [142]:
indList = []
for title in ["Ms.", "Mlle."]:
    ind = np.where(title == np.array(Title))
    indList.extend( ind[0] )
    #print(indList )
for ind in indList: 
    Title[ ind ] = "Miss."

Combine "Mme." and "Dona." into one category

In [143]:
indList = []
for title in ["Mme.","Dona."]:
    ind = np.where(title == np.array(Title))
    indList.extend( ind[0] )
    #print(indList )
for ind in indList: 
    Title[ ind ] = "Mrs."

Convert "Don." to "Mr." 

In [144]:
indList = []
for title in ["Don."]:
    ind = np.where(title == np.array(Title))
    indList.extend( ind[0] )
    #print(indList )
for ind in indList: 
    Title[ ind ] = "Mr."

The data  processing in the cells above can be combined into one cell using Python funciton

In [145]:
# def convert_title(Title, original, converted):
#     indList = []
    
#     for title in original:
#         ind = np.where(title == np.array(Title))
#         indList.extend( ind[0] )
        
#     for ind in indList: 
#         Title[ ind ] = converted
#     return Title

# Title = convert_title(Title, ["Capt.","Col.","Major."], "Army.")

# Title = convert_title(Title, ["Ms.", "Mlle."], "Miss.")

# Title = convert_title(Title, ["Mme.","Dona."], "Mrs.")

# Title = convert_title(Title, ["Don."], "Mr.")

# unique_title = np.unique( np.array(Title) )
# print('number of unique titles :', len(unique_title) )
# print( unique_title )

Check the unique number of Titles

In [146]:
unique_title = np.unique( np.array(Title) )
print('number of unique titles :', len(unique_title) )
print( unique_title )

number of unique titles : 11
['Army.' 'Countess.' 'Dr.' 'Jonkheer.' 'Lady.' 'Master.' 'Miss.' 'Mr.'
 'Mrs.' 'Rev.' 'Sir.']


#### 4.1.3 : Create NameLength

In [147]:
NameLength = [ len(x) for x in df2.Name.tolist() ]

# print(NameLength)

Add three new features to the dartaframe

In [148]:
df2['LastName']   = np.array(LastName)
df2['Title']      = np.array(Title)
df2['NameLength'] = np.array(NameLength)

### 4.2 : Create Feature from 'Parch' and 'SibSp'

Check 'Parch' and 'SibSp' values

In [149]:
df2[['Parch','SibSp']].head()

Unnamed: 0,Parch,SibSp
0,0,0
1,2,1
2,2,1
3,2,1
4,2,1


'FamilySize' is created by adding SibSp, Parch and 1. SibSp is the count of siblings and spouse, 
and Parch is the count of parents and children. Those columns are added in order to find the 
total size of families. The 1 in the addtion represents the current passenger.

In [150]:
df2['FamilySize'] = df2[['Parch','SibSp']].apply( lambda row : np.sum(row)+1, axis=1 )

In [151]:
df2.head(3)

Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Boat,HomeDest,LastName,Title,NameLength,FamilySize
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,StLouis_MO,Allen,Miss.,29,1
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22_C26,S,11,Montreal_PQ__Chesterville_ON,Allison,Master.,30,4
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22_C26,S,UNKNOWN,Montreal_PQ__Chesterville_ON,Allison,Miss.,28,4


Final check  for null values

In [152]:
df2.isnull().sum()

Pclass        0
Survived      0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin         0
Embarked      0
Boat          0
HomeDest      0
LastName      0
Title         0
NameLength    0
FamilySize    0
dtype: int64

---------

## Part 5 : Get the Final Dataframe

Shuffle final data randomly without replacement 

In [153]:
n  = df2.shape[0] 
df = df2.sample(n, replace=False).reset_index(drop=True)

In [155]:
dataFile = 'data_titanic_clean.csv'
df2.to_csv( 'data_titanic_clean.csv', index=False)