<a href="https://colab.research.google.com/github/insight4healthlab/course-GS-HLTH-6270/blob/main/notebooks/classification_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a classifier with Titanic Data Set

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image, display
%matplotlib inline

sns.set(style="whitegrid", font_scale=1.75)

# Titanic dataset:

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S
), their age (A
), their passenger-class (C
), their sex (G
) and the fare they paid (X
).

In [None]:
#Load the dataset using Pandas
# train and test
data_path = 'course-GS-HLTH-6270/datasets/titanic'
train = pd.read_csv('/content/gdrive/My Drive/'+data_path+'/train.csv')
test = pd.read_csv('/content/gdrive/My Drive/'+data_path+'/test.csv')

print(train.shape)
print(test.shape)

(891, 12)
(418, 11)


In [None]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
# create new features
# for Name
train['Name_len']=train.Name.str.len()

In [None]:
train['Ticket_First']=train.Ticket.str[0]

In [None]:
train['FamilyCount']=train.SibSp+train.Parch

In [None]:
train['Cabin_First']=train.Cabin.str[0]

In [None]:
# Regular expression to get the title of the Name
train['title'] = train.Name.str.extract('\, ([A-Z][^ ]*\.)',expand=False)

In [None]:
train.title.value_counts().reset_index()

Unnamed: 0,index,title
0,Mr.,517
1,Miss.,182
2,Mrs.,125
3,Master.,40
4,Dr.,7
5,Rev.,6
6,Major.,2
7,Mlle.,2
8,Col.,2
9,Don.,1


**Missing Value treatment**

Having missing values in a dataset can cause errors with some machine learning algorithms and either the rows that has missing values should be removed or imputed
Imputing refers to using a model to replace missing values.

There are many options we could consider when replacing a missing value, for example:

constant value that has meaning within the domain, such as 0, distinct from all other values.
value from another randomly selected record.
mean, median or mode value for the column.
value estimated by another predictive model.


In [None]:
# impute the missing Fare values with the mean Fare value
train.Fare.fillna(train.Fare.mean(),inplace=True)

In [None]:
# impute the missing Age values with the mean Fare value
train.Age.fillna(train.Age.mean(),inplace=True)

sometimes it is more reasonabale to drop the columns

In [None]:
# We see that a majority 77% of the Cabin variable has missing values.
# Hence will drop the column from training a machine learnign algorithem
train.Cabin.isnull().mean()

0.7710437710437711

In [None]:
trainML = train[['Survived', 'Pclass', 'Sex', 'Age', 'Parch',
       'Fare', 'Embarked', 'Name_len', 'Ticket_First', 'FamilyCount',
       'title']]

In [None]:
trainML

Unnamed: 0,Survived,Pclass,Sex,Age,Parch,Fare,Embarked,Name_len,Ticket_First,FamilyCount,title
0,0,3,male,22.000000,0,7.2500,S,23,A,1,Mr.
1,1,1,female,38.000000,0,71.2833,C,51,P,1,Mrs.
2,1,3,female,26.000000,0,7.9250,S,22,S,0,Miss.
3,1,1,female,35.000000,0,53.1000,S,44,1,1,Mrs.
4,0,3,male,35.000000,0,8.0500,S,24,3,0,Mr.
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,13.0000,S,21,2,0,Rev.
887,1,1,female,19.000000,0,30.0000,S,28,1,0,Miss.
888,0,3,female,29.699118,2,23.4500,S,40,W,3,Miss.
889,1,1,male,26.000000,0,30.0000,C,21,1,0,Mr.


In [None]:
# drop rows of missing values
trainML = trainML.dropna()

In [None]:
# check the datafram has any missing values
trainML.isnull().sum()

Survived        0
Pclass          0
Sex             0
Age             0
Parch           0
Fare            0
Embarked        0
Name_len        0
Ticket_First    0
FamilyCount     0
title           0
dtype: int64

**Encoding categorical variables**

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(trainML['Sex'].unique())
le.classes_

array(['female', 'male'], dtype=object)

In [None]:
trainML['Sex']=le.transform(trainML['Sex'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trainML['Sex']=le.transform(trainML['Sex'])


In [None]:
le.fit(trainML['title'].unique())
trainML['title'] = le.transform(trainML['title'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trainML['title'] = le.transform(trainML['title'])


In [None]:
le.fit(trainML['Ticket_First'].unique())
trainML['Ticket_First'] = le.transform(trainML['Ticket_First'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trainML['Ticket_First'] = le.transform(trainML['Ticket_First'])


In [None]:
le.fit(trainML['Embarked'].unique())
trainML['Embarked'] = le.transform(trainML['Embarked'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trainML['Embarked'] = le.transform(trainML['Embarked'])


# Building a Logistic Regression Classifier


In [None]:
#create input-data and target-data
x_data = trainML[['Pclass', 'Sex', 'Age', 'Parch','Fare', 'Embarked', 'Name_len', 'Ticket_First', 'FamilyCount','title']]
y_data = trainML['Survived']

In [None]:
#split the data into train and test

In [None]:
# Normalize the input features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
trainML = scaler.fit_transform(trainML)

In [None]:
# fit the Logistic Regression Classifier
