# Project: Titanic- Machine Learning Project.

# Table of Content
 - <a href='#Understanding the data structure of and its attributes'>Understanding the data structure of and its attributes</a>
 - <a href='#Data Wrangling'>Data Wrangling</a>
 - <a href='#Data splitting and model training'>Data splitting and model training</a>

## Objectives: Predict which passengers survived the ship wreck and those who didn't.

# Description:
the project is based on the titanic ship wreck. The `training set` will be used to build a machine learning model. The model will be based on features like passengers, gender and class and also the creation of additional feature(s). 

The `test set` will be used to see how well the model performs on data never seen before by the model. for the test sets, the ground truth values are not provided for each passengers.

`Gender_submssion.csv` which generaliseds that only female passengers survived the shipwreck.

# Data and Data Description
 - Survival: This features contains data for passengers that survived and passengers who did not survive the shipwreck.
 - Ticket class: A proxy for socio-economic status (SES).
      - 1st: Upper Class
      - 2nd: Middle Class.
      - 3rd: Lower Class.
 - Age: Age in years.
 - Sibsp: Number of sibling / spouses abroard the Titanic. The dataset defines family relationships such as:
     - Parent: Mother and Father.
     - Child: Daughter, Son, Stepdaughter, Stepson, Some children travelled only with a nanny, therefore parch=0 for them.
 - Parch: Number of parents / children abroard the Titanic.
 - Ticket: Ticket number.
 - Fare: Passenger fare.
 - Cabin: Cabin number.
 - embarked: Port of Embarkation

# Importing necessry libraries

In [2]:
# libraries data manipulation and data wrangling
import pandas as pd
import numpy as np

# Libraries for model training
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Libraries for splitting data sets into train and test splits
from sklearn.model_selection import train_test_split

# Importing train and test data set

In [4]:
# importing train_df and test_df
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
combine = [train_df, test_df]

<a id='Understanding the data structure of and its attributes'></a>
# Understanding the data structure of and its attributes
This phase of the project invloves viewing the data and having an understanding of its structure, its features, and the various data types of all the observed features within the data structure.
<a id='Understanding the data structure of and its attributes'></a>

In [5]:
# the train_df and all its features at a glance.
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# the test_df and all its features at a glance.
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [7]:
# a brief overview of the various features and their various data types for the test_df and train_df.
train_df.info()
print("-"*10)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1

<a id='Data Wrangling'></a>
# Data Wrangling
This phase of the project invloves identification of the missing values of the features withing the dataframe. missing values are filled appropriately with a combination of mean and standard deviation for all relevant numerical values and removal of redundant features such as `passengerId`, `name`, `ticket` and `cabin`. 

In [8]:
# Viewing the missing values in the train_df
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [9]:
# Viewing the missing values in the test_df
test_df.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [10]:
# Viewing the missing values in the train_df
train_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [11]:
# Viewing the missing values in the test_df
test_df.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [92]:
# Filling missing values of age of train_df with a combination of standard deviation and mean.
np.random.seed(42)
train_mean = train_df["Age"].mean()
train_std = train_df["Age"].std()
train_ms = train_mean + train_std
num_na = train_df["Age"].isna().sum()
random_vals = train_mean + train_std * np.random.randn(num_na)
train_df.loc[train_df["Age"].isna(), "Age"] = random_vals

In [94]:
# its impossible for age value to be a fraction therefore, it has to be converted from float into an integer. 
train_df["Age"] = train_df["Age"].astype(np.int64)

# Viewing of the train_df if changes have been implemented and no wanted changes made.
train_df.iloc[335:343]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
335,0,3,0,33,0,0,7,0
336,0,1,0,29,1,0,66,0
337,1,1,1,41,0,0,134,1
338,1,3,0,45,0,0,8,0
339,0,1,0,45,0,0,35,0
340,1,2,0,2,1,1,26,0
341,1,1,1,24,3,2,263,0
342,0,2,0,28,0,0,13,0


In [15]:
# A more comprehensive method of knowing whether the missing values for age test_df was filled.
train_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [16]:
# Filling missing values of age of the test_df with a combination of standard deviation and mean.
test_mean = test_df["Age"].mean()
test_std = test_df["Age"].std()
num_na = test_df["Age"].isna().sum()
random_vals =test_mean + test_std * np.random.randn(num_na)
test_df.loc[test_df["Age"].isna(), "Age"] = random_vals

In [17]:
# A more comprehensive method of knowing whether the missing values for age test_df was filled.
test_df.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [96]:
# its impossible for age value to be a fraction therefore, it has to be converted from float into an integer. 
test_df["Age"]= test_df["Age"].astype(np.int64)
train_df["Age"].head()

0    22
1    38
2    26
3    35
4    35
Name: Age, dtype: int64

In [99]:
# Cross evaluating the data types for test data set and the train data set after filling missing values and data type changes.
train_df.info()
print(10*"-")
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Survived  891 non-null    int64
 1   Pclass    891 non-null    int64
 2   Sex       891 non-null    int32
 3   Age       891 non-null    int64
 4   SibSp     891 non-null    int64
 5   Parch     891 non-null    int64
 6   Fare      891 non-null    int32
 7   Embarked  891 non-null    int32
dtypes: int32(3), int64(5)
memory usage: 45.4 KB
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Pclass       418 non-null    int64
 2   Sex          418 non-null    int32
 3   Age          418 non-null    int64
 4   SibSp        418 non-null    int64
 5   Parch        418 non-null    int64
 6   Fare         418 non-null    int32
 7

In [20]:
# Identyfying categorical data and mapping them with corresponding numerical values.
for dataset in combine:
    dataset["Sex"] = dataset["Sex"].map({"male": 0, "female": 1}).astype(int)

In [21]:
# Cross checking whether the categorical changes were implemented.
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",0,34,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",0,62,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",0,27,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22,1,1,3101298,12.2875,,S


In [22]:
# Cross checking whether the categorical changes were implemented.
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35,0,0,373450,8.05,,S


In [23]:
port = train_df.Embarked.dropna().mode()[0]
port

'S'

In [25]:
# Fill missing values of embarked.
for dataset in combine:
    dataset["Embarked"] = dataset["Embarked"].fillna(port)

In [26]:
# Identyfying categorical data and mapping them with corresponding numerical values.
for dataset in combine:
    dataset["Embarked"] = dataset["Embarked"].map({"S": 0, "C": 1, "Q": 2}).astype(int)

In [27]:
# Cross checking whether the categorical changes were implemented.
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",0,35,0,0,373450,8.05,,0


In [28]:
# Cross checking whether the categorical changes were implemented.
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",0,35,0,0,373450,8.05,,0


In [29]:
# filling missing values and converting them from floats into integers.
for dataset in combine:
    dataset["Fare"] = dataset["Fare"].fillna(0)
    dataset["Fare"] = dataset["Fare"].astype(int)

In [30]:
# Removal of unnecessary features.
train_df = train_df.drop(["PassengerId", "Name", "Ticket", "Cabin",], axis = 1)
test_df = test_df.drop( ["Name", "Ticket", "Cabin",], axis = 1)

In [31]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22,1,0,7,0
1,1,1,1,38,1,0,71,1
2,1,3,1,26,0,0,7,0
3,1,1,1,35,1,0,53,0
4,0,3,0,35,0,0,8,0


In [32]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,0,34,0,0,7,2
1,893,3,1,47,1,0,7,0
2,894,2,0,62,0,0,9,2
3,895,3,0,27,0,0,8,0
4,896,3,1,22,1,1,12,0


<a id='Data splitting and model training'></a>
# Data splitting and model training
This involes splitting of data sets into train and test sets and then fitting the train split into the model and evaluating the model with the test split.

In [47]:
# Stripping the target from the data frame.
X_train = train_df.drop("Survived", axis = 1)
y_train = train_df["Survived"]
x_test = test_df.drop("PassengerId", axis = 1)

In [48]:
# Splitting the X_train and y_train into training and test split.
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.1)

In [49]:
# Assignment of logistic regression to a variable.
lml_model = LogisticRegression(max_iter=1000)

# Assignment of random forest classifier to a variable.
clf_model = RandomForestClassifier()

In [50]:
# Fitting the model to the training data set
lml_model = lml_model.fit(X_train, y_train)
clf_model = clf_model.fit(X_train, y_train)

In [51]:
# Model evaluation with training data set
lml_model.score(X_train, y_train)

0.7940074906367042

In [52]:
# Model evaluation with training data set
clf_model.score(X_train, y_train)

0.9675405742821473

In [53]:
# # Model scoring with test data set
lml_model.score(X_test, y_test)

0.8333333333333334

In [54]:
# Model scoring with test data set
clf_model.score(X_test, y_test)

0.8666666666666667

In [58]:
# Titanic shipwreck Survival predictions using logistic Regression
lml_predict = lml_model.predict(x_test)
lml_predict[:5]

array([0, 0, 0, 0, 1], dtype=int64)

In [59]:
# Titanic shipwreck survival predictions using random forest classifier
clf_predict = clf_model.predict(x_test)
clf_predict[:5]

array([0, 0, 0, 1, 0], dtype=int64)

In [70]:
lml_submission = pd.read_csv("gender_submission.csv")
lml_submission["Survived"] = lml_predict
lml_submission.to_csv("titanic_lml_prediction.csv", index=False)

In [71]:
clf_submission = pd.read_csv("gender_submission.csv")
clf_submission["Survived"] = clf_predict
clf_submission.to_csv("titanic_clf_prediction.csv", index=False)