# DSE200x Final Project by Liyang(Leon) Guan 

## Dataset: The Titanic Survival Dataset


<br> 
The original dataset was collected from: <a href="https://www.kaggle.com/c/titanic/data">Titanic Dataset</a> `

## 0. Libraries & Data Ingestion 

<p>Relevant Python Libraries and the titanic dataset itself will be imported in this section. Since the original testing dataset provided by Kaggle is not labelled (the survival column is missing), I will only be using the training dataset and split it into training & testing data. </p> 

In [1]:
# Python libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# import data

df = pd.read_csv("./titanic/train.csv")

## 1. Initial Data Exploration 

<p> In this step, an initial exploration of the dataset will be conduced to further determine the research questions for this analysis and to gain some initial insights into the data. </p>

In [4]:
# data frame features

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

<p>Details regarding each Feature can be found at <a href="https://www.kaggle.com/c/titanic/data">https://www.kaggle.com/c/titanic/data</a></p>

In [5]:
# peek at the first 10 rows

df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [6]:
# dataframe original shape 

df.shape

(891, 12)

In [11]:
# dataframe types

df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [7]:
# the number of null values for each column

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [13]:
# the number of survived passengers vs. not survived

len(df[df["Survived"] == 1]), len(df[df["Survived"] == 0])

(342, 549)

## 2. Research Questions

<p>The following research questions are decided to be the center of this analysis:</p>
<ul>
    <li>1. What are the top 2~3 features (from this titanic dataset) to most likely determine the survival outcome? </li>
    <li>2. Can we have a decent prediction accuracy (above 0.8) of the survival outcome based on most of the features provided in this titanic dataset? <b>[Normal Classification]</b></li>
    <li>3. Can we predict the probability of survival for each passenger based on most of the features provided in this titanic dataset? <b>[Bayes Classification]</b> </li>
</ul>

<p><b>Motivation:</b></p>
<p>An analysis of such could be extremely beneficial for various parties. Firstly, finding the most predictive features and predicting the probability of survival would allow insurance company to introduce different classes of insurance for different types passengers more suitably. Besides, it would be nice to give those metrics (survival probability in a catastrophic event, key survival features and etc) to the passengers before purchasing the tickets so that they have an idea of the potential risks when encountering an accident. Furthermore, a successful investigation of this dataset might also be useful for predicting survival rates for other events like earthquakes or tornadoes with, of course, extre domain knowledge amd further modifications. At last, I myself am interested in those problems as well because my initial intuition (before any research) tells me that those features provided in the dataset would not be good indicators of survival. It is both exciting and intriguing to see that how my intuition fails me and learn some new insights into those tragic events. </p>

## 3. Data Cleaning 

This section involves the following steps: 
<ul>
<li>Removing null & missing values in the original soccer dataset. </li>
<li>Removing columns that are hard to encode. </li>
<li>Determining whether we have sufficient amount of data to proceed. </li>
<li>Encoding some of the String cloumns if necessary. </li>
</ul>

## 4. Research Methods

For each of the research question proposed, there will be a different research method.

<ul>
    <li> For research question 1: Principal Component Analysis will be performed to find the 2-3 most relevant features regarding survival. </li>
    <li> For research question 2: since it is a binary classification, Decision Trees and Random Forests will be considered with different parameters.</li>
    <li> For research question 3: since it is a bayes classification, Support Vectore Machines will be used for producing the survival probablity with different parameters. </li>
</ul>

## 5. Core Data Analysis 

<p>In this section, proper data explorations (including Machine Learning Techniques) are conducted with testing to answer the research questions proposed in Section 2.</p>

<p>Data Visualizations (i.e. plots) will be included along the research process.</p>

### 5.1 RQ1: Top Features of Determining Survival Outcome

<p>For this research question, Principal Component Analysis is the main focus of the investigation.</p>

<p>Decisions & Justifications for RQ1:</p>

<ul>
    <li></li>
    <li></li>
</ul>

### 5.2 RQ2: Survival Outcome Classification

<p>For this research question, Decision Trees and Random Forests are the Machine Learning technniques I will be using. </p>

<p>Decisions & Justifications for RQ2:</p>

<ul>
    <li></li>
    <li></li>
</ul>

### 5.3 RQ3: Survival Probability Prediction

<p>For this research question, Support Vectore Machines will be considered.</p>

<p>Decisions & Justifications for RQ2:</p>

<ul>
    <li></li>
    <li></li>
</ul>

## 6. Findings & Conclusions

<p><b>Findings:</b></p>


<p><b>Conclusions:</b></p>


## 7. Limitations

<p></p>

## 8. Recommended Future Work 

<p></p>
