# Titanic Data Analysis

Authors: $\lambda$ Justin Ventura & Blaine Mason $\lambda$

Date: Tuesday, December 1st, 2020.

## - Description -

In this notebook, we will use various machine learning techniques in order to create a model that can predict whether a given individual on the Titanic had survived, given their 'features.'  We will show the effectiveness and learning curves of each algorithm, as well as analyze why or why not the algorithm is performing as intended.

### Below we will need to import these libraries:

In [18]:
""" Basic Data-Science / Machine Learning Libraries: """
import numpy as np  # Typical numpy as np.
import pandas as pd  # Dataframes library.
import numpy.linalg as la  # Linear Alg np.
import matplotlib.pyplot as plt  # Plotting.
from scipy import stats  # For simple stats.
from timeit import default_timer as timer # start = timer(), end = timer()

""" K-Nearest Neighbors Model Imports """
from KNN_Model import knn_vector, kNN_Model
kvect, kmodel = knn_vector, kNN_Model

""" Import the Titanic Dataset """
titanic = pd.read_csv('titanic_data.csv')  # Reads data nicely.
print('(Row, Col) of Titanic Data =', titanic.shape)
titanic.head(5) # Print the first 5 entries.

(Row, Col) of Titanic Data = (1309, 14)


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"


## Explanation of the data being analyzed.

The data at hand is real data collected from the Titanic (April 14-15, 1912).  Each row is a specific person who was present at the time of the ships departure.  The columns represent the 'features' of each individual such as: their name,sex, age, fare paid, ticket number, and a few others.  The most important one which we will designate as the 'class' or 'label' of every row or 'population' will be the 'survived' column (feature).

### Designating a Label.

The purpose of a 'label' or 'class' is to be able to look at the data, and find specific sub-populations that may be related through some sort of metric to define such a relationship.  In this case, the most obvious and interesting feature to be considered the label of each person is: whether or not they survived the tragic sinking of the Titanic in 1912.  This clearly gives two groups: survivors (numeric label: 1) and non-survivors (0).  Given such labels, we can then take their features and possibly find out if any given value(s) for a feature, pair of features, or any combination of certain features, could have influenced their likelihood of survival.

### Why?

There are many questions to be asked, and (almost) just as many answers to be returned.  A question one may have is: 'what does this matter?  The Titanic has already sank!'  This is true; however, some findings can be interesting to those who may be concerned as to their own likelihood of surviving a ship-wreck if they are met with these unfortunate cirumstances.  It is also interesting to see which people tended to survive, and possibly come to 'conclusions' as to why they may have been more or less likely to survive.  I use the word 'conclusion' loosely, as the #1 rule to any data analysis is: * CORRELATION DOES NOT EQUAL CAUSATION. *