# Welcome
This is a python notebook that I used to explore the Titanic dataset. I came into this exploration with a few assumptions about the dataset that I wanted to explore before trying to model the results for predicting survivabilty. So instead of completely exploring all the variables form a statistical standpoint, I wanted to explore my prior heuristics for the data and develop a general understanding of the data using some visualizations. 

The few topics that this notebook will cover are:
- Were women and children ushered off first?
- Fare, and Class demonstrate wealth, which wealth really mattered?
- Did people in the same PassengerClass pay the same fare?
- What does the cabin variable mean?

In [27]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

import matplotlib.pyplot as plt
%matplotlib inline
plt.tight_layout()
# Any results you write to the current directory are saved as output.

To load the data, you have to import the data on the side. Make sure that you can see your data.    
To load via using online request, click on the dataset, and the location should be listed. For this project, there are three located in the ../input folder.     
In the same interface you can upload a dataset, or add other data sources that are available on kaggle. 

In [2]:
df = pd.read_csv('../input/train.csv')

In [3]:
df.shape

In [4]:
df.count()

In [5]:
df.describe()

From these three general codes: shape, count, and describe, I learn about the variables that I am working with. I learn that the training set has 891 people, with 12 variables. Of the 12, not all of the rows have complete information, Age is missing results, as is Cabin and Embarked. More missing information could exist in the data. And from the describe method, I learn that 7 of the results are numerical. The ID should be a string, Survived is binary, Pclass is a cateogry, Age is a continuous but truncated variable with an exception for a minimum of 0.42, SibSp is a discrete number, as is Parch, and Fare is what I assume to be the price that was paid by the individual. 

SibSp is described to be the # of siblings/spouses aboard
Parch is described to be the # of parents/children aboard

The remaining variables that are listed are non-numerical are Name, Sex, Ticket, Cabin, and Embarked. Name is unique, Ticket, is described to be the number fare Passenger fare.
The Ticket variable was poorly described, but I will try to infer a definition with exploration likewise for cabin. Embarked is one of the three locations that the passenger boarded the Titanic. 

Now that I have the variables laid out, I will try to explain what I will first explore. 

# What I would explore:
I have sparse knowledge about the Titanic, only that it sunk from hitting an Iceberg, some sources say that people did not evacuate quickly enough because the damage was not considered severe until the damage became visible, women and children should be ushered off first, and that the Titanic is a popular movie reference as well as a milestone in Data Scientists' Journey, just like the handwritten classifier for computer vision + deep learning. 

A few assumptions I would like to test, in addition to finding out which factors affected survivalability:
- Were women and children ushered off first?
- Fare, and Class demonstrate wealth, which wealth really mattered?
- Did people in the same PassengerClass pay the same fare?
- What does the cabin variable mean?




## Cabin Variable
- What does the cabin variable mean?
### Code

In [6]:
df.Cabin.value_counts().head()

### Results
The Cabin shows the location of the individual. This needs to be related to a map to make sense of it. I am sure that running it through a machine might be able to find out whether there was a multidimensional pattern. 

A quick google search for "when did the titanic hit the iceberg" yieleded this result.
```
April 14
However, just before midnight on April 14, the ship hit an iceberg, and five of the Titanic's compartments were ruptured along its starboard side. At about 2:20 a.m. on the morning of April 15, the massive vessel sank into the North Atlantic.
```
That statement was supported across other platforms, some more specific. The general idea that I mean to get from this statements is that it was late at night when most people would be at their rooms or in a congregation area. This would support the idea that Cabin location would affect those who were in their rooms. 

## Women and Children Survivability
There is no explicit range of age for a person to be considered a child, but for initial search I will use the age 18. 
This following part will first explore women's survivability, then children's survivability. 
### Code


In [7]:
# This code shows what expressions were used to categorize the Sexes
# In addition, it shows in the training sample, there are more men than women. 
df.Sex.value_counts()

In [28]:
fig = plt.figure(figsize=(20,10))

plt.subplot(231)
plt.title('Gender of Passengers')
df.Sex.value_counts(normalize=True).plot(kind = 'bar')

plt.subplot(232)
plt.title('Number of Survivors')
df.Survived.value_counts(normalize=True).plot(kind = 'bar')

plt.subplot(234)
plt.title('Female Survivors')
plt.bar(['Survived', 'Deceased'],df.Survived[df.Sex == 'female'].value_counts(normalize=True), color = ['c', 'm'])

plt.subplot(235)
plt.title('Male Survivors')
plt.bar(['Deceased', 'Survived'],df.Survived[df.Sex == 'male'].value_counts(normalize=True), color = ['r', 'b'])

plt.subplot(236)
plt.title('Deceased of both Genders')
plt.bar(['Male', 'Female'], df.Sex[df.Survived == 0].value_counts(normalize=True), color = ['r', 'm'])


In [9]:
fig = plt.figure(figsize=(20,10))

plt.subplot2grid((2,3), (0,0))
plt.title('Age WRT Survival')
plt.scatter(df.Survived, df.Age, alpha=0.1)

plt.subplot2grid((2,3), (0,1))
plt.title('Age WRT Survival')
plt.scatter(df.Survived[df.Age<=18], df.Age[df.Age <=18])

In [10]:
print(
    df.Age[df.Survived==1].describe(),
    '\n',
    df.Age[df.Survived==0].describe()
)

### Results
From this training set, we see that there are 577 males, and 314 females. From the first plot we see that a little more than 60% of the training set passengers are male. And similarily a little more than 60% of passengers did not survive. If we further explore the genders, we can see that for females, the survival rate is a little more than 70%, while male survivors only have a little less than 20% survival rate. This is further exacerbated in the third chart in the second row where you can see that of those deceased, more than 85% of them were male. 

What could be taken away from this is that if we judge solely on gender, we could build a fairly accurate classifier for survival. In addition, it is clear that women first is definitely a belief that was upheld during the sinking of the titanic. 

For the age aspect, it was inconclusive. I first tried visualizing the results, but it showed no clear result. I instead looked at the statistics, which showed of the 290 that survived, and 424 that did not in the training set, the mean age were very similar, 28 to 30 respectively. The ranges were also very similar with the median also being 28 for both situations. The idea that children are saved first is further unsupported when we separate the results by gender. It is inconclusive if age played a role in survivability via the exploration conducted so far. 

In [11]:
print('Age Distribution for Female',
      '\n',
    df.Age[(df.Survived==1) & (df.Sex=='female')].describe(),
    '\n',
    df.Age[(df.Survived==0) & (df.Sex=='female')].describe()
)
print('Age Distribution for Male',
      '\n',
    df.Age[(df.Survived==1) & (df.Sex=='male')].describe(),
    '\n',
    df.Age[(df.Survived==0) & (df.Sex=='male')].describe()
)

## Which wealth mattered?
- Which of Fare, Class, or SibSP + Parch matter most for surviving? All demonstrate wealth, which wealth really mattered?

My reasoning is that the variables are all somewhat related to wealth. A higher fare might correlate to more wealth, a higher class would also denote more wealth, having more family members would show the general wealth of the family. 
### Code

In [12]:
df.Pclass.value_counts()

In [13]:
# This result for how much each person paid is too long, instead I should create a graph.  
df.Fare.value_counts().head()

In [30]:
# I want to do some more complex visualizations. Seaborn will help with this
import seaborn as sns

fig = plt.figure(figsize=(20, 10) )

plt.subplot2grid((3,3), (0, 0), colspan=3)
df.Fare.hist(bins=70)
plt.xlabel('Price')
plt.ylabel('Purchased')
plt.title('Price of Fares Purchased')


print('\n \n')

plt.subplot2grid((3,3), (1, 0), colspan=2, rowspan=2)
# plt.scatter(df.Pclass, df.Fare)
sns.stripplot(df.Pclass, df.Fare, jitter=True, edgecolor='none', alpha=0.5)
# http://dataviztalk.blogspot.com/2016/02/how-to-add-jitter-to-plot-using-pythons.html
plt.title('Price of Fares split by Class')
sns.boxplot(df.Pclass, df.Fare)

### Mid Result
The result from the two graphs above show the price of the fares purchased by the passengers. There is a trend that shows the higher class you are, the more you pay for the rooms. That seems self explanatory. To relate these two variables to surivability, we will look at a few charts below. 
> 

In [36]:
fig = plt.figure(figsize=(20,10))
for x in [1, 2, 3]:
    plt.subplot(230+ x)
    names = ['Upper Class', 'Middle Class', 'Lower Class']
    plt.title('Survivability of the ' + names[x-1])
    df.Survived[df.Pclass==x].value_counts(normalize=True).plot(kind='bar')

###  2nd Mid Result
First be aware that 1 denotes survived. The rate of survivability from upper class to lower class are around the following respectively: 65%, 45%, and 25%. 

This shows that being rich may be a factor in surviability. Likewise this could be correlated with location on the boat; the lower class may be located further from escape boats, or exit to safety. 

In [71]:
plt.subplot()
plt.hist(df.Fare[df.Survived==1], color = 'orange', bins=15, alpha=0.5, label='Survived')
plt.hist(df.Fare[df.Survived==0], color = 'blue', bins=15, alpha=0.5, label='Deceased')
plt.title('Fares and Survivability')
plt.legend()
plt.ylabel('Count')
plt.xlabel('Fare Price')

In [67]:
print(df.Fare.corr(df.Survived))
print(df.Pclass.corr(df.Survived))

### Result
The two seem pretty closely correlated, but from the code, it seems that the correlation between class to surviving is stronger than that of fare paid. Most people who paid a low fare did end up not making it, this separation is most prominent when the fare was in the 100 units price range and higher. 


# Takeaway
- Were women and children ushered off first?
It seems like yes, women were ushered off first, but  cannot be certain if children were ushered off first. There seems to be no correlation that age played a factor in survivability. 
- Fare, and Class demonstrate wealth, which wealth really mattered?
From the exploration, both are related to survivability, where the richer were more likely to survive. The Class correlation seemed stronger than Fare, this could be because when people were ushered off of the boat, people cared more about who you were as a value of importance instead of how much was paid. 
- Did people in the same PassengerClass pay the same fare?
No people in the same Class did not pay the same fare, but across all classes, many paid the minimum amount. This could be further explored by looking at different classes but in the same fare-bracket to see how that affected survivability. 
- What does the cabin variable mean?
From looking at a sample of the variables, and doing a search online, the cabin variable represents locations that the passenger resided in. But a flaw in using this variable is that many of the variables were missing. We may have to look at how the passengers were related to one another to be able to appropriately fill those gaps. 

# Next Titanic Notebook
The next notebook that I will use to explore this dataset would be the following:
- The survivability of different classes that paid the same fare brackets. 
- Variable exploration, if I were to use my current knowledge and statistical knowledge, what would I use to build a model?
- A basic model that uses Heuristics to predict survivability as baselines.
- A model that beats the heuristic models. 