# Analysis on Titanic dataset
## Quick Introduction to the dataset
This dataset contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. Accoding to [the Kaggle website](https://www.kaggle.com/c/titanic/data), where the data was originally obtained, the variables are defined as follows:
```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
```              








## Data Wrangling


### Data Acquisition
Since the dataset was provided as a CSV format, let's load the data using Pandas.
Note that I locataed the dataset CSV file under the subdirectory of this Notebook `./dataset/titanic` and filename is `titanic-data.csv`. 

In [62]:
import numpy as np
import pandas as pd

# read csv file
rawData = pd.read_csv('./dataset/titanic/titanic-data.csv') # a relative directory
rawData.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Data Cleaning
*DELETE - Is the data cleaning well documented?*
The project documents any changes that were made to clean the data, such as merging multiple files, handling missing values, etc.

First of all, according to [the Kaggle site](https://www.kaggle.com/c/titanic/data), the age is written in the form xx.5 if the age is estimated. While this information whether the age is estimated or not would be valuable for further analysis, the way they use could skew any statistics. So let's see how many we have those estimated ages and separate them out as another column. 


In [63]:
# we can use modular operation with 1 over the ages to find xx.5 formed entries. 
# as the following code, the operation should give you True when an entry is xx.5 form, otherwise False.
def checkIfAgeWasEstimated(ages):
    mod = ages % 1
    return pd.DataFrame({'IsAgeEstimated': (mod == 0.5), 'Age': ages})
    
# verifying the method with small testset
testset = pd.Series(np.array([11.5, 11.0, 35, 35.0, 35.5]))
checkIfAgeWasEstimated(testset)

Unnamed: 0,Age,IsAgeEstimated
0,11.5,True
1,11.0,False
2,35.0,False
3,35.0,False
4,35.5,True


In [64]:
IsAgeEstimated = checkIfAgeWasEstimated(rawData['Age'])
IsAgeEstimated.sum()  # sum() treats True as 1 and False as 0.

Age               21205.17
IsAgeEstimated       18.00
dtype: float64

So we have total 18 entries of ages that are estimated. Let's see them. 

In [65]:
ageEstimatedOnly = IsAgeEstimated .loc[IsAgeEstimated['IsAgeEstimated'] == True]
print ageEstimatedOnly

print ''
print ageEstimatedOnly.describe()

      Age IsAgeEstimated
57   28.5           True
111  14.5           True
116  70.5           True
122  32.5           True
123  32.5           True
148  36.5           True
152  55.5           True
153  40.5           True
203  45.5           True
227  20.5           True
296  23.5           True
331  45.5           True
525  40.5           True
676  24.5           True
735  28.5           True
767  30.5           True
814  30.5           True
843  34.5           True

             Age
count  18.000000
mean   35.277778
std    13.224556
min    14.500000
25%    28.500000
50%    32.500000
75%    40.500000
max    70.500000


Let's add the column as `IsAgeEstimated` with boolean types. 

In [66]:
rawData['IsAgeEstimated'] = IsAgeEstimated['IsAgeEstimated']
rawData.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,IsAgeEstimated
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,False
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,False


Now we need to round down the `Age` column so that it does not contains xx.5 forms. 

In [None]:
rawData['Age'] = rawData['Age'] / 

## Data Exploration
DELETE -  *Is the data explored in many ways?*
The project investigates the stated question(s) from multiple angles. At least three variables are investigated using both single-variable (1d) and multiple-variable (2d) explorations.
*Are there a variety of relevant visualizations and statistical summaries?*
  - The project's visualizations are varied and show multiple comparisons and trends. Relevant statistics are computed throughout the analysis when an inference is made about the data.
  - At least two kinds of plots should be created as part of the explorations.



## Conclusions
DELETE -  *Has the student correctly communicated tentativeness of findings?*
The results of the analysis are presented such that any limitations are clear. The analysis does not state or imply that one change causes another based solely on a correlation.

