## Introduction

**Project description**: lorem ipsum: 

High level steps (table of content, internal links) goes here: 

On [this link (Kaggle)](https://www.kaggle.com/c/titanic/data) you can find the original dataset and its description.

*Project owner*: ***Gustaf Olofsson***

## Data dictionary
**PassengerId**: type should be integers  
**Survived**: Survived or Not  
**Pclass**: Class of Travel  
**Name**: Name of Passenger  
**Sex**: Gender  
**Age**: Age of Passengers  
**SibSp**: Number of Sibling/Spouse aboard  
**Parch**: Number of Parent/Child aboard  
**Ticket**:   
**Fare**  
**Cabin**  
**Embarked**: The port in which a passenger has embarked. C - Cherbourg, S - Southampton, Q = Queenstown  

## Load Data from CSVs

In [1]:
import pandas as pd
import numpy as np
import os

dir_path = os.path.dirname(os.path.realpath('__file__'))
raw_data = pd.read_csv(dir_path+"\\titanic_data.csv")

#test that data loaded correctly
raw_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Checking for Unexpected Values

In [2]:
def print_unique(df, attribute):
    print("Unique vales of '{}': {}".format(getattr(df, attribute).name, getattr(df, attribute).unique()))
    return 0

### "Survived" should have value of 0 or 1
print_unique(raw_data, "Survived")

### "Sex" should have value of male or female
print_unique(raw_data, "Sex")

### "Pclass" should have value of 1, 2 or 3
print_unique(raw_data, "Pclass")

### "Embarked" should have value of 'C', 'S' or 'Q'
print_unique(raw_data, "Embarked")

### "SibSp" should have integer values only
print_unique(raw_data, "SibSp")

### "Parch" should have integer values only
print_unique(raw_data, "Parch")

print("Total number of records: {}".format(len(raw_data.index) ) )

Unique vales of 'Survived': [0 1]
Unique vales of 'Sex': ['male' 'female']
Unique vales of 'Pclass': [3 1 2]
Unique vales of 'Embarked': ['S' 'C' 'Q' nan]
Unique vales of 'SibSp': [1 0 3 4 2 5 8]
Unique vales of 'Parch': [0 1 2 5 3 4 6]
Total number of records: 891


Expected values found for the *Survived*, *Sex*, *Pclass* categories.

  
It could be considered odd that a passenger has 8 siblings aboard, but the maximum number of children for a parent aboard is 6. However, this dataset contains only a subset of the passengers aboard (891 out of 1317 passengers). This is because the titanic dataset has been divided into a training- and test dataset for one of Kaggle's ML-challenges. Thus, the parent(s) of the children with 8 siblings is probably part of the test data set. 
 
  
The *Embarked* category contains empty values. Investigating these records closer. 

In [3]:
raw_data.loc[raw_data['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


We can see 2 entries without a port of embarktion. However, we can also see that the 'Ticket' and 'Cabin' values are the same. Luckily, a quick google search led me to the [Encyclopedia Titanica](https://www.encyclopedia-titanica.org) which contains information on the Titanic passengers. The following useful information explains the situation:  

>Miss Rose Amélie Icard, 38, was born in Vaucluse,  [...]
She boarded the Titanic at **Southampton** as **maid to Mrs George Nelson Stone. She travelled on Mrs Stone's ticket (#113572)**.

We can now also update the port of embarktion to Southampton. 

In [4]:
cleaned_df = raw_data.copy()
cleaned_df.loc[cleaned_df['Embarked'].isnull(), 'Embarked'] = 'S'

# confirm that update worked as intended
print_unique(cleaned_df, "Embarked")
cleaned_df.loc[cleaned_df['Cabin'] == 'B28']

Unique vales of 'Embarked': ['S' 'C' 'Q']


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,S


## Initial investigation

Initially I want to explore how the following factors differ between the survivors and non-survivors: 
* Ticket class  
* Gender
* Age
* Port in which passenger embarked from

To do this, I will compare the split between values for the whole dataset, survivors, and non-survivors.  
First, I create a new factor, 'age_group' to make the age comparison useful (this will also normalise entries, as some have registered whole years of age, whereas others take the number of months into consideration). For blank values, I will assign a 'unknown' value.  
*Note: we could potentially also use the aforementioned [Encyclopedia Titanica](https://www.encyclopedia-titanica.org) to look up passenger ages*

In [5]:
# Create age_group category and apply this function to the dataframe

def add_age_group(passenger):
    if 0 < passenger.Age < 10:
        return "0-10"
    elif 10 <= passenger.Age < 20:
        return "10-19"
    elif 20 <= passenger.Age < 30:
        return "20-29"
    elif 30 <= passenger.Age < 40:
        return "30-39"
    elif 40 <= passenger.Age < 50:
        return "40-49"
    elif 50 <= passenger.Age < 60:
        return "50-59"
    elif 60 <= passenger.Age:
        return "60+"
    else:
        return "unknown"

cleaned_df['age_group'] = cleaned_df.apply(add_age_group, axis=1)
# test that category applied correctly
cleaned_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,age_group
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,20-29
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,30-39
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,20-29


In [6]:
# This function calculates the percentage of passengers who survived, grouped by a specified attribute
def survival_pct(attribute):
    grouped_df = cleaned_df.groupby([attribute,'Survived']).agg({'PassengerId': 'count'})
    return grouped_df.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))

# Function is run for the above mentioned categories

In [7]:
survival_pct('Pclass')

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
Pclass,Survived,Unnamed: 2_level_1
1,0,37.037037
1,1,62.962963
2,0,52.717391
2,1,47.282609
3,0,75.763747
3,1,24.236253


In [8]:
survival_pct('Sex')

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
Sex,Survived,Unnamed: 2_level_1
female,0,25.796178
female,1,74.203822
male,0,81.109185
male,1,18.890815


In [9]:
survival_pct('Embarked')

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
Embarked,Survived,Unnamed: 2_level_1
C,0,44.642857
C,1,55.357143
Q,0,61.038961
Q,1,38.961039
S,0,66.099071
S,1,33.900929


In [10]:
survival_pct('age_group')

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
age_group,Survived,Unnamed: 2_level_1
0-10,0,38.709677
0-10,1,61.290323
10-19,0,59.803922
10-19,1,40.196078
20-29,0,65.0
20-29,1,35.0
30-39,0,56.287425
30-39,1,43.712575
40-49,0,61.797753
40-49,1,38.202247


***Observations to be added here, along with follow-up questions to test correlation***

## Refining the question

## SECTION TITLE

## SECTION TITLE

## *old, delete?* (set as Raw NBconvert type for now)