In [None]:
import random

import numpy as np
import pandas as pd

import seaborn
import matplotlib.pyplot as plt

%matplotlib inline

# Titanic Dataset

## dataset info

It contains demographics and passenger information and crew on board the Titanic

## question to be answered

    What factors made people more likely to survive?

## load data

In [None]:
titanic_data = pd.read_csv('titanic-data.csv')

## look at few rows of the data

In [None]:
titanic_data.head()

## variable descriptions in the dataset

[Kaggle Titanic Data](https://www.kaggle.com/c/titanic/data)

`
PassengerId     Passenger Unique ID									
Survived        Survival
                (0 = No; 1 = Yes)
Pclass          Passenger Class
                (1 = Upper Class; 2 = Middle Class; 3 = Lower Class)
Name            Name
Sex             Sex
Age             Age
SibSp           Number of Siblings/Spouses Aboard
Parch           Number of Parents/Children Aboard
Ticket          Ticket Number
Fare            Passenger Fare
Cabin           Cabin
Embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
`

**SPECIAL NOTES:**
***
    
    Pclass is a proxy for socio-economic status (SES)
    - 1st ~ Upper
    - 2nd ~ Middle
    - 3rd ~ Lower

    Age is in Years; Fractional if Age less than One (1)
    If the Age is Estimated, it is in the form xx.5

    With respect to the family relation variables (i.e. sibsp and parch)
    some relations were ignored.  The following are the definitions used
    for sibsp and parch.

    Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
    Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
    Parent:   Mother or Father of Passenger Aboard Titanic
    Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

    Other family relatives excluded from this study include cousins,
    nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
    only with a nanny, therefore parch=0 for them.  As well, some
    travelled with very close friends or neighbors in a village, however,
    the definitions do not support such relations.

## observatory look at dataset

- 891 passenger/crew data
- 12  features their age, sex and survival status

In [None]:
titanic_data.shape

In [None]:
titanic_data.columns

# Data Wrangling and Cleaning

### checking missing values

missing values:
    
    Age      : 177 values
    Cabin    : 687 values
    Embarked : 2 values

In [None]:
titanic_data.isnull().sum()

### fixing missing values

values to be fixed:
- Age
- Embarked

**Note:**
Cabin values are missing but they are not relevant for out analysis, so we can simple ignore that

#### fixing age

    We can fix  the age by replacing it with the overall mean age but
    to make it more approriate for the analysis,
    lets take the mean of ages grouped by sex and passenger class

##### mean age grouped by Sex and Passenger Class

    taking mean grouped by Sex and Passenger Class gives more appropriate results then
    the overall mean
    
    we can see that the overall mean is around 29.7 whereas if we group by
    Sex and the Passenger Class we see a different distribution
    
    We will be using this grouped mean for the mapping of missing ages

In [None]:
overall_mean_age = titanic_data['Age'].mean()
overall_mean_age

In [None]:
mean_ages = titanic_data.groupby(['Sex','Pclass'])['Age'].mean()
mean_ages

In [None]:
mean_ages['male', 2]

##### mapping missing ages to mean ages calculated

In [None]:
def fix_missing_ages(row):
    '''
    checks if the age is null and replace with the mean age
    grouped by Sex and Passenger Class from the dataset
    '''
    if pd.isnull(row['Age']):
        return mean_ages[row['Sex'],row['Pclass']]
    else:
        return row['Age']

In [None]:
titanic_data['Age'] = titanic_data.apply(fix_missing_ages, axis=1)

    ages fixed, 0 null values

In [None]:
titanic_data['Age'].isnull().sum()

#### fixing Embarked missing values

items in Embarked:
- Southampton : 644 people
- Cherbourg   : 168 people
- Queenstown  : 77 people

In [None]:
titanic_data['Embarked'].value_counts()

    about 72% of passengers Embarked from Southampton port

In [None]:
100 * float(titanic_data[titanic_data['Embarked'] == 'S']['Embarked'].count()) / titanic_data['Embarked'].count()

the 2 missing values

In [None]:
titanic_data[titanic_data['Embarked'].isnull()]

    missing values in Embarked can be replaced by the station Southampton, 
    since most of the people boarded from this station

In [None]:
titanic_data['Embarked'].fillna('S',inplace=True)

    missing embarked values fixed, 0 null values

In [None]:
titanic_data['Embarked'].isnull().sum()

### preparing final dataset for analysis

#### dropping irrelavant columns from dataset

    columns : 'PassengerId', 'Name', 'Ticket', 'Cabin are not relevant for our analysis
    we can safely discard them from the dataset

In [None]:
titanic_data = titanic_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

In [None]:
titanic_data.head()

#### formatting values in a more readable form

the following formats, when applied makes data more presenting:
- Survived
    - 0 : False
    - 1 : True
- Pclass 
    - 1 : Upper Class
    - 2 : Middle Class
    - 3 : Lower Class
- Embarked
    - S : Southampton
    - C : Cherbourg
    - Q : Queenstown
    
**Note:** 
Sex, Age, SibSp, Parch & Fare are already in the best format

In [None]:
survived_formatted = {0: False, 1: True}
titanic_data['Survived'] = titanic_data['Survived'].map(survived_formatted)

pclass_formatted = {1: 'Upper Class', 2: 'Middle Class', 3: 'Lower Class'}
titanic_data['Pclass'] = titanic_data['Pclass'].map(pclass_formatted)

embarked_formatted = {'S': 'Southampton', 'C': 'Cherbourg','Q':'Queenstown'}
titanic_data['Embarked'] = titanic_data['Embarked'].map(embarked_formatted)

In [None]:
titanic_data.head()

#### adding a new 'FamilySize' column

    Family Size can be calculated by adding the values of Siblings, Spouses and Parents aboard

In [None]:
titanic_data['FamilySize'] = titanic_data['SibSp'] + titanic_data['Parch']

#### adding a new 'AgeGroup' column

    to make interesting analysis, instead of using ages;
    we can add a new column of age groups which contains ages grouped 
    in intervals of 10 years

In [None]:
age_labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
titanic_data['AgeGroup'] = pd.cut(titanic_data.Age, range(0, 81, 10), right=False, labels=age_labels)

#### final dataset

In [None]:
titanic_data.head()

# Analysis on the dataset

### survival stats

In [None]:
def percent(value, total):
    return (100 * float(value) / total)

In [None]:
survivors_data = titanic_data[titanic_data.Survived == True]
non_survivors_data = titanic_data[titanic_data.Survived == False]

In [None]:
def survival_stats(features_and_values):
    data = []
    index = []
    columns = ['total', 'survived', 'not survived', '% survived', '% not survived']
    
    for feature in features_and_values:
        for value in features_and_values[feature]:
            total = len(titanic_data[titanic_data[feature] == value])
            survived = len(survivors_data[survivors_data[feature] == value])
            not_survived = len(non_survivors_data[non_survivors_data[feature] == value])
            percent_survived = percent(len(survivors_data[survivors_data[feature] == value]), len(titanic_data[titanic_data[feature] == value]))
            percent_not_survived = percent(len(non_survivors_data[non_survivors_data[feature] == value]), len(titanic_data[titanic_data[feature] == value]))
            
            data.append([total, survived, not_survived, percent_survived, percent_not_survived])
            index.append(value)
            
    return pd.DataFrame(data, index, columns)

In [None]:
features_and_values = {
    'Sex' : ['male', 'female'],
    'Pclass' : ['Upper Class', 'Middle Class', 'Lower Class'],
    'AgeGroup' : ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
}

In [None]:
survivor_stats = survival_stats(features_and_values)
survivor_stats

### effect of survival rate by different parameters

#### helper function

In [None]:
feature = "Sex"

In [None]:
values = features_and_values[feature]
labels = features_and_values[feature]

In [None]:
values

In [None]:
survivors_count = [survivor_stats['survived'].loc[i] for i in values]
    non_survivors_count = [survivor_stats['not survived'].loc[i] for i in values]
    survivors_percent = [survivor_stats['% survived'].loc[i] for i in values]
    non_survivors_percent = [survivor_stats['% not survived'].loc[i] for i in values]

In [None]:
survivors_count

In [None]:
def plot_survival(feature):
    """
        this function takes in a feature from the titanic data  e.g., Pclass, Sex etc
        and plots the counts and percent for all possible categories of that feature 
        with respect to survival and non survival rate 
        e.g, for Sex, the possible values are male and females
    """
    values = features_and_values[feature]
    labels = features_and_values[feature]
    
    survivors_count = [survivor_stats['survived'].loc[i] for i in values]
    non_survivors_count = [survivor_stats['not survived'].loc[i] for i in values]
    survivors_percent = [survivor_stats['% survived'].loc[i] for i in values]
    non_survivors_percent = [survivor_stats['% not survived'].loc[i] for i in values]
    
    f, (plot1, plot2) = plt.subplots(1, 2, figsize=(10,5))
    
    plot1.bar(range(len(survivors_count)), survivors_count, label = 'Survivors', alpha = 0.5, color = 'g')
    plot1.bar(range(len(non_survivors_count)), non_survivors_count, bottom = survivors_count, 
              label = 'Non Survivors', alpha=0.5, color='r')

    plt.sca(plot1)

    plt.xticks([0.4, 1.4, 2.4], labels )
    plot1.set_ylabel("Count")
    plot1.set_xlabel("")
    plot1.set_title("Count of survivors by " + feature, fontsize = 14)
    plt.legend(loc = "best")


    plot2.bar(range(len(survivors_percent)), survivors_percent, alpha = 0.5, color = 'g')
    plot2.bar(range(len(non_survivors_percent)), non_survivors_percent, 
              bottom = survivors_percent, alpha = 0.5, color = 'r')

    plt.sca(plot2)

    plt.xticks([0.4, 1.4, 2.4],  labels)
    plot2.set_ylabel("Percentage")
    plot2.set_xlabel("")
    plot2.set_title("% of survivors by " + feature, fontsize = 14)

### Is the survival rate affected by sex?

In [None]:
plot_survival("Sex")

**Conclusion:** females had a greater rate of survival.

### Is the survival rate affected by passenger class?

In [None]:
plot_survival("Pclass")

    above plots show that the passengers travelling in lower class were highest in numbers 
    but had the lowest survival rate
    while the upper class passengers had the highest survival rate

**Conclusion: ** Passenger class does have an impact on chances of survival

### survival rate based on age groups

In [None]:
titanic_data.groupby(['AgeGroup']).size()

    majority of passengers belong to the age group "20-29"

In [None]:
titanic_data.groupby(['AgeGroup']).size().plot(kind = 'bar')

plt.title("Distribution of Age Groups", fontsize = 14)
plt.ylabel('Count')
plt.xlabel('Age Group')

In [None]:
plot_survival("AgeGroup")

    we cannot conclude much from the age group about survival rate

###  combined effects of Age Group, Sex and Passenger class on survival rate

#### helper function

In [None]:
def plot_suvival_group(features):
    
    values = [titanic_data[titanic_data[features[1]] == value].groupby(features[0]).Survived.mean().values 
              for value in features_and_values[features[1]]]
    
    x_labels = features_and_values[features[0]]
    legend_labels = features_and_values[features[1]]
    colors = random.sample(['b', 'g', 'r'], len(legend_labels))

    plot = plt.subplot()

    positions = []
    position_move = 0
    for i in features_and_values[features[1]]:
        positions.append(np.array(range(len(age_labels))) + position_move)
        position_move += 0.25
    
    for position, value, label, color in zip(positions, values, legend_labels, colors):
        plot.bar(position, value, width = 0.25, label = label, alpha = 0.5, color = color)

    plt.xticks((np.array(range(len(x_labels))) + 0.4), x_labels)

    plot.set_ylabel("Proportion")
    plot.set_xlabel("Age Group")
    plot.set_title("Survivors by Age Group & Gender", fontsize = 14)
    plt.legend(loc = 'best')

#### Age Group and Sex

In [None]:
plot_suvival_group(['AgeGroup', 'Sex'])

#### Age Group and Passenger Class

In [None]:
plot_suvival_group(['AgeGroup', 'Pclass'])

# Conclusion

- females have a higher chance of survival
- children and old people irrecpective of their sex have higher chance of survival