# P2 - Data Analyst

## Introduction
The idea here is to explore the data and look for some conclusions we can get from the data. At this moment no question can be formulated without a first knowledge of the data

## Importing
Import libraries and CSV. Verify if CSV was imported correctly.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.plotting import figure
from bokeh.charts import Histogram
from bokeh.charts import Donut
from bokeh.charts import Bar
from bokeh.io import output_notebook, show
from bokeh.layouts import row
import random
from sets import Set

output_notebook()

df = pd.read_csv('titanic_data.csv')



In [2]:
df.describe()



Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,,0.0,0.0,7.9104
50%,446.0,0.0,3.0,,0.0,0.0,14.4542
75%,668.5,1.0,3.0,,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [3]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Data Cleaning
The 'Age' has some NaN and we must fix it.

In [4]:
df.isnull().sum()


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

From a total of 891 passengers, we don't have the Age of 177, the Cabin of 687 and the Place of Embark for 2.

We can consider that Cabin is nearly useless and remove this column.

For the Place of Embark and Age, I will remove the passengers from our analysis. I am not happy by doing it, they are over 150 being removed but we will still keep over than 700 passengers to analyse

In [5]:
df_clean = df.copy()

del df_clean['Cabin']

df_clean = df_clean.dropna()

df_clean.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

The column PassengerId and Ticket will not give me any relevant statistical result, so they will be also removed.

In [6]:
del df_clean['PassengerId']
del df_clean['Ticket']

df_clean.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,712.0,712.0,712.0,712.0,712.0,712.0
mean,0.404494,2.240169,29.642093,0.514045,0.432584,34.567251
std,0.491139,0.836854,14.492933,0.930692,0.854181,52.938648
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,1.0,20.0,0.0,0.0,8.05
50%,0.0,2.0,28.0,0.0,0.0,15.64585
75%,1.0,3.0,38.0,1.0,1.0,33.0
max,1.0,3.0,80.0,5.0,6.0,512.3292


## Exploring phase
Now that the data is clean it is time to play with it. Let´s start with Age, we will verify if there is a major difference between the age of those who survived.

In [7]:
print("The total number of passengers were %d, from them only %d survived." 
      %(df_clean['Survived'].count(), df_clean[df_clean['Survived'] == 1]['Survived'].count()))
print("There is %d males and %d females in the cleaned data." 
      %(df_clean[df_clean['Sex'] == 'male']['Survived'].count(),
        df_clean[df_clean['Sex'] == 'female']['Survived'].count()))


The total number of passengers were 712, from them only 288 survived.
There is 453 males and 259 females in the cleaned data.


Let´s check the Age, we will verify if there is a major difference between the age of those who survived.

In [8]:
hist = Histogram(df_clean, values='Age', title="Age of passagers Histogram", plot_width=400, plot_height=400)
hist2 = Histogram(df_clean, values='Age', color='Survived', legend='top_right',
                  title="Age of passagers Histogram if Survived or not", plot_width=400, plot_height=400)

show(row(hist,hist2))

It surprised me that most of the babies (below 5) survived and someone above 75. But it is not much to play with it.

Let's check if there is a difference for 'Sex'.

In [9]:
df_female = df_clean[df_clean['Sex'] == 'female']
df_male = df_clean[df_clean['Sex'] == 'male']

female_survived = df_female[df_female['Survived'] == 1].loc[:,'Survived'].count()
female_not_survived = df_female[df_female['Survived'] == 0].loc[:,'Survived'].count()
male_survived = df_male[df_male['Survived'] == 1].loc[:,'Survived'].count()
male_not_survived = df_male[df_male['Survived'] == 0].loc[:,'Survived'].count()

df_simple = pd.DataFrame({'Sex': ['female', 'female', 'male', 'male'], 
             'Survived': ['Survived', 'Not survived', 'Survived', 'Not survived'], 
             'counting': [female_survived, female_not_survived, male_survived, male_not_survived]})

bar = Bar(df_simple, values='counting', label='Survived', stack='Sex', legend='top_center',
                  title="Number of passengers if survived or not and divided by sex", plot_width=400, plot_height=400)
hist = Histogram(df_clean, values='Age', color='Sex', legend='top_right',
                  title="Age of passagers Histogram depending of the sex", plot_width=400, plot_height=400)

show(row(bar,hist))

Now it really surprised me. The amount of females that survived is a lot bigger them the mount of males.

The age between males and females didn't change a lot. Since we checked something more interesting in the survival, let's play with it.

To improve the understanding let's check one more graphic about the Sex and Survival.

In [10]:
d = Donut(df_simple, label=['Sex','Survived'], values='counting',
                  title="Quantity of passengers if survived or not and divided by sex")

show(d)

Again we can clearly see that females survived a lot more than man.

In [11]:
def survival_rate(sex):
    df_one_sex = df_clean[df_clean['Sex'] == sex]
    count_survivals = df_one_sex[df_one_sex['Survived'] == 1].loc[:,'Survived'].count()
    count_one_sex = df_one_sex['Survived'].count()
    return float(count_survivals)/count_one_sex

survival_rate_male = survival_rate('male')
survival_rate_female = survival_rate('female')

print("The survival rate for males were %.2f and for females were %.2f." %(survival_rate_male, survival_rate_female))

The survival rate for males were 0.21 and for females were 0.75.


I am suprised again, I didn't expect that much difference in the percentages. I will try to statistically indicate that the survival rate for the females in the titanic were higher than the males.

# Statistical analysis
Now that the data is already known we will try to draw some conclusions. Here the question is: Does females had a higher chance of survival than males in the titanic?

Here will we try to show the statistical significance that the average survival rate for the females in the titanic were higher than the survival rate for the males.

We will draw 1000 random samples of 20 subjects for each, show the histogram of the survival rate for the samples and check if the average survival rate for females is significant higher than for the males.

We will use the critical value of 0.05 and we will use the t-test.

So Hypotesis 0:
average survival rate of females <= average survival rate of males

Alternative Hypotesis:
average survival rate of females > average survival rate of males

In [12]:
SIZE_SAMPLE = 20
TOTAL_SIZE = 1000

def sample(sex,size_sample):
    indexes = Set([])
    while len(indexes) < size_sample:
        df_one_sex = df_clean[df_clean['Sex'] == sex]
        indexes.add(random.choice(df_one_sex.index))
    return indexes

def survival_rate_of_sample(indexes):
    survivals = []
    for index in indexes:
        survivals.append(df_clean.loc[index,'Survived'])
    return float(sum(survivals))/len(survivals)

def survival_rate_list(sex,total_size,size_sample):
    samples_surv_rate = []
    for num_sample in range(total_size):
        samples_surv_rate.append(survival_rate_of_sample(sample(sex,size_sample)))
    return samples_surv_rate
    

survival_rate_list_male = survival_rate_list('male',TOTAL_SIZE,SIZE_SAMPLE)
survival_rate_list_female = survival_rate_list('female',TOTAL_SIZE,SIZE_SAMPLE)

Note: there is a chance of keeping twice the same survival rate (if the 20 samples are the same). Since the chance is extremely small we are ignoring it.

In [13]:
data = pd.DataFrame({'Sex': [], 'survival_rate': []})

for survival_rate_male in survival_rate_list_male:
    data = data.append({'Sex': 'male', 'survival_rate': survival_rate_male},ignore_index=True)
for survival_rate_female in survival_rate_list_female:
    data = data.append({'Sex': 'female', 'survival_rate': survival_rate_female},ignore_index=True)

In [14]:
hist = Histogram(data, values='survival_rate', color='Sex', title="Survival rate for sex Histogram",
                 plot_width=400, plot_height=400, legend='top_center')

show(hist)

We want to check the difference so we will analyse the female survival rate - male survival rate

In [15]:
survival_rate_difference = np.array(survival_rate_list_female) - np.array(survival_rate_list_male)

hist = Histogram(pd.DataFrame({'Survival rate difference': survival_rate_difference}),
                 title="Survival rate difference Histogram", plot_width=400, plot_height=400)

show(hist)

In [16]:
print('The mean is %.2f and the standard deviation is %.2f.' %(survival_rate_difference.mean(),survival_rate_difference.std()))

The mean is 0.54 and the standard deviation is 0.13.


In [17]:
t_statistic = survival_rate_difference.mean()/(survival_rate_difference.std()/SIZE_SAMPLE**0.5)

print('The t-statistic is %.2f.' %t_statistic)

The t-statistic is 18.61.


The degree of freedom is 19.

The t-critical is 1.729. (Remember we are only checking if it is higher, one direction.)

# Conclusion
Since t-statistical is a lot higher than t-critical we can say that we drop the initial hypotesis and assume that the average survival rate for females is significant higher than the average survival rate for males.

So answering the question in the begining of the statistical analysis: Does females had a higher chance of survival than males in the titanic?

The answer is: Yes. We can say if 95% certainty that females had a higher chance of survival than males.

It could be used to predict the chance of survival in the Titanic but we may have lurking features that are influencing on it also. So to verify how other features behave more studies should be done.

Despite the simple conclusion, we know that the Titanic had 1300 passengers and our data base for the analysis only had 720. Depending of the data missing our analysis may need to be updated or totally discharged.

Imagine that the missing data are from missing unidentified person and the majority are women, it can totally change the analysis.