# DAT 19: Homework 1 Assignment

## Instructions

In this assignment, we will explore the passenger list of the Titanic, as provided in a well-known [Kaggle](https://www.kaggle.com) competition. For this assignment, we are concerned only with initial exploration. Please answer the specific questions listed below.

The dataset is a list of passengers. The second column of the dataset is the label for each person indicating whether that person survived (1) or did not survive (0). The data is described in more detail below.

There is no need to download the data from Kaggle. We have provided the titanic.csv dataset for you in the same  directory as this assignment.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work. Questions 10, 13, and 14 ask for your thoughts as a few sentences (regular prose, not code :-).

**Please submit your completed notebook by 11:59PM on Thursday, December 17.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

### Load Data

Start by loading the Titanic data into a numpy array. The code to do this is included below to get you started.

In [57]:
# We'll use the csv module from the Python standard library to read each row of our input data file
#   into a list that we then convert into a numpy array.

import csv as csv
import numpy as np

file_object = csv.reader(open('titanic.csv', 'rb'))
header = file_object.next() #First, let's strip off the header
data = []
for row in file_object:
    data.append(row)

data = np.array(data) #Turn our Python list into a numpy array

# Let's see what our data looks like. Notice what type the data appears to be inside the numpy array.
print type(data)
print data.shape

print data[:10,:]
print '\n' + '...'
print data[-1,:]

<type 'numpy.ndarray'>
(891, 12)
[['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
  '7.25' '' 'S']
 ['2' '1' '1' 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
  'female' '38' '1' '0' 'PC 17599' '71.2833' 'C85' 'C']
 ['3' '1' '3' 'Heikkinen, Miss. Laina' 'female' '26' '0' '0'
  'STON/O2. 3101282' '7.925' '' 'S']
 ['4' '1' '1' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)' 'female' '35'
  '1' '0' '113803' '53.1' 'C123' 'S']
 ['5' '0' '3' 'Allen, Mr. William Henry' 'male' '35' '0' '0' '373450'
  '8.05' '' 'S']
 ['6' '0' '3' 'Moran, Mr. James' 'male' '' '0' '0' '330877' '8.4583' '' 'Q']
 ['7' '0' '1' 'McCarthy, Mr. Timothy J' 'male' '54' '0' '0' '17463'
  '51.8625' 'E46' 'S']
 ['8' '0' '3' 'Palsson, Master. Gosta Leonard' 'male' '2' '3' '1' '349909'
  '21.075' '' 'S']
 ['9' '1' '3' 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)' 'female'
  '27' '0' '2' '347742' '11.1333' '' 'S']
 ['10' '1' '2' 'Nasser, Mrs. Nicholas (Adele Achem)' 'female' '14' '1' '0'
  

## Homework Questions

**1) How many passengers are in our passenger list? For this assignment, we’ll assume our dataset represents the full passenger list for the Titanic.**

In [58]:
# Your code here for num_passengers. Uncomment the line below to format your answer.

num_passengers = len(data)

print 'There are ' + str(num_passengers) + ' passengers in our list.'

There are 891 passengers in our list.


**2) What is the overall survival rate (as a percentage of total passengers in our list)?**

In [59]:
#HINT: You may want to import another module here
import pandas as pd
import math

df = pd.DataFrame(data, columns = header)
num_survived = df.Survived.astype('float').sum()
survival_rate = num_survived / num_passengers

#Your code here to calculate num_survived and survival_rate. 
#Again, we've provided some code you can use to format your answer.


print 'Of ' + str(num_passengers) + ' passengers, ' + str(num_survived) + ' survived.'
print 'The survival rate is ' + str(survival_rate) + ' or '+ str("{0:.0f}%".format(survival_rate * 100))

Of 891 passengers, 342.0 survived.
The survival rate is 0.383838383838 or 38%


**3) How many male passengers were onboard?**

In [60]:
# Your code here to calculate num_males

bysex = df.groupby(['Sex'])
males = bysex.get_group('male')
num_males = len(males)

print 'There were ' + str(num_males) + ' male passengers onboard.'

There were 577 male passengers onboard.


**4) How many female passengers were onboard?**

In [61]:
# Your code here to calculate num_females

females = bysex.get_group('female')
num_females = len(females)

print 'There were ' + str(num_females) + ' female passengers onboard.'

There were 314 female passengers onboard.


**5) What is the overall survival rate of male passengers?**

In [62]:
# Your code here to calculate male_survival_rate

num_males_survived = males.Survived.astype("float").sum()
male_survival_rate = num_males_survived / num_males

print 'The survival rate of male passengers was ' + str("{0:.2f}%".format(male_survival_rate * 100))

The survival rate of male passengers was 18.89%


**6) What is the overall survival rate of female passengers?**

In [63]:
# Your code here to calculate female_survival_rate

num_females_survived = females.Survived.astype("float").sum()
female_survival_rate = num_females_survived / num_males


print 'The survival rate of female passengers was ' + str("{0:.2f}%".format(female_survival_rate * 100))

The survival rate of female passengers was 40.38%


**7) What is the average age of all passengers onboard?**

In [78]:
# Your code here to calculate avg_age
# HINT: This may be trickier than it first seems. Look at the age values in the raw data file.

data = df
data.Age[data.Age == ''] = 0
avg_age = data.Age[data.Age != 0].astype('float').mean()
data.Age[data.Age == ''] = avg_age
avg_age = data.Age.astype('float').mean()

print 'The average age of all passengers onboard was ' + str("{0:.2f}".format(avg_age))



The average age of all passengers onboard was 23.80


**8) What is the average age of passengers who survived?**

In [87]:
# Your code here to calculate avg_age_survived

data.Age = data.Age.astype('float')

avg_age_survived = data.groupby('Survived')['Age'].mean()[1]


print 'The average of age of passengers who survived was ' + str("{0:.2f}".format(avg_age_survived))

The average of age of passengers who survived was 24.03


**9) What is the average age of passengers who did not survive?**

In [88]:
# Your code here to calculate avg_age_not_survive
not_survived = df.groupby('Survived').get_group('0')

data.Age = data.Age.astype('float')

avg_age_not_survive = data.groupby('Survived')['Age'].mean()[0]


print 'The average of age of passengers who did not survive was ' + str("{0:.2f}".format(avg_age_not_survive))

The average of age of passengers who did not survive was 23.65


**10) At this (early) point in our analysis, what might you infer about any patterns you are seeing in who survived / did not survive?**

It seems one was more likely to survive if female.  Age does not seem to be much of a factor in surviving.

**11) How many passengers are in each of the three classes of service (e.g. First, Second, Third?)**

In [89]:
# Your code here
byclass = data.groupby('Pclass')
byclass.groups
num_firstclass = len(byclass.get_group('1'))
num_secondclass = len(byclass.get_group('2'))
num_thirdclass = len(byclass.get_group('3'))

# You do not need to format the output nicely like we did for the questions above.
print "There are " + str(num_firstclass) + " people in First Class."
print "There are " + str(num_secondclass) + " people in Second Class."
print "There are " + str(num_thirdclass) + " people in Third Class."


There are 216 people in First Class.
There are 184 people in Second Class.
There are 491 people in Third Class.


**12) What is the survival rate for passengers in each of the three classes of service?**

In [91]:
# Your code here
# HINT: Averaging of the 1's and 0's in the "Survived" column will give a survival rate.
# We want the survival rate for each passenger class.
# You do not need to format the output nicely.

firstclass = byclass.get_group('1')
secondclass = byclass.get_group('2')
thirdclass = byclass.get_group('3')

rate_firstclass_survived = firstclass.Survived.astype("float").mean()
rate_secondclass_survived = secondclass.Survived.astype("float").mean()
rate_thirdclass_survived = thirdclass.Survived.astype("float").mean()

print "Survival rate of First Class: " + str(rate_firstclass_survived)
print "Survival rate of Second Class: " + str(rate_secondclass_survived)
print "Survival rate of Third Class: " + str(rate_thirdclass_survived)


Survival rate of First Class: 0.62962962963
Survival rate of Second Class: 0.472826086957
Survival rate of Third Class: 0.242362525458


**13) How does what we learned in 11) and 12) influence our early conclusions from 10) ?**

One was more likely to survive if in First Class.

**14) If we were to build a predictive model, which features in the data do you think we should include in the model and which can we leave out? Why?**


Sex - Yes, because as we saw one is more likely to survive if female.
Age - Maybe. Data isn't very clean.
Pclass - Yes, because Class has some effect on survival rate
Name - Maybe. We would need to do some more exploratory research, but people of certain ethnicities may have had better chances of survival than most, and NLP of names can help determine ethnicity.  
Cabin - Maybe.  Perhaps location of Cabin will have some effect on surviving (need to do more exploratory research)
Number Family members aboard - Maybe. Having relatives aboard will change the behavior of an individual which may effect their likeliness of survival (need to do more exploratory research)
Ticket Number - No.  If anything ticket number may be related to Pclass, Cabin location, price of the ticket, or where they embarked, but it's doubtful there will be any correlation with Ticket Number and survival.
Fare - No. Again, this may be related to class service, cabin location, or when they bought the ticket, but this doesn't give us any new information about survival.
Embarked - No.  Seems unlikely that the fact of where someone got onto the Titanic has an effect on how they got off.  
