# DAT 19: Homework 1 Assignment

## Instructions

In this assignment, we will explore the passenger list of the Titanic, as provided in a well-known [Kaggle](https://www.kaggle.com) competition. For this assignment, we are concerned only with initial exploration. Please answer the specific questions listed below.

The dataset is a list of passengers. The second column of the dataset is the label for each person indicating whether that person survived (1) or did not survive (0). The data is described in more detail below.

There is no need to download the data from Kaggle. We have provided the titanic.csv dataset for you in the same  directory as this assignment.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work. Questions 10, 13, and 14 ask for your thoughts as a few sentences (regular prose, not code :-).

**Please submit your completed notebook by 11:59PM on Thursday, December 17.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

### Load Data

Start by loading the Titanic data into a numpy array. The code to do this is included below to get you started.

In [12]:
# We'll use the csv module from the Python standard library to read each row of our input data file
#   into a list that we then convert into a numpy array.

import csv as csv
import numpy as np

file_object = csv.reader(open('titanic.csv', 'rb'))
initial_columns = []
data = []
for row in file_object:
    data.append(row)
initial_columns = data[0]
data = np.array(data[1:]) #Turn our Python list into a numpy array

# Let's see what our data looks like. Notice what type the data appears to be inside the numpy array.
print type(data)
print data.shape
print data[0]
print data[:10,:]
print '\n' + '...'
print data[-1,:]
print initial_columns

<type 'numpy.ndarray'>
(891, 12)
['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
 '7.25' '' 'S']
[['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
  '7.25' '' 'S']
 ['2' '1' '1' 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
  'female' '38' '1' '0' 'PC 17599' '71.2833' 'C85' 'C']
 ['3' '1' '3' 'Heikkinen, Miss. Laina' 'female' '26' '0' '0'
  'STON/O2. 3101282' '7.925' '' 'S']
 ['4' '1' '1' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)' 'female' '35'
  '1' '0' '113803' '53.1' 'C123' 'S']
 ['5' '0' '3' 'Allen, Mr. William Henry' 'male' '35' '0' '0' '373450'
  '8.05' '' 'S']
 ['6' '0' '3' 'Moran, Mr. James' 'male' '' '0' '0' '330877' '8.4583' '' 'Q']
 ['7' '0' '1' 'McCarthy, Mr. Timothy J' 'male' '54' '0' '0' '17463'
  '51.8625' 'E46' 'S']
 ['8' '0' '3' 'Palsson, Master. Gosta Leonard' 'male' '2' '3' '1' '349909'
  '21.075' '' 'S']
 ['9' '1' '3' 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)' 'female'
  '27' '0' '2' '347742' '11.1333

## Homework Questions

**1) How many passengers are in our passenger list? For this assignment, weâ€™ll assume our dataset represents the full passenger list for the Titanic.**

In [17]:
# Your code here for num_passengers. Uncomment the line below to format your answer.
passengers, attributes = data.shape

print 'There are ' + str(passengers) + ' passengers in our list.'

There are 891 passengers in our list.


**2) What is the overall survival rate (as a percentage of total passengers in our list)?**

In [140]:
#HINT: You may want to import another module here
import pandas as pd

#Your code here to calculate num_survived and survival_rate. 
#Again, we've provided some code you can use to format your answer.
titanic = pd.DataFrame(data=data)

titanic.columns = ['id', 'survived', 'class', 'name', 'sex', 'age', 'sibSp', 'parch', 'ticket', 'fare', 'cabin', 'embarked']
titanic
num_survived = titanic['survived'][titanic['survived'] == '1'].count()
survival_rate = float(num_survived)/passengers

print 'Of ' + str(passengers) + ' passengers, ' + str(num_survived) + ' survived.'
print 'The survival rate is ' + str(survival_rate) + ' or '+ str("{0:.0f}%".format(survival_rate * 100))

Of 891 passengers, 342 survived.
The survival rate is 0.383838383838 or 38%


**3) How many male passengers were onboard?**

In [111]:
# Your code here to calculate num_males
num_males = titanic['sex'][titanic['sex'] == 'male'].count()


print 'There were ' + str(num_males) + ' male passengers onboard.'

There were 577 male passengers onboard.


**4) How many female passengers were onboard?**

In [112]:
# Your code here to calculate num_females
num_females = titanic['sex'][titanic['sex'] == 'female'].count()


print 'There were ' + str(num_females) + ' female passengers onboard.'

There were 314 female passengers onboard.


**5) What is the overall survival rate of male passengers?**

In [113]:
# Your code here to calculate male_survival_rate
num_males_survived = titanic['sex'][(titanic['sex'] == 'male') & (titanic['survived'] == '1')].count()

male_survival_rate = float(num_males_survived)/num_males

print 'The survival rate of male passengers was ' + str("{0:.2f}%".format(male_survival_rate * 100))

The survival rate of male passengers was 18.89%


**6) What is the overall survival rate of female passengers?**

In [117]:
# Your code here to calculate female_survival_rate
num_females_survived = titanic['sex'][(titanic['sex'] == 'female') & (titanic['survived'] == '1')].count()

female_survival_rate = float(num_females_survived)/num_females

print 'The survival rate of female passengers was ' + str("{0:.2f}%".format(female_survival_rate * 100))

The survival rate of female passengers was 74.20%


**7) What is the average age of all passengers onboard?**

In [183]:
# Your code here to calculate avg_age
# HINT: This may be trickier than it first seems. Look at the age values in the raw data file.
passengers_known_ages = titanic['age'][titanic.age.notnull()].convert_objects(convert_numeric=True)
average_known_age = passengers_known_ages.mean()
average_known_age
cleaned_ages = passengers_ages.fillna(average_known_age)
average_age = cleaned_ages.mean()

print 'The average age of all passengers onboard was ' + str("{0:.2f}".format(average_age))

The average age of all passengers onboard was 29.70


  app.launch_new_instance()


**8) What is the average age of passengers who survived?**

In [187]:
# Your code here to calculate avg_age_survived
passengers_known_ages_survived = titanic['age'][(titanic.age.notnull()) & (titanic.survived == '1')].convert_objects(convert_numeric=True)
average_known_age_survived = passengers_known_ages_survived.mean()
average_known_age_survived
cleaned_ages_survived = passengers_ages_survived.fillna(average_known_age_survived)
average_age_survived = cleaned_ages_survived.mean()

print 'The average of age of passengers who survived was ' + str("{0:.2f}".format(average_age_survived))

The average of age of passengers who survived was 28.34


  from ipykernel import kernelapp as app


**9) What is the average age of passengers who did not survive?**

In [189]:
# Your code here to calculate avg_age_not_survive
passengers_known_ages_died = titanic['age'][(titanic.age.notnull()) & (titanic.survived == '0')].convert_objects(convert_numeric=True)
average_known_age_died = passengers_known_ages_died.mean()
average_known_age_died
cleaned_ages_died = passengers_ages_died.fillna(average_known_age_died)
average_age_died = cleaned_ages_died.mean()


print 'The average of age of passengers who did not survive was ' + str("{0:.2f}".format(average_age_died))

The average of age of passengers who did not survive was 30.63


  from ipykernel import kernelapp as app


**10) At this (early) point in our analysis, what might you infer about any patterns you are seeing in who survived / did not survive?**

Females were much more likely to survive. 

**11) How many passengers are in each of the three classes of service (e.g. First, Second, Third?)**

In [191]:
# Your code here
# You do not need to format the output nicely like we did for the questions above.
num_first = titanic['class'][titanic['class'] == '1'].count()
num_second = titanic['class'][titanic['class'] == '2'].count()
num_third = titanic['class'][titanic['class'] == '3'].count()
print 'In first class, there were {0:.2f} people'.format(num_first)
print 'In second class, there were {0:.2f} people'.format(num_second)
print 'In third class, there were {0:.2f} people'.format(num_third)

In first class, there were 216.00 people
In second class, there were 184.00 people
In third class, there were 491.00 people


**12) What is the survival rate for passengers in each of the three classes of service?**

In [212]:
# Your code here
# HINT: Averaging of the 1's and 0's in the "Survived" column will give a survival rate.
# We want the survival rate for each passenger class.
# You do not need to format the output nicely.

num_first_survival_rate = titanic['survived'][titanic['class'] == '1'].convert_objects(convert_numeric=True).mean()
num_second_survival_rate = titanic['survived'][titanic['class'] == '2'].convert_objects(convert_numeric=True).mean()
num_third_survival_rate = titanic['survived'][titanic['class'] == '3'].convert_objects(convert_numeric=True).mean()
print 'In first class, the survival rate was {0:.2f}%'.format(num_first_survival_rate*100)
print 'In second class, the survival rate was {0:.2f}%'.format(num_second_survival_rate*100)
print 'In third class, the survival rate was {0:.2f}%'.format(num_third_survival_rate*100)


In first class, the survival rate was 62.96%
In second class, the survival rate was 47.28%
In third class, the survival rate was 24.24%




**13) How does what we learned in 11) and 12) influence our early conclusions from 10) ?**

Your passenger class is a good predictor of survival, with the passenger class number being inversely related to survival.

**14) If we were to build a predictive model, which features in the data do you think we should include in the model and which can we leave out? Why?**

I think we can include sex, passenger class, and parch. We can leave out embarked since it doesn't matter how the passenger boarded, since they all were aboard the ship as it sank. Ticket cost is also not predictive since it correlates so highly with passenger class. 