# DAT 19: Homework 1 Assignment

## Instructions

In this assignment, we will explore the passenger list of the Titanic, as provided in a well-known [Kaggle](https://www.kaggle.com) competition. For this assignment, we are concerned only with initial exploration. Please answer the specific questions listed below.

The dataset is a list of passengers. The second column of the dataset is the label for each person indicating whether that person survived (1) or did not survive (0). The data is described in more detail below.

There is no need to download the data from Kaggle. We have provided the titanic.csv dataset for you in the same  directory as this assignment.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work. Questions 10, 13, and 14 ask for your thoughts as a few sentences (regular prose, not code :-).

**Please submit your completed notebook by 11:59PM on Thursday, December 17.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

### Load Data

Start by loading the Titanic data into a numpy array. The code to do this is included below to get you started.

In [19]:
# We'll use the csv module from the Python standard library to read each row of our input data file
#   into a list that we then convert into a numpy array.

import csv as csv
import numpy as np

titanic = None
columns = []
with open('titanic.csv') as data:
    reader = list(csv.reader(data, delimiter=','))
    columns = reader[0]
    without_header = reader[1:]

    titanic = np.array(without_header) #Turn our Python list into a numpy array

    # Let's see what our data looks like. Notice what type the data appears to be inside the numpy array.
    print (type(titanic))
    print (titanic.shape)

    print (titanic[:10,:])
    print ('\n' + '...')
    print (titanic[-1,:])

<class 'numpy.ndarray'>
(891, 12)
[['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171'
  '7.25' '' 'S']
 ['2' '1' '1' 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'
  'female' '38' '1' '0' 'PC 17599' '71.2833' 'C85' 'C']
 ['3' '1' '3' 'Heikkinen, Miss. Laina' 'female' '26' '0' '0'
  'STON/O2. 3101282' '7.925' '' 'S']
 ['4' '1' '1' 'Futrelle, Mrs. Jacques Heath (Lily May Peel)' 'female' '35'
  '1' '0' '113803' '53.1' 'C123' 'S']
 ['5' '0' '3' 'Allen, Mr. William Henry' 'male' '35' '0' '0' '373450'
  '8.05' '' 'S']
 ['6' '0' '3' 'Moran, Mr. James' 'male' '' '0' '0' '330877' '8.4583' '' 'Q']
 ['7' '0' '1' 'McCarthy, Mr. Timothy J' 'male' '54' '0' '0' '17463'
  '51.8625' 'E46' 'S']
 ['8' '0' '3' 'Palsson, Master. Gosta Leonard' 'male' '2' '3' '1' '349909'
  '21.075' '' 'S']
 ['9' '1' '3' 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)' 'female'
  '27' '0' '2' '347742' '11.1333' '' 'S']
 ['10' '1' '2' 'Nasser, Mrs. Nicholas (Adele Achem)' 'female' '14' '1' '0'
 

## Homework Questions

**1) How many passengers are in our passenger list? For this assignment, we’ll assume our dataset represents the full passenger list for the Titanic.**

In [38]:
# Your code here for num_passengers. Uncomment the line below to format your answer.
rows, columns = titanic.shape
num_passengers = rows
print('There are {num_passengers} passengers in our titanic dataset'.format(num_passengers=num_passengers))
#print 'There are ' + str(num_passengers) + ' passengers in our list.'

There are 891 passengers in our titanic dataset


**2) What is the overall survival rate (as a percentage of total passengers in our list)?**

In [47]:
#HINT: You may want to import another module here
import pandas as pd
titanic = pd.DataFrame(data=titanic)
titanic.columns = ['id', 'survived', 'class', 'name', 'sex', 'age', 'sibSp', 'parch', 'ticket', 'fare', 'cabin', 'embarked']
num_survived = pd.value_counts(titanic['survived'] == '1')[1]
survival_rate = num_survived / num_passengers

print('Of {num_passengers} passengers, {num_survived} survived.'.format(num_passengers=num_passengers, num_survived=num_survived)) 
print('The survival rate was {survival_rate:.2f} or {percent_survival_rate:.2%}'.format(survival_rate=survival_rate, percent_survival_rate=survival_rate))


Of 891 passengers, 342 survived.
The survival rate was 0.38 or 38.38%


**3) How many male passengers were onboard?**

In [58]:
num_males = len(titanic[titanic['sex'] == 'male'])

print('There were {num_males} male passengers onboard the Titanic'.format(num_males=num_males))



There were 577 male passengers onboard the Titanic


**4) How many female passengers were onboard?**

In [57]:
num_females = len(titanic[titanic['sex'] == 'female'])

print('There were {num_females} female passengers onboard the Titanic'.format(num_females=num_females))

There were 314 female passengers onboard the Titanic


**5) What is the overall survival rate of male passengers?**

In [66]:
male_survival_rate = len(titanic[(titanic['sex'] == 'male') & (titanic['survived'] == '1')])/num_males

print('The survival rate of males passengers was {male_survival_rate:.2%}'.format(male_survival_rate=male_survival_rate))

The survival rate of males passengers was 18.89%


**6) What is the overall survival rate of female passengers?**

In [67]:
female_survival_rate = len(titanic[(titanic['sex'] == 'female') & (titanic['survived'] == '1')])/num_females

print('The survival rate of female passengers was {female_survival_rate:.2%}'.format(female_survival_rate=female_survival_rate))

The survival rate of female passengers was 74.20%


**7) What is the average age of all passengers onboard?**

In [121]:
titanic.fillna(value='unknown', inplace=True)
ages = list(titanic.age)
cleaned_ages = [float(x) for x in ages if x]
average_age = sum(cleaned_ages)/num_passengers

print('The average age of all passengers onboard was {average_age:.2f}'.format(average_age=average_age))

The average age of all passengers onboard was 23.80


**8) What is the average age of passengers who survived?**

In [None]:
# Your code here to calculate avg_age_survived


#print 'The average of age of passengers who survived was ' + str("{0:.2f}".format(avg_age_survived))

**9) What is the average age of passengers who did not survive?**

In [None]:
# Your code here to calculate avg_age_not_survive


#print 'The average of age of passengers who did not survive was ' + str("{0:.2f}".format(avg_age_not_survive))

**10) At this (early) point in our analysis, what might you infer about any patterns you are seeing in who survived / did not survive?**

_Your answer here as markdown_

**11) How many passengers are in each of the three classes of service (e.g. First, Second, Third?)**

In [None]:
# Your code here
# You do not need to format the output nicely like we did for the questions above.



**12) What is the survival rate for passengers in each of the three classes of service?**

In [None]:
# Your code here
# HINT: Averaging of the 1's and 0's in the "Survived" column will give a survival rate.
# We want the survival rate for each passenger class.
# You do not need to format the output nicely.



**13) How does what we learned in 11) and 12) influence our early conclusions from 10) ?**

_Your answer here as markdown_

**14) If we were to build a predictive model, which features in the data do you think we should include in the model and which can we leave out? Why?**

_Your answer here as markdown_