In [None]:
#allows us to see plots in the notebook
%matplotlib inline

#we will need to make use of several popular python libraries
import matplotlib   #plotting
import numpy as np  #fast (large) array math
import pandas as pd #statistics
import matplotlib.pyplot as plt
import seaborn as sns #makes plotting much prettier

# Titanic
This [jupyter](http://jupyter.org/) notebok analyzes a titanic dataset from [kaggle](https://www.kaggle.com/c/titanic). If you have never used a notebook like this before,
start by clicking on `Help` -> `User Interface Tour`. After that, execute the python
cells one at a time. Check out the [seaborn api](https://stanford.edu/~mwaskom/software/seaborn/api.html) for some of the more attractive visualizations.

In [None]:
#step one is to read the data from the CSV file, using pandas
titanic = pd.read_csv('titanic_data.csv')
type(titanic)

In [None]:
#A DataFrame is a python object that models a table
titanic

In [None]:
#it consists of a bunch of columns, or Series objects
print(type(titanic.Name))
titanic.Name

In [None]:
#we can use the plots built-in to pandas to get an attractive histogram
titanic.Age.plot.hist(bins=6, title="Distribution of Age on the Titanic (years)")

In [None]:
#or we can use seaborn's lovely "distplot"
sns.distplot(titanic.Age.dropna(), bins=10, axlabel="Distribution of Age on the Titanic (years)")

In [None]:
#summary statistics are had with the describe()
#method on Series and DataFrames
titanic.Age.describe()

The distribution of age on the titanic is roughly symmetric with $\overline{x}\approx29.7$ years and a standard deviation of approximately 14.5 years. There appears to be a small, second mode representing young children on the titanic.

In [None]:
#You may be curious about the dropna() method we called
#on the series object. That was necessary because there
#are some null values for the age variable.
titanic.Age

In [None]:
len(titanic.Age), len(titanic.Age.dropna())

In [None]:
#Seaborn can also make some nice bxplots for us
sns.boxplot(x='Pclass', y='Age', data=titanic)

In [None]:
age_of_survived = titanic.Age.where(titanic.Survived == 1)
age_of_deceased = titanic.where(titanic.Survived == 0).Age #different syntax, for the sake of example
sns.boxplot(data=[age_of_survived, age_of_deceased])

In [None]:
sns.countplot(x='Sex', hue='Survived', data=titanic)

In [None]:
sns.countplot(x='Survived', hue='Sex', data=titanic)

In [None]:
help(sns.distplot)