April 21, 2016 - Women in Data Science Meetup - "Data Science from Scratch" Workshop #2

# Exploring the Titanic Data set with Pandas:

This is an exploration of the Titanic Data Set which is available at https://www.kaggle.com/c/titanic/data, this pandas notebook has been forked from https://github.com/TarekDib03/titanic-EDA/blob/master/Titanic%20-%20Project.ipynb and modified for this workshop.  

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib as mpl
import matplotlib.pyplot as plt
#import seaborn as sns 
%matplotlib inline

# Set default matplot figure size
mpl.pylab.rcParams['figure.figsize'] = (10.0, 8.0)

### Install Watermark - tool to help with reproducibility:

Always use this tool to document what versions of packages were used and the machine that was used.  Make sure to include the packages that were imported in the previous section.  <a href='http://sebastianraschka.com/'>Sebastian Raschka</a> is the author of it - his site is a great resource for IPython/Python and other Machine Learning topics!  Thank you Sebastian! 

In [None]:
%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py

In [None]:
%load_ext watermark

In [None]:
%watermark -n -t -z -u -m -w -v -p matplotlib,numpy,conda

If any of these libraries are not available, install them using "conda install ____" at the command prompt.

## Reading Data Set using Pandas

Make sure your train.csv file is in the same directory as this iPython Notebook.

In [None]:
titanic_df = pd.read_csv('train.csv')

## <font color='blue'>You've read in the training data, which you can think of as being organized kind of like a spreadsheet.  In Pandas, this data structure is called a Data Frame.</font>

## <font color='blue'>Q:  What are some of the first things you would want to know about your Data Frame?</font> 

## <font color='blue'>Talk with your group and post your group's answers to the Slack channel.</font>

## How to Interact with a Data Frame (df)

In [None]:
#We can refer to a column of data as:
names1 = titanic_df.Name

In [None]:
#or as below:
names2 = titanic_df['Name']

In [None]:
#Both give you the same thing.
print(names1.head())
print(names2.head())
print(titanic_df['Name'].head())

In [None]:
#Refer to a row with:
titanic_df.ix[0] #for first row

In [None]:
#Use python slicing techniques from last time
#for first 6 rows of column 4:
titanic_df.ix[:5,4]

In [None]:
#Also, you can call and manipulate a df at the same time:
kids = titanic_df[titanic_df['Age'] < 20]

#You can make this more complex:
alivekids = titanic_df[(titanic_df.Age < 20) & (titanic_df.Survived == 1)]

## <font color='blue'>Try this yourself.  Make one or two new variables by selecting on two different features.</font>

# Making plots and graphs

## <font color='blue'>Let's make some graphs with these variables using matplotlib.</font>

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(kids.Age,kids.Fare,marker='o',color='b')
ax.scatter(alivekids.Age, alivekids.Fare,marker='x',color='r')

## <font color='blue'>We can group our data to look at specific trends, and use the Pandas plotting method.</font>

In [None]:
groupkids = kids.groupby('Age')

In [None]:
#Here we want the total number of kids at each age to make a bar graph.
groupkids['Age'].sum().plot(kind='bar')

In [None]:
#Or a horizontal bar graph.
groupkids['Age'].sum().plot(kind='barh')

In [None]:
#instead of writing as above, we could save ourselves a step and write:
kids.groupby('Age')['Age'].sum().plot(kind='barh')

## <font color='blue'>Make a graph with the variable that you defined above in step 12.</font>

## <font color='blue'>All of the plotting that we did above used matplotlib, including the Pandas plotting method.  However, seaborn is another plotting package that's prettier and has extra built-in functions!</font>

## To use seaborn, go back up to the import statements and uncomment the "import seaborn" line.  You should also add seaborn to your list of watermarks.

In [None]:
# Number of passengers in each class
titanic_df.groupby('Pclass')['Pclass'].count()

In [None]:
# Instead of a group by, use seaborn to plot the count of passengers in each class
fg = sns.factorplot('Pclass', data=titanic_df, kind='count', aspect=1.5)
fg.set_xlabels('Class')

## <font color='blue'>Count the number of passengers of each sex now.</font>

## <font color='blue'>Instead of a group by, use seaborn to plot the number of males and females.</font>

## <font color='blue'>Describe the results you found in the slack channel.</font>

In [None]:
# Number of men and women in each of the passenger class.
titanic_df.groupby(['Sex', 'Pclass'])['Sex'].count()

In [None]:
# Again use seaborn to group by Sex and class
g = sns.factorplot('Pclass', data=titanic_df, hue='Sex', kind='count', aspect=1.75)
g.set_xlabels('Class')

As shown in the figure above, there are more than two times males than females in class 3. However, in classes 1
and 2, the ratio of male to female is almost 1.

# Pivot tables

# <font color='blue'>Write a groupby statement to find the number of men and women in each of the passenger class who survived.</font>

In [None]:
# We can represent this same information another way using pivot_table.

# Number of passengers who survived in each class grouped by sex. Also total was found for each class grouped by sex.
titanic_df.pivot_table('Survived', 'Sex', 'Pclass', aggfunc=np.sum, margins=True)

## <font color='blue'>Create a variable for the passengers who didn't survive.<font>

## <font color='blue'>Create a factor plot of passengers who survived vs. those who didn't.</font>

## <font color='blue'>Calculate the number of passengers who didn't survive.</font>

## <font color='blue'>Change the statement below to include your variable from above to calculate the total number of passengers who didn't survive.  You will need to change 2 things.</font>

In [None]:
# Number of passengers who did not survive in each class grouped by sex.
not_survived.pivot_table('Survived', 'Sex', 'Pclass', aggfunc=len, margins=False)

# Crosstab and unstacking

In [None]:
# Passengers who survived and who didn't survive grouped by class and sex
table = pd.crosstab(index=[titanic_df.Survived,titanic_df.Pclass], columns=[titanic_df.Sex,titanic_df.Embarked])

In [None]:
table

In [None]:
table.unstack()

In [None]:
table.columns, table.index

In [None]:
# Change name of columns
table.columns.set_levels(['Female', 'Male'], level=0, inplace=True)
table.columns.set_levels(['Cherbourg','Queenstown','Southampton'], level=1, inplace=True)
table

# Basic statistics

In [None]:
print('Average and median age of passengers are %0.f and %0.f years old, respectively.'%(titanic_df.Age.mean(), 
                                                                          titanic_df.Age.median()))

In [None]:
titanic_df.Age.describe()

In [None]:
# Drop missing values for the records in which age passenger is missing
age = titanic_df['Age'].dropna()

In [None]:
# Distribution of age, with an overlay of a density plot
age_dist = sns.distplot(age)
age_dist.set_title("Distribution of Passengers' Ages")

In [None]:
# Another way to plot a histogram of ages is shown below
titanic_df['Age'].hist(bins=50)

# Modifying your data to learn more

In [None]:
# Create a function to define those who are children (less than 16)
def male_female_child(passenger):
    age, sex = passenger
    
    if age < 16:
        return 'child'
    else:
        return sex

In [None]:
titanic_df['person'] = titanic_df[['Age', 'Sex']].apply(male_female_child, axis=1)

In [None]:
# Let's have a look at the first 10 rows of the data frame
titanic_df[:10]

In [None]:
# Let's do a factorplot of passengers split into sex, children and class
sns.factorplot('Pclass', data=titanic_df, kind='count', hue='person', order=[1,2,3], 
               hue_order=['child','female','male'], aspect=2)

In [None]:
# Count number of men, women and children
titanic_df['person'].value_counts()

In [None]:
# Do the same as above, but split the passengers into either survived or not
sns.factorplot('Pclass', data=titanic_df, kind='count', hue='person', col='Survived', order=[1,2,3], 
               hue_order=['child','female','male'], aspect=1.25, size=5)

There are many more children in third class than there are in first and second class. However, one might expect that
there would be more children in 1st and 2nd class than there are in 3rd class.

Also, we can see that women and children really did seem to make it off first.

### kde plot, Distribution of Passengers' Ages

#### Grouped by Gender

In [None]:
fig = sns.FacetGrid(titanic_df, hue='Sex', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.set(title='Distribution of Age Grouped by Gender')
fig.add_legend()

In [None]:
fig = sns.FacetGrid(titanic_df, hue='person', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

#### Grouped by Class

In [None]:
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.set(title='Distribution of Age Grouped by Class')
fig.add_legend()

From the plot above, class 1 has a normal distribution. However, classes 2 and 3 have a skewed distribution towards
20 and 30-year old passengers.

#### What cabins did the Passengers stay in?

In [None]:
deck = titanic_df['Cabin'].dropna()
deck.head()

In [None]:
# Grab the first letter of the cabin letter
d = []
for c in deck:
    d.append(c[0])

In [None]:
d[0:10]

In [None]:
from collections import Counter
Counter(d)

In [None]:
# Now lets factorplot the cabins. First transfer the d list into a data frame. Then rename the column Cabin 
cabin_df = DataFrame(d)
cabin_df.columns=['Cabin']
sns.factorplot('Cabin', data=cabin_df, kind='count', order=['A','B','C','D','E','F','G','T'], aspect=2, 
              palette='winter_d')

In [None]:
# Drop the 'T' cabin
cabin_df = cabin_df[cabin_df['Cabin'] != 'T']

In [None]:
# Then replot the Cabins factorplot as above
sns.factorplot('Cabin', data=cabin_df, kind='count', order=['A','B','C','D','E','F','G'], aspect=2, 
              palette='Greens_d')

In [None]:
# Below is a link to the list of matplotlib colormaps
url = 'http://matplotlib.org/api/pyplot_summary.html?highlight=colormaps#matplotlib.pyplot.colormaps'
import webbrowser
webbrowser.open(url)

#### Where did the passengers come from i.e. Where did the passengers land into the ship from?

In [None]:
sns.factorplot('Embarked', data=titanic_df, kind='count', hue='Pclass', hue_order=range(1,4), aspect=2,
              order = ['C','Q','S'])

From the figure above, one may conclude that almost all of the passengers who boarded from Queenstown were in third 
class. On the other hand, many who boarded from Cherbourg were in first class. The biggest portion of passengers 
who boarded the ship came from Southampton, in which 353 passengers were in third class, 164 in second class and 
127 passengers were in first class. In such cases, one may need to look at the economic situation at these different towns at that period of time to understand why most passengers who boarded from Queenstown were in third class for example.

In [None]:
titanic_df.Embarked.value_counts()

In [None]:
# For tabulated values, use crosstab pandas method instead of the factorplot in seaborn
port = pd.crosstab(index=[titanic_df.Pclass], columns=[titanic_df.Embarked])
port.columns = [['Cherbourg','Queenstown','Southampton']]

In [None]:
port

In [None]:
port.index

In [None]:
port.columns

In [None]:
port.index=[['First','Second','Third']]

In [None]:
port

#### Who was alone and who was with parents or siblings?

In [None]:
titanic_df[['SibSp','Parch']].head()

In [None]:
# Alone dataframe i.e. the passenger has no siblings or parents
alone_df = titanic_df[(titanic_df['SibSp'] == 0) & (titanic_df['Parch']==0)]
# Add Alone column
alone_df['Alone'] = 'Alone'
# Not alone data frame i.e. the passenger has either a sibling or a parent.
not_alone_df = titanic_df[(titanic_df['SibSp'] != 0) | (titanic_df['Parch']!=0)]
not_alone_df['Alone'] = 'With family'

# Merge the above dataframes
comb = [alone_df, not_alone_df]
# Merge and sort by index
titanic_df = pd.concat(comb).sort_index()

In [None]:
[len(alone_df), len(not_alone_df)]

In [None]:
# Show the first five records of the alone data frame
alone_df.head()

In [None]:
# Show the first five rows of the not alone data frame
not_alone_df.head()

In [None]:
titanic_df.head()

In [None]:
""" Another way to perform the above
titanic_df['Alone'] = titanic_df.SibSp + titanic_df.Parch

titanic_df['Alone'].loc[titanic_df['Alone']>0] = 'With family'
titanic_df['Alone'].loc[titanic_df['Alone']==0] = 'Alone'"""

In [None]:
fg=sns.factorplot('Alone', data=titanic_df, kind='count', hue='Pclass', col='person', hue_order=range(1,4),
                 palette='Blues')
fg.set_xlabels('Status')

From the figure above, it is clear that most children traveled with family in third class. For men, most traveled alone in third class. On the other hand, the number of female passengers who traveled either with family or alone among the second and third class is comparable. However, more women traveled with family than alone in first class. 

### Factors Affecting the Surviving

In [None]:
'''Now lets look at the factors that help someone survived the sinking. We start this analysis by adding a new
cloumn to the titanic data frame. Use the Survived column to map to the new column with factors 0:no and 1:yes
using the map method'''
titanic_df['Survivor'] = titanic_df.Survived.map({0:'no', 1:'yes'})

In [None]:
titanic_df.head()

#### Class Factor

In [None]:
# Survived vs. class Grouped by gender
sns.factorplot('Pclass','Survived', hue='person', data=titanic_df, order=range(1,4), 
               hue_order = ['child','female','male'])

From the figure above, being a male or a third class reduce the chance for one to survive. 

In [None]:
sns.factorplot('Survivor', data=titanic_df, hue='Pclass', kind='count', palette='Pastel2', hue_order=range(1,4),
              col='person')

### Age Factor

In [None]:
# Linear plot of age vs. survived
sns.lmplot('Age', 'Survived', data=titanic_df)

There seems to be a general linear trend between age and the survived field. The plot shows that the older the passenger is, the less chance he/she would survive.

In [None]:
# Survived vs. Age grouped by Sex
sns.lmplot('Age', 'Survived', data=titanic_df, hue='Sex')

Older women have higher rate of survival than older men as shown in the figure above. Also, older women has higher
rate of srvival than younger women; an opposite trend to the one for the male passengers.

In [None]:
# Survived vs. Age gruped by class
sns.lmplot('Age', 'Survived', hue='Pclass', data=titanic_df, palette='winter', hue_order=range(1,4))

In all three classes, the chance to survive reduced as the passengers got older.

In [None]:
# Create a generation bin
generations = [10,20,40,60,80]
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,x_bins=generations, hue_order=[1,2,3])

#### Deck Factor

In [None]:
titanic_df.columns

In [None]:
titanic_DF = titanic_df.dropna(subset=['Cabin'])

In [None]:
d[0:10]

In [None]:
len(titanic_DF), len(d)

In [None]:
titanic_DF['Deck'] = d

In [None]:
titanic_DF = titanic_DF[titanic_DF.Deck != 'T']

In [None]:
titanic_DF.head()

In [None]:
sns.factorplot('Deck', 'Survived', data=titanic_DF, order=['A','B','C','D','E','F','G'])

There does not seem to be any relation between deck and the survival rate as shown in the above figure!

#### Family Status Factor

In [None]:
sns.factorplot('Alone', 'Survived', data=titanic_df, palette='winter') #hue='person', 
               #hue_order=['child', 'female', 'male'])

There seems that the survival rate diminishes significantly for those who were alone. However, lets check if a
gender or age play a factor. From the figure below, one may conclude that the survival rate for women and children
are much higher than that of men, as was concluded previously and as anticipated. However, the survival rate is not
significant for either gender or for children who were with family versus who were alone. Moreover, the survival 
rate for women and children increases for those who were alone. For men, the survival rate diminishes slightly 
for those who were alone versus for those who were with family.

In [None]:
sns.factorplot('Alone', 'Survived', data=titanic_df, palette='winter', hue='person', 
               hue_order=['child', 'female', 'male'])

In [None]:
# Lets split it by class now!
sns.factorplot('Alone', 'Survived', data=titanic_df, palette='summer', hue='person', 
               hue_order=['child', 'female', 'male'], col='Pclass', col_order=[1,2,3])

### Predictive Modeling

In [None]:
import sklearn