## Intersectional Survival Analysis ##

*Lab adapted from an exercise by [Meredith Broussard](https://merbroussard.github.io/) in [Artificial Unintelligence](https://mitpress.mit.edu/books/artificial-unintelligence)*

In [None]:
# Libraries we need

import pandas as pd # for dataframes
import numpy as np  # for math
from sklearn import tree, preprocessing # for our model

In [None]:
# Data we need--
# We'll be using the training data and, if time, the testing data (pre-split) 
# associated with the Titanic disaster. It's in CSV format

train_file = "titanic_data/train.csv"
train_df = pd.read_csv(train_file)

test_file = "titanic_data/test.csv"
test_df = pd.read_csv(test_file)

In [None]:
# Let's take a look at the training data

train_df.head()

Some of the headers are self-explanatory, but some are not. 

For instance, what's `Parch`? 

Thanks to this dataset's [data dictionary](https://www.kaggle.com/c/titanic/data), we're able to determine the following:

In [None]:
# cell below just centers the table

In [None]:
%%html
<style>
table {float:left}
</style>

**Data Dictionary**

| Variable  | Definition          | Key             | 
| :-------- | :------------------ | :-------------- |
| survival  | Survival            | 0 = No, 1 = Yes | 
| pclass    | Ticket class        | 1 = 1st, 2 = 2nd, 3 = 3rd | 
| sex       | Sex                 |                           |	
| Age       | Age in years        |   |
| sibsp     | # of siblings / spouses aboard the Titanic | 
| parch     | # of parents / children aboard the Titanic |
| ticket    | Ticket number      | |
| fare	    | Passenger fare	 | |
| cabin     | Cabin number	     | |
| embarked	| Port of Embarkation|	C = Cherbourg, Q = Queenstown, S = Southampton|

**Additional Variable Notes**

`pclass`: A proxy for socio-economic status (SES)
1st = Upper; 2nd = Middle; 3rd = Lower

`age`: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

`sibsp`: The dataset defines family relations in this way:

Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés were ignored)

`parch`: The dataset defines family relations in this way:

Parent = mother, father; Child = daughter, son, stepdaughter, stepson; Some children travelled only with a nanny, therefore parch=0 for them.

With this in mind, let's take a look at the first line of data and see if we can interpret it.

In [None]:
train_df.loc[0]

**What do each of these values mean?**

In [None]:
# check out the test data

test_df.head()

**How is this different?**

Let's dig into our data a bit more by using the convenient `describe` function:

Note that it excludes non-numeric datatypes.

In [None]:
train_df.describe()

**What does this tell us?**

**What does it mean that the count of the "Age" column is lower than the rest?**

**What does the mean of the "Survived" column tell us?**

The `value_counts()` function can tell us a bit more:

In [None]:
train_df["Survived"].value_counts()

**How many passengers died and how many survived?** 

Let's see those same numbers normalized:

In [None]:
train_df["Survived"].value_counts(normalize=True) # note lowercase of "True" -- different than R! 

**If we were going to predict only on this metric, would we predict that a random person would survive or would die?**

But there were likely other factors at play on the ship.

For example, the principle of "women and children first" was likely employed, to some degree, since it was a principle used during maritime disasters since at least the 1850s. 

So let's see if our intitution is correct, and that more women than men survived.

First, let's see the gender breakdown on the ship:


In [None]:
train_df["Sex"].value_counts()

**How many men and how many women were on the ship?**

Now we're going to use something called [boolean indexing](https://www.geeksforgeeks.org/boolean-indexing-in-pandas/) in order to select only the men on the ship, and then see how many of them survived and how many died:

NOTE: If you want to learn more about boolean indexing and/or pandas (how Python handles dataframes), you can consult [these](https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/class13-pandas-complete.ipynb) [notebooks](https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/class14-pandas-in-action-complete.ipynb) from my Fall text as data course.


In [None]:
train_df["Survived"][train_df["Sex"] == 'male'].value_counts()

**Of the men, how many died and how many survived?**

Let's just normalize that so we get a percentage:

In [None]:
train_df["Survived"][train_df["Sex"] == 'male'].value_counts(normalize=True)

**What percentage of men survived?**

In [None]:
# now let's look at the women:
train_df["Survived"][train_df["Sex"] == 'female'].value_counts(normalize=True)

**What does this tell us?**

Note that the 1 and 0 is inverted from the above, since value_counts defaults to list the highest number first. 


**If we were going to manually fill out the "Survived" column in the test data with a prediction, what would we predict?**

**Now it's your turn. Check to see if pclass-- what the data dictionary describes as "a proxy for socio-economic status"-- turns out to matter for a person's probability of survival.**

In [None]:
# first, see what the values are
train_df["Pclass"].value_counts()

In [None]:
# now, modifying the code for gender (above), see how many of each SES survived, 
# and which SES has the best survival rate


**What percentage of "upper", "middle", and "lower" class people survived?**

**What would an intersectional approach to this data tell us to do next?**

There are many approaches to this type of question. Today, we're going to use a *decision tree* in order to build our model. 

A decision tree is popular type of predictive modeling algorithm. More specifially, it's a *supervised* machine learning model that is used to predict a target (in our case, whether a person survives or not) by learning *decision* rules from features (in our case specific columns in our dataset). One of the most helpful features of decision trees, for our purposes, is that it can tell us which features are most important for making the prediction.  

Here's a helpful illustration of decision trees from Laurraine Li, over at *Towards Data Science*:

![decision tree diagram](https://miro.medium.com/max/2000/1*WerHJ14JQAd3j8ASaVjAhw.jpeg)

Based on the features that we specify from our training data, the decision tree model will ask (or "learn" in ML-speak) a series of questions to infer the class labels of the samples. As we can see, decision trees are attractive models if we care about interpretability.

Although the diagram illustrates the concept of a decision tree based on categorical targets (classification), the same concept applies if our targets are real numbers (regression).

If you want to learn more about decision trees, I'd recommend reading the rest of Li's article, "[Classification and Regression Analysis with Decision Trees](https://towardsdatascience.com/https-medium-com-lorrli-classification-and-regression-analysis-with-decision-trees-c43cdbc58054)"

It's quite easy to implement a decision tree using Python's `sklearn` library. But before we see how easy, we need to fix one thing in our data....

Remember that issue with the "Age" data? There was some missing, right?

That will break our algorithm. So we have two choices:

Exclude the people with missing ages, or make something up. 

So that you see how to do it, we're going to make something up. But we can talk about the implications of this later. 

In [None]:
# assign any missing age the median age

train_df["Age"] = train_df["Age"].fillna(train_df["Age"].median())

In [None]:
# now we're ready to run our decision tree

# first, create the target; remember, this is the prediction we're after

target = train_df["Survived"].values

In [None]:
# we also need to turn our "Sex" variable into a binary --
# more on the problematic nature of the gender binary in a couple of classes!!!

# define the transformation we're about to apply
binary_sex = preprocessing.LabelEncoder()

# fit and transform the data as defined above
train_df["Sex"] = binary_sex.fit_transform(train_df["Sex"])

In [None]:
# pull out the features we want
features = train_df[["Pclass", "Sex", "Age", "Fare"]].values

# instantiate and fit the decision tree
my_dtree = tree.DecisionTreeClassifier()
my_dtree.fit(features, target)

# look at the importance and score of the included features

my_dtree.feature_importances_

Here's [an article that talks more about the math of calculating feature_importances](https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3). 

For now, just know that the higher the value, the more important the feature. Also, the feature_importances are listed in the order that the features are entered into the model. So in our case, "Fare" is most important, closely followed by "Sex."

Another metric we can take a look at is the mean accuracy, which tells you the percentage of correct predictions made. Sklearn also has this built in as the `score` function.

In [None]:
my_dtree.score(features, target)

**What is the accuracy of our model? Are you satisfied?**

**If time: apply our model to our test data--remember that batch of data that didn't have the "Survived" column.**

In [None]:
# fill in missing data

test_df["Fare"] = test_df["Fare"].fillna(test_df["Fare"].median())
test_df["Age"] = test_df["Age"].fillna(test_df["Age"].median())

# convert the "Sex" column to numbers again; we can use the same encoder
test_df["Sex"] = binary_sex.fit_transform(test_df["Sex"])

In [None]:
# pull out the features we want
test_features = test_df[["Pclass", "Sex", "Age", "Fare"]].values

# make prediction
my_prediction = my_dtree.predict(test_features)

my_prediction

In [None]:
# cross reference w/ passenger id

PassengerId= np.array(test_df["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns=["Survived"])

my_solution