<a href="https://colab.research.google.com/github/kwaldenphd/building-a-ml-model/blob/main/klein_titanic_broussard_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building Your Own Machine Learning Model in Python: Option #2

<a href="http://creativecommons.org/licenses/by-nc/4.0/" rel="license"><img style="border-width: 0;" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" alt="Creative Commons License" /></a>
This tutorial is licensed under a <a href="http://creativecommons.org/licenses/by-nc/4.0/" rel="license">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

# Overview

Read pages 94-114 from Meredith Broussard's 2018 book [*Artificial Unintelligence: How Computers Misunderstand the World*](https://onesearch.library.nd.edu/permalink/f/1phik6l/ndu_aleph004791189) (MIT Press). 

In this section of Chapter 7 "Machine Learning: The DL on the ML," Broussard outlines a machine learning workflow using data about Titanic passengers.

Follow the steps outlined in the chapter excerpt to build a machine learning classifier.

Chapter excerpt:
- [Access via Google Drive](https://drive.google.com/file/d/1sXJPcvk84SDB3QXCNWiKwL7AQ-gKcgh7/view?usp=sharing) (ND users only)
- [Link to electronic access through Hesburgh Libraries](https://onesearch.library.nd.edu/permalink/f/1phik6l/ndu_aleph004791189)

Jupyter notebook:
- [Link to Jupyter Notebook](https://colab.research.google.com/drive/1dJLBUyDvQZ7qGzsUxiVbegxZkDiDRMr4?usp=sharing)
  * SOURCE: Adapted from [Lauren F. Klein](https://lklein.com/)'s  [implementation of Broussard's exercise](https://github.com/laurenfklein/feminist-data-science/blob/master/notebooks/lab2-survival/lab2-survival-inclass.ipynb), developed for the Spring 2020 Emory University course [QTM 490 "Feminist Data Science"](https://github.com/laurenfklein/feminist-data-science).

You'll also need two data files for  this option: `text.csv` and `train.csv`
- GitHub
  * Test: https://raw.githubusercontent.com/kwaldenphd/building-a-ml-model/main/data/test.csv
  * Train: https://raw.githubusercontent.com/kwaldenphd/building-a-ml-model/main/data/train.csv
- Google Drive
  * [Test](https://drive.google.com/file/d/1YIKOH2upzQQUAiIqLwIqk5xdddetDsfR/view?usp=sharing)
  * [Train](https://drive.google.com/file/d/1CcHmC4hvVrLeM8AeSHmxCJSUvdXofGN_/view?usp=sharing)

This is a guided (i.e. not open-ended), moderately complex option.



---



<em>SOURCE: Adapted from [Lauren F. Klein](https://lklein.com/)'s  [implementation of Broussard's exercise](https://github.com/laurenfklein/feminist-data-science/blob/master/notebooks/lab2-survival/lab2-survival-inclass.ipynb), developed for the Spring 2020 Emory University course [QTM 490 "Feminist Data Science"](https://github.com/laurenfklein/feminist-data-science).</em>

# Intersectional Survival Analysis

*Lab adapted from an exercise by [Meredith Broussard](https://merbroussard.github.io/) in [Artificial Unintelligence](https://mitpress.mit.edu/books/artificial-unintelligence)*

In [3]:
# Libraries we need

import pandas as pd # for dataframes
import numpy as np  # for math
from sklearn import tree, preprocessing # for our model

# Data

You'll need two data files for  this option: `text.csv` and `train.csv`
- GitHub
  * Test: https://raw.githubusercontent.com/kwaldenphd/building-a-ml-model/main/data/test.csv
  * Train: https://raw.githubusercontent.com/kwaldenphd/building-a-ml-model/main/data/train.csv
- Google Drive
  * [Test](https://drive.google.com/file/d/1YIKOH2upzQQUAiIqLwIqk5xdddetDsfR/view?usp=sharing)
  * [Train](https://drive.google.com/file/d/1CcHmC4hvVrLeM8AeSHmxCJSUvdXofGN_/view?usp=sharing)

## Load from GitHub URLs

In [None]:
# training data
train_df = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/building-a-ml-model/main/data/train.csv")

# show training data
train_df

In [None]:
# test data
test_df = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/building-a-ml-model/main/data/test.csv")

# show test data
test_df

## Load from Files

If working with Jupyter Notebooks on your local computer, you'll need to move files into the same directory (folder) as the Jupyter Notebook.
- Alternatively, you can provide the full file path.

If working in Google CoLab, you'll either need to upload the files to your session or mount Google Drive to access the file.
- [Uploading files](https://youtu.be/6HFlwqK3oeo?t=177)
- [Mounting Google Drive](https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/)

In [None]:
# load training data
train_file = "train.csv"

# create df
train_df = pd.read_csv(train_file)

# show df
train_df

In [None]:
# load test data
test_file = "test.csv"

# create df
test_df = pd.read_csv(test_file)

# show df
test_df

# Exploring the Data

Some of the headers are self-explanatory, but some are not. 

For instance, what's `Parch`? 

Thanks to this dataset's [data dictionary](https://www.kaggle.com/c/titanic/data), we're able to determine the following:

## Data Dictionary

| Variable  | Definition          | Key             | 
| :-------- | :------------------ | :-------------- |
| survival  | Survival            | 0 = No, 1 = Yes | 
| pclass    | Ticket class        | 1 = 1st, 2 = 2nd, 3 = 3rd | 
| sex       | Sex                 |                           |	
| Age       | Age in years        |   |
| sibsp     | # of siblings / spouses aboard the Titanic | 
| parch     | # of parents / children aboard the Titanic |
| ticket    | Ticket number      | |
| fare	    | Passenger fare	 | |
| cabin     | Cabin number	     | |
| embarked	| Port of Embarkation|	C = Cherbourg, Q = Queenstown, S = Southampton|

## Additional Variable Notes

`pclass`: A proxy for socio-economic status (SES)
  * `1` = first class (upper class)
  * `2` = second class (middle class)
  * `3` = third class (lower class)

`age`: Age is fractional if less than 1. If the age is estimated, is it in the form of `xx.5`

`sibsp`: The dataset defines family relations in this way:
  * Sibling = brother, sister, stepbrother, stepsister
  * Spouse = husband, wife (mistresses and fiancés were ignored)

`parch`: The dataset defines family relations in this way:
  * Parent = mother, father
  * Child = daughter, son, stepdaughter, stepson
  * Some children travelled only with a nanny, therefore `parch=0` for them.

With this in mind, let's take a look at the first line of data and see if we can interpret it.

In [8]:
# show first line of training data
train_df.loc[0]

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

**What are we able to tell from this row data? What kind of information is included (and how is it recorded or notated)? What information might be missing or unclear?**

We can also take a look at the first five rows of the test data.

In [9]:
# check out the test data

test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


**What are we able to tell from this row data? What kind of information is included (and how is it recorded or notated)? What information might be missing or unclear?**

**How is this data similar to or different from the training data?**

Let's dig into our data a bit more by using the convenient `describe` function:

Note that it excludes non-numeric datatypes.

In [None]:
# show data description
train_df.describe()

**What information or details can we glean from using `.describe()`?**

Specifically...
- What does it mean that the count of the "Age" column is lower than the rest?
- What does the mean of the "Survived" column tell us?

The `value_counts()` function can tell us a bit more:

In [None]:
# show values in Survived column from training data
train_df["Survived"].value_counts()

We can normalize those numbers to see how the numbers of passangers that died and the number that survived.

In [None]:
# normalize values in Survived column
train_df["Survived"].value_counts(normalize=True) # note lowercase of "True" -- different than R! 

**If we were going to predict only on this metric, would we predict that a random person would survive or would die?**

# Making Predictions

**If we were going to predict only on this metric, would we predict that a random person would survive or would die?**

But there were likely other factors at play on the ship.

For example, the principle of "women and children first" was likely employed, to some degree, since it was a principle used during maritime disasters since at least the 1850s. 

So let's see if our intitution is correct, and that more women than men survived.

First, let's see the gender breakdown on the ship:

In [None]:
# show number of values in Sex column
train_df["Sex"].value_counts()

**How many men and how many women were on the ship?**

Now we're going to use something called [boolean indexing](https://www.geeksforgeeks.org/boolean-indexing-in-pandas/) in order to select only the men on the ship, and then see how many of them survived and how many died:


In [None]:
# show number of male passangers, grouped by survival status
train_df["Survived"][train_df["Sex"] == 'male'].value_counts()

**Of the men, how many died and how many survived?**

Let's just normalize that so we get a percentage:

In [None]:
# normalized male survival status number
train_df["Survived"][train_df["Sex"] == 'male'].value_counts(normalize=True)

**What percentage of men survived?**

We can make the same calculation for female passengers.

In [None]:
# normalized female survival status number
train_df["Survived"][train_df["Sex"] == 'female'].value_counts(normalize=True)

**What does this tell us?**

Note that the 1 and 0 is inverted from the above, since value_counts defaults to list the highest number first. 


**If we were going to manually fill out the "Survived" column in the test data with a prediction, what would we predict?**

## Now It's Your Turn

**Check to see if pclass-- what the data dictionary describes as "a proxy for socio-economic status"-- turns out to matter for a person's probability of survival.**

In [None]:
# get values for passenger class column
train_df["Pclass"].value_counts()

Modify the code used to determine survival rates by gender to determine survival rates by passenger class.

Questions to consider:
- How many of each class survived
- Which class had the best survival rate

**What percentage of "upper", "middle", and "lower" class people survived?**

**What would an intersectional approach to this data tell us to do next?**

# Building a Decision Tree Machine Learning Model

There are many approaches to this type of question. Today, we're going to use a *decision tree* in order to build our model. 

A decision tree is popular type of predictive modeling algorithm. More specifially, it's a *supervised* machine learning model that is used to predict a target (in our case, whether a person survives or not) by learning *decision* rules from features (in our case specific columns in our dataset). One of the most helpful features of decision trees, for our purposes, is that it can tell us which features are most important for making the prediction.  

Here's a helpful illustration of decision trees from Laurraine Li, over at *Towards Data Science*:

![decision tree diagram](https://miro.medium.com/max/2000/1*WerHJ14JQAd3j8ASaVjAhw.jpeg)

Based on the features that we specify from our training data, the decision tree model will ask (or "learn" in ML-speak) a series of questions to infer the class labels of the samples. As we can see, decision trees are attractive models if we care about interpretability.

Although the diagram illustrates the concept of a decision tree based on categorical targets (classification), the same concept applies if our targets are real numbers (regression).

If you want to learn more about decision trees, I'd recommend reading the rest of Li's article, "[Classification and Regression Analysis with Decision Trees](https://towardsdatascience.com/https-medium-com-lorrli-classification-and-regression-analysis-with-decision-trees-c43cdbc58054)"

It's quite easy to implement a decision tree using Python's `sklearn` library. But before we see how easy, we need to fix one thing in our data....

Remember that issue with the "Age" data? There was some missing, right?

That will break our algorithm. So we have two choices:

Exclude the people with missing ages, or make something up. 

So that you see how to do it, we're going to make something up. But we can talk about the implications of this later. 

In [None]:
# assign any missing age the median age

train_df["Age"] = train_df["Age"].fillna(train_df["Age"].median())

In [None]:
# now we're ready to run our decision tree

# first, create the target; remember, this is the prediction we're after

target = train_df["Survived"].values

In [None]:
# we also need to turn our "Sex" variable into a binary --
# more on the problematic nature of the gender binary in a couple of classes!!!

# define the transformation we're about to apply
binary_sex = preprocessing.LabelEncoder()

# fit and transform the data as defined above
train_df["Sex"] = binary_sex.fit_transform(train_df["Sex"])

In [None]:
# pull out the features we want
features = train_df[["Pclass", "Sex", "Age", "Fare"]].values

# instantiate and fit the decision tree
my_dtree = tree.DecisionTreeClassifier()
my_dtree.fit(features, target)

# look at the importance and score of the included features

my_dtree.feature_importances_

Here's [an article that talks more about the math of calculating feature_importances](https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3). 

For now, just know that the higher the value, the more important the feature. Also, the feature_importances are listed in the order that the features are entered into the model. So in our case, "Fare" is most important, closely followed by "Sex."

Another metric we can take a look at is the mean accuracy, which tells you the percentage of correct predictions made. Sklearn also has this built in as the `score` function.

In [None]:
my_dtree.score(features, target)

**What is the accuracy of our model? Are you satisfied?**

**If time: apply our model to our test data--remember that batch of data that didn't have the "Survived" column.**

In [None]:
# fill in missing data

test_df["Fare"] = test_df["Fare"].fillna(test_df["Fare"].median())
test_df["Age"] = test_df["Age"].fillna(test_df["Age"].median())

# convert the "Sex" column to numbers again; we can use the same encoder
test_df["Sex"] = binary_sex.fit_transform(test_df["Sex"])

In [None]:
# pull out the features we want
test_features = test_df[["Pclass", "Sex", "Age", "Fare"]].values

# make prediction
my_prediction = my_dtree.predict(test_features)

my_prediction

In [None]:
# cross reference w/ passenger id

PassengerId= np.array(test_df["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns=["Survived"])

my_solution

# Next Steps

[Click here](https://github.com/kwaldenphd/building-a-ml-model/) to return to the main lab page on GitHub.