# Lab 13: Machine Learning II

Please complete this lab by providing answers in cells after the question. Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score

In this lab, we will go over using decision tree models. We'll then include them in machine learning workflow to find the best models possible. Remember, the machine learning workflow we will use is as follows:
- Create train and test sets
- Fit the models using train set
- Predict using the test set
- Evaluate models using metrics such as precision and recall
- Make your conclusions

## Pulse of the Nation - Climate Change

Suppose we want to predict which people do not believe in climate change or believe it is not caused by people. To build our model, we will use the Cards Against Humanity Pulse of the Nation dataset. 

In [None]:
cah = Table.read_table("201709-CAH_PulseOfTheNation.csv")
cah.show(5)

Note that we have a decent amount of variables that are strings in this dataset. We can't use the strings as is with the `sklearn` package, so we need to make sure to change those in to numerical or boolean values first. To do this, we'll create what are called **dummy variables**, which convert categorical variables into 0/1 variables.

## Creating Dummy variables

Let's take a look at the `Climate Change` variable to start out. This will be our outcome variable, or **label**. 

In [None]:
cah.group('Climate Change')

We want to predict the people who believe climate change is not real or not caused by people. This is two different categories, so we need to find a way to convert the `Climate Change` values into a variable with 0 if it was `DK/REF` or `Real and Caused by People`, and 1 if it was `Not Real At All` or `Real but not Caused by People`. We'll do this by creating a function and using `apply` (if you don't remember how to use `apply`, look back at Lab 4). 

In [None]:
def create_label(response):
    '''
    For a Climate Change response, turns it into a True or False depending on the answer.
    
    Arguments:
    response: str, the response to the Climate Change questions.
    
    Returns:
    A boolean
    '''
    
    if response == 'Real but not Caused by People' :
        return 1
    if response == 'Not Real At All':
        return 1
    else:
        return 0

In [None]:
climate_change_dummy = cah.apply(create_label, 'Climate Change')
climate_change_dummy

After using `apply` to get an array of 0 and 1 values, we can add it back in to the Table and drop the original `Climate Change` variable. We'll call the new variable `label` since it is the label that we are trying to predict.  

<font color = 'red'>**Question 1. Add the `climate_change_dummy` variable to the Table as a variable called `label`. Drop the `Climate Change` variable. Call the new Table `cah_dummy_label`.**</font>

### Dummy Variables with Multiple Categories 

If we want to preserve the multiple categories, we can use multiple dummy variables for a given categorical variable. Let's take a look at the `Gender` variable.

In [None]:
cah.group('Gender')

With this variable, we might want to make sure we keep each of the three unique categories. In order to do this, we can create two dummy variables that contain the same information as this one categorical variable. We'll create a variable called `Female` that is a 0 if the person is not Female and 1 if the person is Female. We'll also create a variable called `Male` that is 0 if the person is not Male and 1 if the person is Male. If a person responded "Other", then they will simply have a 0 on both `Female` and `Male`. 

In [None]:
male = cah.column('Gender') == "Male"
female = cah.column('Gender') == "Female"

Next, we can add these back into the Table and drop the `Gender` variable.

In [None]:
cah_dummy_gender_label = cah_dummy_label.with_columns('Male', male,
                                                     'Female', female).drop('Gender')
cah_dummy_gender_label.show(5)

Let's take a look at two other variables that we want to turn into dummy variables.

In [None]:
cah.group('Political Affiliation')

In [None]:
cah.group('Level of Education')

<font color = 'red'>**Question 2. Create dummy variables for all the other variables that are still categorical variables. Make sure to keep as much information as possible by creating multiple dummy variables if there are more than two categories. Leave out the `DK/REF` group for `Political Affiliation` and the `Other` group for `Level of Education`.**</font>

Call the new Table with all of the dummy variables `cah_clean`. Make sure it does not have any of the original categorical variable.

In [None]:
democrat = ...
republican = ...
independent = ...

college = ...
graduate = ...
high_school = ...
some_college = ...

cah_clean = ...
cah_clean.show(5)

### Step 1: Create train and test sets

First, let's split up the data into train and test sets. For this assignment, we will do a simple holdout set, assigning a random 20% of the data as the test data, and building the model on the remaining 80% of the data. 

<font color = 'red'>**Question 3. Create two Tables, one called `test` and one called `train`. The `test` table should contain a random 20% of the data, while the `train` Table should contain the other 80%.**<\font>

*Hint:* You can shuffle the entire dataset (sample the whole dataset without replacement), then just take the top 80% as your train data.

In [1]:
# Find the number of rows you want to take by multiplying the number of rows in
# movements by 0.8. Remember, this needs to be an integer!
rows_to_take = ...

# Shuffle the Table
shuffled_cah = ...

# Use .take and np.arange to split the data into train and test.
# train should be the first rows_to_take rows of shuffled_cah
# test should be the rest
train = ...
test = ...

Make sure you have 248 rows in the train set and 62 rows in the test set.

In [None]:
train.num_rows

In [None]:
test.num_rows

### Step 2: Fitting the models

Now, let's use the `train` data to fit a Decision Tree model. We can do this using `DecisionTreeClassifier`, similar to how we used `KNeighborsClassifier`. 

In [None]:
tree = DecisionTreeClassifier()

predictors = train.drop('label').rows
outcome = train.column('label')

tree.fit(X = predictors, y = outcome)

We can use the `plot_tree` function to get an idea of what the decision making process looks like for this model. 

In [None]:
plot_tree(tree)
plt.show()

This is very hard to read, and has lots of splits being made. It's likely that the tree is **overfitting** because we are making the model too close to our data. Let's try fitting a smaller tree model. We'll set the `max_depth` to 5, so that the tree only goes down 5 steps.

In [None]:
tree = DecisionTreeClassifier(max_depth = 5)

predictors = train.drop('label').rows
outcome = train.column('label')

tree.fit(X = predictors, y = outcome)
plot_tree(tree)
plt.show()

The text is likely still too small to read, but that is ok. This should at least give you an idea of the size of the tree model.

<font color = 'red'>**Question 4. Create three model objects called `tree2`, `tree3`, and `tree4` that represent the Decision Tree classifiers with max_depth = 2, 3, and 4, respectively. Using the `train` Table you created above, fit each of the three models.**</font>

If you're not sure about the exact format of the data needed, remember that you need to use `.rows` for the `X` values and `.column` for the `y` values.

In [None]:
# Create the model objects
tree2 = ...
tree3 = ...
tree4 = ...

...

### Step 3: Predict Test Set

As before, we can use the `.predict_proba` method with the model objects to generate predict scores, and use those with a threshold to get our predictions.

In [None]:
# Setting a threshold 
threshold = 0.3

# Make sure you fit the model before running this!
test_features = test.drop('label').rows
tree_predicted = tree.predict_proba(test_features)[:,1] > threshold
tree_predicted

The `True` and `False` values correspond to a prediction of `1` (for `True`) and `0` (for `False`). When we created our `label` variable, we made sure to make the positive case of `1` to be believing Climate Change was not real or not caused by humans. So, we use `[:,1]` to make sure the `1` value is still our positive case. 

Since our `label` is already a 0/1 variable, we don't need to do anything to use it for our performance metrics. 

In [None]:
expected = test.column('label')

<font color = 'red'>**Question 5. Create additional arrays that contain the predicted values for each of the models that we've fit (call them `tree2_predicted`, `tree3_predicted`, and `tree4_predicted`).**</font>

In [None]:
tree2_predicted = ...
tree3_predicted = ...
tree4_predicted = ...


### Step 4: Evaluate

You can get a confusion matrix using the `confusion_matrix` function that we brought in at the beginning. This is part of the `sklearn.metrics` module.

In [None]:
conf_matrix = confusion_matrix(expected,tree_predicted)

In [None]:
conf_matrix

The columns represent predictions and the rows represent actual values, so the top left is **true negatives (TN)**, the bottom right is **true positives (TP)**, the top right is **false positives (FP)**, and the bottom left is **false negatives (FN)**.

<img src="confusion_matrix.jpeg"/>

### Evaluation 

Precision measures the accuracy of the classifier when it predicts an example to be positive. It is the ratio of correctly predicted positive examples to examples predicted to be positive. 

$$ Precision = \frac{TP}{TP+FP}$$

Recall measures the accuracy of the classifier to find positive examples in the data. 

$$ Recall = \frac{TP}{TP+FN} $$

By selecting different thresholds we can vary and tune the precision and recall of a given classifier. A conservative classifier (threshold 0.99) will classify a case as 1 only when it is *very sure*, leading to high precision. On the other end of the spectrum, a low threshold (e.g. 0.01) will lead to higher recall. 

We can use the `precision_score` and `recall_score` functions to find the value of these measures.

In [None]:
precision_score(expected,tree_predicted)

In [None]:
recall_score(expected,tree_predicted)

<font color = 'red'>**Question 6. Find the confusion matrix for one of the other models. Use the `precision_score` and `recall_score` functions to find precision and recall for your models.**</font>

<font color = 'red'>**Question 7. Which model of two we measured the performance of above performed the best according to precision? Recall?**</font>

### Step 5: Repeating the steps

We've done one iteration ... but we've only done it with one threshold, and we haven't tuned the parameters much. We won't go through all of the various ways we can fine-tune our models, but we can show how it is done: using loops.

<font color = 'red'>**Question 8. Write a loop that tries thresholds of 0.1, 0.2, 0.3, 0.4, and 0.5 and max_depth values of 2, 3, 4, and 5 to make predictions using Decision Trees. Store all of the values tried within arrays so that they can be combined into a Table afterwards.**</font>

The loop has been started below for you.

In [None]:
thresholds = make_array()
depths = make_array()
precisions = make_array()
recalls = make_array()

predictors = train.drop('label').rows
outcome = train.column('label')

expected = test.column('label')

for threshold in make_array(0.1, 0.2, 0.3, 0.4, 0.5):
    for depth in make_array(2, 3, 4, 5):
        ...
    

Use the code below to look at the results afterwards.

In [None]:
# You can use these to look at results
tree_results = Table().with_columns('Threshold', thresholds,
                                    'Max Depth', depths,
                                    'Precision', precisions,
                                    'Recall', recalls)
tree_results.show(20)

<font color = 'red'>**Question 9. What model and threshold combination gives the best precision? The best recall? If there are ties, choose the one that has the better performance in the other metric (so, for example, if two model/threshold combinations are tied for best recall, choose the one with the better precision of the two. If they are tied in both, then just choose whichever one you want.).**</font>

*Hint:* Use `.sort` to sort by a variable.

#### Step 6: Model Selection and Conclusions

Generally, when deciding on the best model, we compare the models we fit with each other, as well as against a baseline.

<font color = 'red'>**Question 10. Consider the threshold used in the best model by precision chosen in Question 9. How well does our best model perform compared to the baseline of a random model?**</font>