Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Decision trees: Problem solving

We previously looked at predicting whether or not  a candy is popular based on its other properties using logistic regression.

This gave us an idea of how  different properties **add** together to make a candy popular, but it didn't give us as much of an idea of how the properties act on each other.

In this session, you will predict whether or not is popular using decision trees.

This dataset [was collected](http://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/) to discover the most popular Halloween candy.

| Variable         | Type              | Description                                                  |
|:-----------------|:------------------|:--------------------------------------------------------------|
| chocolate        | Numeric (binary)  | Does it contain chocolate?                                   |
| fruity           | Numeric (binary)  | Is it fruit flavored?                                        |
| caramel          | Numeric (binary)  | Is there caramel in the candy?                               |
| peanutalmondy    | Numeric (binary)  | Does it contain peanuts, peanut butter or almonds?           |
| nougat           | Numeric (binary)  | Does it contain nougat?                                      |
| crispedricewafer | Numeric (binary)  | Does it contain crisped rice, wafers, or a cookie component? |
| hard             | Numeric (binary)  | Is it a hard candy?                                          |
| bar              | Numeric (binary)  | Is it a candy bar?                                           |
| pluribus         | Numeric (binary)  | Is it one of many candies in a bag or box?                   |
| sugarpercent     | Numeric (0 to 1)  | The percentile of sugar it falls under within the data set.  |
| pricepercent     | Numeric (0 to 1)  | The unit price percentile compared to the rest of the set.   |
| winpercent       | Numeric (percent) | The overall win percentage according to 269,000 matchups     |
| popular | Numeric (binary) | 1 if win percentage is over 50% and 0 otherwise |

<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset is Copyright (c) 2014 ESPN Internet Ventures and distributed under an MIT license.
</div>


## Load the data

First import `pandas`.

Load a dataframe with `"datasets/candy-data.csv"` but use `index_col="competitorname"` to make `competitorname` an ID instead of a variable.
Then display the dataframe.

## Explore the data

Since this is a dataset you've looked at before, just make a correlation heatmap to show how the variables are related to each other.

Start by importing `plotly.express`.

And create and show the heatmap figure in one line.

----------------------------
**QUESTION:**

Look at the first two columns.
What variables do these correspond to?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

How would you describe their pattern of correlation with other variables?

**ANSWER: (click here to edit)**


<hr>

## Prepare train/test sets

You need to split the dataframe into training data and testing data, and also separate the predictors from the class labels.

Start by dropping the label, `popular`, and its counterpart, `winpercent`, to make a new dataframe called `X`.

Save a dataframe with just `popular` in `Y`.

Import `sklearn.model_selection` to split `X` and `Y` into train and test sets.

Now do the splits. Use `random_state=1` so we all get the same answer

## Decision tree model

First import `sklearn.tree`.

Now create the decision tree model

----------------------------
**QUESTION:**

Why don't we need to scale anything?

**ANSWER: (click here to edit)**


<hr>

Fit the model and get predictions.

## Evaluate model performance

Import `sklearn.metrics`.

Get the accuracy.

And get the recall, precision, and f1.

As we can see, both the accuracy and the average precision, recall, and f1 are all very good.

## Display the Tree

First import `graphviz`.

And use it to build the tree. Try to copy this from your other notebook if at all possible.

--------------

**QUESTION:**

Explain the tree - what are the first three important decisions it makes?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Consider what the logistic regression model said were important features below. 
How does the decision tree compare?

![image.png](attachment:image.png)

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Which model (decision tree or logistic regression) do you think is more correct?
How would you know?

**ANSWER: (click here to edit)**


<hr>

<!--  -->