Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Logistic Regression: Problem solving

In this session, you will predict whether or not a candy is popular based on its other properties.
This dataset [was collected](http://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/) to discover the most popular Halloween candy.

| Variable         | Type              | Description                                                  |
|:-----------------|:------------------|:--------------------------------------------------------------|
| chocolate        | Numeric (binary)  | Does it contain chocolate?                                   |
| fruity           | Numeric (binary)  | Is it fruit flavored?                                        |
| caramel          | Numeric (binary)  | Is there caramel in the candy?                               |
| peanutalmondy    | Numeric (binary)  | Does it contain peanuts, peanut butter or almonds?           |
| nougat           | Numeric (binary)  | Does it contain nougat?                                      |
| crispedricewafer | Numeric (binary)  | Does it contain crisped rice, wafers, or a cookie component? |
| hard             | Numeric (binary)  | Is it a hard candy?                                          |
| bar              | Numeric (binary)  | Is it a candy bar?                                           |
| pluribus         | Numeric (binary)  | Is it one of many candies in a bag or box?                   |
| sugarpercent     | Numeric (0 to 1)  | The percentile of sugar it falls under within the data set.  |
| pricepercent     | Numeric (0 to 1)  | The unit price percentile compared to the rest of the set.   |
| winpercent       | Numeric (percent) | The overall win percentage according to 269,000 matchups     |
| popular | Numeric (binary) | 1 if win percentage is over 50% and 0 otherwise |

**Source:** This dataset is Copyright (c) 2014 ESPN Internet Ventures and distributed under an MIT license.

## Load the data

First import `pandas`.

Load a dataframe with `"datasets/candy-data.csv"` and display it.

Notice there is a bogus variable `competitorname` that is actually an ID, also known as an **index**. 
We saw the same thing in KNN regression with the `mpg` dataset, but that time it was the car name.

Load the dataframe again, but this time use `index_col="competitorname"` to fix this.

## Explore the data

### Descriptive statistics

Describe the data.

Remember that for the 0/1 variables, the mean reflects the average presence of an ingredient in candy.
For example, `chocolate` is in 43.5% of candy.

**QUESTION:**

What is the least common ingredient (there may be more than one that is the same)?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

What is the most common ingredient?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Do you see any problems with the data, e.g. missing data?

**ANSWER: (click here to edit)**


<hr>

### Correlations

Create and display a correlation matrix.

**QUESTION:**

What property is most positively related to being popular?
What property is most negatively related to being popular?

**ANSWER: (click here to edit)**


<hr>

Create a heatmap for the correlation matrix.
Start by importing `plotly.express`.

Create the heatmap figure

**QUESTION:**

What color is strongly negative, what color is zero, and what color is strongly positive?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

What's going on in the lower right corner?

**ANSWER: (click here to edit)**


<hr>

### Histograms

For binary variables, histograms don't tell us anything that the descriptives don't already tell us.

However, there are two percent-type variables to plot, `sugarpercent` and `pricepercent`.

Plot a histogram of `sugarpercent`.

Plot a histogram of `pricepercent`.

**QUESTION:**

What can you say about the distributions of `sugarpercent` and `pricepercent`?
Is there anything we should be concerned about?

**ANSWER: (click here to edit)**


<hr>

## Prepare train/test sets

You need to split the dataframe into training data and testing data, and also separate the predictors from the class labels.

Start by dropping the label, `popular`, and its counterpart, `winpercent`, to make a new dataframe called `X`.

Save a dataframe with just `popular` in `Y`.

Import `sklean.model_selection` to split `X` and `Y` into train and test sets.

Now do the splits.

## Logistic regression model

Import libraries for:

- Logistic regression
- Metrics
- Ravel

**NOTE: technically we don't need to scale anything and so don't need a pipeline.**

**QUESTION:**

Why don't we need to scale anything?

**ANSWER: (click here to edit)**


<hr>

Create the logistic regression model.

Train the logistic regression model using the splits.

Get predictions from the model using the test data.

## Assessing the model

Print the model accuracy.

**QUESTION:**

How does this compare to the average value of `popular`? 
Is this a good accuracy?

**ANSWER: (click here to edit)**


<hr>

Print precision, recall, and F1.

**QUESTION:**

How to the precision/recall/f1 compare for unpopular (0) and popular (1)?

**ANSWER: (click here to edit)**


<hr>

Make an ROC plot. 

**QUESTION:**

If we decreased the recall to .66, what would the false positives be? HINT: hover your mouse over the plot line at that value. 

**ANSWER: (click here to edit)**


<hr>

This last part is something we didn't really get to develop in the first session, so just run the code.

The odds ratio shows how much more likely a property makes the candy `popular`.
For many of these, the property is just presence/absence.
For example, the odds ratio of 3.06 on chocolate means that having chocolate as an ingredient makes the candy 3.06 times more popular than candy without chocolate.

<!-- TODO: move this into the main notebook; we will have space if we take out the data cleaning currently there -->

In [24]:
pd.DataFrame( {"variable":X.columns, "odds_ratio":np.exp(np.ravel(lm.coef_)) })

Unnamed: 0,variable,odds_ratio
0,chocolate,4.441083
1,fruity,1.238341
2,caramel,1.750972
3,peanutyalmondy,2.435954
4,nougat,0.759784
5,crispedricewafer,1.219726
6,hard,0.388651
7,bar,1.488628
8,pluribus,1.173178
9,sugarpercent,2.066613


**QUESTION:**

What are the top three *ingredients* that make something popular? Do any surprise you given the correlation matrix?

**ANSWER: (click here to edit)**


<hr>

<!--  -->