Copyright 2022 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Plotting: Problem solving

In this notebook we'll look at the `cereal` dataset, which consists mostly of nutrition information along with some other properties:

| Variable | Type | Description |
|:-------|:-------|:-------|
| name     | Nominal | Name of cereal (an ID)                                                                                                                          |
| mfr      | Nominal | Manufacturer of cereal: (A)merican Home Food Products; (G)eneral Mills; (K)elloggs; (N)abisco; (P)ost; (Q)uaker Oats; (R)alston Purina |
| type     | Nominal | (H)ot or (C)old                                                                                                                        |
| calories | Ratio   | calories per serving                                                                                                                   |
| protein  | Ratio   | grams of protein                                                                                                                       |
| fat      | Ratio   | grams of fat                                                                                                                           |
| sodium   | Ratio   | milligrams of sodium                                                                                                                   |
| fiber    | Ratio   | grams of dietary fiber                                                                                                                 |
| carbo    | Ratio   | grams of complex carbohydrates                                                                                                         |
| sugars   | Ratio   | grams of sugars                                                                                                                        |
| potass   | Ratio   | milligrams of potassium                                                                                                                |
| vitamins | Ordinal | vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended                                            |
| shelf    | Ratio   | display shelf (1, 2, or 3, counting from the floor)                                                                                    |
| weight   | Ratio   | weight in ounces of one serving                                                                                                        |
| cups     | Ratio   | number of cups in one serving                                                                                                          |
| rating   | Ratio   | a rating of the cereals (Possibly from Consumer Reports?)                                                                              |
      
<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from <a href="https://www.kaggle.com/crawford/80-cereals">Kaggle</a>.
</div>
<br>

## Import libraries

We need to load our data into a dataframe and do some plots, so import `pandas` and `plotly.express` below.

<!-- We will cover plotting and reading the following data visualizations in `plotly`:

- scatter plots 
- line graphs
- bar charts
- box plots
- plots for distributions (histograms and density plots)

We will specifically use the `plotly.express` library for quick and easy plotting. 

Keep in mind: some of the same plots can be done `plotly.graph_objs` and `plotly.figure_factory`. These libraries also allow for additional (and more complicated) plots or have additional flexibility.  -->

## Load data

Load `"datasets/cereal.csv"` into a dataframe and display the dataframe.

## Research questions

Any plot you make should be designed to address a question.
At this point, we're **not making a statistical/quantitative argument** but rather a qualitative one.

Here are some possible/plausible questions. 
For each one, think about the type of plot that might make the most sense.
Then open up the dropdown to see the plot we're going to do.
If you think we're choosing wrong, bring it up for discussion!

#### Do fat and sugar go together?

We might expect that fatty cereals have a lot of sugar and vice versa.
What would be a good plot?

<details>
  <summary>Answer</summary>
  
  Since both variables are both ratio, a scatterplot makes sense. That way we can see the individual datapoints and even label them if we want.
</details>

#### Do some manufacturers have more sugar in their cereals than others?

Maybe all manufacturers have similar product lines, but maybe some specialize in sugary cereal.
What would be a good plot?

<details>
  <summary>Answer</summary>
  
  Since manufacturer is nominal and we want to compare manufacturers on a single value, a barplot makes sense. Then we can use the average amount of sugar for each one.
</details>

#### How do hot and cold cereals compare on healthy attributes like protein, fiber, and vitamins?

Hot cereals are often more traditional and less processed, so maybe they have a better nutritional profile.
What would be a good plot?

<details>
  <summary>Answer</summary>
  
  Since hot/cold is nominal and we want to compare them on multiple values, a line plot makes sense. We would then plot each healthy attribute as its own line on the plot.
</details>

#### Is protein approximately normally distributed, or are their cereals with unusually high and low amounts of protein?

If protein is not being manipulated by manufacturers, we'd expect it to be approximately normal. But if manufacturers are intentionally adding/removing protein, we might see that in the distribution of the variable.
What would be a good plot?

<details>
  <summary>Answer</summary>
  
  Histograms are the only plot we've talked about for the distribution of a variable.
</details>

## Do fat and sugar go together?

Make a scatterplot using `fat` and `sugars`

<details>
  <summary>Interpretation</summary>
  
That's a curious graph!
Notice first that although there are 70+ cereals, we're only seeing about 35 points.
That's because some cereals have the same values for sugar and fat, so they are plotted on top of each other - something to watch out for!

Overall it looks like as fat increases, sugar goes to the middle of the range, i.e. about 7, rather than going high with it.
</details>


## Do some manufacturers have more sugar in their cereals than others?

First, group by manufacturers (`mfr`).

Now make a barplot with the means of these groups for `sugars`.

<details>
  <summary>Interpretation</summary>
  
It certainly looks like some manufacturers specialize in low sugar, e.g. `American Home Food Products` and `Nabisco`.
</details>

## How do hot and cold cereals compare on healthy attributes like protein, fiber, and vitamins?

The first notebook made a line plot with the groups.
However, that would show **all** the variables, and this dataframe has too many for that.

So as a first step, create a new dataframe `healthy` with just the columns `type`,`protein`, `fiber`, `vitamins` in it, then display `healthy`.

*Hint: selecting columns was shown in the 'Nature of data' notebook*.

Now group `healthy` by `type` and store in `healthy_groups`.

Now make the line plot using the mean of `healthy_groups`.

<details>
  <summary>Interpretation</summary>
  
Again, very interesting - looks like cold cereals are being enriched with vitamins in a way hot cereals are not.
Hot cereals have a bit more protein but surprisingly less fiber.
</details>


## Is protein approximately normally distributed, or are their cereals with unusually high and low amounts of protein?

Make a histogram on `protein` using `dataframe`

<details>
  <summary>Interpretation</summary>
  
That certainly doesn't look normal.
It seems that protein is being manipulated somehow when the level is high.
</details>

<!--  -->