# Coding Homework 3: [Your Name]

Go through this notebook, following the instructions! 
- You can add new cells if you need (with the "+" button above); but, deleting cells could very likely cause your notebook to fail MarkUs autotesting (and you'd have to start over and re-enter your answers into a completely fresh version of the notebook to get things to work again...)

> TAs will mark this assignment by checking ***MarkUs*** autotests; then, manually confirming the correctness of plots in `Q4` and `Q10`; then, manually reviewing the written response to question `Q17`. TAs may or may not spot check the presence of plots and written answers for `Q13`, `Q14`, `Q15`, and `Q18`.
> - The following questions "automatically fail" during automated testing so that MarkUs exposes example answers for student review and consideration for these problems.  These "failed MarkUs tests" are not counted against the student: `Q6`, `Q11`, `Q13`, `Q14` and `Q15`

# Super Bowl Commercials!

The Super Bowl is the annual championship game of the National Football League (NFL) in the United States, drawing massive viewership and having major cultural significance. Data about the commercials shown during past Super Bowls are available in the "superbowl_ads.csv" file.
> This data was posted on [github](https://github.com/fivethirtyeight/superbowl-ads#super-bowl-ads) by the data-oriented reporting outlet [FiveThirtyEight](https://github.com/fivethirtyeight) and subsequently featured on [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-02/readme.md).  For more information see the above links.

In [None]:
# Import/Load the "superbowl_ads.csv" data

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff

In [None]:
superbowl_ads_csv_asdf = pd.read_csv("superbowl_ads.csv")
superbowl_ads_csv_asdf

## Refining Numerical and Categorical Variables

The notion of numeric data types versus non-numeric data types, like numbers versus words, should be fairly intuitive to you.  And therefore, the distinction between numeric and categorical variables should hopefully also be fairly clear as well.  Actually, for technical clarity and communication, it can sometimes be beneficial to break down variables into even finer categories.

- Note that categorical/qualitative data types can be encoded as numerical values. E.g., **ordinal** values could reported as integer scores "1, 2, 3, 4, 5, etc.", and so could **nominal** values as long as we remember that the numeric order doesn't matter.  **Binary** variables are a simple example since they could be recorded as either $0$ or $1$. An easy example of a **binary variables** is a **logical/boolean** value like how `python` has `True` and `False` values.  In the case of **logical/boolean** values it is standard to coerce `False` to $0$ and `True` to $1$ whenever actual numeric values are required in place of **logical/boolean** values.

#### Data types can further be split into subtypes, as seen in the graphic below:
![](im/3/HW3_data_types.JPG)

### Q0: Which of these best describes the data type of the `brand` variable?
A. Nominal Categorical      
B. Ordinal Categorical  
C. Continuous  
D. Discrete  
E. Binary/Boolean

In [None]:
# Q0: your answer will be tested!
Q0 = None # Assign either 'A', 'B', 'C', 'D', or 'E' to `Q0` instead of `None`
# E.g., Q0 = 'A'

In [None]:
hint = "You could sort them alphabetically or by perhaps maybe by counts, "
hint += "but is there a deep intrinsic order? "
hint += "Different languages or different data sets could produce different orderings..."
test = Q0=='A'

In [None]:
# test_Q0
assert test, hint

### Q1: Which of these best describes the data type of the `view_count` variable?
A. Nominal Categorical      
B. Ordinal Categorical  
C. Continuous  
D. Discrete  
E. Binary/Boolean

In [None]:
# Q1: your answer will be tested!
Q1 = None # Assign either 'A', 'B', 'C', 'D', or 'E' to `Q1` instead of `None`
# E.g., Q1 = 'A'

In [None]:
hint = "Think about whether it represents a measurement that can take on any value within a range or if it consists of distinct categories or groups"
test = Q1=='D'

In [None]:
# test_Q1
assert test, hint

### Q2: Which of these best describes the data type of the `funny` variable?
A. Nominal Categorical      
B. Ordinal Categorical  
C. Continuous  
D. Discrete  
E. Binary/Boolean

In [None]:
# Q2: your answer will be tested!
Q2 = None # Assign either 'A', 'B', 'C', 'D', or 'E' to `Q2` instead of `None`
# E.g., Q2 = 'A'

In [None]:
hint = "How many (non-NaN) values are possible for the data in the `funny` column?"
test = Q2=='E'

In [None]:
# test_Q2
assert test, hint

### Q3: Which of *these* best describes the data type of the `funny` variable?
> Note: This is not a mistake!    

A. Nominal Categorical      
B. Ordinal Categorical  
C. Continuous  
D. Discrete   

In [None]:
# Q3: your answer will be tested!
Q3 = None # Assign either 'A', 'B', 'C', or 'D' to `Q3` instead of `None`
# E.g., Q3 = 'A'

In [None]:
hint = "Is there a deep intrinsic order? Are the values numbers? "
hint += "Discrete is not really a wrong choice if you're thinking of coercing False and True into a 0 and 1 encoding... "
hint += "However, this question is asking about the data as it is orignally stored in the `funny` column."
test = Q3=='B'

In [None]:
# test_Q3
assert test, hint

### Q4: Create 3 histograms to explore the distribution of `view_count`: (i) one with 2 bins, (ii) one with 4 to 8 bins, and (iii) one with about 50 bins. 
> Hint 1: Remember to import `plotly.express`!  
> Hint 2: Googling "plotly hisotgrams" is the trick if you don't immediately remember how to do this

- This will be manually reviewed. 

In [None]:
# (i) create a histogram with 2 bins
fig = px.histogram(superbowl_ads_csv_asdf, 'view_count', nbins=2, color_discrete_sequence=['lightcoral'])
fig.show()

In [None]:
# (ii) create a histogram with 4 to 8 bins
fig = px.histogram(superbowl_ads_csv_asdf, 'view_count', nbins=8, color_discrete_sequence=['lightsalmon'])
fig.show()

In [None]:
# (iii) create a histogram with about 50 bins
fig = px.histogram(superbowl_ads_csv_asdf, 'view_count', nbins=50, color_discrete_sequence=['plum'])
fig.show()

> If you noticed that the solutions `px.histogram(..., x="view_count", nbins=2)`, `px.histogram(..., x="view_count", nbins=8)`, and `px.histogram(..., x="view_count", nbins=50)` don't give the exact number of bins as is specified by the `nbins` parameter, great job!  This behavior is discussed, e.g., [here](https://community.plotly.com/t/plotly-histogram-nbinx-does-not-provide-the-right-number-of-bins-in-python/1126) and [here](https://stackoverflow.com/questions/68013490/what-is-nbins-in-plotly) which are found by googling "plotly nbins not working".

### Q5: Create a boxplot of the `view_count` data using an `x="view_count"` specification, then determine the 25th and 75th percentiles (also called the first and third quartiles, Q1 and Q3) of the data and compute the interquartile range (IQR) to measure the spread of the middle 50% of the data

#### Give your answer as a discrete integer numeric value ignoring all decimal values
> Hint 1: Construct a boxplot of the distribution for `view_count` (https://letmegooglethat.com/?q=plotly+boxplot)   
> Hint 2: Hover on the boxplot

In [None]:
# create the plot in these cells


In [None]:
px.box(d, x="view_count")

In [None]:
# Q5: your answer will be tested!
percentile_25 = None
percentile_75 = None
IQR = percentile_75 - percentile_25 

Q5 = (percentile_25, percentile_75, IQR) # Assign a `tuple` of three `int`s to Q5

In [None]:
hint = "Look closely at the labels on the boxplot: Q1 is not 10, if that's your mistake"

In [None]:
# test_Q5
assert Q5 == (4699, 112083, 107384), hint

### Q6: In terms of representing the `view_count` data, which of the previous plots (histograms and boxplot) do you think most informatively and succinctly characterizes its distribution and why? 

#### Write a two the three sentences to answer this question in markdown cell below
- Compare your response to the answer given in the ***MarkUs*** output

> Answer here...


In [None]:
hint = "\n\nAUTOMATICALLY FAILING AUTOTEST: DOES NOT COUNT AGAINST STUDENT\n"
hint += "Included as an example answer for feedback purposes only\n\n"
hint += "Among the three histograms, the one with about 4 to 8 bins does a good job of streamlining the display of the `view_count` distribution to emphasizing its key characteristic of being right-skewed and unimodal.  The two-bin histogram definitely appears too coarse to really get a good sense for the shape of the data, but in this regard using about 50 bins may be preferred because this shows how quickly the decay of the `view_count` data actually is. From the histogram with about 50 bins we can see that something like half of the `view_count` are less than 20,000; so, we might expect that the median would be around 20,000 and then the mean would be greater than that due to the long right tail of large values.  The boxplot turns out to be a very nice visualization in this instance since we can see that the 50th percentile (median) of the data is about 35,000, and the shape of the boxplot which shows that the 25th and 75th percentiles are close to 0 and 100,000 respectively, which in conjunction with the median and the whiskers and outliers of the boxplot, strongly suggests the right-skewed and unimodal nature of the `view_count` data."

In [None]:
# test_Q6
assert False, hint

### Q7: Which of these best describes the shape of the histograms and boxplot above?
> Hint: Skewness refers to the measure of asymmetry or lack of symmetry in a distribution of data, where left-skewed data has a longer tail on the left side of the distribution, while right-skewed data has a longer tail on the right side of the distribution  

A. Left-skewed  
B. Right-skewed  
C. Multimodal  
D. Symmetric

In [None]:
# Q7: your answer will be tested!
Q7 = None # Assign either 'A' or 'B' or 'C' or 'D' to `Q7` instead of `None`
# E.g., Q7 = 'A'

In [None]:
hint = "Look closely at the figures and the definitions of skew!"
test = Q7=='B'

In [None]:
# test_Q7
assert test, hint

### Q8: What is the mean, median and standard deviation of `view_count`?

#### Give your answer as a continuous numeric value rounded to a single decimal position

> Hint 1: After importing `numpy` as `import numpy as np` round with `np.round(1234.56, 1)` for your submitted answers  
> Hint 2: there are several ways to do this, but both `pandas` and `numpy` can provide the mean, median and standard deviation functionality you're looking for when used like `df.view_count.round(1)` or `np.round(df.view_count,(1))`
- Note: `pandas` and `numpy` may give different values for standard deviation; this is because by default, `numpy` uses a denominator of $n$ instead of $n-1$. For the correct value when using `numpy`, use the parameter `ddof=1` within the `np.std()` function; whereas, `pandas` uses the $n-1$ divisor [by default](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html) 

In [None]:
# Q8: your answer will be tested!
view_count_mean = None
view_count_median = None
view_count_std = None
Q8 = (view_count_mean, view_count_median, view_count_std) # Assign a `tuple` of three `float`s to Q8

In [None]:
hint = "There are many ways to do this, try np.round(np.<metric>(df.view_count),1), np.round(df.view_count.<metric>(),1), or df.view_count.<metric>().round(1)"

In [None]:
# test_Q8
assert Q8==(np.round(superbowl_ads_csv_asdf.view_count.mean(),1), superbowl_ads_csv_asdf.view_count.median().round(1), np.round(np.std(superbowl_ads_csv_asdf.view_count, ddof=1),1)), hint

### Q9: What explains the difference between the mean and the median values observed for the `view_count` variable above?
- Choose the best answer from the option below

A. Outliers      
B. Left-skewed outliers  
C. Means are always greater than medians  
D. The large standard deviation  

In [None]:
# Q9: your answer will be tested!
Q9 = None # Assign either 'A' or 'B' or 'C' or 'D' to `Q9` instead of `None`
# E.g., Q9 = 'A'

In [None]:
hint = "Think about some examples of small handful of numbers where mean and median values are different and why"
test = Q9=='A'

In [None]:
# test_Q9
assert test, hint

### Q10: Create a *Kernel Density Estimate* (KDE) plot of the $\log$ of the `view_count` by transforming the  `view_count` column with the `.apply(np.log)` method
> Hint 1: Remember to import `plotly.figure_factory`  
> Hint 2: Googling "plotly kernel density" should be enough to find the "distplots" you're looking for  
> Hint 3: Overlaying a histogram with a KDE is a great way to make this visualization  
> Hint 4: Check for mising `NaN` values  
- This will be manually reviewed. 

In [None]:
# create the plot in these cells


In [None]:
hist_data = [d.view_count.apply(np.log).dropna()]
group_labels = ['log_view_count'] 
ff.create_distplot(hist_data, group_labels)

### Q11: Describe what the effect of the $\log$ transformation is on the plot by describing how the distribution of the data changes after the $\log$ tranformation

#### Write two to three sentences to answer this question in the markdown cell below
- Compare your response to the answer given in the ***MarkUs*** output

> Answer here...

In [None]:
hint = "\n\nAUTOMATICALLY FAILING AUTOTEST: DOES NOT COUNT AGAINST STUDENT\n"
hint += "Included as an example answer for feedback purposes only\n\n"
hint += "The larger the value is the more it gets reduced under the log transformation. In the case of this data the log transformation effect is so strong that the distribution changes from a right-skewed distribution to a left-skewed distribution.  The distribution is now unimodal and hill-like, as opposed to just a right-skewed decay distribution. There are not extreme outliers in long tails, but the left-skew is quite strong. We'll need to remember to transform back to the original scale (with `np.exp`) if we want to interpret things on the original `view_count` scale though."
hint += "\n\nCODE:\n"
hint += '''
hist_data = [d.view_count.apply(np.log).dropna()]
group_labels = ['log_view_count'] 
ff.create_distplot(hist_data, group_labels)'''

In [None]:
# test_Q11
assert False, hint

### Q12: Demonstrate that as a result of the change in skewness the mean of the $\log$ of the `view_count` data is now less than its median, and that essentially most (or nearly all) of the data are within 2 (or 3) standard deviations of the mean by computing these values to confirm these statements

- Round with `np.round(1234.56, 1)` or similar for your submitted answers

In [None]:
# Q12: your answer will be tested!
log_view_count_mean = None
log_view_count_median = None
log_view_count_std = None
Q12 = (log_view_count_mean, log_view_count_median, log_view_count_std) 
# Assign a `tuple` of three `float`s to Q12

In [None]:
# test_Q12
assert Q12 == (superbowl_ads_csv_asdf.view_count.apply(np.log).dropna().mean().round(1), superbowl_ads_csv_asdf.view_count.apply(np.log).dropna().median().round(1), superbowl_ads_csv_asdf.view_count.apply(np.log).dropna().std().round(1))

### Q13: Construct a plot to visualize the distribution of one of the categorical variables listed below; then, describe the distribution in 1 to 2 sentences

> - `show_product_quickly`
> - `celebrity`
> - `funny`
> - `danger`

#### Create the plot and write your answer in the code cell and markdown cells below respectively

- Compare your figure and response to the answer given in the ***MarkUs*** output 

> Hint: a **barplot** and a **histogram** are very similar plots (and are even therefore made using the same function in `plotly`; however, the difference is that a **barplot** is a **histogram** for **categorical** data; so, you're here making a **barplot** rather than a **histogram** is you want to get semantical about it.

In [None]:
# create the plot in these cells
fig = px.histogram(d, x='danger')
fig.show()

> Answer here...

In [None]:
hint = "\n\nAUTOMATICALLY FAILING AUTOTEST: DOES NOT COUNT AGAINST STUDENT\n"
hint += "Included as an example answer for feedback purposes only\n\n"
hint += "Barplots let us easily compare the frequency of each category. There are significantly more ads that show the product quickly (146) than those that do not (65). Similarly, there are significantly more ads that are funny (144) than those that are not (67). On the other hand, there are significantly less ads that feature danger (65) than those that do not (146). There are also significantly less ads that feature a celebrity (63) than those that do not (148)."
hint += "\n\nCODE:\n"
hint += '''
fig = px.histogram(d, x='danger')
fig.show()'''

In [None]:
# test_Q13
assert False, hint

### Q14: Construct a plot to visualize the distribution of the `brand` categorical variable; then, in 2 to 3 sentences, describe the distribution and discuss what "skewness" might mean in the context of this plot

#### Create the plot and write your answer in the code cell and markdown cells below respectively

- Compare your figure and response to the answer given in the ***MarkUs*** output 

In [None]:
# create the plot in these cells
fig = px.histogram(d, x='brand')
fig.show()

> Answer here...

In [None]:
hint = "\n\nAUTOMATICALLY FAILING AUTOTEST: DOES NOT COUNT AGAINST STUDENT\n"
hint += "Included as an example answer for feedback purposes only\n\n"
hint += "In the barplot, there is significant imbalance between the number of ads for each brand: Bud Light (56), Budweiser (36) and Pepsi (25) have the highest number of ads, while NFL has the least number of ads (6). If we were to consider increasing to decreasing brand prevelence, the notion of skewness could describe how dissimilar the distribution is from 'uniform' prevelence; and, could describe how imbalanced the prevelence is away from 'uniform' parity."
hint += "\n\nCODE:\n"
hint += '''
fig = px.histogram(d, x='brand')
fig.show()'''

In [None]:
# test_Q14
assert False, hint

### Q15: Construct a set of two boxplots showing the distributions of `like_count` separated by whether or not ads included a `celebrity` and provide a written description and relative comparison of these distributions.

#### Create the plot and write your answer in the code cell and markdown cells below respectively

> Hint 1: Use percentiles and indicate locations, spreads, shape, etc. as is helpful for writing your descriptions  
> Hint 2: This should be a single plot  
> Hint 3: You can change the orientation of the boxplot by switching the `x` and `y` function parameters 

- Compare your figure and response to the answer given in the ***MarkUs*** output 

In [None]:
# create the plot in these cells
fig = px.box(d, x="like_count", y="celebrity")
fig.show()

> Answer here...


In [None]:
hint = "\n\nAUTOMATICALLY FAILING AUTOTEST: DOES NOT COUNT AGAINST STUDENT\n"
hint += "Included as an example answer for feedback purposes only\n\n"
hint += "From the boxplots, it seems that the like count of ads that did not feature a celebrity have a smaller spread than ads that did feature a celebrity: for ads that did not feature a celebrity, there is a smaller IQR and a significantly smaller range. They share the same minimum value, at 0. Moreover, the median like count of ads that did not feature a celebrity is slightly less than that of the ads that did feature a celebrity. There are also more outliers for the like count of ads that did not feature a celebrity."
hint += "\n\nCODE:\n"
hint += '''
fig = px.box(d, x="like_count", y="celebrity")
fig.show()'''

In [None]:
# test_Q15
assert False, hint

### Q16: How many outliers are there for the `like_count` of advertisments that featured a celebrity according as determined by the criterion of a boxplot? 
> Hint 1: The dots in a boxplot are the ‘outliers’ according to the criterion of a boxplot
> Hint 2: You can close up on the boxplot by dragging your cursor through the plot

In [None]:
# Q16: your answer will be tested!
Q16 = None # Assign an integer number such as 18, 36, or 1338
# E.g., Q16 = 18 

In [None]:
# test_Q16
assert Q16==6, "Look closely at the boxplot, did you accidentally write a string?"

### Q17: Choose which of histograms, boxplots, and KDEs are your favorite way to represent distributions of data and explain why

#### Write two sentences to answer this question in markdown cell below

- Your response will be manually reviewed. 

> Answer here...



### Q18: Construct a plot that shows the relationship between `view_count` and `like_count`
> Hint: https://letmegooglethat.com/?q=plotly+scatterplot


In [None]:
fig = px.scatter(d, x="view_count", y="like_count")
fig.show()

# Tutorial Head Start Preparation

> Optional, but recommended and encouraged (to lesson the demands of this assignment and make your life easier)!

Your ***Tutorial Assignment*** this week will involve describing some visualizations of the `coffee_ratings.csv` data using `plotly` (ie. histograms, boxplots, barplots, KDEs, or scatterplots). The `coffee_ratings.csv` dataset contains information about various samples of coffee and their ratings. Ideally, your visualizations will provide some interesting insights into the data.

***You will not have enough time to fully complete this assignment during tutorial; so, you are encouraged to explore the data and practice preparing 2-3 interesting visualizations while the `plotly` code for doing so is now immediately at your finger tips and still fresh in your mind!***

> The ***Tutorial Assignment*** details follow below for your information; although, note that the terms below will be discussed during tutorial, so you're not necessarily expected to be familiar these terms yet or able to fully complete this assignment at this stage.

## Tutorial Assignment 

- Submit your work for the assignment through Quercus
- Include the `.ipynb` file upon which your written submission is based

#### Marking is based on addressing the points on the next slide (relative to the prompt on the next-next slide) and the subsequent determination of the TA that they have come to a good sense of the figure(s) you're describing without referring to your submitted `.ipynb` file

- Don't spend more than 60 minutes on this assignment (unless really needed...)    

    - Aim for something close to 200 to 500 words
    - Grammar is *not* the main focus of the assessment, but it is important that you communicate in a clear and professional manner; so, 
        - use full sentences (without slang or emojis) 
    
## Tutorial Assignment 

- Describe the data source, acknowledging any notable limitations of the data<br>(such as sample size, missing data, etc.)
- State the type of graph you're making (and identify the x- and y-axes) 
- Describe the key features and characteristics of the data distribution
    - Where is the data located/centered (approximate values if possible)?
        - Note key relative frequencies in categorical data contexts
    
    - What is the scale/spread of the data (approximate values if possible)? <!-- (Characterized relative to location/center of the data if helpful) -->
    - What is the shape of the data? <br>Symmetric? Left or right skewed? Multiple modes (and how many)?
    - Note the presense or absense of any potential outliers/extreme data: <br> **Provide description of the nature of outliers in the tails of the distribution if doing so is sensible and helpful**

## Tutorial Assignment (complete at home if needed)

Suppose you're on the phone with your friend, and for whatever reason you're describing some of the data and data visualization techniques you've been working with for STA130.  

Use `plotly` to construct 1-2 plots from the `coffee_ratings.csv` dataset and prepare a small paragraph with your description of the graph(s) for your friend keeping in mind that they cannot see the graph(s). Suppose your friend has not taken STA130, so they will not be as familiar with the statistical vocabulary as you are; so, explain any terms you use in plain language as you describe the data and graph(s) to your friend.

- Do not include any code in your written submission; but, separately include your `.ipynb` file along with your written submission
    - Annotate your notebook with comments where it's helpful for understanding what your code is doing