# Coding Homework 3: [Your Name]

The Super Bowl is an annual championship game of the National Football League (NFL) in the United States, drawing massive viewership and cultural significance. Data about the commercials shown during past Super Bowls are available in the "superbowl_ads.csv" file.
> This data was posted on [github](https://github.com/fivethirtyeight/superbowl-ads#super-bowl-ads) by the data-oriented reporting outlet [FiveThirtyEight](https://github.com/fivethirtyeight) and subsequently featured on [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-02/readme.md).  For more information see the above links.

> TAs will mark this assignment by checking ***MarkUs*** autotests; and, by manually review the written responses to questions `Q1`, `Q3`, `Q6`, `Q7` and `Q9`.

In [None]:
# Import/Load the "superbowl_ads.csv" data

In [None]:
import pandas as pd
d = pd.read_csv("superbowl_ads.csv")
d

In [None]:
d.groupby('brand').size()

#### Data types can further be split into subtypes, as seen in the graphic below:
![](HW3_data_types.JPG)

### Q0: Which of these best describes the data type of the `brand` variable?
A. Nominal Categorical      
B. Ordinal Categorical  
C. Continuous  
D. Binary/Boolean

In [None]:
# Q0: your answer will be tested!
Q0 = None # Assign either 'A' or 'B' or 'C' or 'D' to `Q0` instead of `None`
# E.g., Q0 = 'A'

In [None]:
# test_Q0
assert Q0=='A', "You could sort them alphabetically or by counts, but is there a deep intrinsic order?"

### Q1a: Create 3 histograms to explore the distribution of `view_count`: (i) one with 2 bins, (ii) one with 8 bins, and (iii) one with 50 bins. 
> Hint: Remember to import `plotly.express`!
- This will be manually reviewed. 

In [None]:
import plotly.express as px

In [None]:
# (i) create a histogram with 2 bins
fig = px.histogram(d, x='view_count', nbins = 2, title = "Distribution of Views on Superbowl Ads")
fig.show()

In [None]:
# (ii) create a histogram with 8 bins
fig = px.histogram(d, x='view_count', nbins = 8,  title = "Distribution of Views on Superbowl Ads")
fig.show()

In [None]:
# (iii) create a histogram with 50 bins
fig = px.histogram(d, x='view_count', nbins = 50,  title = "Distribution of Views on Superbowl Ads")
fig.show()

### Q1b: Which of these histograms is most appropriate to describe the distribution of `view_count`? Why? Write a few sentences describing the distribution based on the histogram you chose as most appropriate.
Provide your written answer in the markdown cell below.
- This will be manually reviewed. 

> Answer here...




### Q2: What is the mean, median and standard deviation of `view_count`?
> - After importing `numpy` as `import numpy as np` you can round with `np.round(1234.56, 1)`

In [None]:
d['view_count'].mean(), d['view_count'].median(), d['view_count'].std()

In [None]:
np.round(99400.4976,1)

In [None]:
# Q2: your answer will be tested!
view_count_mean = None
view_count_median = None
view_count_std = None
Q2 = (view_count_mean, view_count_median, view_count_std) # Assign a `tuple` of three `float`s to Q2

In [None]:
# test_Q2
assert Q2 == (np.round(99400.4976, 1), np.round(34565.0, 1), np.round(165964.8491, 1)), "Try `d['view_count'].mean(), d['view_count'].median(), d['view_count'].std()`"

### Q3: Create a kernel density estimation of `view_count`
> Hint 1: Remember to import `plotly.figure_factory`  
> Hint 2: Check for `na` values  
> Hint 3: To make the plot run faster, create a new column that is `view_count / 100000`
- This will be manually reviewed. 

In [None]:
import plotly.figure_factory as ff

In [None]:
d.isnull().sum()

In [None]:
d["view_count2"] = d.view_count/100000

In [None]:
group_labels = ['view_count2']
hist_data = [d.view_count2.dropna().values.tolist()]
fig = ff.create_distplot(hist_data, group_labels)
fig.show()  

### Q4: Which of these best describes the shape of the above histogram and kernel density estimation?
> Hint: Skewness refers to the measure of asymmetry or lack of symmetry in a distribution of data, where left-skewed data has a longer tail on the left side of the distribution, while right-skewed data has a longer tail on the right side of the distribution  

A. Left-Skewed  
B. Right-Skewed  
C. Multimodal  
D. Symmetric

In [None]:
# Q4: your answer will be tested!
Q4 = None # Assign either 'A' or 'B' or 'C' or 'D' to `Q4` instead of `None`
# E.g., Q4 = 'A'

In [None]:
# test_Q4
assert Q4=='B', "Look closely at the figure from Q3 and the definitions of skew!"

### Q5: What are the 25th and 75th percentiles of `view_count`? We use this to find the interquartile range (IQR) which is used to measure the spread of the middle 50% of the data!

#### Round your answer to the nearest thousand.
> Hint 1: Construct a boxplot of the distribution for `view_count` (https://letmegooglethat.com/?q=plotly+boxplot)   
> Hint 2: Hover on the boxplot

In [None]:
fig = px.box(d, x="view_count")
fig.show()

In [None]:
# Q5: your answer will be tested!
percentile_25 = None
percentile_75 = None
IQR = percentile_75 - percentile_25 

Q5 = (percentile_25, percentile_75, IQR) # Assign a `tuple` of three `int`s to Q5

In [None]:
# test_Q5
assert Q5 == (5, 112, 107), "Look closely at the labels on the boxplot"

### Q6: Construct a plot to visualize the distribution of one of these categorical variables: `show_product_quickly`, `funny`, `danger`, or `celebrity`

In [None]:
fig = px.histogram(d, x='danger')
fig.show()

#### Describe the distribution in 1-2 sentences
Provide your written answer in the markdown cell below.
- This will be manually reviewed. 

> Answer here...

### Q7: Construct a set of two boxplots showing visual summaries of the distribution of the number of likes (`like_count`) for whether ads included a celebrity or not (`celebrity`); make sure to specify meaningful axis labels where appropriate
> Hint 1: This should be a single plot  
> Hint 2: By switching the `x` and `y` variables in the parameters, we can get a horizontal boxplot

In [None]:
fig = px.box(d, x="like_count", y="celebrity")
fig.show()

#### Write 3-4 sentences comparing these distributions.
Provide your written answer in the markdown cell below.
- This will be manually reviewed. 

> Answer here...

### Q8: How many outliers are there for the `like_count` of advertisments that featured a celebrity? 
> Hint: You can close up on the boxplot by dragging your cursor through the plot

In [None]:
# Q8: your answer will be tested!
Q8 = None # Assign an integer number such as 18, 36, or 1338
# E.g., Q8 = 18 

In [None]:
# test_Q8
assert Q8 == 6, "Look closely at the boxplot, did you accidentally write a string?"

### Q9: Construct a plot that shows the relationship between `view_count` and `like_count`
> Hint: https://letmegooglethat.com/?q=plotly+scatterplot
- This will be manually reviewed. 

In [None]:
fig = px.scatter(d, x="view_count", y="like_count")
fig.show()

# Tutorial Preparation

Your ***Tutorial Assignment*** this week will consider some different visualizations about the `coffee_ratings.csv` data using `plotly` (ie. histograms, boxplots, bar plots, or scatterplots) to provide some interesting insights about the data. The `coffee_ratings.csv` dataset contains information about various samples of coffee and their ratings.

***You will not have enough time to fully complete this assignment in tutorial; so, you may want to explore the data, find some relations between variables, and prepare 2-3 interesting visualizations while you still have the ideas fresh in your mind!***

In [None]:
pd.read_csv('coffee_ratings.csv')