# Coding Homework 7: [Your Name]

## Part 1: Broadway

Lin-Manuel Miranda was nominated for “Best Original Song” for the March 27, 2022 the Academy Awards
(also known as the Oscars) for his work on the Disney movie Encanto. Miranda had already won an Emmy,
Grammy, and Tony (mostly for his work on the broadway musical “Hamilton”), so he was very close to
the [EGOT](https://www.vanityfair.com/hollywood/2022/02/oscar-nominations-2022-will-lin-manuelmiranda-finally-egot-for-encanto) (Emmy, Grammy, Oscar and Tony), a rare occurrence as only 16 people
have won all four awards. Unfortunately, Miranda did not win the Oscar in 2022. Perhaps he will soon!


In this question we will look at a sample of weekly broadway musical data available in the broadway.csv.

This data set contains a sample of Broadway musical information for 500 weeks from 1985 to 2020. In this
data set an observation is one broadway musical in a particular week (ending on a Sunday). 

Variables of interest are:

- show: Name of the broadway musical/show.
- Hamilton: indicates whether the musical is “Hamilton” or not.
- week_ending: Date of the end of the weekly measurement period. Always a Sunday.
- weekly_gross_overall: Weekly box office gross for all shows.
- avg_ticket_price: Average price of tickets sold in a particular week.
- top_ticket_price: Highest price of tickets sold in a particular week.
- seats_sold: Total seats sold for all performances and previews in a particular week.
- pct_capacity: Percent of theatre capacity sold. Shows can exceed 100% capacity by selling standing room tickets.



Let’s explore different ways to estimate the average ticket price for Broadway shows! 


> TAs will mark this assignment by checking ***MarkUs*** autotests; and, by manually reviewing the written responses and graph-based questions

In [1]:
# Import/Load Put the "broadway.csv" data with the name broadway



In [2]:
# Take a look at the data to understand it more



### Q0: Use `.drop(<columnName>, axis = 1)` to remove any `columns` with missing values.

#### Hint: Use any() to check if any `columns` have missing values first, not rows!

### Q1: Make a plot showing the relationship between the average ticket price (on the y-axis) and the weekly gross overall sales (on the x-axis)

> - Hint: You did something like this in the final problem in the week three homework with the `scatter` function from `plotly.express`

### Q2: Add trendline to the plot


> - Use the `trendline="ols"` parameter argument for the `scatter` function

### Q3: Use `np.corrcoef()` to compute the correlation between the variables in the plot(s) above
#### Round to 3 decimal places

In [10]:
# Q3: your answer will be tested!
Q3 = None

### Q4: Explain why you do or do not think correlation provides a reasonable summary of the relationship between the variables in the plot(s) above? 

- Correlation measures of the strength of "straight line" linear association between two variables
- Compare your answer against the example answer provided by MarkUs

> Answer here... 

### The Four Assumptions of Linear Regression

So far, we have been using linear regression without knowing what important assumptions we must make (this will be important to confirm in future statistics courses!)

#### 1. Normality
The residuals of the model follow a normal distribution. We can check this using Q-Q plots or the Shapiro-Wilk test but for our purposes we can do a visual inspection. We can simply create a histogram of all the residuals and if it looks "bell-shaped" then we can assume it follows a normal distribution.

Here is a rough example, in your case you would have to manually find the residuals instead of making it randomized:

In [1]:
import plotly.graph_objects as go
import numpy as np

np.random.seed(123)
residuals = np.random.normal(0, 1, 1000)

histogram = go.Histogram(x=residuals, nbinsx=30)
layout = go.Layout(title='Histogram of Residuals', xaxis=dict(title='Residuals'), yaxis=dict(title='Frequency'))

fig = go.Figure(data=[histogram], layout=layout)
fig.show()

#### 2. Homoscedasticity 

The residuals have a constant variance for every x-value. If this condition is not met, it is called heteroscedasticity. Higher variance makes it more difficult to trust our model. The simplest way to check this is to create a scattor plot for the fitted values against the residuals. If we see a cone-shaped plot, it means the variability differs and the model suffers from heteroscedasticity.

In [2]:
# What we are looking for:

import plotly.express as px
import pandas as pd
import numpy as np

np.random.seed(123)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1.2, 2.1, 2.8, 4.1, 4.9, 5.9, 7.1, 8.2, 8.9, 10.3])

predicted_y = 1.1 * x  
residuals = y - predicted_y
data = pd.DataFrame({'Predicted Values': predicted_y, 'Residuals': residuals})

fig = px.scatter(data, x='Predicted Values', y='Residuals', trendline=None)
fig.update_layout(title='Residual Plot', xaxis_title='Predicted Values', yaxis_title='Residuals')
fig.show()

In [3]:
# What we are not looking for, notice the "cone shape" as the points continue to spread
# Notice the scale of the axis in both plots!

import plotly.graph_objects as go
import numpy as np

np.random.seed(123)
predicted_values = np.linspace(0, 10, 50)
residuals = np.random.normal(0, predicted_values**2, 50)

scatter_residuals = go.Scatter(x=predicted_values, y=residuals, mode='markers', name='Residuals')

layout_residuals = go.Layout(title='Scatter Plot of Residuals (Heteroscedasticity)',
                             xaxis=dict(title='Predicted Values'), yaxis=dict(title='Residuals'))

fig_residuals = go.Figure(data=[scatter_residuals], layout=layout_residuals)
fig_residuals.show()

#### 3. Linearity

The relationship between the independent variables and the dependent variable is linear. In other words, the change in the dependent variable is directly proportional to the change in the independent variables. We should see an indication of a straight line fitting the data. 

In [4]:
# Example:

import plotly.graph_objects as go
import numpy as np

np.random.seed(123)
X = np.linspace(0, 10, 50)

noise = np.random.randn(50)
Y = 2*X + noise

scatter_plot = go.Scatter(x=X, y=Y, mode='markers', name='Data Points')

layout = go.Layout(title='Scatter Plot',
                   xaxis=dict(title='Independent Variable'), yaxis=dict(title='Dependent Variable'))

fig = go.Figure(data=[scatter_plot], layout=layout)
fig.show()

#### 4. Independence

The observations in the dataset are independent of each other. There should be no relationship or dependence between the residuals. We do want there to be a pattern among consecutive residuals because it can lead to an ineffective model and invalid hypothesis tests. 

### Q5: Relative to the `trendline="ols"` straight line association present in the data, describe the shape of the data relative to the regression assumptions of normality of error terms and homoskedastic (constant) variance, and what this means for the appropriate of these assumptions for the data.

#### Write your answer in 1 to 3 sentences.

> Answer here... 

### Q6: Add two new columns to the data frame based on the following variable transformations 
- `log_avg_ticket_price = np.log(avg_ticket_price)`
- `weekly_gross_overall_in_100k = weekly_gross_overall/100000`

#### The $\log$ of a number less than or equal to zero is not defined so gets stored as `NaN`. Double check that there are no `NaN` values in the $\log$-transformed columns and drop rows if there are any missing values resulting from the transformation.

### Q7: Plot the association between the variables you created along with a line of best fit to the plot, and describe the relationship observed in the plot, as well as any notable artifacts present in the plot.

- `log_avg_ticket_price` (on the y-axis)
- `weekly_gross_overall_in_100k` (on the x-axis)
- Use the `trendline="ols"` parameter argument for the `scatter` function

#### Plot your figure in the code cell below and write your answer in 2 to 3 sentences.

- Compare your answer against the example answer provided by MarkUs

> Answer here... 

### Q8: From your plot in Question 7, which of the four assumptions of linear regression has improved the `most`?

A. `Normality`  
B. `Homoscedasticity`  
C. `Linearity`  
D. `Independence`

In [5]:
# Q8: your answer will be tested!
Q8 = None 

### Q9: Calculate the correlation between `log_avg_ticket_price` and `weekly_gross_overall_in_100k` using the `np.corrcoef` function.  Interpret the result in the context of this data.
#### Round to 3 decimal places

In [6]:
# Q9: your answer will be tested!
Q9 = None

### Q10: Which statement most accurately describes the change in correlations you found in Question 3 and Question 9?

A. `The correlation for Q9 has decreased significantly as more points are tightly fit to the line in Q3.`  
B. `The correlation is similar because both plots look almost identical.`  
C. `The correlation in Q9 is significantly larger as the points are more linear than in Q3.`  
D. `The correlation is rather similar because in Q3 the points were tightly fit to the line on the left side of the graph but now they are more uniform overall.`

In [7]:
# Q10: your answer will be tested!
Q10 = None

### Q11: Write down a simple linear regression model specification with response `log(avg_ticket_price)` and explanatory variable `weekly_gross_overall_in_hundred_thousands`. Explain each component of the model.

#### Refer to the videos on how we wrote and interpreted linear equations for list and sale prices for Amazon books.

> Answer here...

### Recalling Hypothesis Tests

Review the regression model in the lecture videos. Suppose we wanted to see if the Amazon list price had any impact on the Amazon sale price. One way to think about this is to check if the slope in our model is zero or not. A zero slope would indicate that the a increase or decrease in list price would not impact the sale price of the book. If the slope was not zero then a change in the list price will in indeed impact the sale price.

To formalize this as a hypothesis test is to check if $\beta_1$ (the slope of our model) is equal to 0 or not.

$H_0: \beta_1 = 0$

$H_A: \beta_1 \neq 0$

### Q12: State the null and alternative hypotheses you would use assess whether the slope of the linear regression model where weekly gross overall income in hundred thousands is predicting the log average ticket price.

> Answer here... 

### Q13:  Use Statsmodels to fit the linear model that corresponds with your line of best fit above. Report the regression coefficients of the fitted line.


#### Round to 3 decimal places

In [11]:
# Q13: your answer will be tested!
beta_0 = None
beta_1 = None
Q13 = (beta_0, beta_1)

### Q14: Report the fitted equation of the line. Interpret the regression coefficients in the context of this data. Make a conclusion about the hypotheses you defined above.

> Answer here...

### Q15: What is the coefficient of determination, $R^2$, for your model in Question 9 and write one sentence interpreting it in context.

Recall: $R^2 = r^2$, where $r$ is the correlation of the model. 

#### We will further explore $R^2$ in tutorial!

#### Round to 3 decimal places

In [8]:
# Q14: your answer will be tested!
Q14 = None

> Answer here... 

### Q16: Brief exploration of using categorical variables.

In Question 1, we created a plot to seek a relation between weekly gross overall profit and average ticket price. One way to further study its relation is to add a categorial variable to understand why the plot might be of a certain shape.

Facet wraps allow you to view individual categories in their own graph and colouring allows you to colour a point based on the value of the categorial variable.

The two plots below show the data for when the show is not Hamilton and is Hamilton respectively. Titles are added for clarity on our choice.

Use your plot code in Question 1 and add the parameter `facet_col="Hamilton"` on one plot and `color="Hamilton"` in another plot to understand whether a show was Hamilton or not impacts the average ticket price. Remember to add the appropriate titles.

We will further explore the idea of using categorical variables in our models next week!

In [9]:
# Title needs to refelct boolean selection

px.scatter(broadway[broadway.Hamilton=='No'], x="weekly_gross_overall", y="avg_ticket_price", trendline="ols")
fig.update_layout(title="Weekly Gross Overall Income vs. Average Ticket Price - Show is not Hamilton")

NameError: name 'broadway' is not defined

In [None]:
# Title needs to refelct boolean selection

px.scatter(broadway[broadway.Hamilton=='Yes'], x="weekly_gross_overall", y="avg_ticket_price", trendline="ols")
fig.update_layout(title="Weekly Gross Overall Income vs. Average Ticket Price - Show is Hamilton")

In [None]:
# Facet Plot


In [None]:
# Colour Plot


### Q17: Compare and contrast faceting and colouring. Provide one advantage and disadvantage for using faceting and one advantage and disadvantage for using colouring.

> Answer here...