# Homework 05

In this exercise, you will practice inferential statistics with confidence intervals, bootstrapping, and hypothesis testing. Problems may involve a combination of math and code. 

Recall that you can use LaTeX to nicely format your math inside Markdown cellsby enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$ For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). **Please note that Gradescope currently does not render ```\begin{align} ... \end{align}``` and ```\begin{equation} ... \end{equation}```. Consider using ```\begin{aligned} ... \end{aligned}``` as an alternative to save the graders some hassle.**

Show your work and/or briefly explain your answers. In general, you will not receive full credit for numeric answers with no accompanying work or justification (math, code, explanation). For numeric answers, we will accept answers that are very slightly off due to rounding, z score of 2 vs. 1.96, etc. 

When you finish please go to Kernel --> Restart and Run All, and then double check that your notebook looks correct before submitting your .ipynb file (the notebook file) on gradescope.

In [2]:
# Run this code cell to import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats

### Question 1

1. A website is trying to increase registration for first-time visitors, exposing a random subset of these visitors to a new site design. Of $752$ randomly sampled visitors over a month who saw the new design, $64$ registered. Construct a $95\%$ confidence interval for the percentage of visitors who would register for the website under the new design using the normal distribution. Save your answer in a tuple `q1_1` with two `numpy.float64` items that `q1_1[0]` is the left bound and `q1_1[1]` is the right bound. Use percentages for both bounds, for example, save $50.0$ in your answer for $50\%$ of visitors.
2. A study examined the average pay for a random sample of men and women entering the workforce as doctors for $21$ different positions. If each gender was equally paid, then we would expect about half of those positions to have men paid more than women and women would be paid more than men in the other half of positions. In the study, men were, on average, paid more in $17$ of the $21$ positions. Complete a hypothesis test using the normal distribution (two-sided or one-sided, just be clear which you are reporting) to examine whether there is significant evidence (at the $0.05$ level) of gender discrimination in pay in these positions. Report your p-value and interpret the result. Save your p-value in `q1_2` as a `numpy.float64` and interpret it in the **Answer 1.2** cell.

In [5]:
# Code for question 1
E = 64 / 752
var = E - E**2

q1_1 = stats.norm.interval(alpha=0.95, loc=64/752*100, scale=np.sqrt(var)/np.sqrt(752)*100)

# Leave these lines here for grading and ease of debugging
print(q1_1)

z_score=np.sqrt(21)*((17/21-0.5)/0.5)

q1_2 = 1 - stats.norm.cdf(z_score)
# Leave these lines here for grading and ease of debugging
print(q1_2)

(6.516269200219607, 10.505007395525073)
0.0022781749015925445


### Answer 1.2
Write your interpretation for 1.2 here. There is no need to write anything for 1.1.

Null hypothesis $H_0$: there is no significant difference.

Alternative hyposis $H_a$: there is significant difference.

Using central limit theorem: $$\begin{aligned} \sqrt n \frac{(\overline X_n - \mu)}{\sigma} &\sim N(0,1)  \\ \sqrt {21} \frac{(17 \div 21 - 0.5)}{0.5} &\sim N(0,1) \\ 2.8366 &\sim N(0,1)\end{aligned}$$

Then, we should use the method `stats.norm.cdf` to calculate the one-sided $p-value$, and we find that the result is 0.0023, and the two-sided $p-value$ is 0.0046, which is far less than 0.05.
Therefore, we reject H_0, the null hypothesis, and we accept the alternative hypothesis H_a

<!-- END QUESTION -->

## Movie Ratings Data
In the remainder of this assignment you will work with the movielens dataset of movie ratings that we have seen before. Below we import and preview the data. It consists of 2 tables: `users` has a row for every individual who has rated any movies, `movie-ratings` has a row for every rating of a particular movie by a particular user. This means users with multiple ratings are in the `movie_ratings` multiple times. The data is a random sample of all of the movie ratings made on the movielens service.

In [6]:
users = pd.read_csv("users.csv")
users.head()

Unnamed: 0,user_id,age,sex,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


In [7]:
movie_ratings = pd.read_csv("movies-all.csv")
movie_ratings.head()

Unnamed: 0,user_id,age,sex,occupation,movie_id,rating,movie_title
0,1,24,M,technician,61,4,Three Colors: White (1994)
1,13,47,M,educator,61,4,Three Colors: White (1994)
2,18,35,F,other,61,4,Three Colors: White (1994)
3,58,27,M,programmer,61,5,Three Colors: White (1994)
4,59,49,M,educator,61,4,Three Colors: White (1994)


### Question 2
1. Compute a $95\%$ confidence interval for the mean `age` of users using the normal distribution. Save your answer in a tuple `q2_1` with two `numpy.float64` items that `q2_1[0]` is the left bound and `q2_1[1]` is the right bound.
2. Compute a $95\%$ confidence interval for the mean `age` of users who have rated the movie `Casablanca (1942)` using the normal distribution. Save your answer in `q2_2`, similar requirements as above.
3. *Casablanca* is an old movie, one might suspect that it has been rated by older individuals on average than the entire dataset. Just looking at the confidence intervals you computed in steps 1 and 2, can you conclude that there is significant evidence for this belief? Why or why not? Put your answer in the **Answer 2.3** cell.

In [8]:
# Code for question 2
m1=np.mean(users["age"])
sig1=np.std(users["age"])
n=len(users["age"])
q2_1=stats.norm.interval(alpha=0.95, loc=m1, scale=sig1/np.sqrt(n))
# Leave these lines here for grading and ease of debugging
print(q2_1)

cas=movie_ratings[movie_ratings["movie_title"]=="Casablanca (1942)"]
m2=np.mean(cas["age"])
sig2=np.std(cas["age"])
n2=len(cas["age"])
q2_2=stats.norm.interval(alpha=0.95, loc=m2, scale=sig2/np.sqrt(n2))
# Leave these lines here for grading and ease of debugging
print(q2_2)

(33.27417039488504, 34.829753253047095)
(34.46345637868268, 37.330782304444895)


### Answer 2.3
Write your interpretation for 2.3 here. There is no need to write anything for 2.1 & 2.2.

Confidence interval of $95 \%$ means that we estimate the mean within the interval with $95 \%$ probability. 
There is not significant evidence for this belief because the distance between two intervals is not that big.
Further calculation like ttest is needed to conlude that.

<!-- END QUESTION -->

### Question 3
Only $18$ users have rated the movie `Lost in Space (1998)`.
1. Use bootstrapping with $10000$ bootstrap resamples to compute a $95\%$ confidence interval for the average `age` of users who have rated `Lost in Space (1998)`. Save your answer in a tuple `q3_1` with two `numpy.float64` items that `q3_1[0]` is the left bound and `q3_1[1]` is the right bound.
2. One of the advantages of bootstrapping is that we can easily compute confidence intervals for arbitrary measurements of distributions. Use bootstrapping with $10000$ bootstrap resamples to compute a $95\%$ confidence interval for the **median** `rating` of `Lost in Space (1998)`. Note that numpy provides a vectorized function for [calculating the median](https://numpy.org/doc/stable/reference/generated/numpy.median.html) as well as the mean. Save your answer in a tuple `q3_2` with two `numpy.float64` items. Similar requirements as above.

In [9]:
# Code for question 3
# Leave these lines here for grading and ease of debugging
sample1=movie_ratings[movie_ratings["movie_title"]=="Lost in Space (1998)"]["age"].to_numpy()
bootstrap_sample1=np.random.choice(sample1, size=(10000, len(sample1)), replace=True)
sampling_means=np.average(bootstrap_sample1, axis=1)
bootstrap_ci_l1=np.percentile(sampling_means, 2.5)
bootstrap_ci_r1=np.percentile(sampling_means, 97.5)
q3_1=(bootstrap_ci_l1, bootstrap_ci_r1)
print(q3_1)

sample2=movie_ratings[movie_ratings["movie_title"]=="Lost in Space (1998)"]["rating"].to_numpy()
bootstrap_sample2=np.random.choice(sample2, size=(10000, len(sample2)),replace=True)
sampling_medians=np.median(bootstrap_sample2, axis=1)
bootstrap_ci_l2=np.percentile(sampling_medians, 2.5)
bootstrap_ci_r2=np.percentile(sampling_medians, 97.5)
q3_2=(bootstrap_ci_l2, bootstrap_ci_r2)
# Leave these lines here for grading and ease of debugging
print(q3_2)

(26.0, 36.72222222222222)
(2.5, 4.0)


### Question 4
The `Star Wars (1977)` film is quite popular, with a median rating of `5` out of $5$. However, of those that left a rating, male users gave it a slightly higher average rating of about $4.4$ whereas female users gave the same movie an average rating of about $4.2$.

1. Consider the null hypothesis that the average rating of `Star Wars (1977)` is the same for `sex='F'` and `sex='M'` users. The alternative hypothesis is that the average ratings are not equal. Conduct a two-sided t test using [`stats.ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) to evaluate this using the sample ratings data. Report your p-value and interpret it at a significance level of $0.05$. Save your p-value in `q4_1` as a `numpy.float64` and interpret it in the **Answer 4** cell.

2. Consider the null hypothesis that $51\%$ of men would rate `Star Wars (1977)` a `5`. Conduct a two-sided hypothesis test using [`stats.t.cdf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) to evaluate this in light of the sample ratings data of male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of $0.05$. Save your p-value in `q4_2` as a `numpy.float64` and interpret it in the **Answer 4** cell.

3. Consider the null hypothesis that women and men were equally likely to rate `Star Wars (1977)` a `5`. Conduct a two-sided t test using [`stats.ttest_ind_from_stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html) to evaluate this in light of the sample data of female and male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of $0.05$. Save your p-value in `q4_3` as a `numpy.float64` and interpret it in the **Answer 4** cell. You should observe a different p-value than in step 2 despite the hypotheses under consideration being ostensibly similar. Briefly explain why you observe this difference.

In [10]:
# Code for question 4
sw=movie_ratings[movie_ratings["movie_title"]=="Star Wars (1977)"]
m=sw[sw["sex"]=="M"]
f=sw[sw["sex"]=="F"]
m_ratings=m["rating"].to_numpy().tolist()
f_ratings=f["rating"].to_numpy().tolist()
f_pct=len(f[f["rating"]==5])/len(f)
m_pct=len(m[m["rating"]==5])/len(m)
t_score=np.sqrt(len(m))*abs(m_pct-0.51)/(np.sqrt(0.51*0.49))
q4_1=stats.ttest_ind(m_ratings, f_ratings)[1]
q4_2=(1-stats.t.cdf(t_score, len(m)-1))*2
q4_3=stats.ttest_ind_from_stats(mean1=m_pct, std1=np.sqrt(m_pct*(1-m_pct)), nobs1=len(m), mean2=f_pct, std2=np.sqrt(f_pct*(1-f_pct)), nobs2=len(f))[1]
# Leave these lines here for grading and ease of debugging
print(q4_1)

# Leave these lines here for grading and ease of debugging
print(q4_2)

# Leave these lines here for grading and ease of debugging
print(q4_3)

0.06606506021398857
0.008010296971218134
0.1717837459146108


### Answer 4

4.1

The null hypothesis $H_0$: "man and woman have equal average ratings".

The alternative hypothesis $H_1$: "man and woman do not have equal average ratings".

P-value equals to 0.066, greater than 0.05, so we cannot reject $H_0$. Therefore, the average rating is same for men and women users.

4.2

The null hypothesis $H_0$: "rate of man to rate 5 for Star Wars equal to $51\%$".

The alternative hypothesis $H_1$: "rate of man to rate 5 for Star Wars not equal to $51\%$".

P-value equals to 0.008, less than 0.05, so we reject $H_0$. Therefore, the percentage of rating 5 for men does not equal to 51

4.3

The null hypothesis $H_0$: "man and woman have equal rates to rate 5 for Star Wars".

The alternative hypothesis $H_1$ is "man and woman do not have equal rates to rate 5 for Star Wars".

P-value equals to 0.172, greater than 0.05, so we cannot reject $H_0$. Therefore, women and men are equally likely to rate a 5.

<!-- END QUESTION -->

## Submitting

You should make sure any code that you write to answer the questions is included in this notebook. We recommend you go to the Kernel option and choose \"Restart & Run All.\" Double check that your entire notebook runs correctly and generates the expected output. Finally, make sure to save your work (timestamp at the top tells you the last checkpoint and whether there are unsaved changes). When you finish, submit your assignment at [Gradescope](http://gradescope.com/).