# Perform 8
In this exercise, you will demonstrate your learning of inferential statistics with confidence intervals, bootstrapping, and hypothesis testing. Problems may involve a combination of math and code. 

Recall that you can use LaTeX to nicely format your math inside Markdown cellsby enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$ For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). 

Show your work and/or briefly explain your answers. In general you will not receive full credit for numeric answers with no accompanying work or justification (math, code, explanation). For numeric answers, we will accept answers that are very slightly off due to rounding, z score of 2 vs. 1.96, etc. 

When you finish please go to Kernel --> Restart and Run All, and then double check that your notebook looks correct before submitting your .ipynb file (the notebook file) on gradescope.

In [1]:
# Run this code cell to import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats

### Question 1

1. A website is trying to increase registration for first-time visitors, exposing a random subset of these visitors to a new site design. Of 752 randomly sampled visitors over a month who saw the new design, 64 registered. Construct a 95% confidence interval for the percentage of visitors who would register for the website under the new design using the normal distribution.
2. A study examined the average pay for a random sample of men and women entering the workforce as doctors for 21 different positions. If each gender was equally paid, then we would expect about half of those positions to have men paid more than women and women would be paid more than men in the other half of positions. In the study, men were, on average, paid more in 17 of the 21 positions. Complete a hypothesis test (two-sided or one-sided, just be clear which you are reporting) to examine whether there is significant evidence (at the 0.05 level) of gender discrimination in pay in these positions. Report your p-value and interpret the result.

In [4]:
# Code for question 1

# 1.1
mean=64/752
sigma=np.sqrt(mean*(1-mean))
cinterval= stats.norm.interval(alpha=0.95,loc=mean,scale=sigma/np.sqrt(752))
print(cinterval)

# 1.2
feeder=(np.sqrt(21)*((17/21)-.5))/np.sqrt(.25)
print(1-stats.norm.cdf(feeder))

(0.06516269200219607, 0.10505007395525073)
0.0022781749015925445


The p value of our one sided test is .002, so we can conclude that we can reject the null hypothesis and say that there is disparity in the average pay of men in women as doctors.

### Answer 1

## Movie Ratings Data
In the remainder of this assignment you will work with the movielens dataset of movie ratings that we have seen before. Below we import and preview the data. It consists of 3 tables: `users` has a row for every individual who has rated any movies, `ratings` has a row for every rating of a particular movie by a particular user, and `movies` has a row for every movie relating its `movie_id` to its `movie_title`. The data is a random sample of all of the movie ratings made on the movielens service. You may need to merge parts of the data to answer some of the following questions.

In [5]:
users = pd.read_csv("users.csv")
users.head()

Unnamed: 0,user_id,age,sex,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


In [6]:
ratings = pd.read_csv("ratings.csv")
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [7]:
movies = pd.read_csv("movies.csv")
movies.head()

Unnamed: 0,movie_id,movie_title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


### Question 2
1. Compute a 95% confidence interval for the mean `age` of users using the normal distribution.
2. Compute a 95% confidence interval for the mean `age` of users who have rated the movie `Casablanca (1942)` using the normal distribution.
3. Casablanca is an old movie, one might suspect that it has been rated by older individuals on average than the entire dataset. Just looking at the confidence intervals you computed in steps 1 and 2, can you conclude that there is significant evidence for this belief? Why or why not?

In [13]:
# Code for question 2

# 2.1 CI for age
mu21=np.mean(users['age'])
sigma21=np.std(users['age'])
cinterval21= stats.norm.interval(alpha=0.95,loc=mu21,scale=sigma21/np.sqrt(len(users['age'])))
print(mu21)
print(cinterval21)

# 2.2 CI for age of casablanca raters
userratings=pd.merge(users,ratings)
userratingsmovies=pd.merge(userratings,movies)
casablanca=userratingsmovies[userratingsmovies['movie_title']=='Casablanca (1942)']
mu22=np.mean(casablanca['age'])
sigma22=np.std(casablanca['age'])
cinterval22= stats.norm.interval(alpha=0.95,loc=mu22,scale=sigma22/np.sqrt(len(casablanca['age'])))
print(mu22)
print(cinterval22)

34.05196182396607
(33.27417039488504, 34.829753253047095)
35.89711934156379
(34.46345637868268, 37.330782304444895)


### Answer 2 

On average, the confidence interval for average age of reviews for the movie Casablancas is larger and higher then the one for all reviews. It's hard to tell from just this measure alone, but we might conclude that the reviewers for Casablancas had an older age range based on the average and confidence intervals, but there is overlap between the two.

### Question 3
Only 18 users have rated the movie `Lost in Space (1998)`.
1. Use bootstrapping with 10,000 bootstrap resamples to compute a 95% confidence interval for the average `age` of users who have rated `Lost in Space (1998)`.
2. One of the advantages of bootstrapping is that we can easily compute confidence intervals for arbitrary measurements of distributions. Use bootstrapping with 10,000 bootstrap resamples to compute a 95% confidence interval for the **median** `rating` of `Lost in Space (1998)`. Note that numpy provides a vectorized function for [calculating the median](https://numpy.org/doc/stable/reference/generated/numpy.median.html) as well as the mean.

In [18]:
# Code for question 3

# 3.1
lostinspace=userratingsmovies[userratingsmovies['movie_title']=='Lost in Space (1998)']
lostisage=lostinspace['age']
bootstrap_sample31= np.random.choice(lostisage, size=(100000,18), replace=True)
samplemedians=np.median(bootstrap_sample31,axis=1)
c131=np.percentile(samplemedians,2.5)
c231=np.percentile(samplemedians,97.5)

print(c131,c231)

# 3.2

lostisratings=lostinspace['rating']
bootstrap_sample32= np.random.choice(lostisratings, size=(100000,18), replace=True)
samplemedians=np.median(bootstrap_sample32,axis=1)
c132=np.percentile(samplemedians,2.5)
c232=np.percentile(samplemedians,97.5)

print(c132,c232)

22.5 37.5
2.5 4.0


### Answer 3

### Question 4
The `Star Wars (1977)` film is quite popular, with a median rating of 5 out of 5. However, male users gave it a slightly higher average rating of about 4.4 whereas female users gave the same movie an average rating of about 4.2.
1. Consider the null hypothesis that the average rating of `Star Wars (1977)` is the same for `sex='F'` and `sex='M'` users. The alternative hypothesis is that the average ratings are not equal. Conduct a two-sided t test using [`stats.ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) to evaluate this using the sample ratings data. Report your p-value and interpret it at a significance level of 0.05.
2. About 51% of female users rated `Star Wars (1977)` a `5` (the highest rating). Consider the null hypothesis that 51% of male users rate `Star Wars (1977)` a `5`. Conduct a two-sided hypothesis test  using [`stats.t.cdf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) to evaluate this in light of the sample ratings data of male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of 0.05.
3. Consider the null hypothesis that female and male users are equally likely to rate `Star Wars (1977)` a `5`. Conduct a two-sided t test using [`stats.ttest_ind_from_stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html) to evaluate this in light of the sample data of female and male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of 0.05. You should observe a different p-value than in step 2 despite the hypotheses under consideration being ostensibly similar. Briefly explain why you observe this difference.

In [35]:
# Code for question 4

# 4.1
starwars=femsw=userratingsmovies[userratingsmovies['movie_title']=='Star Wars (1977)']
femsw=starwars[starwars['sex']=='F']
mensw=starwars[starwars['sex']=='M']
stats.ttest_ind_from_stats(mean1=np.mean(femsw['rating']), std1=np.std(femsw['rating']), nobs1=len(femsw['rating']), mean2=np.mean(mensw['rating']), std2=np.std(mensw['rating']), nobs2=len(mensw['rating']))


# 4.3
femsw5=femsw[femsw['rating']==5]

stats.ttest_ind_from_stats(mean1=femsw5/femsw['rating'], std1=np.std(femsw['rating']), nobs1=len(femsw['rating']), mean2=mensw5/mensw['rating'], std2=np.std(mensw['rating']), nobs2=len(mensw['rating']))


Ttest_indResult(statistic=-1.8447732162807637, pvalue=0.06557932832818517)

4.1

p-value is .066 about, so it is great enough to where we cannot reject the null hypothesis. The average ratings between the sexes are not dissimilar enough for us to make a conclusion.

In [None]:
# 4.2
mensw5=mensw[mensw['rating']==5]
mean42=len(mensw5)/len(mensw)
feeder42=(np.sqrt(len(mensw5))*(np.mean(mensw5)-.51))/np.sqrt(.51*.49)
print(1-stats.t.cdf(mean42))

### Answer 4