# Practice 8
In this exercise, you will practice inferential statistics with confidence intervals, bootstrapping, and hypothesis testing. Problems may involve a combination of math and code. 

Recall that you can use LaTeX to nicely format your math inside Markdown cells by enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$ For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). 

Show your work and/or briefly explain your answers. In general you will not receive full credit for numeric answers with no accompanying work or justification (math, code, explanation). For numeric answers, we will accept answers that are very slightly off due to rounding, z score of 2 vs. 1.96, etc. 

When you finish please go to Kernel --> Restart and Run All, and then double check that your notebook looks correct before submitting your .ipynb file (the notebook file) on gradescope.

In [1]:
# Run this code cell to import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats

### Question 1
The General Social Survey asked the following question to a random sample of 1,155 Americans: “After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?” A 95% confidence interval for the mean number of hours spent relaxing or pursuing activities they enjoy was (1.38, 1.92).
1. Your friend reads the survey and says it means "95% of the survey respondents reported between 1.38 and 1.92 hours." Is this a valid interpretation of the confidence interval? Why or why not?
2. Suppose another set of researchers reported a confidence interval of (1.29, 2.01) based on the same sample of 1,155 Americans. Is this indicative of a higher or lower confidence level (the percentage)?
3. Suppose next year a new survey asking the same question is conducted, and this time the sample size
is 2,500. Assuming that the summary statistics (mean and standard deviation) are roughly the same as before, how will the new confidence interval differ from the (1.38, 1.92) computed before? Why?

### Answer 1
1. No this is not a valid interpretation because a 95% confidence interval isn't about the number of respondents. A 95% confidence interval of (1.38, 1.92) means that there is a 95% chance of the mean being between those two numbers.
2. It's indicative of a higher confidence level because there is a greater range of possibilities the mean can take on, meaning there's a higher chance that the actual mean is within these bounds.
3. To calculate a confidence interval of 95%, we add and subtract a number S equivalent to $$\frac{2*\sigma}{\sqrt{n}}$$, so as n increases, the confidence interval will become smaller based on the value of S.

### Question 2
1. A random survey of 1,000 US adults found that 42% believe raising the minimum wage will help the economy. Using the normal distribution, construct a 95% confidence interval for the true percentage of US adults who believe this using the normal distribution.
2. A study of 19 random Risso's dolphins finds that the average amount of micrograms of mercury per wet gram of muscle in a dolphin is 4.4, with a standard deviation of 2.3. Construct a 95% confidence interval around this empirical mean using the student's t-distribution.   

In [2]:
# Code for question 2 (or can use a hand calculator and show work)

# 2.1
sigma=0.49355850717
cinterval= stats.norm.interval(alpha=0.95,loc=.42,scale=sigma/np.sqrt(1000))
print(cinterval)
# 2.2
cinterval2= stats.t.interval(alpha=0.95,df=18,loc=4.4,scale=2.3/np.sqrt(19))
print(cinterval2)

(0.38940948891043786, 0.4505905110895621)
(3.2914354851665495, 5.508564514833451)


### Answer 2
(0.39, 0.45)
(3.3, 5.5)

### Question 3
You have a small dataset of the total number of miles that a random subset of individuals have walked over the last week: `data = [1, 3, 4, 8, 14, 23, 39, 51, 106, 319]` as defined in the code below.
1. Construct a 95% confidence interval for the mean of `data` using the student's t-distribution.
2. Use bootstrapping with 100,000 bootstrap resamples to construct a 95% confidence interval for the mean of `data`.
3. Which confidence interval is more reasonable? Why?

In [3]:
# Run but do not modify this cell
data = np.array([1, 3, 4, 8, 14, 23, 39, 51, 106, 319])

In [4]:
# Code for question 3
mu=np.mean(data)
sigma2=np.std(data)
n=len(data)
#3.1
cinterval3= stats.t.interval(alpha=0.95,df=n-1,loc=mu,scale=sigma2/np.sqrt(n))
print(cinterval3)
#3.2
bootstrap_sample= np.random.choice(data, size=(100000,n), replace=True)
samplemeans=np.average(bootstrap_sample,axis=1)
c1=np.percentile(samplemeans,2.5)
c2=np.percentile(samplemeans,97.5)

print(c1,c2)


(-9.412687084679476, 123.01268708467947)
13.797500000000039 121.9


### Answer 3

3.3 The bootstrapping convidence interval is more reasonable because we get a greater number of samples to work with from that gets us closer to an underlying distributing of the original samples.

### Question 4
#### Part 1. 
It is believed that nearsightedness affects about 8% of all children. In a random sample of 194 children, 21 are nearsighted. Consider the following question: do these data provide evidence that the 8% value is inaccurate? State the specific hypotheses you will test to answer this question and indicate whether it is a one-sided or two-sided test (you can do either, just clarify which). Use a significance level of 0.05. Conduct the hypothesis test and calculate the p-value using the normal distribution. Interpret your result.

#### Part 2.
A USA Today/Gallup poll asked a group of unemployed and underemployed Americans if they have had major problems in their relationships with their spouse or another close family member as a result of not having a job (if unemployed) or not having a full-time job (if underemployed). 27% of the 1,145 unemployed respondents and 25% of the 675 underemployed respondents said they had major problems in relationships as a result of their employment status. Consider the following question: is the percentage of those having major problems different for unemployed versus underemployed Americans? State the specific hypotheses you will test to answer this question and indicate whether it is a one-sided or two-sided test (you can do either, just clarify which). 

Use a significance level of 0.05. Conduct the hypothesis test and calculate the p-value. You can do so most easily using [`scipy.stats.ttest_ind_from_stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html#scipy.stats.ttest_ind_from_stats), though you can also look up the standard error calculations for the difference of proportions in Chapter 6.2 of the openIntro Statistics book referenced in the prepare if you wish to run the test using the normal distribution for a (very) slightly tighter p-value (you will get similar p-values and the same conclusion either way). Interpret your result.

4.1 

Null Hypothesis: Our polling did not find that 8% is a good measure of nearsightedness in children.

Altn Hypothesis: Our polling did find that 8% is a good measure of nearsightedness in children.

Using a one sided test, a p-value of .073 was found. Because the significance level is .05, this measure is too great to reject the null hypothesis based on this sample. From the sample, we cannot say that 8% is an accurate mean probability for children with nearsightedness.

In [5]:
# Code for question 4
feeder=(np.sqrt(194)*((21/194)-.08))/np.sqrt(.08*.92)
print(1-stats.norm.cdf(feeder))

0.07349538001845213


4.2

Null Hypothesis: Our polling did not find that there is a significant relationship between having relations whether underemployed or unempolyed. We used a two sided test in this calculation and found that the probability of the null hypothesis was significant, or >.05, so it cannot be ruled out that the relationship between these two measures is random.

Altn Hypothesis: Our polling did find that there is a significant relationship between having relations whether underemployed or unemplyed. 


In [6]:
sd1=np.sqrt(.27*.73)
sd2=np.sqrt(.25*.75)


stats.ttest_ind_from_stats(mean1=.27, std1=sd1, nobs1=1145, mean2=.25, std2=sd2, nobs2=675)



Ttest_indResult(statistic=0.9368337461051707, pvalue=0.3489685143193123)

### Answer 4

### Question 5
Below we import the `university_data` dataset we have looked at before. It contains information about 311 universities in the United States. In general, private universities charge higher tuition rates than public universities. However, private universities often argue that once you take financial aid into account, the cost is often not different. In this question you will explore this issue.
1. First, report the average `tuition` of `public` schools and the average `tuition` of `private` schools to confirm the basic notion that `private` schools charge higher tuition on average.
2. Consider the null hypothesis that `private` and `public` universities have the same average `cost_after_aid`. Conduct a two-sided t-test to determine whether the dataset provides statistically significant evidence to reject the null hypothesis in favor of the alternative hypothesis that they have different average `cost_after_aid`. You will notice that some universities do not have a value recorded for `cost_after_aid`. For now, simply omit those universities from your analysis and assume that the remaining are a random sample of American universities. Report the resulting p-value. Interpret your results at a significance level of 0.05.
3. In the previous step you tested for statistical significance of the difference in `cost_after_aid` between public and private schools. What is the effect size? Report the average `cost_after_aid` of `public` schools and the average `cost_after_aid` of `private` schools.
4. In step 2 we assumed that we could omit the universities with missing data and the remainder would be a random sample of American universities. Is that assumption well justified? Consider especially the average values you computed in steps 1 and 3 and consider which universities are missing the `cost_after_aid` information. Given this, what can you say about the claim that "private universities often argue that once you take financial aid into account, the cost is often not different?"

In [7]:
# Run but do not modify this code
uni = pd.read_csv("university_data.csv")
uni.tail(100)

Unnamed: 0,act_avg,sat_avg,enrollment,city,acceptance_rate,percent_receiving_aid,cost_after_aid,state,hs_gpa_avg,tuition,Institution_name,institution_type,us_rank
211,21.0,960.0,14622.0,Denver,61.0,,,CO,3.4,31209,University of Colorado--Denver,public,207.0
212,20.0,910.0,6999.0,North Dartmouth,76.0,,,MA,3.2,28285,University of Massachusetts--Dartmouth,public,207.0
213,20.0,950.0,10077.0,Missoula,92.0,,,MT,3.3,24943,University of Montana,public,207.0
214,19.0,,18313.0,Kalamazoo,82.0,,,MI,3.3,14699,Western Michigan University,public,207.0
215,23.0,1030.0,45813.0,Miami,49.0,,,FL,3.9,18956,Florida International University,public,216.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
306,,,,Cypress,96.0,,,CA,,9000,Trident University International,proprietary,
307,,,,Cincinnati,,,,OH,,12416,Union Institute and University,private,
308,,,,Phoenix,,,,AZ,,9690,University of Phoenix,proprietary,
309,,,,Minneapolis,,,,MN,,12075,Walden University,proprietary,


### Answer 5

In [8]:
# 5.1
privateset=uni[uni['institution_type']=='private']
publicset=uni[uni['institution_type']=='public']

publictmean=np.mean(publicset['tuition'])
privatetmean=np.mean(privateset['tuition'])

print('public: ')
print(publictmean)
print('private: ')
print(privatetmean)
# feel free to add more code cells

public: 
25968.97894736842
private: 
40871.0350877193


In [9]:
#5.2
publiccostafter=publicset['cost_after_aid'].dropna()
privatecostafter=privateset['cost_after_aid'].dropna()

publicmean=np.mean(publiccostafter)
privatemean=np.mean(privatecostafter)

sdpublic=np.std(publiccostafter)
sdprivate=np.std(privatecostafter)

stats.ttest_ind_from_stats(mean1=publicmean, std1=sdpublic, nobs1=len(publiccostafter), mean2=privatemean, std2=sdprivate, nobs2=len(privatecostafter))


Ttest_indResult(statistic=3.721426909531116, pvalue=0.0002853807144786082)

5.2 

From a 2 sided hypothesis test, the p-value was found to be .0003 which is small enough to reject the null hypothesis 
of .05. There is clearly some indication that the cost of public and private schools after aid to have a relationship based on the data used in the test.

In [10]:
print(publicmean)
print(privatemean)

36163.055555555555
31647.098591549297


The effect size is about 4600. Average cost after aid of public schools is about 36,200. Average cost after aid of private schools is about 31,600.

5.4

Based on the lengths of these two datasets, there is a large chunk of the college data for cost after aid missing. While we are getting around the same number of private and public in our data, more of the public schools have been dropped, so it's hard to tell what is really the case with all of this missing data. There is also a large effect size between the two data sets which is indicative of there being not enough sufficient data. 

In [11]:
print(len(uni['cost_after_aid']))
print(len(uni['cost_after_aid'].dropna()))
print()
print(len(publicset['cost_after_aid']))
print(len(publicset['cost_after_aid'].dropna()))
print()
print(len(privateset['cost_after_aid']))
print(len(privateset['cost_after_aid'].dropna()))

311
143

190
72

114
71
