# Chapter 5: Practice Exercises with Solutions

In [1]:
import numpy as np
from scipy.stats import norm

**Exercise 1:** As part of a quality control process for computer chips, an engineer randomly samples 232 chips at a factory during a week of production to test the current rate of chips with severe defects. She finds that 34 of the chips are defective.

A) What is the population under consideration in this study?

B) What parameter is being estimated?

C) What is the point estimate for the parameter $\hat{p}$?

D) Compute the standard error to measure the uncertainty in the point estimate.

In [2]:
print('''
A) All computer chips manufactured at the factory during the week of production.''')

print('''
B) The proportion of computer chips manufactured at the factory during the week of production that had defects.''')

print('''
C) We estimate the parameter by computing the observed value in the data: p_hat = 34/232 =''', '%.2f' % (34/232))

print('''
D) SE = (p_hat(1-p_hat)/n)^1/2 =''', '%.3f' % (np.sqrt(0.15*(1-0.15)/232)))


A) All computer chips manufactured at the factory during the week of production.

B) The proportion of computer chips manufactured at the factory during the week of production that had defects.

C) We estimate the parameter by computing the observed value in the data: p_hat = 34/232 = 0.15

D) SE = (p_hat(1-p_hat)/n)^1/2 = 0.023


**Exercise 2:** (fictional) Of all the first year university students, 18% made the Dean's list in the current year. As part of a class project, students randomly sample 40 students and check if those students made the list. They repeat this 1,000 times and build a distribution of sample proportions.

A) What is this distribution called?

B) Calculate the variability of this distribution.

In [3]:
print('A) Sampling distribution.')

print('''
B) SE = (p_hat(1-p_hat)/n)^1/2 =''', round(np.sqrt(0.18*(1-0.18)/40), 3))

A) Sampling distribution.

B) SE = (p_hat(1-p_hat)/n)^1/2 = 0.061


**Exercise 3:** A website is trying to increase registration for first-time visitors, exposing 1% of these visitors to a new site design. Of 782 randomly sampled visitors over a month who saw the new design, 64 of them registered.

A) Check any conditions required for constructing a confidence interval.

B) Compute the standard error.

C) Construct and interpret a 90% confidence interval for the fraction of first-time visitors of the site who would register under the new design (assuming stable behaviors by new visitors over time).

D) This time, construct a 95% confidence interval. Is this interval wider or narrower? Why?

In [4]:
print('''
A) (1) The visitors are from a simple random sample, so the independence condition is satisfied. 
   (2) The success-failure condition is also satisfied, with both 64 and 782-64 = 718 above 10. 
   Therefore, we can use a normal distribution to model p_hat and construct a confidence interval.''')

print('''
B) The sample proportion is: p_hat = 64/782 =''', round(64/782, 2))
print("   SE = (p_hat*(1-p_hat)/n)^1/2 =", '%.3f' % (np.sqrt(0.08*(1-0.08)/782)))

print('''
C) For a 90% confidence interval, use z =''',  norm.ppf(0.95).round(4))
print('''    
   The confidence interval is therefore, point estimate +/- z x SE = 
   0.08 +/- 1.65 x 0.010 =
  ''', 0.08-1.645*0.010, ",", 0.08+1.645*0.010)
print("   We are 90% confident that 6.35% to 9.65% of first-time site visitors will register using the new design.")

print('''
D) For a 95% confidence interval, use z =''',  norm.ppf(0.975).round(4))
print('''    
   The confidence interval is therefore, point estimate +/- z x SE = 
   0.08 +/- 1.96 x 0.010 =
  ''', 0.08-1.96*0.010, ",", 0.08+1.96*0.010)

print("   We are 95% confident that 6.04% to 9.96% of first-time site visitors will register using the new design.")

print("""
   This interval is wider as we would like to be even more confident that 
   the true proportion of registering first-time visitors is in this interval.""")


A) (1) The visitors are from a simple random sample, so the independence condition is satisfied. 
   (2) The success-failure condition is also satisfied, with both 64 and 782-64 = 718 above 10. 
   Therefore, we can use a normal distribution to model p_hat and construct a confidence interval.

B) The sample proportion is: p_hat = 64/782 = 0.08
   SE = (p_hat*(1-p_hat)/n)^1/2 = 0.010

C) For a 90% confidence interval, use z = 1.6449
    
   The confidence interval is therefore, point estimate +/- z x SE = 
   0.08 +/- 1.65 x 0.010 =
   0.06355 , 0.09645000000000001
   We are 90% confident that 6.35% to 9.65% of first-time site visitors will register using the new design.

D) For a 95% confidence interval, use z = 1.96
    
   The confidence interval is therefore, point estimate +/- z x SE = 
   0.08 +/- 1.96 x 0.010 =
   0.0604 , 0.0996
   We are 95% confident that 6.04% to 9.96% of first-time site visitors will register using the new design.

   This interval is wider as we would like

**Exercise 4:** (fictional) 400 students were randomly sampled from RMIT and 220 of them said they go to the gym at least once a week. Using the Prepare Check Calculate Conclude framework, conduct a hypothesis test to check whether this represents a statistically significant difference from 50%:

A) Using a significance level of 1%.

B) Using a significance level of 5%.

C) Explain how and why the conclusion changes when the significance level changes.

In [5]:
print('''A) Using a significance level of 1%

PREPARE
The parameter of interest: The proportion of RMIT students who go to the gym at least once a week.
Set up hypotheses: 
Ho: p = 0.50
Ha: p != 0.50
Significance level (alpha): %1
''')
p0 = 0.5
p_hat = 220/400
n = 400   

print(f'''CHECK
(1) Independence: It is given that 400 students were randomly sampled from RMIT, hence random sampling gets us independence.
(2) Success-failure: n*p0 = 400*(0.5) = {n*p0} and n*(1-p0) = 400*(0.5) = {n*(1-p0)} are both >= 10, 
hence the success-failure conditions are satisfied.
''')

print('CALCULATE')
se = np.sqrt(p0*(1 - p0)/n)
z_score = round((p_hat - p0)/se, 2)
print(f'''The Z-score is {z_score}. The Z-score's one-tail area: {round(1 - norm.cdf(np.abs(z_score)), 3)}, yet this is a two-sided hypothesis testing.
Thus, the p-value is twice this one-tail area: {round(2*(1 - norm.cdf(np.abs(z_score))), 3)}
''')

print('''CONCLUDE (with 1%)
Because the p-value of 4.6% is greater than alpha = 1%, we do NOT REJECT the null hypothesis and 
conclude that the proportion of RMIT students who go to the gym at least once a week is not different than 50%. 
''')

print(f'''B) Using a significance level of 5%
CONCLUDE (with 5%)
Because the p-value of 4.6% is smaller than alpha = 5%, we do REJECT the null hypothesis and 
conclude that the proportion of RMIT students who go to the gym at least once a week is actually different than 50%. 
In this particular case, the data suggests that the actual proportion is higher than 50% (since p_hat = 0.55).
''')

print('''C) Why the conclusion changes:
The significance level controls the Type 1 error probability. A 1% significance is very conservative and 
it will rarely reject the null hypothesis. On the other hand, a 5% significance level is more liberal and
it will reject the null hypothesis more easily compared to the 1% level. Specifically, with a p-value of 4.6%,
a 5% significance level will reject the null hypothesis whereas a 1% level will not reject it.
''')

A) Using a significance level of 1%

PREPARE
The parameter of interest: The proportion of RMIT students who go to the gym at least once a week.
Set up hypotheses: 
Ho: p = 0.50
Ha: p != 0.50
Significance level (alpha): %1

CHECK
(1) Independence: It is given that 400 students were randomly sampled from RMIT, hence random sampling gets us independence.
(2) Success-failure: n*p0 = 400*(0.5) = 200.0 and n*(1-p0) = 400*(0.5) = 200.0 are both >= 10, 
hence the success-failure conditions are satisfied.

CALCULATE
The Z-score is 2.0. The Z-score's one-tail area: 0.023, yet this is a two-sided hypothesis testing.
Thus, the p-value is twice this one-tail area: 0.046

CONCLUDE (with 1%)
Because the p-value of 4.6% is greater than alpha = 1%, we do NOT REJECT the null hypothesis and 
conclude that the proportion of RMIT students who go to the gym at least once a week is not different than 50%. 

B) Using a significance level of 5%
CONCLUDE (with 5%)
Because the p-value of 4.6% is smaller than alp