# Homework 8: Confidence Intervals

**Reading**: 
* [Estimation](https://www.inferentialthinking.com/chapters/13/estimation.html)

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

For all problems that you must write our explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)



## 1. Thai Restaurants


Ben and Frank are trying see what the best Thai restaurant in Berkeley is. They survey 1500 UC Berkeley students selected uniformly at random, and ask each student what Thai restaurant is the best (*Note: this data is fabricated for the purposes of this homework*). The choices of Thai restaurant are Lucky House, Imm Thai, Thai Temple, and Thai Basil. After compiling the results, Ben and Frank release the following percentages from their sample:

|Thai Restaurant  | Percentage|
|:------------:|:------------:|
|Lucky House | 8% |
|Imm Thai | 52% |
|Thai Temple | 25% |
|Thai Basil | 15% |

These percentages represent a **uniform random sample** of the population of UC Berkeley students. We will attempt to estimate the corresponding *parameters*, or the percentage of the votes that each restaurant will receive from the entire population (the entire population is all UC Berkeley students). We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates.

The table `votes` contains the results of the survey.

In [None]:
# Just run this cell
votes = Table.read_table('votes.csv')
votes

We are trying to estimate a population parameter, in this case, the mean percentage of votes for Imm Thai. The problem is that all we have is a sample. If we were able to get another sample, it would not be the same as our first sample and the number of votes for Imm Thai would not be the same. This would be true for all samples that we might collect. 

#### So, how do we deal with the uncertainty that is inherent in the sampling process?
If we could collect more samples, we could look at the distribution of votes for Imm Thai across all samples and quantify the amount of uncertainty in the sampling process. However, in real life, collecting more samples is not an option. If it were, we would probably just get a larger sample to begin with.



The tbl.sample() method samples from the table named tbl with replacement, meaning that an observation can be selected more than once from tbl. Unless otherwise specified, it will draw the same number of samples as there are observations in the original table. The table 'votes' has 1500 observations, so votes.sample() will create a new table with 1500 rows.

The big idea here is that if the original sample is both
- large enough
- representative of the population

then the new table we made by sampling the original sample  **could have** been drawn from the poulation.  We call this new sample (which we invented from the original sample) a **bootstrap** sample.

Run the cell below and note that the new table has the same number of rows as the original table, but it is not the same as the original table. If you run it several times you will see that each new bootstarp sample is not the same as the previous bootstrap samples. Each of these is treated as a sample that we might have collected from the population of interest.

In [None]:
votes.sample()

Lets count the number of votes for Imm Thai in the bootstrap sample. Run the cell several times to get a look at the variability across different samples.

In [None]:
votes.sample().where('Vote','Imm Thai').num_rows

The function `one_resampled_percentage` below returns Imm Thai's **percentage** of votes after simulating one bootstrap sample of `tbl`.

**Note:** `tbl` will always be in the same format as `votes`.


In [None]:
np.random.seed(12345)
def one_resampled_percentage(tbl):
    new_sample = tbl.sample()
    return new_sample.where('Vote','Imm Thai').num_rows/tbl.num_rows*100

one_resampled_percentage(votes)

The `percentages_in_resamples` function returns an array of 2500 bootstrapped estimates of the percentage of voters who will vote for Imm Thai. It uses the `one_resampled_percentage` function above.


In [None]:
def percentages_in_resamples():
    percentage_imm = make_array()
    for i in np.arange(2500):
        new_pct = one_resampled_percentage(votes)
        percentage_imm = np.append(percentage_imm,new_pct)
    return percentage_imm
    
results = percentages_in_resamples()
results

In the following cell, we run the function, `percentages_in_resamples`, and create a histogram of the calculated statistic for the 2,500 bootstrap estimates of the percentage of voters who voted for Imm Thai. Based on what the original Thai restaurant percentages were, does the graph seem reasonable? Talk to a friend or ask your professor if you are unsure!

In [None]:
resampled_percentages = percentages_in_resamples()
Table().with_column('Estimated Percentage', resampled_percentages).hist("Estimated Percentage")

The histogram above shows the amount of uncertainty that we would we expect in samples of size 1500 collected from the population. It appears that the middle is about 52 %, but some samples may be as low as 48 % or as high as 56 %. Keep in mind, all we have to work with is one sample, and we don't know where it might fall in this distribution.

At this point, we could say that we believe that the parameter that we are trying to estimate, (Imm Thai's percentage of votes) is between 48 and 56 percent. However, we can't be completely certain since some samples may be outside of that range.

mathematically, we could say that $\mu = 52 \pm 4 $ percent.

**However, we can not be completely certain that the population parameter is within 4% of 52%. How certain can we be?** 

Since we cannot be 100% certain, maybe being 95% certain that we've captured the population parameter is good enough.


**Q1.** Using the array `resampled_percentages`, find the values at the two edges of the **middle 95%** of the bootstrapped percentage estimates. (Compute the lower and upper ends of the interval, named `imm_lower_bound` and `imm_upper_bound`, respectively.) Round the lower and upper bounds to 1 decimal place.

hint: you may find the 'percentile()' function useful here.

In [None]:
imm_lower_bound = ...
imm_upper_bound = ...
print("Bootstrapped 95% confidence interval for the percentage of Imm Thai voters in the population: [{:2.1f}, {:2.1f}]".format(imm_lower_bound, imm_upper_bound))

#### What does this mean?

We believe that the parameter we are trying to estimate (The average percentage of votes that Imm Thai would receive from the population) is between the lower bound and the upper bound.


In [None]:
print("The percentage of Imm Thai's percentage of votes from the population is within the interval: [{:2.1f}, {:2.1f}]".format(imm_lower_bound, imm_upper_bound))


We are 95 % certain (confident) that we have captured the parameter in this range. There is a 5% chance that we did not capture the population mean for Imm Thai's votes "

**Question 2.** Based on the above analysis, which of the following statements are true ? Put your selections in an array called myAnswers2.

1. If all of the students at Berkeley voted, Imm Thai's percentage of votes would be 52 %.
2. If all of the students at Berkeley voted, Imm Thai's percentage of votes could be 55 %.
3. If all of the students at Berkeley voted, Imm Thai's percentage of votes could be 50 %.
4. We know that the population average for Imm Thai is between 49.5 and 54.5 %
5. We know that the population average for Imm Thai is likely to be between 49.5 and 54.5 %




In [None]:
myAnswers2 = ...
myAnswers2

#### The next 3 cells are used to generate a plot below. You only need to run these cells.

In [None]:
def confidence(tbl, confidence_level, sample_size, num_bootstraps):
    np.random.seed(12345)
    percentage_imm = make_array()
    for i in  np.arange(num_bootstraps):
        new_sample = tbl.sample(sample_size)
        vote_pct = new_sample.where('Vote','Imm Thai').num_rows/sample_size*100
        percentage_imm = np.append(percentage_imm,vote_pct)
    lower_pctile = (100 - confidence_level)/2
    upper_pctile = 100 - lower_pctile
    lower_bound = percentile(lower_pctile,percentage_imm).round(1)
    upper_bound = percentile(upper_pctile,percentage_imm).round(1)
    middle = np.mean(percentage_imm).round(1)
    interval = [lower_bound ,middle, upper_bound]
    return interval
    
results = confidence(votes,95,150,2500)
results

In [None]:
def plot_ci(x,results):
    color='#2187bb'
    middle = results[1]
    horizontal_line_width = 0.25
    left = x - horizontal_line_width / 2
    top = results[2]
    right = x + horizontal_line_width / 2
    bottom = results[0]
    plt.plot([x, x], [top, bottom], color=color)
    plt.plot([left, right], [top, top], color=color)
    plt.plot([left, right], [bottom, bottom], color=color)
    plt.plot(x, middle, 'o', color='#f44336')
    return



In [None]:
x_ticks = [1,2,3,4]
x_lbls = ['80', '90','95', '99']

plt.xticks(x_ticks, x_lbls)
plt.title('Confidence Intervals vs Confidence Levels')

for i in np.arange(len(x_ticks)):
    cl = float(x_lbls[i])
    results = confidence(votes,cl,1500,2500)
    plot_ci(x_ticks[i],results)

The plot above shows confidence intervals for Imm Thai's voting percentage. Confidence intervals are shown for 80%, 90%, 95%, and 99% confidence levels. The red dots indicate the mean of the bootstrap samples for each confidence level.

**Question 3.**  Based on the above plot, which of the following statements are true ? Put your selections in an array called myAnswers3.

1. As the confidence level decreases, the interval gets larger.
2. As the confidence level increases, the interval gets larger.
3. As the confidence level decreases, the margin of error gets smaller.
4. As the confidence level decreases, the margin of error gets larger.



In [None]:
myAnswers3 = ...
myAnswers3

A confidence interval can be represented in two ways:

$[lower bound, upper bound]$, or $estimate \pm \text{margin of error}$

for example, these 2 representations of a confidence interval are equivalent:

$[2,10]$ or $6 \pm 4$

**Question 4.**  which of the following statements are true ? Put your selections in an array called myAnswers4.

Using the $estimate \pm \text{margin of error}$ representation,

1. the estimate increases as the confidence level increases.
2. the estimate decreases as the confidence level increases.
3. the estimate stays the same as the confidence level increases.
4. the margin of error increases as the confidence level increases.
5. the margin of error decreases as the confidence level increases.
4. the margin of error stayes the same as the confidence level increases.


In [None]:
myAnswers4 = ...
myAnswers4

#### How is the margin of error affected by sample size?

Run the cell below to visualize the impact of sample size on margin of error

In [None]:
x_ticks = [1,2,3,4]
x_lbls = ['20', '80','320', '1280']
cl = 95
plt.xticks(x_ticks, x_lbls)
plt.title('Confidence Intervals (95%) vs Sample Size')

for i in np.arange(len(x_ticks)):
    samp_size = int(x_lbls[i])
    
    
    results = confidence(votes,cl,samp_size,2500)
    me = round((results[2] - results[0])/2,1)
    print('samp_size = ', samp_size, ', margin of error = ',me)
    plot_ci(x_ticks[i],results)

**Question 5.** Which of the following statements are true ? Put your selections in an array called myAnswers5.

Using the  $\text{estimate} \pm \text{margin of error}$  representation,

1. the estimate increases as the sample size increases.
2. the estimate decreases as the sample size increases.
3. the estimate stays the same as the sample size increases.
4. the margin of error increases as the sample size increases.
5. the margin of error decreases as the sample size increases.
6. the margin of error stays the same as the sample size increases.

In [None]:
myAnswers5= ...
myAnswers5

**Question 6.** Which of the following statements are true ? Put your selections in an array called myAnswers6.

Using the  $\text{estimate} \pm \text{margin of error}$  representation,

1. increasing the confidence level will result in a beter estimate of the population parameter.
2. decreasing the confidence level will result in a beter estimate of the population parameter.
3. increasing the sample size  will result in a beter estimate of the population parameter.
4. decreasing the sample size will result in a beter estimate of the population parameter.
5. the best result would have high confidence and low margin of error.
6. the best result would have low confidence and low margin of error.

In [None]:
myAnswers6 = ...
myAnswers6

**Question 7.** Read in the 'baby.csv' data set and create 2 tables: smokers and nonsmokers. smokers should contain only the rows where 'Maternal Smoker' = True, and nonsmokers should contain only the rows where 'Maternal Smoker' = False.




In [None]:
smokers = ...
nonsmokers = ...
smokers.show(5)
nonsmokers.show(5)

#### The function make_boots() creates bootstrap sample means from the column 'column_name' from the table 'tbl'.

For example, 

make_boots(nonsmokers, 'Gestational Days', 500)

will create an array of 500 bootstrap sample means from the column 'Gestational Days' in the table nonsmokers.

In [None]:
def make_boots(tbl, column_name, num_boots = 1000):
    results = []
    for i in np.arange(num_boots):
        
        newBoot = tbl.sample()
        newMean = np.mean(newBoot.column(column_name))
        results = np.append(results,newMean)
    return results



**Question 8.** Create an array called smoking_weights that contains 1000 bootstrap sample means of birthweigths for smokers using the function make_boots().

In [None]:
smoking_weights = make_boots(smokers,'Birth Weight', 1000)
smoking_weights

**Question 9.** Create an array called nonsmoking_weights that contains 1000 bootstrap sample means of birthweigths for smokers using the function make_boots().

In [None]:
nonsmoking_weights = ...

**Question 10.** Create a table called baby_weights that has 2 columns:

1. 'smoking' contains the array smoking_boot_means
2. 'nonsmoking' contains the array nonsmoking_boot_means


In [None]:
baby_weights = ...
baby_weights.show(5)

The following cell will create a histograms of the bootstrap sample means for smoking and nonsmoking baby weights.

In [None]:
baby_weights.hist()
plt.title('Estimates of Baby Weights')

**Q11.** Do the histograms above provide any evidence that there is difference in bay weights for moms that smoked and moms that did not smoke? Select all that apply and put your selection in myAnswers11.

1. Yes there is a difference, but due to sampling uncertainty, we can't say which group has the higher average weight. 
2. Yes there is a difference, babies born to smoking moms are likely to weigh more.
3. Yes there is a difference, babies born to smoking moms are likely to weigh less.
4. Yes there is a difference, babies born to non smoking moms are likely to weigh more.
5. Yes there is a difference, babies born to non smoking moms are likely to weigh less.


In [None]:
myAnswers11 = ...
myAnswers11

**Q12.** Create a 90% confidence interval for the average weight of babies born to smoking moms (find the middle 90% of the bootstrap means). Put the low and high estimates into an array called smoking90_ci

In [None]:
smoking90_ci =...
smoking90_ci

**Q13.** Create a 90% confidence interval for the average weight of babies born to non-smoking moms (find the middle 90% of the bootstrap means. Put the low and high estimates into an array called nonsmoking90_ci

In [None]:
nonsmoking90_ci = ...
nonsmoking90_ci

#### Is there a difference in the average weight of babies between smoking and non-smoking moms?

Run the cell below to create boxplots of the 2 groups.

In [None]:
x_ticks = [1,2]
x_lbls = ['smoking', 'non smoking']
plt.xticks(x_ticks, x_lbls)
plt.title('Baby Weights 90% CI')


color='#2187bb'
horizontal_line_width = 0.25
x = x_ticks[0]
left = x - horizontal_line_width / 2
top = smoking90_ci[1]
right = x + horizontal_line_width / 2
bottom = smoking90_ci[0]
plt.plot([x, x], [top, bottom], color=color)
plt.plot([left, right], [top, top], color=color)
plt.plot([left, right], [bottom, bottom], color=color)

x = x_ticks[1]
left = x - horizontal_line_width / 2
top = nonsmoking90_ci[1]
right = x + horizontal_line_width / 2
bottom = nonsmoking90_ci[0]
plt.plot([x, x], [top, bottom], color=color)
plt.plot([left, right], [top, top], color=color)
plt.plot([left, right], [bottom, bottom], color=color)


**Q14.** Do the confidence intervals above provide any evidence that there is difference in baby weights for moms that smoked and moms that did not smoke? Select all that apply and put your selection in myAnswers14.

1. Yes there is a difference, but because the confidence level is only 90%, we can't say which group has the higher average weight. 
2. Maybe there is a difference, but we should increase the confidence level to be more certain.
3. Yes there is a difference, babies born to smoking moms are likely to weigh more.
4. Yes there is a difference, babies born to smoking moms are likely to weigh less.
5. Because the lowest estimate for babies born to non smoking moms is higher than the highest estimate for babies born to smoking moms, the average weight of babies born to non smoking moms are be higher.
6. A baby born to a smoking mom will not weigh more than about 115 ounces

In [None]:
myAnswers14 = ...

#### Is there another way to do this? 

So far, we have used bootstrap samples to investigate sampling variation and craete confidence intervals.

However, it turns out that sample averages tend to follow a **t - distribution.** Remember that a t - distribution's variation is the **same as long as the sample size is the same.** In other words, the distribution of average baby weights from a 1000 samples follows a known distribution. So, we can use this idea to find the middle 90% (for a 90% confidence level) of the t - distribution.

python can do this easily for us.  We need the t.interval function in the package scipy.stats

the t.interval function needs 3 arguments, the confidence level, the degrees of freedom (sample size - 1), and the sample mean

create 95% confidence interval for population mean 

        *import scipy.stats as stats*

        *stats.t.interval(alpha=0.95, df=len(data)-1, loc=np.mean(data))*

First, we'll import the package we need

In [None]:
import scipy.stats as stats

#### Run the cell below to create a 90% confidence interval for smoking moms' babies weights

In [None]:
data = smokers.column('Birth Weight')
df = len(data)-1
data_mean = np.mean(data)
cl = 0.9
ci = stats.t.interval(alpha=cl, df=df, loc = data_mean)

print(ci)
ci_lo = round(ci[0],1)
ci_hi = round(ci[1],1)

print('The average weight of babies born to smoking moms is estimated to be between ', ci_lo,' and', ci_hi,' ounces' )
print(cl*100, '% confidence')

 #### Is there a difference in gestational days between smoking and non-smoking moms ? 

#### Question 15

Create a 95% confidence interval for the gestational days of smoking  moms. Name the interval gd_smoking_95.


In [None]:
gd_smoking_95 =...
gd_smoking_95

In [None]:
ci_lo = round(gd_smoking_95[0],1)
ci_hi = round(gd_smoking_95[1],1)

print('The gestational period for smoking moms is estimated to be between ', ci_lo,' and', ci_hi,' days' )
print(cl*100, '% confidence')

#### Question 16

Create a 95% confidence interval for the gestational days of non smoking  moms. Name the interval gd_nonsmoking_95.

In [None]:
gd_nonsmoking_95 = ...

In [None]:
ci_lo = round(gd_nonsmoking_95[0],1)
ci_hi = round(gd_nonsmoking_95[1],1)

print('The gestational period for non smoking moms is estimated to be between ', ci_lo,' and', ci_hi,' days' )
print(cl*100, '% confidence')

#### Run the cell below to visualize the confidence intervals for gestational days.

In [None]:
x_ticks = [1,2]
x_lbls = ['smoking', 'non smoking']
plt.xticks(x_ticks, x_lbls)
plt.title('Gestational Days 95% CI')


color='#2187bb'
horizontal_line_width = 0.25
x = x_ticks[0]
left = x - horizontal_line_width / 2
top = gd_smoking_95[1]
right = x + horizontal_line_width / 2
bottom = gd_smoking_95[0]
plt.plot([x, x], [top, bottom], color=color)
plt.plot([left, right], [top, top], color=color)
plt.plot([left, right], [bottom, bottom], color=color)

x = x_ticks[1]
left = x - horizontal_line_width / 2
top = gd_nonsmoking_95[1]
right = x + horizontal_line_width / 2
bottom = gd_nonsmoking_95[0]
plt.plot([x, x], [top, bottom], color=color)
plt.plot([left, right], [top, top], color=color)
plt.plot([left, right], [bottom, bottom], color=color)

**Q17.** Based on your confidence intervals, is there evidence that the average gestational period for smoking moms is less than non smoking moms? Select all that apply and put your selection in myAnswers16.

1. Yes, because smoking moms could have a gestational period as low as about 276 days.
2. Yes, because non smoking moms could have a gestational period as high as about 282 days.
3. No, because smoking and nonsmoking moms could have a gestational periods of 279 days.
4. No because there is a range of days which are the same for both smoking and non smoking moms.

In [None]:
myAnswers17 = ...
myAnswers17

### Congrats!!! You finished homework 8.

In [None]:
print("Q1: [{:2.1f}, {:2.1f}]".format(imm_lower_bound, imm_upper_bound))
print('myAnswers2 = ', myAnswers2)
print('myAnswers3 = ', myAnswers3)
print('myAnswers4 = ', myAnswers4)
print('myAnswers5 = ', myAnswers5)
print('myAnswers6 = ', myAnswers6)
print('myAnswers11 = ', myAnswers11)
print('Q12: confidence interval', smoking90_ci)
print('Q13: confidence interval', nonsmoking90_ci)
print('myAnswers14 = ', myAnswers14)
print('Q15: confidence interval', gd_smoking_95)
print('Q16: confidence interval', gd_nonsmoking_95)
print('myAnswers17 = ', myAnswers17)