# Lambda School Data Science - Unit 1 Sprint 2 Module 3

---

## Assignment: Sampling and Confidence Intervals

### Objectives

* Objective 01 - explain the concepts of statistical estimate, precision, and standard error as they apply to inferential statistics
* Objective 02 - explain the implications of the central limit theorem in inferential statistics
* Objective 03 - explain the purpose of and identify applications for confidence intervals
demonstrate how to build a confidence interval around a sample estimate
* Objective 04 - visualize a confidence interval in order to communicate the precision of sample estimates

## Confidence Intervals

Soft drinks like Coke and Pepsi are manufactured to have a standard caffeine content. For example, a 12-oz serving of Coke has 34mg of caffeine, and a 12-oz serving of Pepsi has 37.6mg of caffeine. However, fountain soft drinks are typically mixed in individual restaurant dispensers, so it is more difficult to maintain a standard level of caffeine per serving. In this study, researchers randomly sampled Coke, Diet Coke, Pepsi, and Diet Pepsi at a set of franchise restaurants and measured the caffeine content in 12oz of each soft drink. The data is found in the Soda.xlsx dataset.

Because individuals can be sensitive to caffeine – and because the manufacturers are interested in product consistency – we wish to estimate the mean caffeine content in 12oz of Coke served in franchise restaurants using a 95% confidence interval. 

You can find the Coke data here: 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv'

The first variable is the sample ID and the second variable is the caffeine content in the 12-oz sample measured in mg.

Source: A.N. Garand and L.N. Bell (1997). "Caffeine Content of Fountain and Private-Label Store Brand Carbonated Beverages," Journal of the American Dietetic Association, Vol. 97, #2, pp. 179-182.


### 1) Load the dataset and print the first few rows.

* name your DataFrame `coke_df`
* set `skipinitialspace=True`
* set `header=0`

In [None]:
# Import your libraries and load the data
import pandas as pd
coke_df=pd.read_csv('https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv',skipinitialspace=True, header=0)

In [None]:
# Take a look at your data
coke_df.head()

Unnamed: 0,Drink,Caffeine
0,1,47.32
1,2,43.78
2,3,48.12
3,4,43.25
4,5,46.42


In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check if the DataFrame was created
assert not coke_df.empty, 'Make sure the df name is accurate and you loaded the correct URL.'
# check the shape of the DataFrame
assert coke_df.shape == (50, 2), 'Is your data loaded with the correct argument?'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###2) Calculate the mean, standard deviation (SD), standard error (SE) for the caffeine content and n for the sample size. 

Label your variables as follows:

* `mean_caffeine`
* `sd_caffeine`
* `n_caffeine`
* `se_caffeine`

In [None]:
mean_caffeine=coke_df['Caffeine'].mean()
sd_caffeine=coke_df['Caffeine'].std()
n_caffeine=coke_df['Caffeine'].count()
se_caffeine=sd_caffeine/(n_caffeine**(1/2))
print(mean_caffeine)
print(sd_caffeine)
print(n_caffeine)
print(se_caffeine)

37.9402
5.243756828216712
50
0.7415792024250598


In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check statistics calculations
assert round(mean_caffeine) == 38, 'Check your mean value'
assert round(sd_caffeine) == 5, 'Check your standard deviation value'
assert n_caffeine == 50, 'Check the sample number'
assert round(se_caffeine, 2) == 0.74, 'Check the standard error value'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### Summarize your results from above in a sentence or two.

The mean of the sample size is 37.94 

###3) Find t* for a 95% confidence interval.  

Use the starter code below and *fill in the degrees of freedom*. The `t_star` variable has been created for you.

In [None]:
# Import the stats library
from scipy.stats import t

#Don't worry too much about where the 0.975 comes from.  It has to do
#with wanting to determine the *middle* 95% of the t-distribution
#We're going to learn how to calculate a 95% CI this easy way in just a minute.

#Hint: Recall that n = 223 for the body temp problem. What was the dof for that problem?

### your code here: fill in the degrees of freedom ###

t_star = t.ppf(0.975, df=n_caffeine-1)
print('t_star =', t_star)

t_star = 2.009575234489209


In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check statistics calculations
assert round(t_star) == 2, 'Check the dof you entered!'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###4) Calculate the margin of error for a 95% confidence interval for the mean caffeine content in a 12-oz Coke. Name your variable `m_err`.



In [None]:
t_star=t.ppf(.975,df=n_caffeine-1)
m_err=t_star*se_caffeine
m_err

1.49025919960566

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check margin of error calculation
assert round(m_err, 2) == 1.49, 'Did you multiply m_err by the correct value?'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### State the margin of error answer with the correct units. (example: The margin of error is 5 pounds per bag of cat food).

The margine of error is ~1.5mg of caffeine in a 12-oz can of Coke

###5) Calculate a 95% CI for the mean caffeine content in a 12-oz fountain Coke with the CI formula using the summary statistics and t* that you calculated above.

Name your variables as follows:

* lower confidence level: `lower_CL`
* upper confidence level: `upper_CL`

In [None]:
lower_CL=mean_caffeine-(t_star*se_caffeine)
upper_CL=mean_caffeine+(t_star*se_caffeine)

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(lower_CL, 2) == 36.45, 'Check your lower CL calculation.'
assert round(upper_CL, 2) == 39.43, 'Check your upper CL calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###6) Calculate a **95% confidence interval** for the mean caffeine content in a 12-oz fountain Coke using the t-interval function in Python. Name your variable `t_int_95`.

In [None]:
t_int_95=t.interval(alpha=.95,df=n_caffeine-1,loc=mean_caffeine,scale=se_caffeine)
t_int_95

(36.44994080039434, 39.43045919960566)

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(t_int_95[0], 2) == 36.45, 'Check your interval calculation.'
assert round(t_int_95[1], 2) == 39.43, 'Check your interval calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###7) Compare the two confidence intervals you calculated.  Do they match?  Should they?

They do match and they should!

###8) Interpret the meaning of the 95% confidence interval for the mean caffeiene content in a 12-oz fountain Coke. in a sentence or two.

We are 95% confident that the population mean of caffeine content in a 12oz can of Coke is between 36.45mg and 39.43mg.

###9) Using the t-interval Python function, calculate a **90% confidence interval** for the mean caffeine content in a 12-oz Coke. Name your variable `t_int_90` (make sure to use `90` at the end!).


In [None]:
t_int_90=t.interval(alpha=.90,df=n_caffeine-1,loc=mean_caffeine,scale=se_caffeine)

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(t_int_90[0], 2) == 36.70, 'Check your interval calculation.'
assert round(t_int_90[1], 2) == 39.18, 'Check your interval calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### Is this estimate *more accurate* or *more precise* (pick one) than the 95% confidence interval?


The estimate is more precise than the 95% confidence interval.

###10) Using the t-interval Python function, calculate a **99% confidence interval** for the mean caffeine content in a 12-oz Coke.  Name your variable `t_int_99` (make sure to use `99` at the end!).




In [None]:
t_int_99=t.interval(alpha=.99,df=n_caffeine-1,loc=mean_caffeine,scale=se_caffeine)

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(t_int_99[0], 2) == 35.95, 'Check your interval calculation.'
assert round(t_int_99[1], 2) == 39.93, 'Check your interval calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### Is this estimate more *accurate* or more *precise* (pick one) than the 99% confidence interval?

SHORT ANSWER TEXT HERE

## Stretch goals:

###1) The correspondence between confidence intervals and hypothesis tests.

Read [this](https://https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-confidence-intervals-and-confidence-levels#:~:text=If%20a%20hypothesis%20test%20produces,corresponding%20confidence%20level%20is%2095%25.&text=If%20the%20confidence%20interval%20does,the%20results%20are%20statistically%20significant.) article about the correspondence between confidence intervals and hypothesis tests.  Feel free to read the whole article, but the relevant part can be found under the heading Why P Values and Confidence Intervals Always Agree About Statistical Significance.

Imagine you work for quality control at Coke and are tasked with making sure that the caffeiene content in the fountain beverages served in restaurants is the same as in a 12-oz can of Coke (34mg).  If you believe that the mean caffeiene content in fountain coke is not 34mg, you must re-train the franchise managers to make sure the Coke served has the correct caffeiene level.

Based on the confidence interval you calculated in the assignment, do you believe that the mean caffeiene content is statistically significantly different from 34 mg in a 12-oz serving?


Because 34mg is not in the bounds of the 95% confidence interval, we can reject the null hypothesis that the mean caffeiene content in 12-oz of fountain Coke is equal to 34mg.  Instead, we conclude it is between about 36.4 and 39.4 mg.

###2) If we increased the sample size from 50 to 100 but the sample mean and SD remained the same, describe **two** ways the margin of error would change.  Would the margin of error become smaller or larger?

Both t* and n would change.  Therefore

In [None]:
t_star = t.ppf(0.975,df=99)


SHORT ANSWER TEXT HERE
