## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

### Instructions

* **Download this notebook** as you would any other ipynb file
* **Upload** to Google Colab or work locally (if you have that set-up)
* **Delete `raise NotImplementedError()`**
* Write your code in the `# YOUR CODE HERE` space
* **Execute** the Test cells that contain `assert` statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
* **Save** your notebook when you are finished
* **Download** as a `ipynb` file (if working in Colab)
* **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

# Lambda School Data Science - Unit 1 Sprint 2 Module 3

## Module Project: Sampling and Confidence Intervals

### Objectives

* Objective 01 - explain the concepts of statistical estimate, precision, and standard error as they apply to inferential statistics
* Objective 02 - explain the implications of the central limit theorem in inferential statistics
* Objective 03 - explain the purpose of and identify applications for confidence intervals
* Objective 04 - demonstrate how to build a confidence interval around a sample estimate


#### Total notebook points: 9

## Introduction

### Confidence Intervals

Soft drinks like Coke and Pepsi are manufactured to have a standard caffeine content. For example, a 12-oz serving of Coke has 34mg of caffeine, and a 12-oz serving of Pepsi has 37.6mg of caffeine. However, fountain soft drinks are typically mixed in individual restaurant dispensers, so it is more difficult to maintain a standard level of caffeine per serving. 

In this study, researchers randomly sampled Coke, Diet Coke, Pepsi, and Diet Pepsi at a set of franchise restaurants and measured the caffeine content in 12oz of each soft drink.

Because individuals can be sensitive to caffeine – and because the manufacturers are interested in product consistency – **we wish to estimate the mean caffeine content in 12oz of Coke served in franchise restaurants using a 95% confidence interval.** 


### Data set

The data set for Coke is available at the link provided below. The first variable is the sample ID and the second variable is the caffeine content (in mg) in the 12oz sample.

*Source: A.N. Garand and L.N. Bell (1997). "Caffeine Content of Fountain and Private-Label Store Brand Carbonated Beverages," Journal of the American Dietetic Association, Vol. 97, #2, pp. 179-182.*

**Task 1** - Load the data

Load the dataset using the provided URL

* Read in your data as a pandas DataFrame with the variable name `coke_df`
* Use the `.head()` method to take a look at the DataFrame

In [9]:
# Task 1

# Imports
import pandas as pd
import numpy as np

# URL for the data
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Soda/Soda.csv'

# YOUR CODE HERE
# raise NotImplementedError()
coke_df = pd.read_csv(data_url)

# Look at your DataFrame
print(coke_df.shape)
coke_df.head()

(50, 2)


Unnamed: 0,Drink,Caffeine
0,1,47.32
1,2,43.78
2,3,48.12
3,4,43.25
4,5,46.42


**Task 1 - Test**

In [None]:
# Task 1 Test

assert isinstance(coke_df, pd.DataFrame), 'Have you created a DataFrame named `coke_df`?'


**Task 2** - Descriptive statistics

Calculate the following statistical quantities for the `Caffeine` content column. Name your variables as indicated (they need to be an exact match to pass the tests).  Summarize your results in a sentence or two.

* mean - `mean_caffeine`
* standard deviation - `std_caffeine`
* standard error - `se_caffeine`
* number of samples - `n_caffeine`


In [6]:
# Task 2

# Name your varaibles exactly as specified in the task

# YOUR CODE HERE
# raise NotImplementedError()
mean_caffeine = coke_df['Caffeine'].mean()
std_caffeine = coke_df['Caffeine'].std()
se_caffeine = coke_df['Caffeine'].sem()
n_caffeine = coke_df['Caffeine'].count()

print(mean_caffeine, std_caffeine, se_caffeine, n_caffeine)

37.940200000000004 5.2437568282167115 0.7415792024250597 50


**Task 2 - Test**

In [10]:
# Task 2 Test

assert n_caffeine == 50, 'Did you correctly calculate the number of samples?'


**Task 2** - ANSWER

Using the statistics you calculated above, write out your answer in words. Use the following format:

*Example: The mean caffeine content is XXmg per 12oz serving with a standard error of XXmg. The sample size is XX.*

This task will not be autograded - but it is part of completing the project.

YOUR ANSWER: The mean caffeine content is 37.94mg per 12oz serving with a standard error of 0.74mg. The sample size = 50.

**Task 3** - Calculate t*

For this task you will calculate t* for a 95% confidence interval.

* set the variable `deg_free` equal to the degrees of freedom for the `Caffeine` variable
* set the variable `t_star` equal to t* using `t.ppf(q, df)` with `q=0.975` and `df = deg_free`

Note: Don't worry about where the 0.975 value comes from - it relates to finding the *middle* of the 95% t-distribution. We're going to learn how to calculate the 95% confidence interval an easier way in the next exercise.

In [11]:
# Task 3

from scipy.stats import t

# YOUR CODE HERE
# raise NotImplementedError()
deg_free = n_caffeine - 1
t_star = t.ppf(q=0.975, df=deg_free)

# View your answer
print('t_star =', t_star)

t_star = 2.009575234489209


**Task 3 - Test**

In [None]:
# Task 3 Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 4** - Margin of error

In this task you'll calculate the margin of error for a 95% confidence interval (CI) for the mean caffeine content in a 12-oz Coke.

* Assign the margin of error for a 95% CI to the variable `margin_err`

Hint: You already have the value for t* for a 95% CI and the standard error

In [12]:
# Task 4

# YOUR CODE HERE
# raise NotImplementedError()
margin_err = t_star * se_caffeine

# View your answer
print('Margin of error = ', margin_err)

Margin of error =  1.4902591996056598


**Task 4 - Test**

In [None]:
# Task 4 Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 4** - ANSWER

Using the margin of error you calculated above, write out your answer in words. Use the following format:

*Example: The margin of error is XXmg of caffeine per 12oz serving*

This task will not be autograded - but it is part of completing the project.

YOUR ANSWER: The margin of error is 1.49mg of caffeine per 12oz serving

**Task 5** - Calculate a confidence interval

For this task, you are going to calculate a 95% CI for the mean caffeine content in a 12-oz fountain Coke with the CI formula using the summary statistics and t* that you calculated above.

* Calculate the lower confidence level and assign it to `lower_cl`
* Calculate the upper confidence level and assign it to `upper_cl`

In [13]:
# Task 5

# YOUR CODE HERE
# raise NotImplementedError()\
lower_cl = mean_caffeine - margin_err
upper_cl = mean_caffeine + margin_err

# View your answers
print ('Lower confidence limit =', lower_cl)
print ('Upper confidence limit =', upper_cl)

Lower confidence limit = 36.449940800394344
Upper confidence limit = 39.430459199605664


**Task 5 - Test**

In [None]:
# Task 5 Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 5** - ANSWER

Write out the confidence interval you just calculated. Use the following format:

*Example: We are 95% confident that the true mean of the caffeine content is between [lower CL, upper CL]*

This task will not be autograded - but it is part of completing the project.

YOUR ANSWER: We are 95% confident that the true mean of the caffeine content is betwen [36.45, 39.43].

**Task 6** - 95% confidence interval using t-interval

As promised in Task 4, we're going to calculate the confidence interval the easy way. We'll use the `t.interval()` function to calculate the 95% confidence interval.

* Assign the confidence interval to `t_int_95`
* `alpha` should be set equal to the confidence level
* `df` is the degrees of freedom
* `loc` is the sample mean
* `scale` is the standard error of the distribution

In [14]:
# Task 6

# YOUR CODE HERE
# raise NotImplementedError()
t_int_95 = t.interval(alpha = 0.95, df = deg_free, loc = mean_caffeine, scale = se_caffeine)

# View your answer
print(t_int_95)

(36.449940800394344, 39.430459199605664)


**Task 6 - Test**

In [None]:
# Task 6 Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 7** - Compare and interpret confidence intervals

(This part is not graded and is practice for writing out your results.)

Q1 - In this task, you are going to do your own test. Look at the two confidence intervals you calculated; are they equal? Should they be equal?

ANSWER: Yes, Yes

Q2 - Interpret the meaning of the 95% confidence interval for the mean caffeine content in the 12oz fountain Coke in a sentence or two.

ANSWER: We are 95% confident that the population mean of caffeine content in the 12oz fountain Coke is between 36.45 and 39.43 mg.

**Task 8** - 90% confidence interval using t-interval

Now that we've calculated a confidence interval at the 95% level, we'll repeat the calculation for a 90% confidence level.

* assign the confidence interval to `t_int_90`
* `alpha` is the confidence level
* `df`, `loc`, `scale` are the same as for the first calculation

In [15]:
# Task 8

# YOUR CODE HERE
# raise NotImplementedError()
t_int_90 = t.interval(alpha=0.90,df=deg_free,loc=mean_caffeine,scale=se_caffeine)

# View your answer
print(t_int_90)

(36.6969047267492, 39.183495273250806)


**Task 8 - Test**

In [None]:
# Task 8 Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 9** - 99% confidence interval using t-interval

And, we'll complete one more confidence interval calculation, this time at the 99% level.

* assign the confidence interval to `t_int_99`
* `alpha` is the confidence level
* `df`, `loc`, `scale` are the same as for the first two calculations

In [16]:
# Task 9

# YOUR CODE HERE
# raise NotImplementedError()
t_int_99 = t.interval(alpha=0.99,df=deg_free,loc=mean_caffeine,scale=se_caffeine)

# View your answer
print(t_int_99)

(35.952803352856854, 39.927596647143154)


**Task 9 - Test**

In [None]:
# Task 9 Test

# Hidden tests - you will see the results when you submit to Canvas

**Task 10** - Summarize confidence interval calculations

This part is not autograded and is practice for writing out your results!

Q1 -  Is the 90% confidence interval more accurate or more precise (pick one) than the 95% confidence interval?

ANSWER: More precise.

Q2 -  Is the 99% confidence interval more accurate or more precise (pick one) than the 95% confidence interval?

ANSWER: More accurate.

**Task 10** Summarize confidence interval calculations

Select the correct relationship between a 90%, 95% and 99% confidence interval. Specify your answer in the next code block using `Answer = `.  For example, if the correct answer is choice B, you'll type `Answer = 'B'`.

A: A 95% confidence interval is the most accurate and precise confidence interval, which is why we always use it.

B: A 90% confidence interval is more accurate and less precise than a 95% or 99% CI.

C: A 99% confidence interval is more precise and accurate than a 95% or 90% CI.

D: A 90% confidence interval is more precise and less accurate than a 95% or 99% CI.


In [18]:
# Task 10

# YOUR CODE HERE
# raise NotImplementedError()
Answer = 'D'

**Task 10 Test**

In [None]:
# Task 10 - Test
# Hidden tests - you will see the results when you submit to Canvas