### Review part [20 Min]:

1. See Sep20 TUT and Sep23 LEC

    1. What is a histogram and what does it describe about the data?

    2. What general conclusions regarding the use of barplots and histograms regarding comparisions sample size considerations can we make from the following examples?

In [None]:
import pandas as pd
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objects as go
pyo.init_notebook_mode()

from scipy import stats
import numpy as np

In [None]:
# load / reset df
df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/2e9bd5a67e09b14d01f616b00f7f7e0931515d24/data/2020/2020-07-07/coffee_ratings.csv")
df = df[df['total_cup_points']>65]
df = df[~df['country_of_origin'].isna()]
df = df.rename(columns={'country_of_origin': 'origin', 'total_cup_points': 'points'})

In [None]:
# fix titles
df.origin = df.origin.str.replace(" (", "<br>(")
df.origin = df.origin.str.replace(", ", ",<br>")

fig = px.histogram(df, x='points', facet_col='origin', 
             facet_col_wrap=6, height=1000, facet_row_spacing=0.05)

fig.for_each_annotation(lambda a: a.update(text=a.text.replace("origin=", ""))) # fix titles

In [None]:
df.origin = df.origin.str.replace("<br>", " ") # fix labels

fig = px.box(df, x='points', y="origin", height=750)

# order plot to be more visually interpretable
fig.update_yaxes(categoryorder='array', 
                 categoryarray=df.groupby("origin")['points'].mean().sort_values().index)

In [None]:
# add in missing sample sizes
keys = df.origin.value_counts().index.values
vals = df.origin.value_counts().index.values + " (n="+df.origin.value_counts().values.astype(str)+")"
df.origin = df.origin.map({k:v for k,v in zip(keys,vals)})

fig = px.box(df, x='points', y="origin", height=750)
fig.update_yaxes(categoryorder='array', 
                 categoryarray=df.groupby("origin")['points'].mean().sort_values().index)

### Demo [40 minutes]

#### Estimating Averages [20/40 minutes]

1. For which countries above do you think we can most accurately estimate the average "points" score of cups of coffee from a given country? 
2. How does the variability of means of simulated samples change as a function of sample size?
3. Does this seem to change if using (symmetric) `normal`, (skewed) `gamma`, or (other empirical shapes) when using **bootstrapped samples**? 


In [None]:
# fix titles
df.origin = df.origin.str.replace(" (", "<br>(")
df.origin = df.origin.str.replace(", ", ",<br>")

fig = px.histogram(df, x='points', facet_col='origin', 
                   facet_col_wrap=6, height=1000, facet_row_spacing=0.05)

fig.for_each_annotation(lambda a: a.update(text=a.text.replace("origin=", ""))) # fix titles

for i,average in enumerate(dict(df.groupby('origin').points.mean()[df.origin.unique()]).values()):
    fig.add_vline(x=average, line_dash="dot", row=6-int(i/6), col=(1+i)%6)
fig.show()

In [None]:
# population model
pop_parameter_mu_μ = 0
pop_parameter_sigma_σ = 1
normal_distribution = stats.norm(loc=pop_parameter_mu_μ, scale=pop_parameter_sigma_σ) 

n = 100 # adjust and experiment with this
# np.random.seed(130)
x = normal_distribution.rvs(size=n) # "x" is a sample

print("The sample mean for the current sample is", x.mean()) 
# sample average "x-bar" a (sample) "statistic" (not a parameter)
print(x)
fig = px.histogram(pd.DataFrame({'sampled values': x}), x='sampled values',
                   histnorm='probability density') # so the scale matches the pdf below
fig.add_vline(x=x.mean(), line_dash="dot", annotation_text='Sample mean '+str(x.mean()))

# pdf stands for "probability density function"
support = np.linspace(-4,4,100)
fig.add_trace(go.Scatter(x=support, y=normal_distribution.pdf(support), 
                         mode='lines', name='Poulation Model<br>(normal distribution)'))

In [None]:
number_of_simulations = 1000 # adjust and experiment with this
simulated_means = np.zeros(number_of_simulations)

# np.random.seed(130) # ?
n = 100 # adjust and experiment with this

for i in range(number_of_simulations):
    # np.random.seed(130) # ?
    simulated_means[i] = stats.norm(loc=0, scale=1).rvs(size=n).mean()

title = str(number_of_simulations)+' simulated means for sample of size n = '+str(n)
fig = px.histogram(pd.DataFrame({title: simulated_means}), x=title,
                   histnorm='probability density')    

support = np.linspace(simulated_means.min(),simulated_means.max(),100)
fig.add_trace(go.Scatter(x=support, y=stats.norm(0,scale=1/np.sqrt(n)).pdf(support), 
                         mode='lines', name='A theoretical<br>distribution of<br>"averages"'))

In [None]:
# population model
pop_parameter_alpha_α = 2
pop_parameter_theta_θ = 4
gamma_distribution = stats.gamma(a=pop_parameter_alpha_α, scale=pop_parameter_theta_θ)

n = 100 # adjust and experiment with this
# np.random.seed(130)
x = gamma_distribution.rvs(size=n) # "x" is a sample

print("The sample mean for the current sample is", x.mean()) 
# sample average "x-bar" a (sample) "statistic" (not a parameter)
# print(x)

fig = px.histogram(pd.DataFrame({'sampled values': x}), x='sampled values',
                   histnorm='probability density') # so the scale matches the pdf below
fig.add_vline(x=x.mean(), line_dash="dot", annotation_text='Sample mean '+str(x.mean()))

support = np.linspace(0,50,100)
fig.add_trace(go.Scatter(x=support, y=gamma_distribution.pdf(support), 
                         mode='lines', name='Poulation Model<br>(gamma distribution)'))
# pdf stands for "probability density function"

In [None]:
number_of_simulations = 1000 # adjust and experiment with this
simulated_means = np.zeros(number_of_simulations)

# np.random.seed(130) # ?
n = 100 # adjust and experiment with this

for i in range(number_of_simulations):
    # np.random.seed(130) # ?
    simulated_means[i] = stats.norm(loc=0, scale=1).rvs(size=n).mean()

title = str(number_of_simulations)+' simulated means for sample of size n = '+str(n)
fig = px.histogram(pd.DataFrame({title: simulated_means}), x=title,
                   histnorm='probability density')    

support = np.linspace(simulated_means.min(),simulated_means.max(),100)
fig.add_trace(go.Scatter(x=support, y=stats.norm(0,scale=1/np.sqrt(n)).pdf(support), 
                         mode='lines', name='A theoretical<br>distribution of<br>"averages"'))

#### Bootstrapping [20/40 minutes]: pretending a sample is the population

1. Why `replace=False`?
2. Why is `n` the same as the original sample size?

In [None]:
keep = (df.origin=='Guatemala') | (df.origin=='Mexico')
px.histogram(df[keep], x='points', facet_col='origin', facet_col_wrap=2, height=300)

In [None]:
contry = 'Mexico' 

# bootstrapping is when `replace=True` and `n` is the original sample size
# and we do this over and over to see the behavior of sample statistics
n_ = (df.origin==contry).sum() # ?
replace_ = True # ?

x = df[df.origin==contry].sample(n=n_, replace=replace_).points
print("The sample mean for the current sample is", x.mean()) 
# sample average "x-bar" a (sample) "statistic" (not a parameter)

dat = pd.DataFrame({'values': np.r_[df[df.origin==contry].points.values,x],
                    'sample': np.r_[['Orginal Sample']*(df.origin==contry).sum(),
                                    ['Bootstrap Sample']*n_]})             

fig = px.histogram(dat, x="values", color="sample", barmode="overlay")
fig.add_vline(x=x.mean(), line_dash="dot", annotation_text='Sample mean '+str(x.mean()))
fig.update_layout(yaxis_range=[0,30])
# Notice that we don't have a "Poulation Model"... only the "Original Sample"

In [None]:
number_of_simulations = 1000 # adjust and experiment with this
simulated_means = np.zeros(number_of_simulations)

# np.random.seed(130) # ?
n = 100 # adjust and experiment with this

for i in range(number_of_simulations):
    # np.random.seed(130) # ?
    simulated_means[i] = df[df.origin==contry].sample(n=n_, replace=replace_).points.mean()

title = str(number_of_simulations)+' simulated means for sample of size n = '+str(n)
fig = px.histogram(pd.DataFrame({title: simulated_means}), x=title,
                   histnorm='probability density')    

support = np.linspace(simulated_means.min(),simulated_means.max(),100)
fig.add_trace(go.Scatter(x=support, y=stats.norm(0,scale=1/np.sqrt(n)).pdf(support), 
                         mode='lines', name='A theoretical<br>distribution of<br>"averages"'))

In [None]:
number_of_simulations = 1000 # adjust and experiment with this
simulated_means = np.zeros(number_of_simulations)

# np.random.seed(130) # ?
n = (df.origin==contry).sum()

for i in range(number_of_simulations):
    simulated_means[i] = stats.norm(loc=0, scale=1).rvs(size=n).mean()

title = str(number_of_simulations)+' simulated means for sample of size n = '+str(n)
fig = px.histogram(pd.DataFrame({title: simulated_means}), x=title,
                   histnorm='probability density')    

support = np.linspace(simulated_means.min(),simulated_means.max(),100)
fig.add_trace(go.Scatter(x=support, y=stats.norm(0,scale=1/np.sqrt(n)).pdf(support), 
                         mode='lines', name='A theoretical<br>distribution of<br>"averages"'))

### Communication [40 minutes]

#### Activity 1 [20/40 minutes]

Break into 5 groups of students, assigning each group to one of the questions. Groups discuss questions for 5 minutes, and then each group (in order) provides their answer to the class for 3 minutes.

1. What are the differences between sampling from a "population model" (such as a normal or gamma distribution) compared to bootstrap sampling from an original sample? 

2. What happens to the variability of sample mean statistics when sampling from a "population model" (such as a normal or gamma distribution) as the sample size (n) increases?

3. Why does it make sense to consider changing the sample size when sampling from a "population model" (such as a normal or gamma distribution) but it does not make sense to change the sample size when considering bootstrapped samples from an original sample? 

4. If you had a histogram of bootstrapped sample means representing the variability of means that an observed sample of size n produces, how would you give a range estimating what the sample mean of a future sample of size n might be? 

5. If you had a theoretical distribution of "averages" representing the variability of means that an observed sample of size n produces, how would you give a range estimating what the sample mean of a future sample of size n might be? 


#### Activity 2 [20/40 minutes]

Break into 3 groups of students, assigning each group to one of the questions. Groups discuss questions for 5 minutes, and then each group (in order) provides their answer to the class for 5 minutes.

1. What is the process of bootstrapping?
2. What is the main purpose of bootstrapping?
3. If you had a hypothesis about what the average of a sample of size n from a population was, and then you observed the sample, how could you use bootstrapping as evidence in favor or against your hypothesis? 



### Homework [0 minutes]

> Code and write all your answers in a python notebook (in code and markdown cells) and save your python jupyter notebook in your own account and "repo" on [github.com](github.com) and submit a link to that notebook though Quercus for assignment marking.

#### Pre-lecture preparation / homework 

To prepare for this weeks lecture, first watch this video [introduction to bootstrapping](https://www.youtube.com/watch?v=Xz0x-8-cgaQ). Then interact with ChatGPT to clarify what is meant by the following concepts introduced in the video.  

1. The video mentioned the "standard error of the mean" as being the "standard deviation" of the distribution bootstrapped means.  What is the difference between the "standard error of the mean" and the "standard deviation" of the original data? 
    1. Include both your answer and a log of your interactions with ChatGPT in your notebook submission for this question.

2. The video suggested that the "standard error of the mean" could be used to create a confidence interval, but didn't describe exactly how to do this.  How can we use the "standard error of the mean" to create a 95% confidence interval which "covers 95% of the bootstrapped sample means"?
    1. Include both your answer and a log of your interactions with ChatGPT in your notebook submission for this question.

3. Creating the "plus and minus 2 times standard error" confidence interval addressed in the previous problem will indeed cover approximately 95% of the bootstrapped sample means. Alternatively, how do we create a 95% bootstrapped confidence interval which more exactly covers 95% of the bootstrapped sample means based on the sampled distribution (aka values) of the bootstrapped means?
    1. Include both your answer and a log of your interactions with ChatGPT in your notebook submission for this question.

4. The video mentioned that bootstrap confidence intervals could apply to other statistics of the sample, such as the "median". Provide code to produce a 95% bootstrap confidence interval and comment the code to demonstrate where it can be changed to produce a 95% bootstrap confidence interval for different sample statistic.
    1. Include both your answer and a log of your interactions with ChatGPT in your notebook submission for this question.

5. The video introduced hypothesis testing by saying that "the confidence interval covers zero, so we cannot reject the hypothesis that the drug is not doing anything".  This conclusion could referred to as "failing to reject the null hypothesis", where the term "null" refers to the concept of "no effect".  Why does a confidence interval overlapping zero "fail to reject the null hypothesis" when the observed sample mean statistic itself is not zero? 
    1. Include both your answer and a log of your interactions with ChatGPT in your notebook submission for this question.

6. A formal null hypothesis has the form $H_0: \mu=0$ which means that the average value $\mu$ in population average is $0$. The alternative of this would be $H_A: H_0 \text{ is false}$ which means that the average value $\mu$ in population average is not $0$. What is the difference between the observed sample values in the sample $x_i$ (for $i = 1, \cdots, n$), the observed sample average $\bar x$, and the actual value of $\mu$?
    1. Include both your answer and a log of your interactions with ChatGPT in your notebook submission for this question.


#### Post-lecture homework 

7. Complete the following assignment. 


# Project: Vaccine Data 

## Overview

The company AliTech has created a new vaccine that aims to improve the health of the people who take it. Your job is to use what you have learned in the course to give evidence for whether or not the vaccine is effective. 

## Data 
AliTech has released the following data.

~~~csv
PatientID,Age,Gender,InitialHealthScore,FinalHealthScore
1,45,M,84,86
2,34,F,78,86
3,29,M,83,80
4,52,F,81,86
5,37,M,81,84
6,41,F,80,86
7,33,M,79,86
8,48,F,85,82
9,26,M,76,83
10,39,F,83,84
~~~

## Deliverables
While you can choose how to approach the project, we are interested in evaluating your report relative to the following deliverables.

- A formal null hypothesis for this context.
- A visual presentation giving some initial insight into the comparison of interest.
- A quantitative analysis of the data and an explanation of the method and purpose of this method.
- A conclusion regarding the null hypothesis after analyzing the data with your methodology.
- The clarity of your documentation, code, and written report. 

> Consider organizing your report within the following outline template.
> 1. Problem Introduction 
>     1. Statement of the Null Hypothesis
>     2. Data Visualization (motivating and illustrating the comparison of interest)
> 2. Quantitative Analysis
>     1. Methodology Code and Explanations
>     2. Supporting Visualizations
> 3. Findings and Discussion
>     1. Conclusion regarding the Null Hypothesis
>     2. Further Considerations

### Further Instructions:
- When using random functions, you should make your analysis reproducible by using the `np.random.seed()` function
- Create a CSV file and read that file in with your code, but do include the CSV file along with your submission
