# Pre-lecture

## 1.

The key factor that makes the difference is the amount of data available. It is possible to compare ideas that both have data available (in more or less equal amounts). However, it is much harder to compare ideas with an unequal amount of data available, or even impossible if one or both do not have any data available.

The key "criteria" defining a good null hypothesis is one that can be ruled out in favor of an alternative hypothesis. Null hypotheses are only "accepted" once all other alternatives have been ruled out, and as such, a good null hypothesis is one that can be proven false if an alternative is proven correct.

The difference between null and alternative hypothesis is that the null hypothesis is that it is not ruled out unless an alternative is accepted. In other words, the null hypothesis is meant to be disproven with an alternative hypotheses that has been accepted during testing.

## 2.

The above sentence means that we are using sample statistics to find answers that apply to the entire population ($\mu$). The population which is defined as the entire group impacted by what we are testing and a sample is a small subset of the population from which we collect and analyze data from (as doing so for the entire population may be costly, time consuming, and exessively complicated). 

When hypothesis testing, individual data points in our sample ($x_i$) and the average of our sample data ($\bar{x}$), along with may other sample statistics, gives us values that leads us to either fail to disprove or disprove the null hypothesis ($\mu_0$). The null hypothesis ($H_0$) is the hypothesis we want to disprove with alternative hypotheses, and unless disproven, the $H_0$ is considered true for the population.

If we were to use sample statistics as the outcomes of testing and not reference that to the population, we would get results that are accurate for the sample but may not be accurate for the population as a whole.

## 3.

Suppose a situation or idea, and imagine all possible universes where this happened. All these different universes have different ways that this situation/idea developed. Each universe then has a p-value assigned to it, which describes how likely this unvierse is to exist. Smaller p-values indicate that it is less liekly to happen, while larger ones are the opposite. As such, p-values indicate how likely the null hypothesisis held true and in what way.

## 4.

Building on the explanation above, smaller p-values make the null hypothesis (a hypothesis that suggests there is no effect) look more ridiculous as they are highly unlikely to occur. This is because the sampling distribution of the test statistic (the variability of test statistics) under the null hypothesis will become more spread out the smaller the p-value gets, representing a lower strength of evidence for the null hypothesis.

# Post-lecture

## 8.

In [5]:
import numpy as np
import pandas as pd

np.random.seed(1)
guesses = np.random.choice([2], size=80, p=[31/80, 49/80])
np.random.shuffle(guesses)

student_ids = np.arange(1, 81)
data = {'Student ID': student_ids, 'Correct Guess': guesses}

df=pd.DataFrame(data)
df.to_csv('tea_experiment.csv', index=False)

ValueError: 'a' and 'p' must have same size

# NotebookLM Session Summary

### Session Summary: Exploring the Null Hypothesis and Sampling Distributions

>This session focused on understanding two fundamental concepts in statistical hypothesis testing: the null hypothesis and the sampling distribution of the test statistic. 

>*   **The Null Hypothesis:** We established that the null hypothesis (*H*<sub>0</sub>) is a statement we aim to reject using statistical evidence.  It often proposes that there is "no effect" or "no difference" concerning a population parameter.  We used the example of a tea-tasting experiment to illustrate this. A possible null hypothesis for that scenario could be:  *H*<sub>0</sub>: STA130 students have no ability to distinguish whether milk or tea was poured first. This implies that their guesses are random. The sources highlight that we never "prove" a null hypothesis. We can only reject it or fail to reject it based on the evidence. 

>*   **The Sampling Distribution of the Test Statistic:** This distribution illustrates the variability of a test statistic if we repeatedly sampled from a population where the null hypothesis is true. It's important to remember that the sampling distribution is distinct from the distribution of the data itself.  We discussed how to create a sampling distribution using simulation:
    1.  Assume the null hypothesis is true.
    2.  Generate numerous synthetic samples from a population under the conditions of the null hypothesis.
    3.  Calculate the test statistic for each synthetic sample.
    4.  Plot the distribution of the calculated test statistics.

>    We used the example of testing a fair coin to demonstrate how to simulate the sampling distribution of the proportion of heads under the null hypothesis.

>*   **Relationship Between the Null Hypothesis and Sampling Distribution:** We clarified that the sampling distribution under the null hypothesis provides a benchmark for evaluating the plausibility of observed data.  It allows us to determine the p-value, which quantifies how unlikely the observed data is if the null hypothesis were true. A smaller p-value suggests stronger evidence against the null hypothesis. We also touched upon the connection between confidence intervals and hypothesis testing.

>The session aimed to provide a comprehensive understanding of the null hypothesis and the sampling distribution of the test statistic, relying on simulation to illustrate key concepts and emphasizing the importance of these elements in the process of hypothesis testing.