# Introduction

Some terms that are going to come up:
- **Population** - every member of a group we want to study
- **Sample** - is a small set of (hopefully) random members of the population. 
- **Parameter** - a characteristic of a population. Often we want to understand parameters
- **Statistic** - a characteristic of a sample. Often we apply statistical inferences to the sample in an attempt to describe the population. 
- **Variable** - a variable is a characteristic that desribes a member of the sample. Can be discrete of continuous. 

# Sampling

One of the great benefits of statistical models is that a reasonable sized random sample will almost always reflect the population. The challenge becomes, how do we select members randomly, and avoid bias?

The are several forms of bias:
- **Selection Bias**: Perhaps the most common, this type of bias favors those members of a population who are more inclined nd ableto answer polls.
    - Unvercoverage Bias: making too few observations or omitting entire segments of a population (e.g., only polling day workers at a hospital)
    - Self-Selection Bias: people who volunteer may differ significantly from those in the population who don't (online survey in a sports team in a city where people feel strongly about that team)
    - Healthy-user bias: the sample may come from a healthier segment of the overall population - people who walk/jog outside, follow healthier behavioros, etc. (polling customers at a fruit stand while asking about health). 
- **Survivorship Bias**: If a population improves over time, it may be due to lesser members leaving the population due to death, expulsion, relocation, etc. For example, head injuries increasing when the British army switching to metal helmets from cloth caps. Head injuries increase because many things that are now head injuries would have been fatalities before. 

## Types of Sampling
There are several types of sampling:
- Random
- Stratified Random
- Cluster

**Random Sampling**

As its name suggests, random sampling means every member of a population has an equal chance of being selected. However, since samples are usually much smaller than populations, there is a change thatentiredemographics might be missed.  

**Stratified Random Sampling**

Stratified random sampling ensures that groups within a population are adequately represented. First, divide the population into segments based on some characteristic. Members cannot belongto two groups at once. Next, take random samples from each group. The size of each sample is based on the size of the group relative to the population. 

Example: A compnay wants to conduct a survey of customer satisfaction. They can only survey 10% of their customers. They want to ensure that every age group is fairly represented. The customer breakdown by age group is as follows:

![image-2.png](attachment:image-2.png)

To obtain 10% sample, we take 10% from each group. 

![image-3.png](attachment:image-3.png)

**Clustering**

A third - and often less precise - method of sampling is clustering. The idea is to break the population down into groups and sample a random selection of groups, or clusters. Usually this is done to reduce costs. 

Example: A marketing firm sends pollsters to a handful of neighborhoods (instead of canvassing an entire city). A researcher samples fishing boats that are in port on a particular day (also known as **convenience sampling**). 

# Central Limit Theorem

The Central Limit Theorem is what makes sampling such a good statistical tool. Receall that a sample mean often vaires from the population mean. The CLT considers a large number of random sample tests. 

The CLT states that the mean values from a group of samples will be *noramlly distributed* about the population mean, even if the population iteself is not normally distributed. That is, 95% of all sample means should fall withing $2\sigma$ of the population mean. 

![image.png](attachment:image.png)

# Standard Error

Imagine we have a population of 10,000 people, $N$, and we are measuring some parameter $P$ and standard deviation $\sigma$. We of course cannot measure the entire population so we measure a sample, $n$. What we are measuring is some sample statistic $\hat{p}$, but we also have standard error of the sample, $SE_{\hat{p}}$.

If for the population of Australia the mean height is 5'9" and for our 100-person survey the mean height is 5'10", then:
- $P = 5'9"$
- $\hat{p} = 5'10"$
- $SE_{\hat{p}} = Standard \; Error \; of \; the \; Mean$

Where the population standard deviation describes how wide individal values stray from the population mean, the Standard Error of the Mean describes how far a sample mean may stray from the population mean. 

If the population standard deviation $\sigma$ is knonw, then the sample standard error of the mean can be calculated as:

<center>$SE_\bar{x}=\frac{\sigma}{\sqrt{n}}$</center>

Let's go through an example. Imagine an IQ Test is designed to have a mean score of 100 with a standard deviation of 15 points. If a sample of 10 scores has a mean of 104, can we assume they come from the general population?

<center>$n=10\;\;\bar{x}=104\;\;\sigma=15$</center>
<center>$SE_\bar{x}=\frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{10}} = 4.743$</center>

This means that 68% of 10-item sample means are expeted to fall between 95.257 and 104.743. Notice how this is relating the sample mean to the population mean. 

This means we can say with a 95% **confidence level** that the population parameter lies within a **confidence interval** of plus-or-minus two standard errors of the sample statistic. 

![image.png](attachment:image.png)

In the above example, the sample statistic $\hat{p}$ is a **point estimator** of the population parameter **$P$**. 

Note again how the standard error is allowing us to build out some relationship between the sample statistic we calculated from our sample vs the actual population parameter. The Standard Error just gives the

# Hypothesis Testing

Hypothesis testing is the application of statistical methods to real-world questions. We start with an assumption, called the **null hypothesis**. We then run an experiment to test this null hypothesis. Based on the results of the experiment, we either **reject** or **fail to reject** the null hypothesis. If the null hypothesis is rejected, then we say that data supports another, mutually exclusive **alternate hypothesis**. We never "prove" a hypothesis.

**Framing the Hypothesis**

How do we take a real-world question and form a null hypothesis. At the start of the experiment, the null hypothesis is assumed to be true. It is then up to the experiement to reject or fail to reject the null hypothesis. If the data fails to support the null hypothesis, only then can we look to alternative hypothesis.  

If testing something assumed to be true, the null hypothesis can reflect the assumptionL

Claim: Our shipping product has an average shipping weight of 3.5 kg. 

Null Hypothesis: average weight = 3.5 kg

Alternative hypothesis: average weight != 3.5 kg

If we are testing a claim we *want* to be true, but cannot assume, we test its opposite. 

Claim: This prepatory course improves test scores.

Null hypothesis: old scores >= new scores

Alternate hyothesis: old scores < new scores

The null hypothesis should contain an equality (=, <=, >=)

The alternate hypothesis should not have an equality (!=, <, >)

So what let's us reject of fail to reject the null hypothesis? We run an experiment and record the result. **Assuming our null hypothesis is valid**, if the probability of observing these results is very small (inside of 0.05) then we reject the null hypothesis. Here 0.05 is our **level of significance**. 

The level of significance $\alpha$ is the area inside of the *tails* of our null hypothesis. If $\alpha = 0.05$ and the alternative hypothesis is *less than* the null, then the left-tail of our probability curve has an area of 0.05. 

![image.png](attachment:image.png)

If $\alpha = 0.05$ and the alternative hypothesis is *more than* the null, then the right-tail of our probability curve has an area of 0.05. 

![image-2.png](attachment:image-2.png)

If $\alpha = 0.05$ and the alternative hypothesis is *not equal* to the null, then the two tails of our probability curve *share* and area of 0.05

![image-2.png](attachment:image-2.png)

These area establish our **critial values** of Z-scores. 

**Tests of Mean vs Proportion**

There are two types of tests:
- Test of Means
- Test of Proportions

Each of these two types of tests has their own test statistic to calculate.

**Mean**
- When we look to find an *average* of specific value in a population we are dealing with means. 

When working with means, 

<center>$Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}}$</center>

This assumes we know the population standard deviation. 

**Proportion**
- Whenever we say something like "35%" or "most"

When working with proportions:

<center>$Z = \frac{\hat{p} - p}{\sqrt{\frac{p \cdot q}{n}}}$</center>


In a **traditional test**:
- Take the level of significance $\alpha$
- Use it to determine the critical value
- Compare the test statistic to the critical value

In a **P-value test**:
- Take the test statistic
- Use it to determine the P-value
- Compare the P-value to the level of significance $\alpha$

"If the P-value is low, the null must go! If the P-value is high, the null must fly!"

# Hypothesis Testing Example # 1

For this example, we will work in the left-hand side of the probability distribution, with negative z-scores. We will show how to run the hypothesis test using the traditional method, and the with the P-value method. 

A company is looking to improve their website performance. Currently, pages have a mean load time of 3.125 seconds, with a standard deviation of 0.700 seconds. They hire a cosulting firm to improve load times. 

<center>$\mu = 3.125 \;\;\; \sigma = 0.700$</center>

Management wants a 99% confidence level that the mean load time improved. A sample run of 40 of the new pages has a mean load time of 2.875 seconds. Are these results statistically faster than before?

<center>$\alpha = 0.01 \;\;\; n = 40 \;\;\; \bar{x} = 2.875$</center>



1. State the null hypothesis: $H_0: \mu \geq 3.125$
2. State the alternative hypothesis: $H_1: \mu < 3.125$
3. Set of level of significance: $\alpha = 0.01$
4. Determine the test type (left tail, right tail, two tail). 
    - Recall that we are saying the null hypothesis mean is greater than or equal to, which means the alternative hypothesis then is less than, so we will be using the left tail test type. 
    
    
![image.png](attachment:image.png)

**Traditional Method**

5. Calculcate the test statistic:

<center>$Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} = \frac{2.875-3.125}{0.7/\sqrt{40}} = -2.259$</center>

6. Critical Value:

*z-table look up on 0.01:* $z = -2.325$

7. Since -2.259 > -2.325, the test statistic falls outside the rejection region, so we have failed to reject the null hypothesis, so we cannot say that the new web pages are statisically faster. 

**P-Value Method**

5. Calculate the test statistic

<center>$Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} = \frac{2.875-3.125}{0.7/\sqrt{40}} = -2.259$</center>

6. P-Vale

*z-table lookup on -2.26:* $P=0.019$

7. Since 0.0119 > 0.01, the P-value is greater than the level of signifiance $\alpha$, so we fail to reject the null hypothesis and cannot say that the webpages are statistically faster. 

# Hypothesis Testing # 2

A video game company surveys 400 of their customers and finds that 58% of the sample are teenagers. Is it fair to say that most of the company's customers are teenagers?

1. Set the null hyptothesis: $H_0: P \leq 0.50$
2. Set the alternative hypothesis: $H_1: P > 0.50$
3. Calculate the test statistic:

<center>
    $
    Z = \frac{\hat{p} - p}{\sqrt{\frac{p \cdot q}{n}}}
    = \frac{0.58-0.50}{\sqrt{\frac{0.5(1-0.5)}{400}}}
    = \frac{0.08}{0.025}
    = 3.2
    $

</center>

4. Set a significance level. Since we weren't given one, we assume a standard $\alpha = 0.05$
5. Decide what type of tail is involved. $H_1: P > 0.50$ means a right-tail test. 
6. Look up the critical value: $Z = 1.645$
7. Since our test statistic (3.2) > the critical value (1.645), we reject the null hypothesis and suppirt the claim that most customers are teenages. 

Note that **the size of the sample matters**. If we had started with a sample size of 40 instead of 400, our test statistic would have been only **1.01**, and we would fail to reject the null hypothesis. 

# Type 1 and Type 2 Errors

Often in medical fields and other scientific fields, hypothesis testing is used to test against results where the "truth" is already known. For example, testing a new diagnostic test for cancer for patients you have already successfully diagnozed by other means. 

In this situation, you already know if the Null Hypothesis is True or False. In these situations where you already know the "truth", then you would know it is possible to commit an error with your results. 

This type of analysis is common enough that these errors have specific names:
- Type 1 Error
- Type 2 Error

Type 1 Error: Rejecting a null hypothesis that should have been supporting

Example: $H_0$: *There is no fire*. You reject this, so you pull the fire alarm, to find out there really was no fire. 

Type 2 Error: Failing to reject a null hypothsis that should have been rejected. 

Example: $H_0$: *There is no fire*. You fail to reject it, when you should have rejected it, don't pull the fire alarm, and there is a fire. 

![image.png](attachment:image.png)

# T-Distribution

Recall that when we used Z scores with a normal distribution, we had to know the population's standard deviation (sigma) in order to calculate Z. But what if in the real world we don't know the population standard deviation?

Using the t-tble, the T-Test determines if there is a significant difference between two sets of data. Due to variance and outliers, it is not enought just to compare mean values. A T-Test also considers sample variances. 

There are multiple types of T-Tests. 

**One-Sample T-Test**: Tests the null hypothesis that the popultion mean is equal to a specified value $\mu$ based on a sample mean $\bar{x}$.

Example - want to check if sample of students have the same mean test scores as population of students. 

**Independent Two-Sample T-Test**: Tests the null hypothesis that two sample means $\bar{x_1}$ and $\bar{x_2}$ are equal

Example - want to check if the mean test scores of two separate samples of students have a statistically significant difference. 

**Dependent, Paired-Sample T-Test**: Used when the samples are dependent: One sample has been tested twice (repeated measurements) or two samples have been matched or paired together. 

Example: Want to check if the same group of students has improved results on test scores before prep course and after prep course. 

Just like with Z statistic, we calculate the t statistic. 

**One-Sample t-tets**

<center>$t = \frac{\bar{x} - \mu}{s/sqrt(n)}$</center>

- x_bar = sample mean
- mu = population mean
- s = sample standard error
- n = sample size

Just like with Z score, we compare to a table of t-scores. These scores depend on:
- Degrees of freedome (based on sample size n)
- Chosen significance level (default 0.05)

Compare to a t-score

<center>$t \lessgtr t_{n-1,\alpha}$</center>

- $t=$ t-statistic
- $t_{n-1,\alpha} = $t-critical
- $n-1=$ degrees of freedom
- $\alpha=$significance level

**Independent Two-Sample t-test**

The calculation of the t-statistic differes slightly for the following scenarios:
- equal sample size, equal variance
- unequal sample sizes, equal variance
- equal or unequal sample sizes, unequal variance (most common)

When working with two sampes and trying to compare them to each other with a t-test, it is often useful to think of the t-test as a ratio of sigal (sample means) to noise (sample variability). 

Calculate the t-statistic:

<center>
    $t = \frac{signal}{noise}=\frac{difference \; in \; means}{sample \; variability}=\frac{\bar{{x_1}}-\bar{{x_2}}}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}$
</center>

Then compare to a t-score

<center>$t \lessgtr t_{df,\alpha}$</center>

- t = t-statistic
- t_df,alpha = t-critical
- df = degrees of freedom
- alpha = significance level

Since we have two, potentially unequal-sized samples with different variances, determining the degrees of freedome is a little more complicated. 

![image.png](attachment:image.png)

The general formula for df if variances are "close enough" = n1 + n2 - 2

t-Distributions have fatter tails than normal Z-distriutions, since you have more variance. 

![image.png](attachment:image.png)

# Student T Distribution Example

Imagine an auto manufacturer has two planes that produce the same car. However due to budget constraints, they are forced to close one of the plants. The company wants to know if there is a significant difference in production between the two plants. The daily production over the same 10 days is as follows:

![image.png](attachment:image.png)

First compare the sample means:

Mean Plant A = 1222
Mean Plant B = 1186

So Plant A produces 36 more cars per day than Plant B. Is 36 more cars enough to say that the plants are different?

$H_0: X_A \leq X_B$

$H_1:X_A > X_B$

So we are going to treat this as a one-tailed test. 

We have 18 degrees of freedom (10+10-2)

Computer the variance

![image-2.png](attachment:image-2.png)

Compute the t-value

![image-3.png](attachment:image-3.png)

Look up critical value from t-table, one tailed test, 95% confidence

![image-5.png](attachment:image-5.png)

So our critical value is 1.734. Now compare our t-value to the critical value. Since our computed t-value is *greater* than the critical value, we reject the null hypothesis. 

Conclusion: We believe with 95% confidence that Plant A produces more cars per day than Plant B. 