```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1.

```

# Table of Contents

## Statistics Basics 3 - Other

- [Small Sample Sizes](#toc01)
    - T-statistic vs. z-statistic
    - T-score tables and degrees of freedom
    - Calculating confidence intervals using t-scores
- [Comparing Two Populations (Proportions)](#toc02)
- [Comparing Two Populations (Means)](#toc03)
- [Chi-Square](#toc04)
    - Goodness-of-fit test
- [ANOVA - Analysis of Variance](#toc05)
    - f-statistic
- [Introduction to Regression](#toc06)

---
<a id='toc01'></a>

## Small Sample Sizes

### T-statistic vs. z-statistic
Up until now, we've used the `z-score` to help us identify how many `standard deviations` a data point might lie from the population mean. It's also been very helpful in developing `confidence intervals`. Now remember, the `z-score` requires that our data be normally distributed. It also requires that we know the `standard deviation` of the population. The central limit theorem tells us that given enough iterations, the mean of our sample will be normally distributed. But often, the population `standard deviation` is unknown. So how can we create confidence intervals when the population standard deviation is unknown? 

<img src="images/t-statistic_01.png" alt="" style="width: 300px;"/>

Believe it or not, you can use the standard deviation of a single sample. But if you have only one sample with a sample size under 30, a relatively small sample size, you can probably guess that your confidence interval will suffer, and this is why the `z-score` is not valid in this situation. 

If you're creating a confidence interval when the population variance is unknown, you must instead use something called the `t-distribution`. Before we discuss the differences between the z- and t-distributions, let's first discuss how they are similar. 

<img src="images/t-statistic_02.png" alt="" style="width: 300px;"/>

Both are symmetrical, bell-shaped distributions. Both require data with a normal distribution. And in both cases, the area under the curve is equal to 1.0. So how is the t-test different from our z-test? Well, the z-test is mostly used to compare the mean of a sample to its larger population. The sample comes from the population, so the means of the sample and population are intertwined. On the other hand, the `t-test` compares two completely independent samples. They don't have to come from the same population. 

So because of these differences, and also because of the small sample size, the `t-distribution` isn't one curve but rather a series of curves. Each curve is representative of the distribution for different sample sizes. The smaller the sample size, the flatter the curve. The larger the sample size, the closer it gets to the `z-distribution`, which we use for the standard normal curve. 

<img src="images/t-statistic_03.png" alt="" style="width: 600px;"/>

Since all of the t-distribution curves are flatter than the z-distribution curve, `the critical scores for t-distributions are higher than those for z-distributions`. You might remember that the appropriate z-score for a 95% confidence interval was 1.96. That's 1.96 standard deviations. How does that compare to t-scores? Well, it depends on the sample size. 

<img src="images/t-statistic_04.png" alt="" style="width: 300px;"/>

For a sample size of three, the t-score is 4.303. For a sample size of 10, the t-score is 2.262. For a sample size of 20, the t-score is 2.093. And by the time our sample size is equal to a hundred, our t-score goes to 1.98. As you can see, `the larger our sample size, the closer the t-score is to the z-score of 1.96`. So where do we get t-scores for all of the different possible sample sizes? Let's take a look at that next.

### T-score tables and degrees of freedom
Since `T-distributions rely on the standard deviation of a sample`, instead of the standard deviation of the population, there is a greater level of uncertainty when creating confidence intervals. As a result, the `z-scores we gather from a z-distribution chart are not sufficient`. Instead, we need to utilize `t-distribution charts`. There's not one single t-distribution chart, but rather multiple charts. Remember, the curve associated with a t-distribution is dependent on the sample size. The smaller the sample size, the flatter the distribution curve, and the greater the uncertainty. 

<img src="images/t-statistic_05.png" alt="" style="width: 300px;"/>

The larger the sample size, the closer the curve gets to the normal distribution. 

<img src="images/t-statistic_06.png" alt="" style="width: 300px;"/>

That is why for each sample size, we need a different t-score distribution table. Just in case you forgot, here's a snapshot of just one part of a z-score table. 

<img src="images/t-statistic_07.png" alt="" style="width: 600px;"/>

So imagine having a different table for each unique sample size. Yep, that would be a lot of really big tables. Most of the time, however, we're looking for the most common confidence intervals. 90%, 95%, 99%. That's why you're more likely to see t-distribution tables that look like this.

<img src="images/t-statistic_08.png" alt="" style="width: 600px;"/>

As you can see, along the top is the confidence level, which is also given as the equivalent of a one- or two-tailed test. Along the left side, you have a column labeled as `df`. This stands for `degrees of freedom`. What are degrees of freedom? The easy answer is that `degrees of freedom` is just our sample size minus one, also referred to as `n-1`. `N` is the amount of data points in our sample. 

<img src="images/t-statistic_09.png" alt="" style="width: 300px;"/>

So, for a sample size of five, we have four degrees of freedom. For a sample size of 10, we have nine degrees of freedom. There's a more complex answer, too, but let's leave that for another day. 

Let's go back to our t-distribution table. `Once we have our sample size, we have our degrees of freedom`. Let's suppose we want a 95% confidence interval, and that our sample size was four, which means we have three degrees of freedom. We isolate the column for 95%, we find the row for three degrees of freedom, and the intersection of that row and column bring us to our critical `t-score`, 3.182. 

<img src="images/t-statistic_10.png" alt="" style="width: 600px;"/>

How about if our sample size is 10? Now we look at the row for nine degrees of freedom. Our critical t-score is now 2.262. As you can see, `as our sample size gets larger, our critical t-score gets smaller, because the larger sample size is associated with the curve that is closer to our normal z-distribution`. So, now that you can find t-scores, let's use this to create some confidence intervals.

<img src="images/t-statistic_11.png" alt="" style="width: 600px;"/>

### Calculating confidence intervals using t-scores
Let's develop a confidence interval using t scores. Remember, that this confidence interval gives us a range of values for estimated population parameter. 

Imagine that a national testing organization has made some major changes to the standardized exam that most aspiring college students take. The exam scores range between 50 points and 200 points. The old exam typically had an average score of 130 points. They like to see how the average score for the updated exam compares to the old version of the exam. Further, this testing organization wants to create a **98% confidence interval** for the updated exam's mean score. 

In order to do so, they gave the updated test to a random sample of 10 aspiring college students. The scores for these 10 students are as follows. 

<img src="images/t-statistic_12.png" alt="" style="width: 300px;"/>

Our sample mean is 126, while we don't know the standard deviation of the exam scores for the entire population of aspiring college students, we can calculate the standard deviation for this sample, 29.51. We also know that since our sample size N was 10, our degrees of freedom is N minus one. In this case, that would be nine. 

We'd like to read a 98% confidence interval. So, when we go to our T distribution table, we find that the critical t score for a 98% confidence interval with nine degrees of freedom is 2.821. 

<img src="images/t-statistic_13.png" alt="" style="width: 600px;"/>

So, to calculate our confidence interval, we use these formulas. 

<img src="images/t-statistic_14.png" alt="" style="width: 300px;"/>

We have our sample mean, 126. We found our t score, 2.821. So now, we need to find our standard error. For this, we'll use this formula. 

<img src="images/t-statistic_15.png" alt="" style="width: 300px;"/>

Remember, our standard deviation of our sample was 29.51. Our sample size N is 10. Therefore, we can calculate our standard error, 9.33. 

Now, we have everything we need to calculate our 98% confidence interval. 

<img src="images/t-statistic_16.png" alt="" style="width: 300px;"/>

Our upper and lower control limits will be 126 plus and minus 2.821 times 9.33 which means our 98% confidence interval for the mean of the exam scores for the updated standardized exam stretches from about 99.7 to 152.3. Again, that means the we're 98% certain that the population mean lies between those two values. 

<img src="images/t-statistic_17.png" alt="" style="width: 300px;"/>

That's a rather big spend for an exam where scores can only be as low as 50 and where they can only be as high as 200. 

Suppose, we were content with a **95% confidence interval**, we can go back to our t distribution table. We find that the critical t score for a 95% confidence interval with nine degrees of freedom is 2.262. We plug in this new value into our confidence interval limit formulas and we get our upper and lower limits are 126 plus or minus our critical t score, 2.262 times 9.33. Our 95% confidence interval would be from about 105 to 147. 

<img src="images/t-statistic_18.png" alt="" style="width: 300px;"/>

Still pretty big but nonetheless, a bit smaller. You're probably thinking, that is still a huge confidence interval. Well, I think `the big lesson here is that without the availability of the population standard deviation, a much larger sample size is needed to provide us with a more meaningful confidence interval`. So, perhaps, this testing organization should go back and administer the updated exam to a much larger random sample.

### Example 1
```
What is the standard error for a sample size of 10 and a standard deviation of 3.64?
The standard error is the standard deviation divided by the square root of the sample size; it reflects uncertainty in the estimate of the mean.
SE = 3.64 / sqrt(10) = 1.15
```

---
<a id='toc02'></a>

## Comparing Two Populations (Proportions)

### Explanation of two populations
Confidence intervals and hypothesis testing. You've now been introduced to both of these important statistical foundations. `Confidence intervals` allow us to take a single sample and create an interval, which we're fairly confident contains the population proportion. `Hypothesis testing` allows us to see if this one sample was likely the result of chance, or if an external force may have impacted the sample data. 

We're now moving on to `comparing two populations`. We'll look to answer questions like: 
- Does taking aspirin reduce the chance of a heart attack? 
- Are young male drivers more likely to get into car accidents than young female drivers? 
- Are people in Los Angeles more likely to be victims of violent crime than people in New York City? 
- Are male high school teachers more likely to have higher salaries than female high school teachers? 

<img src="images/two_populations_prop_01.png" alt="" style="width: 600px;"/>

Notice, we keep using the wording **"more likely"**. Even with our comparisons, we can't be sure, but we can create confidence intervals. But, `what makes all of these questions similar is that each situation can be analyzed by comparing two independent random samples`. One from each population: an `experimental population` and a `control population`. 

<img src="images/two_populations_prop_02.png" alt="" style="width: 600px;"/>

- Those that take aspirin versus those that take a placebo. The placebo is the control group. 
- A sample of young male drivers versus a sample of young female drivers. In this case, either gender can be used as the control. 
- A sample of citizens of Los Angeles versus a sample of New Yorkers. In this case, either city could be used as a control. 
- And of course, a sample of male high school teacher salaries versus a sample of female high school teacher salaries. Again, either gender can be used as the control. 

In this first section, we will look at the comparison of two proportions of two independent populations. We'll use our knowledge of basic proportions. We'll work to create a confidence interval for the difference between these two population proportions, and finally, we will use hypothesis testing to compare the difference between the proportions for each independent sample. Yeah, I know, that sounds like a whole lot of work, but rather than just talk about it, let's walk through a problem.

### Set up a comparison
So, let's compare two independent populations in an effort to figure out if a new drug is effective in reducing the chance of a heart attack. In real life, this testing would take many years. There are several different phases for drug testing but let's ignore that for now. Imagine the a drug company gathers a large number of subjects. The subjects are randomly placed into two groups. One group of subjects is given this new drug. The other group of subjects is given a placebo, a pill with no medicinal value.

<img src="images/two_populations_prop_03.png" alt="" style="width: 600px;"/>

Suppose, these were the results of this long-term study. 

<img src="images/two_populations_prop_04.png" alt="" style="width: 600px;"/>

For a new drug group, we had a sample size of 2,219. 26 of those had a heart attack. So, our p-hat here is 26 divided by 2,219. For our placebo group, we had a sample size of 2,035. 46 of those people had a heart attack. So, in this case, our p-hat is 46 divided by 2,035. The sample group that took the drug had heart attacks at a rate of 0.0117 or 1.17%. The sample group that took the placebo had heart attacks at a rate of 0.0226 or 2.26%. 

`Remember, these are just samples. So, while the central limit theorem tell us these rather large samples are likely to be representative of the population, they are still only samples`. What we found was that the difference in the proportions of the samples was p-hat one minus p-hat two. 

<img src="images/two_populations_prop_05.png" alt="" style="width: 600px;"/>

In this case, that's 0.0226 minus 0.0117 which gives us 0.0109. A 1.09% difference between the two sample proportions. What we'd like to know is, what is the true difference between the rate of heart attacks for the population when they take the new drug versus when they take the placebo. Since the drug company does not likely have the means to do this type of test, they would have to create a confidence interval. The confidence interval formula looks just like the formulas we've already been using for confidence intervals. 

<img src="images/two_populations_prop_06.png" alt="" style="width: 600px;"/>

Let's start filling in our values. Suppose we want a 95% confidence interval, we go to a Z distribution chart and find a critical Z score of 1.96. 

<img src="images/two_populations_prop_07.png" alt="" style="width: 600px;"/>

That number probably looks familiar by now. We also know the observe difference from our samples. We calculated this to be 0.0109. So now, all we need is our standard error. 

<img src="images/two_populations_prop_08.png" alt="" style="width: 600px;"/>

We have everything we need, p-hat one is the sample proportion for our placebo group, 0.0226, n1 is the sample size for this group, 2,035, p-hat two is the sample proportion for our new drug group, 0.0117, and n2 is the sample size for this group, 2,219. When we plug in all of our numbers, we find that our standard error is 0.0040. 

<img src="images/two_populations_prop_09.png" alt="" style="width: 600px;"/>

And now, we can calculate the limits of our confidence interval. Our upper limit and lower limit are just 0.0109 plus or minus 1.96 or critical value times 0.004. Therefore, our upper limit is 0.0188. Our lower limit is 0.0030. 

<img src="images/two_populations_prop_10.png" alt="" style="width: 300px;"/>

What do these numbers mean? `It means that we are 95% confident that the new drug reduces the population's chance of having a heart attack by somewhere between 0.3% and 1.88%`. In other words, we are 95% confident this new drug is more effective than the placebo.

### Hypothesis testing
Before we prepare our hypothesis test, let's briefly recap our example. A company is trying to figure out if a new drug is effective in reducing the chance of a heart attack. The company gathers a large number of subjects. The subjects are randomly placed into two groups. One group of subjects is given this new drug. The other group of subjects is given a placebo. The people in the study are not to be told if they are getting the new drug or the placebo. The results of the study were as follows. 

<img src="images/two_populations_prop_04.png" alt="" style="width: 600px;"/>

For the new drug, we had a sample size 2,219. 26 of those people had a heart attack, so our p-hat was 26 divided by 2,219. For our placebo group, we had a sample size of 2,035. In this case, 46 people had a heart attack, so our p-hat here was 46 divided by 2,035. The results and the resulting 95% confidence interval both provide evidence that the new drug did help reduce the rate of heart attacks.

<img src="images/two_populations_prop_11.png" alt="" style="width: 600px;"/>

The question we have is, what's the probability that our results happened by chance? In other words, we had 4,254 people in the study. 72 of those people suffered a heart attack. `What's the probability that even without the drug or placebo these same people would have suffered heart attacks?` Perhaps these 4,254 people were randomly put into two groups and one group just happened to get a lot more of the people that would end up getting heart attacks. 

<img src="images/two_populations_prop_12.png" alt="" style="width: 600px;"/>

This is extremely important to think about, especially for `random sampling` and `random assignment`. So let's go ahead and perform a hypothesis test. 

<img src="images/two_populations_prop_13.png" alt="" style="width: 300px;"/>

A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

<img src="images/two_populations_prop_14.png" alt="" style="width: 300px;"/>

So for step one, let's develop the hypotheses and state the significance level. Our null hypothesis, H-naught, population proportion of the placebo group minus the population proportion of our new drug group is equal to zero. In other words, the new drug had no effect. Both proportions are identical. Our alternative hypothesis, H sub one, here, our population proportion for the placebo group minus the population proportion for our new drug group is not equal to zero. This means that the proportion of heart attacks for people that took the new drug was different than that of those that took the placebo. This would indicate the new drug had some effect. Let's set our `significance level at 5%`. If there is less than a 5% chance the results of our study could have happened by chance, then we will reject our null hypothesis. 

<img src="images/two_populations_prop_14.png" alt="" style="width: 300px;"/>

So let's go to step two. Let's identify the test statistic. As in previous hypothesis tests the test statistic will be a Z-statistic. In this case, we are using the `Z-statistic for a hypothesis test for the difference between two proportions`. 

<img src="images/two_populations_prop_15.png" alt="" style="width: 300px;"/>

<img src="images/two_populations_prop_16.png" alt="" style="width: 300px;"/>

Yes, a very ugly formula, so let's start plugging in the numbers. The numerator of this formula is looking for the difference between our sample proportions and the true population proportions. Remember, we're testing the null hypothesis. The null hypothesis assumes that the difference between the two populations should be zero. So we can eliminate the second half of our numerator. 

<img src="images/two_populations_prop_17.png" alt="" style="width: 300px;"/>

The first part of the numerator is just the difference between our two samples. Proportion of heart attacks for the placebo was 0.0226. The proportion of heart attacks for the new drug group was 0.0117. Therefore, our numerator is 0.0109. 

<img src="images/two_populations_prop_18.png" alt="" style="width: 300px;"/>

How 'about that scary denominator with the square root? This is actually our standard error. Let's fill in our two sample sizes first, 2,219 for the new drug group, 2,035 for the placebo group. That leaves us with p-hat for the placebo and new drug, which again, are 0.0117 and 0.0226 respectively. When we calculate our denominator, which is the standard error, we get 0.004. Our numerator is 0.0109. Therefore, our Z-statistic is 2.725. 

<img src="images/two_populations_prop_19.png" alt="" style="width: 100px;"/>

So let's see what this looks like. We have our normal distribution. If the null hypothesis were true, the difference between the two populations would be zero. In our samples, we found that the new drug group had a lower proportion of people that suffered heart attacks. The difference between the two groups was 0.0109. 

<img src="images/two_populations_prop_20.png" alt="" style="width: 300px;"/>

This is a two-tailed test with a 5% significance level. The Z-score for this would be 1.96. This means that if our actual outcome were more than 1.96 standard deviations from the expected outcome, then we must reject our null hypothesis. Our result was 2.725 standard deviations from the expected outcome. In fact, by looking at a Z distribution chart, 2.725 corresponds with an outcome that is 0.3% likely. 

<img src="images/two_populations_prop_21.png" alt="" style="width: 600px;"/>

We have to reject our null hypothesis. In other words, we can feel fairly confident that the positive results exhibited by the group that took the new drug did not occur by chance.

### Example 1
The two hypotheses to be tested _____.
- must be logically equivalent
- (correct) cannot both be true
- must have differing probabilities of truth
- must both be true

The hypotheses must be mutually exclusive.

### Example 2
How should populations differ when we wish to investigate an effect?
- They should differ by at least five known variables.
- They should differ in size.
- (correct) They should differ by one or more known variables.
- They should differ by country of origin.

We should control at least one variable.

---
<a id='toc03'></a>

## Comparing Two Populations (Means)

### Basics of comparing two population means
Consider these situations: 
- A large national corporation has a hundred senior executives. About 40 of those senior executives are women. The other 60 are men. It is found that the average salary for male senior executives is about $15,000 per year greater than the salaries of the female senior executives. Why are male senior executives at this company paid higher salaries? Does the gender of the senior executives play a role? 

<img src="images/two_populations_mean_01.png" alt="" style="width: 400px;"/>

- A hundred obese males in their 20s are randomly assigned to two groups for a period of three months. One group of males exercise two hours a day but are allowed to eat whatever they please. The other group of males are not required to exercise, but they must adhere to a very strict diet. The males on the strict diet lose an average of four pounds more during the three month period versus the individuals that are required to exercise two hours per day. Is diet a better mechanism for influencing weight loss among young obese males versus daily exercise? 

<img src="images/two_populations_mean_02.png" alt="" style="width: 600px;"/>

- A group of 1,000 high school students are randomly assigned to two groups. One group is required to take a daily multivitamin pill. The other group is given a placebo, a pill with no medicinal value. The group that took the daily multivitamin scored an average of 3% higher on their math exams during their academic year. Does taking this daily vitamin improve a student's ability to perform well on math exams? 

<img src="images/two_populations_mean_03.png" alt="" style="width: 600px;"/>

In each scenario discussed, we had a group of people. They were assigned to two groups, either by their sex or by the stimuli to which they were exposed. In each case, the results of one group differed from the other group. One group had a higher mean salary. One group experienced a higher mean weight loss. And another group had higher mean exam scores on their math tests. 

The question is, did these measurable changes occur because of the stated differences in their groups or by chance? In other words, maybe the gender of the senior executives did not play a role in the salaries. Perhaps this group of female executives didn't hit their sales numbers that would warrant higher salaries. Maybe the weight loss program was not the differentiator in weight loss. It's possible that one group of obese men just happened to be assigned more of the men that were prone to lose weight under any program. And how would a vitamin help someone do better on a math exam? Perhaps the students that were better at math were randomly selected to be in the group that got the daily multivitamin. 

In this section, we'll look at different ways `to figure out whether population means for two populations can be attributed to actual differences or to chance`. Using data, charts, and randomizations, we'll look at different ways to figure out whether stimuli or chance influenced our statistical outcomes.

### Visualization (re-randomizing)
A certain school has 200 students. The students are randomly assigned to two different groups of 100 students. Each of the 100 students in the first group is given a math textbook to learn a certain math concept. We'll call this Group A. The second group of 100 students is asked to watch an online video to learn the same math concept. We'll call this Group B. Each group is given 30 minutes to learn the math concept. After the 30 minutes are up each of the 200 students takes an exam with 20 questions. 

<img src="images/two_populations_mean_04.png" alt="" style="width: 600px;"/>

The students that learn from the online video, Group B, they had a median test score that was five questions higher than the students that learned from the textbook, Group A. Students that watched the video had a median score of 17 out of 20. Students that learned the concept from the textbook had a median score of 12 out of 20. 

Is the video a more effective teaching tool or did this outcome just happen by chance? In other words, were most of the students that were better at math just happened to be assigned to the group that had access to the online video? 

One way to visualize the likelihood that this might've happened by chance is by taking all 200 test scores and then randomly assigning them to two different groups of 100. But if we randomize the 200 test scores into two groups we might find that one group of test scores has a median of 15 and the other a median of 14. What happens if we randomize the 200 test scores again? Maybe we'll get one group with a median score of 13 and the other group with a median score of 15. And what happens if we randomize these math test scores a total of a hundred times? 

<img src="images/two_populations_mean_05.png" alt="" style="width: 400px;"/>

For each randomization we record the difference between Group A, of 100 test scores, and Group B, and their hundred test scores. This allows us to visualize the difference between the averages of two randomized groups of these test scores. Suppose that this was the resulting distribution, where the X axis measures the difference between the medians of the two random groups. There are 100 dots on our distribution. Each dot represents the difference between the two group medians for a different randomization of the scores. 

<img src="images/two_populations_mean_06.png" alt="" style="width: 600px;"/>

The result of the experiment found that the group that used the online video had a median score five questions higher than the group that used the textbook. From the distribution chart for our 100 test score randomizations we can see that only on three occasions did the median test score for the online video group, Group B, exceed the median test score for Group A by five questions or more. If our significance level were 5% we can see that our experiment results were significant. According to this distribution chart this outcome is less than 5% likely to have occurred by chance. In fact, it's 3% likely to have occurred by chance. Granted, this was only 100 rerandomizations of the data. And this was a rather limited statistical exercise. But this simple example allowed us to, both, understand and visualize how statisticians test whether an outcome was potentially meaningful or if it may have occurred by chance.

### Set up a confidence interval
In the previous section, we discussed a situation where 200 students were randomly placed into two groups. These students were going to be given a test on a math concept. One group of students was allowed to use an online video to prepare for the exam. The second group used a traditional textbook to prepare for the same exam. The exam had 20 questions. 

For this section, let's provide some updated data. The group that used the online video to prepare ended up with 120 students. This group averaged 16.2 correct questions on the exam. The standard deviation of these test scores was 2.5. The group that used the traditional textbook to prepare for the exam, they had 80 students. The textbook group averaged 14.1 correct questions on the exam. The standard deviation of their scores was 3.6. As we can see, the average score for the online video group was 2.1 correct questions higher than that of the textbook group. 

<img src="images/two_populations_mean_07.png" alt="" style="width: 600px;"/>

This of course is only the difference for these two random groups of students from this one school. If the online video was available to the entire population of students that took this course across the country, what would be the difference in the average exam scores? Well, since we don't have data for every student, `we can use these two samples to create a confidence interval. A confidence interval that would contain the true difference between the population mean score` of students that prepared using the online video versus the population mean score of students that prepared using the textbook. Again, we go back to confidence intervals, and again, we see a familiar formula. Let's begin to fill in the numbers for our math test example. 

<img src="images/two_populations_mean_08.png" alt="" style="width: 600px;"/>

First, we know that the difference in mean scores for these two samples is 2.1. Next, our critical value. Let's say we want to build the typical 95% confidence interval, an interval that excludes 2.5% of the outcomes on either end of the distribution. We go to the Z distribution chart and `we find the appropriate critical score, the one that coincides with 0.9750`. 

<img src="images/two_populations_mean_09.png" alt="" style="width: 600px;"/>

Thus, our critical Z score is the very familiar 1.96. So we have 2.1, we have 1.96, we're just missing our standard error. Since our sample sizes are 80 and 120, `we can utilize the standard deviations of our samples as reasonable estimates for the population standard deviations`. So as you can see here, the standard error for this situation is the standard deviation of the video sample squared, divided by the sample size of the video sample plus the standard deviation of the textbook sample squared, divided by the sample size of the textbook sample. And once we have the sum, we take the square root. 

<img src="images/two_populations_mean_10.png" alt="" style="width: 300px;"/>

In this case, the standard deviation of the video sample is 2.5, and this sample size is 120, and the standard deviation of the textbook sample is 3.6 and this sample size is 80. Once we plug in our numbers, we can find the standard error for this problem. We get a standard error of about 0.462. 

<img src="images/two_populations_mean_11.png" alt="" style="width: 300px;"/>

<img src="images/two_populations_mean_12.png" alt="" style="width: 200px;"/>

And now we can calculate the limits of our confidence interval. Our upper and lower limits are 2.1 plus our critical value 1.96, times our standard error, 0.462. So we get an upper limit of 3.01. We get a lower limit of 1.19. 

<img src="images/two_populations_mean_13.png" alt="" style="width: 400px;"/>

So what does this mean? Well it means that we are 95% confident that the online video helps improve the exam scores for this particular test by at least 1.19 questions, and perhaps as high as 3.01 questions. Remember, we're only 95% confident, but since our lower limit is 1.19, we're pretty confident that the online video will improve the average test score of the population by at least one question versus the average score of this population if the students instead use the textbook to study.

### Hypothesis testing
So let's continue our same example. 200 students preparing for a math test. 120 had access to an online video. This group averaged 16.2 correct questions out of 20 on the exam. The Standard deviation for this group is 2.5. In another group, 80 students learn from a textbook. This group averaged 14.1 correct questions on the exam. The Standard deviation of their scores was 3.6. The average score for the online video group was 2.1 questions higher than that of the textbook group. 

<img src="images/two_populations_mean_07.png" alt="" style="width: 600px;"/>

So does this data make you feel like the online video is more effective than the textbook? To find out, let's perform a Hypothesis Test. Let's begin by setting up our two hypotheses. 

<img src="images/two_populations_mean_14.png" alt="" style="width: 600px;"/>

- Our Null hypothesis will be that the online video and the textbook are equally effective. There is no difference in their ability to prepare the students for this exam. In other words, the difference between the population means of these two groups would be 0. We would state this Null hypothesis as the Population mean exam score for the online video group minus the Population mean exam score for the textbook group is equal to 0. 
- Our alternative hypothesis would be that the online video is more effective in preparing students for this math exam. We would state this alternative hypothesis as the Population mean exam score for the online video group minus the Population mean exam score for the textbook group is greater than 0. 
- For this Hypothesis Test let's use a 1% significance level, or, an Alpha of value 0.01. 

Let's take a look at what we're testing. Here we have a normal curve. 

<img src="images/two_populations_mean_15.png" alt="" style="width: 300px;"/>

The curve is centered at 0. Remember, our Null hypothesis states that this would be the difference between these two populations. Our two samples, though, revealed that the video group scored 2.1 questions higher than the textbook group, so the difference between our two samples is out here to the right. 

<img src="images/two_populations_mean_16.png" alt="" style="width: 600px;"/>

Our Hypothesis Test has a 1% significance level. This means that if our outcome, 2.1, has less than a 1% chance of occurring we will reject the Null hypothesis. As you can see, they really want to make sure the evidence is strong before they declare the online video as being a more effective learning tool. To figure this out, we need a z score that is associated with 0.99. On a z distribution chart we find that that z score would be 2.33. So if our outcome of 2.1 is more than 2.33 Standard deviations from 0, then we will reject our Null hypothesis. 

<img src="images/two_populations_mean_17.png" alt="" style="width: 400px;"/>

Well, then to figure this out we'll need to know the size of 1 Standard deviation. Remember, we do not have our population data, we have sample data. But `since our sample sizes are rather large we can use the Standard error of the difference between the two Standard means as our Standard deviation`. We plug in our Standard deviations for the two different samples as our two sigmas in this formula. We then plug in the appropriate sample sizes into the denominators. This gives us a Standard error of 0.462.

<img src="images/two_populations_mean_18.png" alt="" style="width: 200px;"/>

Remember, we're going to use 0.462 as our Standard deviation, so to find our z score we use this formula. 

<img src="images/two_populations_mean_19.png" alt="" style="width: 300px;"/>

As you can see, that big denominator is just the Standard deviation we've just calculated, 0.462. X1 minus X2 is the difference between our sample means. That was 2.10. And from our Null hypothesis we know that mu1 minus mu2 is equal to 0. So we find that our z score is 4.54. 

Remember, on our chart, the threshold to reject the Null hypothesis was that 2.10 would be more than 2.33 Standard deviations from the center of our distribution, 0. Well, as we just saw, we are far beyond 2.33 Standard deviations from the mean. We are 4.54 Standard deviations from the mean. This puts us in the zone where we can reject our Null hypothesis. This means that we reject that the online video and textbook were equally effective in preparing students, which means we feel there is strong support for the alternative hypothesis. It does seem that the online video is more effective than the textbook in preparing students for the exam.

### Example 1
What is a null hypothesis?
- a proposal that cannot be disputed by statistics
- a proposal that all phenomena are unrelated
- (correct) a statement of that which is presumed true in the absence of evidence
- a statement indicating that nothing is true

We wish to find out if information or data can lead us to reject the null hypothesis.

### Example 2
```
Given two samples (one with size 100 and a standard deviation of 4.2, and the other with size 200 and a standard deviation of 9.8), what is the standard error of the mean representing both samples?
SE = 0.81
Take the square root of the sum of the sample variances.
```

### Example 3
Why is re-randomizing population samples useful?
- It permits the use of data visualization techniques.
- It reduces the data analysis problem to simple arithmetic.
- (correct) It demonstrates the probability of obtaining the original result by chance.
- It allows the presentation of graphical data.

This is a way of deciding whether a result was a fluke.

---
<a id='toc04'></a>

## Chi-Square

### Introduction to chi-square
I know it may seem strange, but there are some parts of the year where the rate of births are a bit higher than others. Yeah, there are more babies born in the summer and fall months than in the winter and spring months. Let's say that at a recent conference, a certain hospital administrator reported that at her hospital, historically, the birth rates by season of the year were this. 

<img src="images/chi_square_01.png" alt="" style="width: 600px;"/>

So, 15% of their annual births were in winter. 25% in spring, and 30% for both summer and fall. In an effort to see if the hospital's birth rates followed the stated seasonal distribution, the birth totals by season for last year were collected. And here they are. 

<img src="images/chi_square_02.png" alt="" style="width: 600px;"/>

So in winter, they had 45 births. In spring 48, summer 55, and 52 births in fall. As you can see, the numbers for last year do not seem to match up neatly with the historical rates that were quoted by the hospital administrator. For the 200 total births for this one year, 22.5% of the births occurred in winter versus the expected percentage, 15%. We can also see the data for the other seasons. 

Does this look like the hospital administrator's report was inaccurate? Before you make a judgement, we need to remember this data is for only a single year. But, suppose we wanted to know if the observed frequencies for this one year provided sufficient evidence to support the seasonal birthrates quoted by the hospital administrator? What can we do? Well, we could utilize something called a `Goodness of Fit Test`. This is one type of a `chi-square technique that we can use to perform hypothesis tests that compare two or more populations`. You may be wondering, why can't we just use a t-test here? Well, notice that **our data is divided into categories**. The categories are the seasons. `Chi-squared, or the Goodness of Fit Test, is more appropriate for evaluating data sets in which data is categorized`. With a t-test you're evaluating the null hypothesis when two sets of data are collected but not categorized. This Goodness of Fit Test will help us decide if our observed data for this single year follows the probability distribution that was provided.

### Curves and distribution
Chi-square is an interesting distribution in that, as you might guess, we square something but what is it that we square? Let's say we have at normal distribution. The distribution is centered at zero and the x-axis represents the different numbers in this distribution. 

<img src="images/chi_square_03.png" alt="" style="width: 300px;"/>

These values on the x-axis are the ones that we're going to square. Since we're squaring these values, our chi-square distribution will only have positive numbers. The height of the curve represents the likelihood of that particular outcome. In other words, the y-axis represents probability. 

<img src="images/chi_square_04.png" alt="" style="width: 300px;"/>

According to the distribution, most of the numbers in the distribution are near zero. So, our chi-square distributions will often look like this. 

Now, chi-square distributions allow us to see how multiple independent variables interacts. In other words, instead of just one normally distributed variable, we can have two or more. Using our seasonal birthrate example, imagine that winter is normally distributed, spring is normally distributed, and so our summer and fall. Each is an individual and independent normal variable. Each has its own normal distribution curve. For each season, we take a value from its distribution curve. Those values would be squared and summed. 

One other important factor we'll need to consider is the number of degrees of freedom. You might remember that the number of degrees of freedom is often just our sample size minus one. Why do we need the number of degrees of freedom? 

<img src="images/chi_square_05.png" alt="" style="width: 600px;"/>

Well, `for each degree of freedom, we have a different chi-square distribution curve, and the degrees of freedom tell us the mean of the associated curve`. So, where the mean of a curve with one degree of freedom would be 1.0, the mean of a curve with three degrees of freedom would be 3.0. I know it doesn't look like it but remember, this tail goes off right toward infinity. So, the mean of the curve with five degrees of freedom would be five and, of course, the mean of the curve with 10 degrees of freedom would be 10. 

<img src="images/chi_square_06.png" alt="" style="width: 400px;"/>

As you can see, the greater our degrees of freedom, the closer our chi-square distribution gets to a normal distribution. And just like with Z distributions and T distributions, we have a `table with our chi-square critical values`. Here's a small excerpt of our chi-square distribution table. 

<img src="images/chi_square_07.png" alt="" style="width: 600px;"/>

How do you read the chart? We identify our degrees of freedom. 

<img src="images/chi_square_08.png" alt="" style="width: 600px;"/>

This would be on the left hand side of the chart, as with all of our distribution charts, we then identify a probability threshold. This is along the top of our chart. 

<img src="images/chi_square_09.png" alt="" style="width: 600px;"/>

So, for five degrees of freedom, if we wanted to know the critical chi-square value for a 10% significance level, our chi-square value would be 9.236. 

<img src="images/chi_square_10.png" alt="" style="width: 600px;"/>

This is the value which we would compare to our calculated chi-square. So, with this bit of knowledge, let's go ahead and complete our test of the seasonal birth distribution.

### Goodness-of-fit test
So before we perform our chi-square goodness of fit test, let's recap our situation. Let's take a look at our stated distribution one more time. 

<img src="images/chi_square_11.png" alt="" style="width: 600px;"/>

In an effort to see if the hospital's birth rates follow the stated seasonal distribution, the birth totals by season for last year were collected. There were a total of 200 births. As you can see, the numbers for last year do not seem to match up neatly with the historical rates that were quoted by the hospital administrator. For the 200 total births for this one year, 45 total babies were born in winter. This means 22.5% of the births occurred in winter versus the expected percentage 15%. We can also see the data for the other seasons. 

Now, we want to know if the observed frequencies for this one year provides sufficient evidence to support the seasonal birth rates quoted by the hospital administrator. To do this, we perform a `goodness of fit test`. This is `a type of chi-square hypothesis test used to compare two or more populations`. 

<img src="images/chi_square_12.png" alt="" style="width: 600px;"/>

So, let's begin our goodness of fit hypothesis test. Our null hypothesis, h naught, is as follows. In this case, we will say that the hospital administrator's distribution was accurate, and of course, our alternative hypothesis would then be that the stated null hypothesis was not accurate. Let's set our significance level at 5%. Using the birth data from our table, we can now calculate a chi-square test statistic using this formula. 

<img src="images/chi_square_13.png" alt="" style="width: 600px;"/>

Notice this is not an X. It's the Greek letter chi, so our chi-square critical value is the sum of our observed value minus our expected value squared divided by our expected value. This means for each season, we calculate a chi-square value, and then we add up those individual chi-square values, so let's go ahead and do this. 

<img src="images/chi_square_14.png" alt="" style="width: 600px;"/>

Let's start with winter. Our chi-square for winter is our observed value, 45, minus our expected value, 30, squared, divided by our expected value, 30. This gives us a chi-square for winter for 7.50. 

<img src="images/chi_square_15.png" alt="" style="width: 600px;"/>

We then do the same calculation for spring where we had 48 observed births, and we expected 50 births. As you can see, our chi-square for spring is our 48 observed births minus our 50 expected births, squared, divided by our expected births, 50. This gives us a chi-square for spring of 0.08. Notice, since we always square the numerator, our chi-square values will always be positive. So, here are the chi-square values for all four seasons.

<img src="images/chi_square_16.png" alt="" style="width: 600px;"/>

Notice, our chi-square values are very small when the observed value is very close to our expected value, so `the smaller our chi-square value, the better the goodness of fit`. Now that we have all four seasonal chi-square values, we can add them and get our chi-square test statistic. We add up 7.50, 0.08, 0.42, and 1.07, and we get our chi-square test statistic of 9.07. 

<img src="images/chi_square_17.png" alt="" style="width: 600px;"/>

Now that we have our chi-square test statistic, we need to compare that to our chi-square critical value, so let's find our degrees of freedom first. For chi-square, our degrees of freedom are expressed as k minus one, where k is the number of categories. We had four seasons, so k is equal to four. This means we have three degrees of freedom. 

<img src="images/chi_square_07.png" alt="" style="width: 600px;"/>

We then go to our chi-square table. We find the row for three degrees of freedom. Our significance level is 5%, so we go to the column labeled 0.05, and we find that our chi-square critical value for three degrees of freedom and a 5% significance level is 7.815. 

<img src="images/chi_square_18.png" alt="" style="width: 600px;"/>


Let's look at this on our chi-square distribution. 

<img src="images/chi_square_19.png" alt="" style="width: 600px;"/>

7.815 is here. What does this mean? If we are to the left of 7.815, our one year of birth data would be a relatively likely outcome given the stated distribution. 

<img src="images/chi_square_20.png" alt="" style="width: 600px;"/>

On the other hand, if we are on the right side of 7.815, our one year of birth data would be a rather extreme outcome given the stated distribution.

<img src="images/chi_square_21.png" alt="" style="width: 600px;"/>

Our calculated chi-square value was 9.06. This means we are to the right to 7.815, so we reject our null hypothesis. The goodness of fit test helped us see that based on our single year of birth data, we can say with 95% confidence that the hospital administrator's stated seasonal distribution was extremely unlikely.

### Example 1
```
For a chi-square calculation with five categories, how many degrees of freedom are there?
4
DOF= k-1
```

### Example 2
```
What is chi-square for one observation of 10.2 with an expected value of 13.4?
X^2 = (10.2 - 13.4)^2 / 13.4 = 0.76
It is the square of the deviation divided by the expected value.
```

### Example 3
When is chi-square useful?
- when there are more than five degrees of freedom
- when the data is widely scattered
- (correct) when there is more than one independent variable
- when the data is collected at different times

One often uses chi-square when data is divided into categories.

---
<a id='toc05'></a>

## ANOVA - Analysis of Variance

### What is analysis of variance?
A luxury resort recently gathered survey data from their guests. These results were all captured within a one-week period in June. The hotel guests were asked to rate their resort experience from a score of zero to ten. Here are the survey results reported by these hotel guests, broken down by the age of the guest. 

<img src="images/anova_01.png" alt="" style="width: 600px;"/>

The question is, are these scores different because of the difference in the age of the guest, or is it possible that the reported scores and their differences were the result of random chance? 

In this section, we're going to discuss the Analysis of Variance, often referred to as ANOVA. `ANOVA is a procedure used to determine if the variation between reported output is the result of some particular factor, or if the variation is simply the result of randomness`. 

In this case, we have `one factor`: age. We also consider the `number of levels`. In this case, we had three levels, since guests were divided into three age ranges. We're going to look at ANOVA procedures in action. But, before we begin, we need to understand that `ANOVA relies on some assumptions`, including each population in our comparison is normally distributed, the observations are independent of one another, so, the guests do not influence each other's opinion, each of the populations being compared has an equivalent variance. 

<img src="images/anova_02.png" alt="" style="width: 600px;"/>

We're going to look at `one-way ANOVA`, a procedure that allows us to compare the means of different levels of one factor. This is the most basic form of ANOVA, but it will help us understand the basic goals and capabilities of ANOVA. There are other types of ANOVA, though. For example, we have randomized block ANOVA. `Randomized block ANOVA` would allow us to see if other factors may be influencing the outcomes. For example, you may think the annual income of the hotel guests plays a role in the survey scores. There is also `two-way ANOVA`. Similar to one-way ANOVA, we are comparing means from different levels. But, as you might guess, here we are using two factors. So here, we might be able to look at survey scores based on the age group, our first factor, and also based on the type of room the guest stayed in during their visit, our second factor. Two-way ANOVA can allow us to look at the intersection between these two factors.

### One-way ANOVA and the total sum of squares (SST)
Suppose, there are four different mobile service companies, Air Mobile, Binge Tech, ComMobile, Data Roam. Customers are asked to rank their mobile service from a scale of 1 to 10. One being very poor, 10 being excellent. Here is the individual data for four different customers for each company. 

<img src="images/anova_03.png" alt="" style="width: 600px;"/>

Before we move on, I want to note that while our very simple example has the same number of responses for each of the four companies, `ANOVA actually allows us to have a different number of data points for each individual level`. In any case, if we calculate the mean score for each company, we find Air has a mean score of four, Binge and Data have means of five, and ComMobile has a mean of six. If we average all 16 individual scores, we can get the `grand mean` of all data values. Thus, our grand mean, the average of all 16 data values is 5.0. We are now in position to calculate the `total sum of squares`, this is easy but it requires a bit of work. Let's begin with our Air Mobile data. We will take our first data point, five, and subtract from it the grand mean, five. So, we have five minus five and we're going to square that. We will then do this for every data point under Air Mobile. So, we have three minus our grand mean of five and we'll square that. We have five minus our grand mean of five and we'll square that, and finally, our last data point, three, will subtract the grand mean again, five, and we'll square that. 

<img src="images/anova_04.png" alt="" style="width: 600px;"/>

Actually, we're going to do this for all 16 values. The values for Air, Binge, ComMobile, and Data Roam. If we add all these up, the sum of squares for Air is eight, the sum of squares for Binge is 14, the sum of squares for ComMobile is 10, and the sum of squares for Data is 20. If we add up all of those, we get a `total sum of squares` of 52. 

<img src="images/anova_05.png" alt="" style="width: 600px;"/>

Notice, the total sum of squares is often noted as `SST`. What does this mean? It means that the total amount of variation between each data value and the grand mean is 52. 

So far, ANOVA has allowed us to calculate the level of variance between all 16 points in our complete data set and the grand mean of the entire set but if we look at our table of data, we can see that while there is variance between all the data points, there's also variance between the data values for each company. Let's calculate that next.

### Variance within and variance between (SSW and SSB)
So far, for our mobile service data, ANOVA has allowed us to calculate the level of variance between all 16 data values in each complete data set and the grand mean of the entire set, but if we look at our table of data, we can see that while there is variance between all the data points and the grand mean, there is also variance between the data values for each company and the mean for each mean score for each company.

<img src="images/anova_06.png" alt="" style="width: 600px;"/>

This type of variance is called the variance within. So, let's find the variance within the individual data values for Air Mobile and the mean score for Air Mobile which is 4.0. Just as before, we will add up our squares. The difference this time is that instead of taking our data value, five, and subtracting the grand mean, here, we will subtract the mean for Air Mobile which is four. We do the same for the other three data values under Air Mobile. If we add all of our squares, we get a sum of squares within for our Air Mobile data of four but I don't want to do this for only Air Mobile. I'm going to do this for all four companies. The important thing to remember is that instead of subtracting our grand mean, we will be subtracting the individual means for each company. 

<img src="images/anova_07.png" alt="" style="width: 600px;"/>

So, you can see that this is very similar to calculating our total sum of squares. The big difference that for Air, we subtract by four. For Binge and Data, we subtract by five. Since both of those companies had mean scores of five. And for ComMobile, we will subtract by their mean score, six. So, when we add up all of these squares, we get 44. That means that our `total sum of squares within` each company's data, often noted as `SSW` is 44. We'll come back to this number shortly. 

There's another type or variance. The variance between the mean score for each company and the grand mean. Let's find the squares between each company's data and the grand mean. Remember, the grand mean, the average of all 16 of our data values is 5.0. So, for Air Mobile, we take the Air Mobile mean, 4.0, and subtract the grand mean, five, and, of course, we square it. Here's what we get for each company. But since each company had four data values, we multiply each square by four, why? Remember, each company's mean was made up of four data values. So, we need to multiply it by four so each square is representative of each data value. We add all of these up and we find that our `sum of squares between` the companies, often noted as `SSB`, is eight. 

<img src="images/anova_08.png" alt="" style="width: 600px;"/>

So, let's recap. We know that the total sum of squares, SST, was 52. We found that our sum of squares within, SSW, was 44. And we also found that our sum of squares between, SSB, was eight. Look at that. The sum of squares within, 44, and the sum of squares between, eight. They add up to 52 which just happens to be our total sum of squares. This didn't just happen by chance. The sum of squares within plus the sum of squares between always gives us the total sum of squares.

<img src="images/anova_09.png" alt="" style="width: 300px;"/>

### Hypothesis test and f-statistic
So far, we've been able to calculate our sum of squares, our sum of squares within, and our sum of squares between. We also found the relationship between these three values. Our sum of squares is the sum of the squares within, and the squares between. The next, and perhaps obvious question would be so, what can we do with this? Well, we can start to test our data set. 

<img src="images/anova_10.png" alt="" style="width: 600px;"/>

Presently, it looks like ComMobile is providing better service than its competitors. And it looks like Air Mobile is providing the worst service. I'm simply basing this off of their mean values. The question is, is it possible that these data values happen by chance and that perhaps the services are equal? In other words, perhaps if the entire population of mobile service users got to try all four companies we would find that there really is not a difference between these companies. 

So, if we wanted to establish a hypothesis test, we would begin by stating our hypotheses. 

<img src="images/anova_11.png" alt="" style="width: 600px;"/>

Here's the nice thing about ANOVA: the null hypothesis is always the same. `The null hypothesis always states that all populations are equal`. So for our situation, H-naught is that the population mean for all mobile service providers is equal. Which means our alternative hypothesis states that not all of the population means are equal. So, if we reject the null hypothesis, this would help us see that there is a difference between the four companies, and that some companies are better than others. Let's establish a **5% significance level**. The question is, how will we test this? Well, for this we're going to introduce something new, called the F-statistic. 

<img src="images/anova_12.png" alt="" style="width: 600px;"/>

The `F-statistic` is our SSB divided by m minus one over SSW divided by n-sub t minus m. We can already sort of see what this is doing. `The formula is comparing the variants between the companies versus the variants within the individual company data`. So, if our F-statistic is big, that means there's probably a big difference between the companies. This pushes us to reject the null hypothesis. But of course, if the F-statistic turns out to be small, that means there probably is not a big difference between the companies. This means most of our variants is the result of chance. This would guide us to not reject our null hypothesis. 

<img src="images/anova_13.png" alt="" style="width: 300px;"/>

So, let's get to work. As we look at our F-statistic formula, we know what `SSB` and `SSW` stand for. But how about m and n-sub t? Well, `m` is the number of levels, or groups. In this case, we had four companies, so m is equal to four. That means m minus one is three. You may also recognize this as our `degrees of freedom between the companies`. Now we need n-sub t. `N-sub t` is the total number of observations in our data set. We had 16 different values in our data set. So this is our n-sub t. Therefore, n-sub t minus m is equal to 12. Notice n-sub t minus m is our `degrees of freedom within`. We have 12 degrees of freedom within the data set. Why? Because for each of the four companies, we had four values. That means for each individual company, we had three degrees of freedom. Four companies, each with three degrees of freedom, that gives us a total of twelve degrees of freedom. So, now we have everything we need to calculate our F-statistic. Our F-statistic is equal to 0.727. 

<img src="images/anova_14.png" alt="" style="width: 300px;"/>

Now it's time to go to our F distribution table. This could get ugly for you because `there is a different F distribution table for every level of significance`. So, there's a whole table for a significance level of 1%. Another for 10%, and still another for 5% significance level. Here's a small excerpt of this 5% table. 

<img src="images/anova_15.png" alt="" style="width: 600px;"/>

Notice along the top we have the degrees of freedom in our numerator. We had three degrees of freedom in our numerator. 

<img src="images/anova_16.png" alt="" style="width: 600px;"/>

And along the left, we have the degrees of freedom in our denominator. We had 12 degrees of freedom in our denominator. 

<img src="images/anova_17.png" alt="" style="width: 600px;"/>

So, if we go to the inner section of those two, we find that our critical F value is 3.49. 

<img src="images/anova_18.png" alt="" style="width: 600px;"/>

If our F-statistic is greater than 3.49, we reject our null hypothesis. If our F-statistic is less than 3.49, we do not reject our null hypothesis. Well, our F-statistic was 0.727, so we definitely do not reject our null hypothesis. This means that there is not enough evidence to support that there is a significant level of difference between these four mobile service providers. It's likely that the reported difference in their mean scores occurred by chance.

### Example 1
```
For a dataset containing five groups and each containing six samples, how many degrees of freedom are there within the total data set?
25 = 30 samples - 5 groups
```

### Example 2
Which statement is true for the total sum of squares?
- It is independent of the total number of observations.
- (correct) It depends upon the total number of observations.
- It is proportional to the total number of observations.
- It can be negative.

It is the sum of the square of each deviation.

### Example 3
In ANOVA, each population variance _____.
- has zero value
- is unique
- (correct) is equal to the others
- takes on one of two values

All the variances are the same.

---
<a id='toc06'></a>

## Introduction to Regression

### What is regression?
Do you ever wonder about the correlation between variables? Whether they're somehow related? Well, `a regression helps us investigate the relationship between two variables`. 

For example, let's take a look at a couple of questions that are asked pretty frequently about education. Does education level impact lifetime earnings? Or, does more studying improve test scores? We can infer some things about these questions through the use of regression. The basics are pretty simple. Usually, we express this graphically with a simple scatter plot. 

<img src="images/regression_01.png" alt="" style="width: 600px;"/>

So to begin, on the X axis we have one variable, study time. On our Y axis we have a second variable, exam score. Let's say these are our data points. 

<img src="images/regression_02.png" alt="" style="width: 400px;"/>

So now, we sprinkle our data points onto our graph. Once we have our data points, the goal of regression is to help us find a line that will fit our data points. Now, obviously, we aren't going to have a single line that can hit every one of these data points, so `regression analysis tries to find the formula for the line that will best fit this distribution`. 

<img src="images/regression_03.png" alt="" style="width: 400px;"/>

Now before we move on, let's refresh our knowledge about lines and the formulas that describe them. The formula for a line is often expressed in slope intercept form. Y equals MX plus B. What does that mean? Let's take this line. 

<img src="images/regression_04.png" alt="" style="width: 400px;"/>

The variable B represents the Y intercept, the point where the line crosses the origin on the Y intercept. Then we have M. M represents the slope of our line. It tells us how steep our line is. A positive M indicates a positive slope, meaning that the line is rising from left to right. A negative M tells us our line is falling, from left to right. An M of zero means our line is flat horizontally. 

So, let's say that M is equal to two and B is equal to three. That means our line looks like this. 

<img src="images/regression_05.png" alt="" style="width: 400px;"/>

If we plug in one for X, into the formula, we find that Y would be five. This line tells us that for X equals one, we would expect Y to be five. But it's rarely quite that easy in regression. Let's go back to our example. 

<img src="images/regression_06.png" alt="" style="width: 400px;"/> 

Let's say that this was found to be our regression line. As you can see the regression line doesn't hit any of our points directly. The line is simply the one that bests fits this data. And I think this is a good setup for our next few sections. We're going to learn how to find the formula for our regression line. We're going to use something called `R squared` to understand the relationship in variation between our X and Y variables. And we'll also look at something called the `correlation coefficient`, which will help us understand our regression line and how it fits our data.

### The best-fitting line
The starters for a certain men's college basketball team have the following heights and weights. 

<img src="images/regression_07.png" alt="" style="width: 600px;"/>

Here's what this would look like on our scatter plot. 

<img src="images/regression_08.png" alt="" style="width: 400px;"/>

Now, let's find the line that best fits this data. Remember, we're going to express the line using the slope intercept form, y = ax + b. 

<img src="images/regression_09.png" alt="" style="width: 400px;"/>

In algebra, we often see slope expressed as m. But here, we'll be using a as our slope. B will be the y-intercept. You're going to want to set up six columns. You'll then want to start to fill in your data. 

<img src="images/regression_10.png" alt="" style="width: 600px;"/>

Here are our given data points. For our fourth column, (xy), we multiply height and weight. Our fifth column, x squared, we square our heights. And, for our sixth column, we square our weights. We then add up the sum of columns two through six. We now have everything we need. The sums at the bottom of each column will be what we end up using to find our slope and intercept. So, let's plug in our sums into the formula for our slope, a. 

<img src="images/regression_11.png" alt="" style="width: 400px;"/>

Once we calculate this, we find our slope is 8.832, which means that, for every inch in height, the weight of the player is expected to increase by 8.832 pounds. Let's now move on to our y-intercept, b. 

<img src="images/regression_12.png" alt="" style="width: 400px;"/>

Again, we plug in our sums from our chart into this formula. Notice, we'll be using a, our calculated slope, in this formula. Once we plug everything and calculate, we find our y-intercept, b, is -479.3 pounds. 

So, based on our limited data set, if we had a player that was zero inches tall, we would expect him to weigh -479.3 pounds. Hey, we never said the line was perfect. In any case, we can now use the line to make some educated guesses about player weights and heights. Imagine that we have a player that's five feet, 10 inches tall. That's 70 inches. I can plug in 70 inches, for the value of x in our formula. According to the regression line, we might expect him to weigh 138.9 pounds. 

So, we have our line, and while it may not work for everyone, perhaps it works well for male college basketball players. The question is, was this formula a good fit? Or, just the best we could do, given the data? To answer that question, we'll be introducing a new concept, the `coefficient of determination`.

### The coefficient of determination - R2
For our sample of basketball player weights and heights, we have only five values in each category. Graphing these on our scatter plot is simple. Evaluating them by sight is also rather simple, and the appropriate regression line was relatively easy to anticipate visually. For our data, we could see a pattern, and we could tell the formula for our regression line was a good fit, but what happens when you have a huge data set, when the scatter plot is messy, and the regression line is not necessarily logical? How can you tell if the regression line is a good fit for your data? Well, for this, we have something called `r-squared`. It's called `the coefficient of determination`. `R-squared is a number between zero and one. Zero indicates that our regression line is a very poor fit for our data points. An r-squared of 1.0 on the other hand indicates that the line is a perfect fit for our data`. So, how do we calculate r-squared? Well, the formula for r-squared looks pretty simple. 

<img src="images/regression_13.png" alt="" style="width: 400px;"/>

R-squared is equal to SSR divided by SST. `SSR` is equal to the sum of squares regression. `SST` is the total sum of squares. Calculating SSR and SST is not going to be fun though. Let's go to the table with our original data. 

<img src="images/regression_14.png" alt="" style="width: 400px;"/>

Notice, we also calculated the means for x and y values. Next, we'll add a column with our expected y's. In other words, using the formula for our regression line from our previous section, y hat is equal to 8.83 x minus 479.3. We'll plug in each x and calculate the associated y hat. 

Now, let's calculate SSR, the sum of squares regression. 

<img src="images/regression_15.png" alt="" style="width: 400px;"/>

We'll use this formula to find the individual square regressions. So here we have y hat minus our mean y, and we'll square that. Our first player was expected to be 156.5 pounds. This is our y hat. The mean y was 199. The square regression for this player is 1809.7. We do the same calculation for our four other players. So, our sum of squares regression is the sum of 1809.7, 257.6, 109, 2.6, and 2094. This gives us a SSR of 4272.8. 

<img src="images/regression_16.png" alt="" style="width: 600px;"/>

Now, for our total sum of squares, SST. To calculate our individual squares here, we use a very similar formula except instead of using the predicted weight for each player, we'll use the observed weight, so here our formula is our observed y minus our mean y, and again, we're going to take the difference and square it. 

<img src="images/regression_17.png" alt="" style="width: 400px;"/>

Our first player actually weighed 160 pounds. This is our observed y. The mean y was 199. The square for this player is 1521. We do the same calculation for our four other players. So, our total sum of squares is the sum of 1521, 361, 441, 81, and 2116. This gives us a SST of 4520. 

<img src="images/regression_18.png" alt="" style="width: 600px;"/>

I want to point out that throughout all the calculations, we have rounded differently, so your answers maybe slightly different than mine, and that's perfectly okay. Going through the process is what matters most. 

So now, we're ready to calculate r-squared. 

<img src="images/regression_19.png" alt="" style="width: 200px;"/>

Remember, r-squared is equal to SSR divided by SST. SSR was the sum of squares regression. That was 4272.8. SST is our total sum of squares. That's 4520. Therefore, r-squared is equal to 0.945. That's very close to 1.0 which means our regression line is an excellent fit for our data points just as we expected. 

`R-squared is 0.945. What exactly does that mean? It means that 94.5% of the variation is explained by height`. The other 5.5% of the variation can be attributed to error.

### The correlation coefficient
In our previous section, we calculated `R squared`. For our basketball player data set, R squared was zero point nine four five and note, the r in R squared is a capital r. In our world of regression though, we also have a lowercase r. This `lowercase r is our correlation coefficient and there's a relationship between R squared and r`. 

<img src="images/regression_20.png" alt="" style="width: 300px;"/>

R, the correlation coefficient, is equal to the square root of R squared and the sign of our correlation coefficient is the same as the slope of our regression line. In the case of our basketball player data set, the regression line had a slope of positive eight point eight three. So if we want to calculate r, the correlation coefficient, first, we know the sign would be positive. Second, we take the square root of R squared, zero point nine four five. That means that r, the correlation coefficient, is positive zero point nine seven two and what does that mean? Well, let's look at this graphically. 

<img src="images/regression_21.png" alt="" style="width: 600px;"/>

This is what a r of positive one would look like (on the right). And here is the data for a r of negative one (on the left). In both cases, the data is organized in a way where if we connected the dots, we'd get a perfect straight line. The positive and negative signs simply tell us if the line is climbing from left to right or if it's declining. 

How about a r of negative zero point eight five? 

<img src="images/regression_22.png" alt="" style="width: 300px;"/>

Here we can see that the dots are heading in a downward direction as we move to the right. We can also see that, while we can imagine a regression line, the fit of these dots on the line would not be perfect. 

Here's a r of positive zero point four two. 

<img src="images/regression_23.png" alt="" style="width: 300px;"/>

The dots are generally following an upward trend from left to right, but any regression line we would draw would miss a majority of the dots by a rather big margin. 

How about a r of zero? 

<img src="images/regression_24.png" alt="" style="width: 300px;"/>

As you can see, no real trend is visible and the dots are sort of, just a disorganized mess. 

So while `r`, `the correlation coefficient`, does not provide specific information about the regression line, it does `tell us the tightness of fit of our data points and also the upward or downward trend of the data from left to right on our axis and now when you see someone report a r, use that as a guide to tell you about the trend of the data and whether there seems to be a strong linear relationship between our two variables`.

### Example 1
```
For a coefficient determination of 0.917 and slope estimate of -0.342, 
what is the value of the correlation coefficient r?

r = -0.957

The result is the negative of the square root of 0.917.
```

### Example 2
```
In the equation y = -(1/12) x + 0.05, what is the slope when plotted as y versus x?

slope = -(1/12)

The slope is the multiplier of x.
```

### Example 3
```
What is the value of the coefficient of determination for a perfect linear fit?

R squared = 1.0
```

### Example 4
```
For a mean of 7 and observations of 5, 6, and 9, what is the total sum of squares?

SST = (5-7)^2 + (6-7)^2 + (9-7)^2 = 4 + 1 + 4 = 9
```

### Example 5
```
Which type of hypothesis test do we use to evaluate proportions from categorized data?

goodness of fit
```

### Example 6
```
Which statistic is used in hypothesis testing for population proportions?
z
```

### Example 7
```
Which formula becomes slightly messy when calculating the confidence intervals when using two population samples?

standard error
```

### Example 8
```
For a sample size of 10, what is the number of degrees of freedom?
9
```

### Example 9
```
For the same confidence interval, the t-score is always _____ the z-score?
- equal to
- less than
- larger than (correct - t-distribution is lower than normal distribution and has more data in the tails)
```
https://www.scribbr.com/statistics/t-distribution/