# Inferential Statistics

Inferential statistics allow you to use a relatively small sample to learn about an entire population.

The primary way scientific experiments create new knowledge is by carefully setting up contrasts between groups, such as a treatment and control group.

## Table of Contents

- [Descriptive and Inferential Statistics](#intro)
    - [Descriptive statistics](#desc)
    - [Inferential Statistics](#infer)
- [Populations](#pop)
- [Subpopulations](#subpop)
- [Population Parameters versus Sample Statistics](#popsampl)
- [Tools for Inferential Statistics](#tools)
- [Sample Size and Margins of Error](#size)
- [Sampling Distributions of the Mean](#dis)
- [Confidence Intervals and Precision](#conf)

---
<a id='intro'></a>

## Descriptive and Inferential Statistics

Descriptive and inferential statistics are two broad categories in the field of statistics. Here’s the difference in a nutshell:

- **Descriptive statistics** `describe a dataset` for a particular group of objects, observations, or people. They don’t attempt to generalize beyond the set of observations.

- **Inferential statistics** `use a dataset to make conclusions about the larger population from which the sample was drawn`. These statistics generalize beyond the specific observations that are in the dataset to a larger group or population.

---
<a id='desc'></a>

### Descriptive statistics

**Descriptive statistics** describe a sample. Use descriptive statistics to summarize and graph the data for a group that you choose. This process allows you to understand that specific set of observations.

The process involves taking a potentially large number of data points in the sample and reducing them down to a few meaningful summary values and graphs. This procedure allows us to gain more insights and visualize the data than merely pouring through row upon row of raw numbers.

**Descriptive statistics** frequently use `statistical measures` to describe a particular group:

- **Central tendency**: Use the `mean` or the `median` to locate the center of the dataset. This measure tells you where most values fall.

- **Dispersion**: How far out from the center do the data extend? You can use the `range` or `standard deviation` to measure the dispersion. Low dispersion indicates that values cluster more tightly around the center. Higher dispersion signifies that data points fall further away from the center. We can also graph the `frequency distribution`.

- **Skewness**: The measure tells you whether the distribution of values is `symmetric` or `skewed`.

- **Correlation**: The strength of the tendency for two variables to change together.

You can present this summary information using both `numbers` and `graphs`.

---
<a id='infer'></a>

### Inferential Statistics

In most cases, it is simply impossible to measure the entire population to understand its properties. The alternative is to gather a **random sample** and then use the methodologies of **inferential statistics** to analyze the sample data.

**Inferential statistics** takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the `goal of inferential statistics is to take a sample and generalize its properties to a population`, we need to have confidence that our sample accurately reflects the population. This requirement affects our process. At a broad level, we must do the following:

1. Define the population we are studying.
2. Draw a representative sample from that population.
3. Use analyses that incorporate the sampling error.

We need a sampling procedure that tends to produce a sample that accurately reflects the population from which you draw it. **Random sampling** is a procedure that allows us to have confidence that the sample represents the population. The random nature of this process helps `avoid any systematic bias` that would invalidate our results.

**Random sampling** is a primary method `for obtaining samples that mirrors the population on average`. This type of sampling produces statistics, such as the mean, that are not systematically too high or too low. In other words, the critical characteristic of **random samples** is that they produce **sample statistics** that tend to be correct on average.

Consequently, when we obtain a **random sample**, `we can generalize from the sample to the broader population`. Unfortunately, gathering a genuinely random sample can be a complicated process.

When you estimate the properties of a population from a sample, the sample statistics are unlikely to equal the actual population value exactly. For instance, your sample mean is unlikely to equal the population mean exactly.

**Sampling error** `is the difference between the sample statistic and the population value`. Inferential statistics incorporate estimates of this error into the statistical results.

Summary values in **descriptive statistics** are straightforward. The average score in a specific class is a known value because we measured all individuals in that class. There is little uncertainty.

To gain the benefits of **inferential statistics**, you must understand the relationship between `populations`, `subpopulations`, `population parameters`, `samples`, and `sample statistics`.

---
<a id='pop'></a>

## Populations

**Populations** can include people, but other examples include objects, events, businesses, and so on. In statistics, there are two general types of populations.

- Populations can be the complete set of all similar items that exist. It’s a finite but potentially extensive list of members.
- A population can be a theoretical construct that is potentially infinite in size

**Populations** share a set of attributes that you define. For example, the following are populations:

- Stars in the Milky Way galaxy.
- Parts from a production line.
- Citizens of the United States.
- 8th grade students in the State of Pennsylvania.

`Before you begin a study, you must carefully define the population that you are studying`. These populations can be narrowly defined to meet the needs of your analysis. For example, your population can be adult Swedish women who are otherwise healthy but have osteoporosis.

---
<a id='subpop'></a>

## Subpopulations

**Subpopulations** share additional attributes. For instance, the population of the United States contains the subpopulations of men and women. You can also subdivide it in other ways such as region, age, socioeconomic status, and so on. Different studies that involve the same population can divide it into different subpopulations depending on what makes sense for the data and the analyses.

Understanding the **subpopulations** in your study helps you grasp the subject matter more thoroughly. They can also help you produce statistical models that fit the data better. Subpopulations are particularly important when they have characteristics that are systematically different than the overall population. When you analyze your data, you need to be aware of these deeper divisions. In fact, you can treat the relevant subpopulations as additional factors in later analyses.

<img src="images/stat-infer.png" alt="" style="width: 400px;"/>

Gender is a crucial **subpopulation** that relates to height and increases our understanding of the subject matter. In future studies about height, we can include gender as a variable. That’s how scientists think about research questions and add to their knowledge.

---
<a id='popsampl'></a>

## Population Parameters versus Sample Statistics

A **parameter** is a `value that describes a characteristic of an entire population`, such as the population mean. Because you can rarely measure a population as a whole, you usually don’t know the real value of a parameter. While we can’t measure the parameter value, it exists.

While we’ll never know the precise value of these **population parameters**, we can use inferential statistics to estimate them and incorporate a margin of error.

The population mean and standard deviation are two common parameters. `In statistics, Greek symbols usually represent population parameters`, such as `μ` (mu) for the mean and `σ` (sigma) for the standard deviation.

A **statistic** is a `characteristic of a sample`. If you collect a sample and calculate the mean and standard deviation, these are **sample statistics**. Inferential statistics allow you to use sample statistics to make conclusions about a population. However, to draw valid conclusions, you must use **random sampling** techniques, as discussed earlier.

<img src="images/stat-infer2.png" alt="" style="width: 400px;"/>

In **inferential statistics**, `sample statistics are estimates of population parameters`. For example, if we collect a random sample of adult women in the United States and measure their heights, we can calculate the sample mean and standard deviation and use them as unbiased estimates of the population parameters.

You can calculate the following types of estimates for population parameters:

- **Point estimates**: These estimates use the sample data to produce a single value that is the most likely value for the population parameter. Sample statistics, such as the mean, are typically the point estimate for the population. Unfortunately, point estimates are always wrong by an unknown amount because of **random sampling error**.

- **Interval estimates**: A range of values that likely contains the value of the population parameter. These intervals include a **margin of error** around the point estimate to account for **random sampling error**.

In short, **point estimates** are the best guess value but are guaranteed to be wrong by at least a little bit because you’re working with a sample that is small in comparison to the population. **Intervals estimates** are ranges of values that probably contain the parameter value.

---
<a id='tools'></a>

## Tools for Inferential Statistics

Inferential methods can produce similar summary values as descriptive statistics, such as the mean and standard deviation. However, we use them very differently when making inferences.

### Hypothesis tests

**Hypothesis tests** use sample data answer questions about point estimates, such as the following:

- Is the population mean greater than or less than a particular value?
- Are the means of two or more populations different from each other?

### Confidence intervals (CIs)

`In inferential statistics, a primary goal is to estimate population parameters`. These parameters are the unknown values for the entire population, such as the population mean and standard deviation. These parameter values are not only unknown but almost always unknowable. The **sampling error** produces uncertainty, or a **margin of error**, around our estimates. 

**Confidence intervals** incorporate the uncertainty and sample error to create a range of values the actual population value is like to fall within. For example, a confidence interval of [176 186] indicates that we can be confident that the real population mean falls within this range.

### Regression analysis

**Regression analysis** describes the relationship between a set of `independent variables` and a `dependent variable`. This analysis incorporates hypothesis tests that help determine whether the relationships we observe in the sample data also exist in the population.

---
<a id='size'></a>

## Sample Size and Margins of Error

Having a large sample size is a good thing but sample statistics are always wrong. When you use samples to estimate the properties of populations, you never obtain the correct values exactly.

**Inferential statistics** is a powerful tool because it `allows you to use a relatively small sample to learn about an entire population`. However, to have any chance of obtaining good results, you must follow sampling procedures that help your sample to represent the population faithfully. Upon seeing an estimate, you should wonder - how large is the difference between the estimate and the actual population value? What is the margin of error?

The margin of error relates to the sample size. Let’s explore why larger samples are better.

The primary goal of inferential statistics is to generalize from a sample to a population. To accomplish this objective, the sample must be similar to the population. The simple truth is that `it’s more difficult for a small sample to approximate an entire population closely`. Larger samples tend to better represent the full complexities of a population.

Additionally, `larger sample sizes help you avoid unusual samples`. Think about coin tosses. You expect to obtain heads 50% of the time. If you have four coin tosses, it is not be surprising to get heads 3 out of 4 times (75%). That’s a considerable distance from 50%—but it’s just one extra heads. However, if you have 100 coin tosses, it’s im- probable that you’d get heads 75% of the time. It’s possible, but you’d need very fluky luck to get those 25 extra heads (50 + 25 = 75).


---
<a id='dis'></a>

## Sampling Distributions of the Mean

A vital concept in inferential statistics is that the particular random sample that you draw for a study is just one of a large number of possible samples that you could have pulled from your population of interest. Understanding this broader context of all possible samples and how your study’s sample fits within it provides valuable information.

Suppose we draw a substantial number of random samples of the same size from the same population and calculate the sample mean for each sample. During this process, we’d observe a broad spectrum of sample means, and we can graph their distribution. 

**Sampling distribution of the mean** are distributions of sample means for samples of a particular size that you draw from a population with specific properties. These distributions represent the idea of conducting the same experiment many times and observing the distribution of sample means. **Sampling distributions** also exist for other population properties such as the standard deviation, median, and proportion.

<img src="images/stat-infer3.png" alt="" style="width: 400px;"/>

The first thing to notice is that all three distributions center on the population’s mean IQ of 100 and they are distributed symmetrically around that mean. These properties indicate that the sample means are unbiased because they are not systematically too high or too low. Stated another way, samples have an equal probability of being too high or too low. They’re correct on average. These sample means are unbiased because the simulation used random sampling.

However, the spreads of the distributions are clearly different. The `variability of these distributions` reflects the amount of **sampling error** associated with different sample sizes.

Let’s start with the `grey distribution for a sample size of five`, which is the widest. While this distribution centers on 100, the broader spread indicates that a greater percentage of sample means will fall further away from the population value in the center and instead be out in the tails. Consequently, obtaining a sample mean as low as 88 or as high as 112 is not surprising when the sample size is five. The wider spread of this distribution shows that you are more likely to obtain an unusual sample mean with a small sample size. In other words, you have greater uncertainty that your sample mean is close to the actual population value. `Estimates from smaller samples have a larger margin of error`.

Now, let’s look at the `blue distribution, which represents samples of size 60`. It’s still centered on 100, but the narrow distribution indicates that you are unlikely to obtain sample means far from 100. It would be implausible to obtain sample means less than 95 or higher than 105. Most sample means are packed close to the population value in the center while few are out in the tails. The narrower spread of this distribution shows that `you are less likely to obtain an unusual sample mean when you have a large sample size`. In other words, you have more confidence that your sample mean is close to the actual population value. `Estimates from larger samples have a smaller margin of error`.

The wider distributions for small sample sizes indicate your sample mean has a higher probability of falling further away from the population mean. As you increase the sample size, the sampling distributions tighten up, which signifies you can be more confident that the sample mean is relatively close to the population mean.

In statistics, we refer to this concept as the **precision of the estimates**. `More precise estimates have smaller margins of error around the estimate`. Conversely, less precise estimates have larger margins of error and provide a vaguer idea about a population’s characteristics. You want greater precision because you’ll have better information about the actual value of the population parameter.

**Precision** is a function of both the variability in the population and the size of the sample. However, we can usually control only the sample size in our studies. Consequently, `increasing the sample size is the method to use improve the precision of your sample estimates`.

---
<a id='conf'></a>

## Confidence Intervals and Precision

**Confidence intervals** help you assess the precision of your estimates.

<img src="images/stat-infer4.png" alt="" style="width: 400px;"/>

Statisticians usually consider a sample size of 10 to be a bit on the small side. From the histogram, the data do not look much like the original population. The estimates for the mean and standard deviation are 103.25 and 12.89, respectively. They are the **point estimates** for the population parameters, which are both in the right ballpark for the correct values of 100 and 15.

We have our point estimates, but we know that those aren’t exactly correct. Let’s check the **confidence intervals** to see the ranges for where the actual parameter values are likely to fall.

The **confidence interval** for the mean is [94.03 112.46], and for the standard deviation it is [8.86 23.52]. `The population parameters usually fall within their confidence intervals`. Typically, we don’t know the actual parameter values, but for this illustration we can see that both estimates fall within their intervals. The sample does not provide an exact representation of the population, but the estimates are not too far off. If we didn’t know the actual values, the CIs would give us useful guidance.

<img src="images/stat-infer5.png" alt="" style="width: 400px;"/>

For this larger sample, this histogram is beginning to look more like the underlying population distribution. The estimates for the mean and standard deviation are 99.553 and 15.597, respectively. Both of these point estimates are closer to the actual population values than their counterparts in the smaller sample.

For the **confidence intervals**, both of the CIs again contain the parameters. However, notice that these intervals are tighter than for the sample size of 10. For example, the CI for the mean is [96.458 102.648] compared to [94.03 112.46] for the sample size of 10. That’s a range of about 6 IQ points rather than 18. The tighter intervals indicate that these estimates are more precise than those from the smaller sample. In other words, the difference between the sample estimate and actual parameter value is likely to be smaller for the larger sample.

`Narrower confidence intervals represent more precise estimates`.

In summary, `use confidence intervals to evaluate the precision of your sample estimates`. If your intervals are too broad to be meaningful, you’ll need to increase your sample size.

In [None]:
<img src="images/stat-infer2.png" alt="" style="width: 400px;"/>

---
<a id='res'></a>

# Resources

- [Statistics by Jim](https://statisticsbyjim.com/)
- [onlinemathlearning.com](https://www.onlinemathlearning.com)