# Inferential Statistics

Inferential statistics allow you to use a relatively small sample to learn about an entire population.

The primary way scientific experiments create new knowledge is by carefully setting up contrasts between groups, such as a treatment and control group.

## Table of Contents

- [Descriptive and Inferential Statistics](#intro)
    - [Descriptive statistics](#desc)
    - [Inferential Statistics](#infer)
- [Populations](#pop)
- [Subpopulations](#subpop)
- [Population Parameters versus Sample Statistics](#popsampl)
- [Tools for Inferential Statistics](#tools)

---
<a id='intro'></a>

## Descriptive and Inferential Statistics

Descriptive and inferential statistics are two broad categories in the field of statistics. Here’s the difference in a nutshell:

- **Descriptive statistics** `describe a dataset` for a particular group of objects, observations, or people. They don’t attempt to generalize beyond the set of observations.

- **Inferential statistics** `use a dataset to make conclusions about the larger population from which the sample was drawn`. These statistics generalize beyond the specific observations that are in the dataset to a larger group or population.

---
<a id='desc'></a>

### Descriptive statistics

**Descriptive statistics** describe a sample. Use descriptive statistics to summarize and graph the data for a group that you choose. This process allows you to understand that specific set of observations.

The process involves taking a potentially large number of data points in the sample and reducing them down to a few meaningful summary values and graphs. This procedure allows us to gain more insights and visualize the data than merely pouring through row upon row of raw numbers.

**Descriptive statistics** frequently use `statistical measures` to describe a particular group:

- **Central tendency**: Use the `mean` or the `median` to locate the center of the dataset. This measure tells you where most values fall.

- **Dispersion**: How far out from the center do the data extend? You can use the `range` or `standard deviation` to measure the dispersion. Low dispersion indicates that values cluster more tightly around the center. Higher dispersion signifies that data points fall further away from the center. We can also graph the `frequency distribution`.

- **Skewness**: The measure tells you whether the distribution of values is `symmetric` or `skewed`.

- **Correlation**: The strength of the tendency for two variables to change together.

You can present this summary information using both `numbers` and `graphs`.

---
<a id='infer'></a>

### Inferential Statistics

In most cases, it is simply impossible to measure the entire population to understand its properties. The alternative is to gather a **random sample** and then use the methodologies of **inferential statistics** to analyze the sample data.

**Inferential statistics** takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the `goal of inferential statistics is to take a sample and generalize its properties to a population`, we need to have confidence that our sample accurately reflects the population. This requirement affects our process. At a broad level, we must do the following:

1. Define the population we are studying.
2. Draw a representative sample from that population.
3. Use analyses that incorporate the sampling error.

We need a sampling procedure that tends to produce a sample that accurately reflects the population from which you draw it. **Random sampling** is a procedure that allows us to have confidence that the sample represents the population. The random nature of this process helps `avoid any systematic bias` that would invalidate our results.

**Random sampling** is a primary method `for obtaining samples that mirrors the population on average`. This type of sampling produces statistics, such as the mean, that are not systematically too high or too low. In other words, the critical characteristic of **random samples** is that they produce **sample statistics** that tend to be correct on average.

Consequently, when we obtain a **random sample**, `we can generalize from the sample to the broader population`. Unfortunately, gathering a genuinely random sample can be a complicated process.

When you estimate the properties of a population from a sample, the sample statistics are unlikely to equal the actual population value exactly. For instance, your sample mean is unlikely to equal the population mean exactly.

**Sampling error** `is the difference between the sample statistic and the population value`. Inferential statistics incorporate estimates of this error into the statistical results.

Summary values in **descriptive statistics** are straightforward. The average score in a specific class is a known value because we measured all individuals in that class. There is little uncertainty.

To gain the benefits of **inferential statistics**, you must understand the relationship between `populations`, `subpopulations`, `population parameters`, `samples`, and `sample statistics`.

---
<a id='pop'></a>

## Populations

**Populations** can include people, but other examples include objects, events, businesses, and so on. In statistics, there are two general types of populations.

- Populations can be the complete set of all similar items that exist. It’s a finite but potentially extensive list of members.
- A population can be a theoretical construct that is potentially infinite in size

**Populations** share a set of attributes that you define. For example, the following are populations:

- Stars in the Milky Way galaxy.
- Parts from a production line.
- Citizens of the United States.
- 8th grade students in the State of Pennsylvania.

`Before you begin a study, you must carefully define the population that you are studying`. These populations can be narrowly defined to meet the needs of your analysis. For example, your population can be adult Swedish women who are otherwise healthy but have osteoporosis.

---
<a id='subpop'></a>

## Subpopulations

**Subpopulations** share additional attributes. For instance, the population of the United States contains the subpopulations of men and women. You can also subdivide it in other ways such as region, age, socioeconomic status, and so on. Different studies that involve the same population can divide it into different subpopulations depending on what makes sense for the data and the analyses.

Understanding the **subpopulations** in your study helps you grasp the subject matter more thoroughly. They can also help you produce statistical models that fit the data better. Subpopulations are particularly important when they have characteristics that are systematically different than the overall population. When you analyze your data, you need to be aware of these deeper divisions. In fact, you can treat the relevant subpopulations as additional factors in later analyses.

<img src="images/stat-infer.png" alt="" style="width: 400px;"/>

Gender is a crucial **subpopulation** that relates to height and increases our understanding of the subject matter. In future studies about height, we can include gender as a variable. That’s how scientists think about research questions and add to their knowledge.

---
<a id='popsampl'></a>

## Population Parameters versus Sample Statistics

A **parameter** is a `value that describes a characteristic of an entire population`, such as the population mean. Because you can rarely measure a population as a whole, you usually don’t know the real value of a parameter. While we can’t measure the parameter value, it exists.

While we’ll never know the precise value of these **population parameters**, we can use inferential statistics to estimate them and incorporate a margin of error.

The population mean and standard deviation are two common parameters. `In statistics, Greek symbols usually represent population parameters`, such as `μ` (mu) for the mean and `σ` (sigma) for the standard deviation.

A **statistic** is a `characteristic of a sample`. If you collect a sample and calculate the mean and standard deviation, these are **sample statistics**. Inferential statistics allow you to use sample statistics to make conclusions about a population. However, to draw valid conclusions, you must use **random sampling** techniques, as discussed earlier.

<img src="images/stat-infer2.png" alt="" style="width: 400px;"/>

In **inferential statistics**, `sample statistics are estimates of population parameters`. For example, if we collect a random sample of adult women in the United States and measure their heights, we can calculate the sample mean and standard deviation and use them as unbiased estimates of the population parameters.

You can calculate the following types of estimates for population parameters:

- **Point estimates**: These estimates use the sample data to produce a single value that is the most likely value for the population parameter. Sample statistics, such as the mean, are typically the point estimate for the population. Unfortunately, point estimates are always wrong by an unknown amount because of **random sampling error**.

- **Interval estimates**: A range of values that likely contains the value of the population parameter. These intervals include a **margin of error** around the point estimate to account for **random sampling error**.

In short, **point estimates** are the best guess value but are guaranteed to be wrong by at least a little bit because you’re working with a sample that is small in comparison to the population. **Intervals estimates** are ranges of values that probably contain the parameter value.

---
<a id='tools'></a>

## Tools for Inferential Statistics

Inferential methods can produce similar summary values as descriptive statistics, such as the mean and standard deviation. However, we use them very differently when making inferences.

---
<a id='res'></a>

# Resources

- [Statistics by Jim](https://statisticsbyjim.com/)
- [onlinemathlearning.com](https://www.onlinemathlearning.com)