# Measures of Dispersion: Variance and Standard Deviation

Measures of dispersion (also called variability, scatter, or spread) quantify how much the data in a dataset is spread out around the *mean*.  They tell us whether the data points are tightly clustered together or widely dispersed. This complements measures of central tendency (mean, median, mode), which tell us about the *center* of the data.

We can use measures of dispersion to:
* **Summarize Data:** Describe variability.
* **Compare Datasets:** Determine which dataset has greater variability.
* **Understand Distributions:** Determine data spread.
* **Analyze Data:** Understand dispersion before processing.

---

## Population vs. Sample
*  **Population (N):** The entire group of interest.  If you have data for *every* member of the group, you have population data. Denoted with Capital N.
*  **Sample (n):** A *subset* of the population. Used when it's impractical to collect data on the entire population. Denoted with lowercase n.

The formulas for calculating variance and standard deviation differ slightly depending on whether you're working with a population or a sample.

---


## Variance

Variance measures the average *squared* deviation of each data point from the mean. It quantifies the spread in squared units.

### Population Variance (σ²)

$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$

*   $σ^2$ (sigma-squared): Population variance
*   $x_i$: Individual data points
*   $μ$ (mu): Population mean
*   $N$: Total number of data points in the *population*
*   $Σ$: Summation

### Sample Variance (s²)

$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$$

*   $s^2$: Sample variance
*   $x_i$: Individual data points
*   $\bar{x}$ (x-bar): Sample mean
*   $n$: Total number of data points in the *sample*
*   $Σ$: Summation

### Why square the deviations?
1.  **Makes all deviations positive:**  If some data points are below the mean (negative deviation) and some are above (positive deviation), simply averaging the deviations would lead to cancellation, underestimating the true spread.  Squaring ensures all values are positive.
2.  **Magnifies larger deviations:**  Squaring gives more weight to data points that are farther from the mean. This emphasizes the spread.

### Why divide by `n-1` for the *sample* variance? (Bessel's Correction)
* The sample mean ($\bar{x}$) is used to calculate the sample variance.  Because the sample mean is calculated *from the sample data itself*, it tends to be *closer* to the sample data points than the true (unknown) population mean would be.  This means that using `n` in the denominator would systematically *underestimate* the population variance.
* Dividing by `n - 1` instead of `n` provides a slightly larger value for the sample variance, making it an *unbiased estimator* of the population variance. The average of many sample variances (calculated with n-1) will be closer to the population variance.

---
