## Distributions

*Coding along with third edition of the online version of __[Think Stats](https://allendowney.github.io/ThinkStats/chap02.html)__ by Allen Downey.*

#### __Distributions in Statistics__

A distribution in statistics describes how data is spread out or arranged - essentially, it's a mathematical function that shows how often different values occur in a dataset. Think of it as a way to understand the "shape" of your data.

Here are the key aspects of distributions:

***Central Tendency*** - This shows where the data tends to cluster, typically measured by:
- Mean (average)
- Median (middle value)
- Mode (most frequent value)

***Spread*** - This indicates how dispersed the data is, measured by:
- Range (difference between highest and lowest values)
- Standard deviation (average distance from the mean)
- Variance (standard deviation squared)

__Some common types of distributions include:__

***Normal Distribution (Bell Curve):***
- Symmetrical shape with most data clustered in the middle
- Examples: heights in a population, test scores
- Often occurs naturally in many phenomena

***Uniform Distribution:***
- All values have equal probability of occurring
- Example: rolling a fair die

***Skewed Distributions:***
- Right-skewed: tail extends to the right (higher values)
- Left-skewed: tail extends to the left (lower values)
- Example: income distributions are often right-skewed

Understanding distributions helps us:
- Make predictions about future data
- Identify unusual values or outliers
- Choose appropriate statistical tests
- Make informed decisions based on data patterns

*(Source: Claude.ai)*

In [None]:
import empiricaldist
import numpy as np
import matplotlib.pyplot as plt
# https://github.com/AllenDowney/ThinkBayes/blob/master/code/thinkstats.py
from thinkstats import decorate

## Histograms

A histogram is a visual representation of a distribution that shows the frequency of data points falling within specific ranges or "bins." Think of it like a bar chart where each bar represents how many data points fall within a particular range.

Here's why histograms are particularly useful:

1. They give you an immediate visual sense of:
   - Where most of your data is concentrated
   - The overall shape of your distribution
   - Any unusual patterns or outliers

2. For example, if you measured the heights of 1000 adults:
   - Each bar might represent a 2-inch range (60-62 inches, 62-64 inches, etc.)
   - The height of each bar shows how many people fall within that range
   - You might see a bell-shaped curve typical of height distributions

What makes histograms especially powerful is that they transform raw numbers into a pattern you can understand at a glance. If someone handed you a spreadsheet with 1000 height measurements, it would be hard to spot patterns. But a histogram instantly shows you if the heights cluster around a central value, if they're skewed to one side, or if there are any unexpected gaps or peaks.

*(Source: Claude.ai)*

In [None]:
from empiricaldist import Hist

In [None]:
t = [1.0, 2.0, 2.0, 3.0, 5.0] # let's start with a small list of values

In [None]:
# Hist from the empiricaldist package provides a method called `from_seq`
# `from_seq` takes a sequence and makes a `Hist` object
hist = Hist.from_seq(t)
hist

A `Hist` object is a kind of Pandas `Series` that contains quantities and their frequencies.
In our example the quantity `1.0` corresponds to frequency 1, the quantity `2.0` corresponds to frequency 2, etc.

In [None]:
# `Hist` provides a method called `bar` that plots the histogram as a bar chart
hist.bar()
decorate(xlabel="Quantity", ylabel="Frequency") # naming the axes

In [None]:
hist[2.0] # looking up a quantity to get its frequency

In [None]:
hist(2.0) # can also be called like a function

In [None]:
hist(4.0) # looking up a quantity that has no frequency returns 0

In [None]:
# qs attribute returns an array of quantities
hist.qs

In [None]:
# ps attribute returns an array of frequencies
hist.ps

In [14]:
# items method allows to loop through quantity-frequency pairs
for (x, freq) in hist.items():
    print(x, freq)

1.0 1
2.0 2
3.0 1
5.0 1


## NSFG Distributions

## NSFG Distributions

Exploring the variables we are planning to use one at a time by looking at histograms.

As an example, let's look at data from the National Survey of Family Growth (NSFG).
In the previous chapter, we downloaded this dataset, read it into a Pandas `DataFrame`, and cleaned a few of the variables.
The code we used to load and clean the data is in a module called `nsfg.py` -- instructions for installing this module are in the notebook for this chapter.

In [15]:
from nsfg import read_fem_preg

In [16]:
preg = read_fem_preg()
preg.head()

BadGzipFile: Not a gzipped file (b've')