In [1]:
#| hidden: true
#| echo: false
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# some personal style settings to make the plots look nice
# and save some space in the notebook
plt.style.use("../style.mplstyle")

## Data Generating Processes

In the previous lecture we saw examples of running simulations to generate data, whether by simply drawing samples from a probability distribution (i.e. flipping coins / rolling dice) or by simulating a complex process explicitly (the busking musician example). 

This leads us to a useful way of thinking about data: **all data is generated by some underlying process**. The process can be simple or complex, deterministic or stochastic, observed or unobserved, but it is always there. If that sounds obvious, it is because it is! "Some process" is a bit of a catch-all: of course the data doesn't just appear out of nowhere. However, it is also important to keep in mind because, as we will see, thinking about **data generating processes** (DGPs) is the key to analyzing data.



## Statistical Models
A **statistical model** is a formal mathematical representation of a data generating process. Specifically, it describes the probability distribution of the data. Based on the model, we can make precise statements about the data generated by the process. For example, we can say how likely it is to observe a certain value or set of values. We can tell what the average (or expected) value is, what the most likely value is, and so on.

Let's return to coin flips once again. The data generating process is the flipping of a coin, which has two possible outcomes: heads or tails. The statistical model for this process is a **Bernoulli distribution**, which describes the probability of each outcome. Specifically, 

$$P(X) = \begin{cases}
p & \text{if } X = 1 ~\text{heads} \\
1 - p & \text{if } X = 0 ~\text{tails}
\end{cases}$$

where \(X\) is the outcome of the coin flip, \(p\) is the probability of heads, and \(1 - p\) is the probability of tails. If we assume a fair coin, then \(p = 0.5\).

Now clearly, this model does not capture the complexity of a real-world coin flip, which is of course influenced by many factors such as the weight of the coin, the force of the flip, air resistance, etc. Statistical models are always reductive in this sense. But the important thing is that a Bernoulli distribution really does do a good job of describing the **outcomes** of a coin flip. As long as that is the case, we can use the model to make predictions about the data generated by the process.



<!-- Think back to Lecture 00, where we analyzed a dataset of Airbnb listings.  -->