# DATA 335 - Winter 2025 - Lab 3

2025.02.04, 14:00-15:50, MS 521

In [11]:
import numpy as np
import pandas as pd
import scipy.stats as stats

### 1
Let's simulate fake data on IQ scores and course grades and examine the resulting predictions (cf. Active Statistics, page 129).

In [15]:
def make_data():
    n = 100
    iq = stats.norm(100, 15).rvs(size=n, random_state=0)  # IQ, roughly
    gpa = (
        2.5 + 0.02 * (iq - 100) + stats.norm(0, 0.5).rvs(size=n, random_state=1)
    )  # GPA, roughly
    df = pd.DataFrame({"iq": iq, "gpa": gpa})
    return df


data = make_data()
data.head()

Unnamed: 0,iq,gpa
0,126.460785,3.841388
1,106.002358,2.314169
2,114.68107,2.529536
3,133.613398,2.635784
4,128.01337,3.492971


- Plot the regression of `gpa` on `iq`, overlaid on a scatterplot of the data.

- Use your model to estimate the predictive distribution of the GPA of a student with an IQ of 105.

- What is the probability that a student with an IQ of 105 will have a GPA exceeding 3?

### 2

Here's a fake dataset describing the relationship between cholesterol level (a heart disease risk factor), age (ordinal, four categories, ages 10-30, 30-50, 50-70, and 70-90), and weekly hours of exercise.

In [12]:
def make_data():
    n = 100
    np.random.seed(0)
    age = np.random.choice([0, 1, 2, 3], size=n)
    exercise = 2 * age + 3 * np.random.normal(size=n) + 6
    colesterol = 200 + 30 * age - 5 * exercise + 10 * np.random.normal(size=n)
    df = pd.DataFrame({"age": age, "exercise": exercise, "colesterol": colesterol})
    return df


data = make_data()
data.head()

Unnamed: 0,age,exercise,colesterol
0,0,0.881189,197.260788
1,3,17.852326,207.088683
2,1,6.471043,221.47623
3,0,4.685777,186.015909
4,3,8.241614,239.663708


- Plot the (simple) linear regression of `cholesterol` on `exercise`, overlaid on a scatterplot of the data. Do the results surprise you?

- Fit a multivariate linear regression of `cholesterol` onto `exercise` and `age`. Plot the regression lines corresponding to each of the age groups, overlaid with a scatterplot of the data. Use a different color for each age group. Comment.

This exercise demonstrates a phenomenon known as *Simpson's Paradox*. The inspiration for this exercise comes from &sect;1.2 of **Causal Inference in Statistics** by Pearl, Gylmour, and Jewel.

### 3

In class, we saw how to use simulation to synthesize prior information with new data, leading to posterior estimates. In certain special cases, this can also be approached analytically. 

Suppose we have an estimate $\hat{\theta}_{\text{data}}$ of a population parameter from data that we want to synthesize with a prior estimate $\hat{\theta}_{\text{prior}}$.

Suppose that the distributions giving rise to $\hat{\theta}_{\text{data}}$ and $\hat{\theta}_{\text{prior}}$ are approximately normal with known standard deviations $s_{\text{data}}$ and $s_{\text{prior}}$, respectively. Set
$$
w_{\text{prior}} = \dfrac1{s_{\text{prior}}^2},\quad
w_{\text{data}} = \dfrac1{s_{\text{data}}^2},\quad w = w_{\text{prior}} + w_{\text{data}}
$$
Then the posterior distribution of $\theta$ is normal with mean estimate
$$
\hat{\theta}_{\text{posterior}} = \frac{w_{\text{prior}}\hat{\theta}_{\text{prior}}
+ w_{\text{data}}\hat{\theta}_{\text{data}}}{w}
$$
and standard deviation estimate
$$
s_{\text{posterior}} = \frac1{\sqrt{w}}.
$$

- Write a function that evaluates these posterior estimates given the prior and data estimates.

- Go back to the two example of Bayesian synthesis presented in class &mdash; the polling example and the sex ratio example &mdash; and confirm our simulation-based results using your function.

### 4

Here's yet one more take on the Bayesian synthesis polling example from class.

- Using PyMC, estimate the mean and standard deviation of the posterior associated to
  - a $\operatorname{Bin}(400, p)$-likelihood for with a single observation of $190$;
  - a [beta distribution prior](https://www.pymc.io/projects/docs/en/latest/api/distributions/generated/pymc.Beta.html) for $p$ with parameters $\alpha=77$ and $\beta=70$.

- Compare with results obtained in class and in the previous problem.

- Plot the density functions of the beta prior $\operatorname{Beta}(77, 70)$ used above together with the density function of the normal prior $N(0.524, 0.041^2)$ used in class. Comment.