# Section 5.4 — Bayesian difference between means

This notebook contains the code examples from [Section 5.4 Bayesian difference between means]() from the **No Bullshit Guide to Statistics**.

See also:
- [Half_a_dozen_dmeans_in_Bambi.ipynb](http://localhost:8888/lab/tree/notebooks/explorations/Half_a_dozen_dmeans_in_Bambi.ipynb)
- [compare_iqs2_many_ways.ipynb](./explorations/compare_iqs2_many_ways.ipynb)
- [t-test.ipynb](./explorations/bambi/t-test.ipynb)
- Examples: https://github.com/treszkai/best/tree/master/examples
- Links: https://www.one-tab.com/page/HoSHco_iSG-MHXG7kXOj7g


#### Notebook setup

In [1]:
# load Python modules
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import bambi as bmb
import arviz as az



In [2]:
# Figures setup
plt.clf()  # needed otherwise `sns.set_theme` doesn"t work
from plot_helpers import RCPARAMS
RCPARAMS.update({"figure.figsize": (5, 3)})   # good for screen
# RCPARAMS.update({"figure.figsize": (5, 1.6)})  # good for print
sns.set_theme(
    context="paper",
    style="whitegrid",
    palette="colorblind",
    rc=RCPARAMS,
)

# High-resolution please
%config InlineBackend.figure_format = "retina"

# Where to store figures
DESTDIR = "figures/bayes/dmeans"

<Figure size 640x480 with 0 Axes>

In [3]:
# set random seed for repeatability
np.random.seed(42)
#######################################################

## Model

## Example 1: electricity prices

Electricity prices from East End and West End

### Electricity prices dataset

In [4]:
eprices = pd.read_csv("../datasets/eprices.csv")
eprices.groupby("loc")["price"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
loc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
East,9.0,6.155556,0.877655,4.8,5.5,6.3,6.5,7.7
West,9.0,9.155556,1.562139,6.8,8.3,8.6,10.0,11.8


### Bayesian model
TODO: add formulas

### Bambi model

In [5]:
from ministats.bayes import bayes_dmeans

epricesW = eprices[eprices["loc"]=="West"]["price"]
epricesE = eprices[eprices["loc"]=="East"]["price"]
mod1, idata1 = bayes_dmeans(epricesW, epricesE, group_name="loc", var_name="price", groups=["West", "East"])

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [nu, Intercept, loc, sigma_loc]


Output()

Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 3 seconds.


### Model fitting and analysis

In [6]:
from ministats.bayes import calc_dmeans_stats
calc_dmeans_stats(idata1, group_name="loc");
az.summary(idata1, kind="stats", hdi_prob=0.95,
           var_names=["dmeans", "sigma_West", "sigma_East", "dstd", "nu", "cohend"])

KeyError: "No variable named 'sigma'. Variables on the dataset include ['chain', 'draw', 'Intercept', 'loc_dim', 'loc', ..., 'sigma_loc_dim', 'sigma_loc', 'dmeans', 'mu_other', 'mu_West']"

In [None]:
from ministats.bayes import plot_dmeans_stats
plot_dmeans_stats(mod1, idata1, group_name="loc");

### Compare to previous results

### Conclusions

## Example 2: comparing IQ scores

We'll look at IQ scores data taken from a the paper *Bayesian Estimation Supersedes the t-Test* (BEST) by John K. Kruschke.

### Data

In [None]:
iqs2 = pd.read_csv("../datasets/exercises/iqs2.csv")
iqs2.groupby("group")["iq"].describe()

In [None]:
sns.histplot(data=iqs2, x="iq", hue="group");

### Bayesian model
TODO: add formulas

### Bambi model

In [None]:
formula2 = bmb.Formula("iq ~ 1 + group",
                       "sigma ~ 0 + group")

mod2 = bmb.Model(formula=formula2,
                 family="t",
                 link="identity",
                 data=iqs2)
mod2

In [None]:
# # ALT use the function
# from ministats.bayes import bayes_dmeans
# treated = iqs2[iqs2["group"]=="treat"]["iq"].values
# controls = iqs2[iqs2["group"]=="ctrl"]["iq"].values
# mod2, idata2 = bayes_dmeans(treated, controls, var_name="iq",
#                             group_name="group", groups=["treat", "ctrl"])

In [None]:
mod2.build()
mod2.graph()

### Model fitting and analysis

In [None]:
idata2 = mod2.fit(draws=5000)

In [None]:
from ministats.bayes import calc_dmeans_stats
calc_dmeans_stats(idata2, group_name="group");
az.summary(idata2, kind="stats", hdi_prob=0.95,
           var_names=["dmeans", "sigma_treat", "sigma_ctrl", "dstd", "nu", "cohend"])

In [None]:
# ALT. manual calculations
# post2 = idata2["posterior"]
# # Calculate sigmas from log-sigmas
# post2["sigma_treat"] = np.exp(post2["sigma_group"][:,:,1])
# post2["sigma_ctrl"] = np.exp(post2["sigma_group"][:,:,0])
# # Difference in standard deviations
# post2["dstd"] = post2["sigma_treat"] - post2["sigma_ctrl"]

In [None]:
from ministats.bayes import plot_dmeans_stats
plot_dmeans_stats(mod2, idata2, group_name="group", ppc_xlims=[90,110]);

### Compare to previous results

### Conclusions

## Example 3: lecture and debate curriculums


### Students dataset

In [None]:
students = pd.read_csv("../datasets/students.csv")
students.groupby("curriculum")["score"].describe()

### Bayesian model
TODO: add formulas

### Bambi model

In [None]:
from ministats.bayes import bayes_dmeans

studentsD = students[students["curriculum"]=="debate"]
studentsL = students[students["curriculum"]=="lecture"]
scoresD = studentsD["score"]
scoresL = studentsL["score"]

mod3, idata3 = bayes_dmeans(scoresD, scoresL, group_name="curriculum", var_name="score", groups=["debate", "lecture"])

### Model fitting and analysis

In [None]:
from ministats.bayes import calc_dmeans_stats
calc_dmeans_stats(idata3, group_name="curriculum");
az.summary(idata3, kind="stats", hdi_prob=0.95,
           var_names=["dmeans", "sigma_debate", "sigma_lecture", "dstd", "nu", "cohend"])

In [None]:
from ministats.bayes import plot_dmeans_stats
plot_dmeans_stats(mod3, idata3, group_name="curriculum", ppc_xlims=[50,100]);

### Compare to previous results

### Conclusions

## Explanations

## Discussion

## Exercises

### Exercise 1: small samples

In [None]:
As = [5.77, 5.33, 4.59, 4.33, 3.66, 4.48]
Bs = [3.88, 3.55, 3.29, 2.59, 2.33, 3.59]
groups = ["A"]*len(As) + ["B"]*len(Bs)
df1 = pd.DataFrame({"group": groups, "vals": As + Bs})
# df1

In [None]:
from scipy.stats import t as tdist

tdist(loc=100, scale=10, df=2.1).std()

In [None]:
10 * np.sqrt(2.1 / (2.1-2))

In [None]:
# New Default prior in R BEST code
from scipy.stats import gamma

nuMean = 30
nuSD = 30

alpha = nuMean**2 / nuSD**2  # shape
beta = nuMean / nuSD**2   # rate
print(f"{alpha=} {beta=}")

rv_Nu = gamma(a=alpha, scale=1/beta)
xs = np.linspace(0,100)
ax = sns.lineplot(x=xs, y=rv_Nu.pdf(xs));

In [None]:
# Bambi default prior for `nu`
rv_Nu2 = gamma(a=2, scale=10)
xs = np.linspace(0,100)
sns.lineplot(x=xs, y=rv_Nu2.pdf(xs))

## Links

# BONUS Examples

## Example 4: small example form BEST vignette

See http://cran.nexr.com/web/packages/BEST/vignettes/BEST.pdf#page=2


In [None]:
y1s = [5.77, 5.33, 4.59, 4.33, 3.66, 4.48]
y2s = [3.88, 3.55, 3.29, 2.59, 2.33, 3.59]

from ministats.bayes import bayes_dmeans
mod4, idata4 = bayes_dmeans(y1s, y2s, groups=["y1", "y2"])

In [None]:
from ministats.bayes import calc_dmeans_stats
calc_dmeans_stats(idata4)
az.summary(idata4, kind="stats", hdi_prob=0.95,
           var_names=["dmeans", "sigma_y1", "sigma_y2", "dstd", "nu", "cohend"])

In [None]:
from ministats.bayes import plot_dmeans_stats
plot_dmeans_stats(mod4, idata4, ppc_xlims=None);

## Example 5: comparing morning to evening

https://github.com/treszkai/best/blob/master/examples/paired_samples.py


In [None]:
morning = [8.99, 9.21, 9.03, 9.15, 8.68, 8.82, 8.66, 8.82, 8.59, 8.14,
           9.09, 8.80, 8.18, 9.23, 8.55, 9.03, 9.36, 9.06, 9.57, 8.38]
evening = [9.82, 9.34, 9.73, 9.93, 9.33, 9.41, 9.48, 9.14, 8.62, 8.60,
           9.60, 9.41, 8.43, 9.77, 8.96, 9.81, 9.75, 9.50, 9.90, 9.13]

In [None]:
from ministats.bayes import bayes_dmeans
mod5, idata5 = bayes_dmeans(evening, morning, groups=["evening", "morning"])

In [None]:
from ministats.bayes import calc_dmeans_stats
calc_dmeans_stats(idata5)
az.summary(idata5, kind="stats", hdi_prob=0.95,
           var_names=["dmeans", "sigma_evening", "sigma_morning", "dstd", "nu", "cohend"])

In [None]:
from ministats.bayes import plot_dmeans_stats
plot_dmeans_stats(mod5, idata5, ppc_xlims=None);