## Ch. 02- Programming Probabilistically

In [None]:
# Import pymc and related code
import arviz as az
import pymc as pm
import preliz as pz

In [None]:
# Import other "data science libraries"
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 2.1 Probabilistic programming

#### 2.1.1 Flipping coins the PyMC way2.1.1 Flipping coins the PyMC way

In [None]:
# Initialize repeatable random number generator
rng = np.random.default_rng(123)

In [None]:
# Generate "fake real data"
trials = 4
theta_real = 0.35 # unknown in a real experiment
data = pz.Binomial(
    n=1,
    p=theta_real).rvs(trials,
                      random_state=rng.integers(np.iinfo(np.int32).max))

In [None]:
plt.scatter(range(trials), data)
plt.show()

In [None]:
with pm.Model() as our_first_model:
    θ = pm.Beta('θ', alpha=1., beta=1.)
    y = pm.Bernoulli('y', p=θ, observed=data)
    idata = pm.sample(1000)

### 2.2 Summarizing the posterior

In [None]:
az.plot_trace(idata)

In [None]:
az.plot_trace(idata, kind='rank_bars', combined=True)

In [None]:
az.plot_posterior(idata)

### 2.3 Posterior-based decisions

#### 2.3.1 Savage-Dickey density ration

In [None]:
az.plot_bf(idata, var_name='θ', prior=rng.uniform(0, 1, 10000), ref_val=0.5)

#### 2.3.2 Region of Practical Equivalence

In [None]:
az.plot_posterior(idata, rope=[0.45, 0.55])

In [None]:
az.plot_posterior(idata, ref_val=0.5)

#### 2.3.3 Loss functions

In [None]:
# Plot the loss
# The plotting part of this code is from 
# [the chapter 02 code](https://github.com/aloctavodia/BAP3/blob/main/code/Chp_02.ipynb).
grid = np.linspace(0, 1, 200)
θ_pos = idata.posterior['θ']
lossf_a = [np.mean(abs(i - θ_pos)) for i in grid]
lossf_b = [np.mean((i - θ_pos) ** 2) for i in grid]

_, ax = plt.subplots(figsize=(12, 3))
for lossf, c in zip([lossf_a, lossf_b], ['C0', 'C1']):
    mini = np.argmin(lossf)
    ax.plot(grid, lossf, c)
    ax.plot(grid[mini], lossf[mini], 'o', color=c)
    ax.annotate('{:.2f}'.format(grid[mini]),
                (grid[mini], lossf[mini] + 0.03),
                color=c)

    ax.set_yticks([])
    ax.set_xlabel(r'$\hat \theta$')

plt.show()

In [None]:
# A (silly) assymetric loss function
lossf = []
for i in grid:
    if i < 0.5:
        f = 1 / np.median(θ_pos / np.abs(i**2 - θ_pos))
    else:
        f = np.mean((i - θ_pos) ** 2 + np.exp(-i)) - 0.25

    lossf.append(f)

In [None]:
# Plot the (silly) asymmetric loss function
mini = np.argmin(lossf)
_, ax = plt.subplots(figsize=(12, 3))
ax.plot(grid, lossf)
ax.plot(grid[mini], lossf[mini], 'o')
ax.annotate('{:.2f}'.format(grid[mini]),
(grid[mini] + 0.01, lossf[mini] + 0.1))
ax.set_yticks([])
ax.set_xlabel(r'$\hat \theta$')

### 2.4 Gaussians all the way down

Gaussians are very appealing. They are easy to work with,
many operations applied to Gaussians return another Gaussian.
Additionally, many natural phenomena can be approximated using
Gaussians. In general, almost every time we measure the average
of something, using a **big enough** sample size, the average
will be distributed as a Gaussian.

Many phenomena are indeed averages. For example, the height
of adults. (Actually, this distribution is a **mixture** of
**two** Gaussians - one for men and one for women.)

Consequently, it is important to learn to build Gaussians,
but also to learn how to relax the normality assumptions.
(This relaxation is surprisingly easy with tools like PyMC).



#### 2.4.1 Gaussian inferences

**Background**

We can use nuclear magnetic Resonance (NMR) to study molecules or
living things such as humans, sunflowers, and yeast. NMR allows one
to measure different **observable** quantities related to **unobservable**
molecular properties. Chemical shift is one of these observable
properties that apply to the nuclei of certain types of atoms.
This problem is an example similar to:

- The height of a group of people
- The average time to travel back home
- The weights of bags or oranges

All these examples have continuous variables and can be thought of as an
average plus a dispersion.

Additionally, if the number of possible values is large enough, we can
approximate it using a Gaussian. For example, the sexual partners of
bonobos, a very promiscuous monkey.

In our example, we have 48 chemical shift value.

- The median is around 53
- The inter-quartile range is about 52 to 55
- Two values "far away" from the resto of the data appear to be outliers.

In [None]:
# Load the data
data = np.loadtxt('./data/chemical_shifts.csv')

In [None]:
# Plot the data using a boxplot
_, ax = plt.subplots(figsize=(12, 3))
ax.boxplot(data, vert=False)
plt.show()

We'll forget about the two outlying points. We will further assume that
a Gaussian is a good description of the data. Since know neither the mean
nor the standard deviation, we set priors for both of them. Therefore, a
reasonable model is:

$$
\begin{gather}
\mu \sim \mathcal{U(l, h)} \\
\sigma \sim \mathcal{HN(\sigma_{\sigma})} \\
Y \sim \mathcal{N(\mu, \sigma)}
\end{gather}
$$

where

- $\mathcal{U(l, h)}$ is the Uniform distribution between
  $\mathcal{l}$ and $\mathcal{h}$
- $\mathcal{HN(\sigma_{\sigma})}$ is the Half-Normal distribution
  with scale $\mathcal{\sigma_{\sigma}}$
- $\mathcal{N(\mu, \sigma)}$ is the Gaussian distribution with mean,
  $\mathcal{\mu}$, and standard deviation, $\mathcal{\sigma}$.

Since we do not know the possible values of $\mu$ and $\sigma$ - a typical
situation - we can set priors reflecting our ignorance. For example, we
can set the boundaries of our uniform distribution to be
$\mathcal{l} = 40$ and $\mathcal{h} = 75$: a range **larger** than the
range of the data.

For the Half-Normal, in the absence of more information, we can choose a
large value compared to the **scale** of the data. The following PyMC
code puts details to our model.

In [None]:
with pm.Model() as model_g:
    mu = pm.Uniform('\u03bc', lower=40, upper=70)
    sigma = pm.HalfNormal('\u03c3', sigma=5)
    Y = pm.Normal('Y', mu=mu, sigma=sigma, observed=data)
    idata_g = pm.sample()

In [None]:
az.plot_trace(idata_g)
plt.show()

In [None]:
az.plot_pair(idata_g, kind='kde', marginals=True)
plt.show()

In [None]:
az.summary(idata_g, kind='stats').round(2)