# Bayesian inference
 
 Bayesian inference is a way of reasoning that combines prior knowledge and data to estimate the probability of a hypothesis. In the Bayesian model, prior knowledge is understood as what you already know or believe about a situation based on your experience, expertise, or assumptions. On the other hand, data is new evidence that you collect as a result of observations, experiments, or surveys. The probability of a hypothesis is how likely it is to be true, considering both the prior knowledge and data.

Bayesian inference consists of three steps.

1. We choose a probability density function to model the parameter $θ$, that is, the a prior distribution $p(θ)$ . This is our best guess on the parameters before we get the data $X$.

2. Probability function We choose a probability density function for $p(X|θ)$ . Essentially we are modeling how the data X will look like with the given parameter $θ$ .

3. Posterior probability We compute the posterior distribution $p(θ|X)$ and choose the $θ$ with the highest $p(θ|X)$ .

As a result, the posterior distribution becomes the new a prior distribution. The third step needs to be repeated each time new data arrives.

## Mathematical equation of Bayesian Inference

Bayesian inference uses a mathematical formula called Bayes' theorem to update the probability of a hypothesis as new data becomes available.

$$ 
{p(θ|X)} = \frac{p(X|\theta)\times p(\theta)} {p(X)}
$$

where  

$p(X|θ)$ - the likelihood, that is, the distribution of the observed data $X$ conditional on the parameter $θ$;

$p(θ)$ - the prior distribution;

$p(θ|X)$ - the posterior distribution.


 Bayesian inference has 3 basic blocks and these are:
 
 #### *The likelihood*

 The first building block of a  parametric  of Bayesian inference is likehood:
$$
\boldsymbol p(X|θ)
$$
When the paramters of the distribution generated by the data are equal to $θ$, the probability density of $X$.

No we assume that $X$ and $θ$ are continous. We will discuss later how to relax this assumption.

#### *The prior*

The second building block our inference is the prior:
$$
\boldsymbol p(θ)
$$
The prior is the subjective probability density associated with the parameter $θ$

#### *The posterior*

After obseving data $X$, we operate Bayes' rule to update the prior about the parameter $θ$ (the formula is given below)

Suppose that we fit a model with parameters $\boldsymbol w$ to the dataset $\boldsymbol D = (\boldsymbol X, \boldsymbol y)$. According to the Bayes formula the posterior distribution:

$$
    p(\boldsymbol w \vert \boldsymbol X, \boldsymbol y) \propto p(\boldsymbol y \vert \boldsymbol X, \boldsymbol w) p(\boldsymbol w).
$$

We are particularly interested in the posterior distribution because it allows us to make predictions.

### Probability Distribution functions in dynamic graph
source: https://www.datacamp.com/tutorial/probability-distributions-python

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from ipywidgets import interact, widgets
from scipy.stats import uniform, norm, expon, bernoulli, binom, poisson

In [None]:
def generate_graph(distribution_type, parameter_1, parameter_2):
    x = np.linspace(0, 1, 1000) # theta

    if distribution_type == 'Uniform':
        # x = 1000 # theta
        y = uniform.pdf(x, loc=parameter_1, scale=parameter_2)
        title = f"Uniform distribution: start={parameter_1}, width{parameter_2}"
    elif distribution_type == 'Normal':
        y = norm.pdf(x, loc=parameter_1, scale=parameter_2)
        title = f"Normal distribution: mean of the distribution={parameter_1}, standard deviation={parameter_2}"
    elif distribution_type == 'Exponential':
        y = expon.pdf(x, loc=parameter_1, scale=parameter_2)
        title = f"Exponential distribution: loc={parameter_1}, 1/lambda={parameter_2}"
    elif distribution_type == 'Bernouli':
        y = bernoulli.pmf(x, p=parameter_1)
        title = f"Bernouli distribution: probability of success={parameter_1} (<= 1)"
    elif distribution_type == 'Binomial':
        y = binom.pmf(x, n=int(parameter_1), p=parameter_2)
        title = f"Binomial distribution: n={int(parameter_1)}, probability of success=={parameter_2} (<= 1)"
    elif distribution_type == 'Poisson':
        y = poisson.pmf(x, mu=int(parameter_1))
        title = f"Poisson distribution: lambda={int(parameter_1)}"
    else:
        return

    plt.figure(figsize=(6, 3))
    plt.plot(x, y, label=distribution_type)
    plt.title(title)
    plt.xlabel('Theta')
    plt.ylabel('')
    plt.legend()
    plt.grid(True)
    plt.show()

choose_dist_type = widgets.Dropdown(
    options=['Uniform', 'Normal', 'Exponential', 'Bernouli', 'Binomial', 'Poisson'],
    value='Uniform',
    description='Distribution type:'
)

parameter_1 = widgets.FloatSlider(value=0, min=0, max=10, step=0.1, description='Parameter 1')
parameter_2 = widgets.FloatSlider(value=1, min=0, max=10, step=0.1, description='Parameter 2')


interact_plot = interact(
    generate_graph,
    distribution_type=choose_dist_type,
    parameter_1 = parameter_1,
    parameter_2 = parameter_2
)

## Conjugate distributions

In probabilistic theory, if the prior and the posterior belong to the same parametric family (set of probability distributions that share a common mathematical form or structure characterized by a set of parameters), then the prior is considered conjugate for the likelihood.

$$
p(θ|x) = \frac{p(x|θ)p(θ)}{p(x)} = \frac{p(x|θ)p(θ)}{\int_{θ}^{} p(x|θ)p(θ)dθ}
$$

### Definition

Let $Φ $ be a parametric family of probability distributions. A prior distribution $p(θ)$ belonging to $Φ $ is said to be conjugate for the likelihood $p(x|θ)$ if and only if, the resulting posterior distribution $p(θ|x)$ also belongs to $Φ $ .

Mathematically, this can be expressed as:
$$
p(θ|x) \in Φ \text{  if and only if  } (θ) \in Φ 
$$

In symbols:
$$
p(θ|x) \in Φ \Leftrightarrow p(θ) \in Φ
$$

In simple words, when we use a conjugate prior, the updated posterior obtained through Bayesian updating process belongs to the same parametric family as the prior.

### Binomial likelihood and beta priors

Click to show
Remark
Reminder -  click/hide intercation with user


Remark: The Binomial distribution represents the probability of obtaining a number of successes in a fixed number of independent trials, where each trial has a binary outcome (success/failure) with probability $p$ of success. The Beta distribution is a continuous probability distribution defined on the interval [0,1], often parametrized by two shape parameters, $ \alpha $ and $ \beta $.

````{admonition} Remark :class: dropdown 

The Binomial distribution represents the probability of obtaining a number of successes in a fixed number of independent trials, where each trial has a binary outcome (success/failure) with probability $p$ of success. The Beta distribution is a continuous probability distribution defined on the interval [0,1], often parametrized by two shape parameters, $ \alpha $ and $ \beta $.

````

$$
f(x | n, \alpha , \beta) = \int_{0}^{1} Bin(x|n,p)Beta(p|\alpha,\beta){ dp} 
= {\left(\begin{array}{c}n\\ x\end{array}\right)}{\frac{1}{B(\alpha, \beta)}}\int_{0}^{1} p^{x+\alpha-1} {(1-p)^{n-x+\beta-1}}dp
={\left(\begin{array}{c}n\\ x\end{array}\right)}{\frac{ B(x+\alpha, n-x+ \beta)}{B(\alpha, \beta)}}
$$

or like this
1. Prior distribution:
<br>
$p(p) \propto p^{\alpha - 1} \cdot (1-p)^{\beta-1}$

2. Likelihood function
<br>
$p(X=k|p) = {\left(\begin{array}{c}n\\ k\end{array}\right)} \cdot p^k \cdot (1-p)^{n-k}$
<br>
where the $k$ is a fixed number of independent trials

3. Posterior distribution
<br>
$p(p|X=k) \propto p^{\alpha + k - 1} \cdot (1-p)^{\beta+n-k-1}$

### Beta-binomial model

Example with soccer dataset https://www.kaggle.com/datasets/irkaal/english-premier-league-results

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta, binom
import pandas as pd

Lets look at the data. The main columns that we'll work are **Season** and **FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)** 

In [None]:
df = pd.read_csv("dataset_for_bayes_inference/results.csv", encoding='latin-1')
df.tail()

We will select matches of Home Teams (local teams) from 2015 season to 2020 and try to observe new data which will be matches from 2020-2021 season

In [None]:
selected_matches = df.query("Season in ['2015-16', '2016-17','2017-18', '2018-19', '2019-20']")
selected_matches_number = len(selected_matches)
print("Seasons from 2015 to 2020: ",selected_matches_number)

As a prior belief will be a win rate of the selected matches

In [None]:
home_wins = len(selected_matches[selected_matches['FTR'] == 'H'])
home_win_rate = round(home_wins/selected_matches_number,2)
print("Win rate: ", home_win_rate)

$\alpha$ and $\beta$ positive parameters control the shape of the distribution. Prior parameters:

In [None]:
alpha_param = home_win_rate * selected_matches_number
beta_param = selected_matches_number - alpha_param

print(alpha_param, beta_param)

Selected matches from 2020 to 2021

In [None]:
selected_matches_20_21 = df.query("Season == '2020-21'")
selected_matches_20_21_num = len(selected_matches_20_21)

home_wins_20_21 = len(selected_matches_20_21[selected_matches_20_21['FTR'] == 'H'])
win_rate_20_21 = round(home_wins_20_21/selected_matches_20_21_num,2)

print(" Selected matches from 2020-21: ", selected_matches_20_21_num, "\n", "Win matches: ", home_wins_20_21, "\n", "Win rate: ", win_rate_20_21)

According to eq10, the posterior parameters are:

$$ \alpha’ = \alpha + y $$
$$ \beta’ = n - y + \beta$$



In [None]:
alpha_pos_param = alpha_param + home_wins_20_21
beta_pos_param = selected_matches_20_21_num - home_wins_20_21 + beta_param
print(alpha_pos_param, beta_pos_param)

Lets calculate prior, likelihood and posterior of our selected data

In [None]:
n = selected_matches_20_21_num
y = home_wins_20_21

theta = np.linspace(0,1,1000) # parameter array represents a possible value of the parameter for the beta distribution

prior = beta.pdf(theta, alpha_param, beta_param)
likelihood = binom.pmf(y, n, theta) 
posterior = beta.pdf(theta, alpha_pos_param, beta_pos_param) 


The plot representation of results of Bayesian upfate theorem

In [None]:
def plot_show(theta, prior, likelihood, posterior, l_factor):
    plt.figure(figsize=(10, 6))

    plt.plot(theta, prior, label="prior")
    ''' 
    l_factor for scaling the likelihood curve to distinguish the 
    likelihood curve from the prior and posterior curves in the plot
    Because likelihood function typically has a smaller curve.
    But scaling the likelihood function isn't required by math.
    '''
    plt.plot(theta, l_factor * likelihood, label="likelihood")
    plt.plot(theta, posterior, label="posterior")

    plt.ylabel("y")
    plt.xlabel("x")
    plt.xlim([0, 1])
    plt.legend()
    plt.xticks(np.arange(0, 1, 0.1))
    plt.show()

In [None]:
plot_show(theta, prior, likelihood, posterior, l_factor=900)

After seeing the graph we can observe that the blue line describes our prior belief which is more than 0.4 on x-axis (we already defined this value as 0.46), likelihood function shown as an orange curve has highest probability value less than 0.4, and posterior's green curve is around 0.45 after seeing our data from 2020 to 2021. <br> The posterior' value is less than our prior belief which means that **the probability changed** after observing the new data. Now our local teams have a winning probability of around 0.45.

### Uninformative value

What if we don't consider the prior belief of our data? On the previous example a prior belief was calculated from the matches between 2015-2020. New data were from 2020-2021. We know that during this period was COVID-19 and maybe this affected to our players. <br> <br> Calculating with minimal prior belief called **uninformative prior**. Uniformative prior has equal weights for all possible values of the parameter. This means that all values of the parameter are equally likely before we see any data.

In [None]:
alpha_param = 1
beta_param = 1

alpha_pos_param = alpha_param + y
beta_pos_param = n - y + beta_param

prior = beta.pdf(theta, alpha_param, beta_param)
likelihood = binom.pmf(y, n, theta) 
posterior = beta.pdf(theta, alpha_pos_param, beta_pos_param) 

plot_show(theta, prior, likelihood, posterior, l_factor=300)

After observing our data in the graph we can see 3 lines. Our prior blue line is flat which means all values of the parameter are equally likely before seeing any data. The likelihood orange curve has a value around 0.4 (or less than 0.4). The same value can be observed in the posterior green curve but it is more spread out than the likelihood function.

### The normal likelihood and normal priors

When we considering a normal likelihood and a normal prior in Bayesian statistics, we're essentially dealing with a situation where both the likelihood and the prior distributions follow the normal (Gaussian) distribution. 

The Normal parametric family has two parameters - the mean $μ$ and the variance​ $σ^2$, so The Bayesian updating process with a normal likelihood and normal prior can be expressed as follows:
 
$$
p(μ|x) = (2 \pi \tau_n^2)^{-\frac{1}{2}} \exp{(-\frac{1}{2\tau_n^2} (μ-μ_n)^2)}
$$

where

$$
μ_n = (\frac{n}{σ^2} + \frac{1}{\tau_0^2})^{-1}
$$

Or like this:

1. Prior distribution:
<br>
$p(θ) \propto \exp{(- \frac{1}{2σ_2^0} (θ-μ_0)^2)}$

2. Likelihood function
<br>
$p(X=x|θ) \propto \exp{(- \frac{1}{2σ^2} (x-θ)^2)}$

3. Posterior distribution
<br>
$p(θ|X=x) \propto \exp{(- \frac{1}{2σ_n^2}(θ - μ_n)^2)}$

### QUIZ section | Glossary cells

In [3]:
# pip3 install jupytercards
# pip3 install jupyterquiz
    
from jupytercards import display_flashcards
folder='files/'
display_flashcards(folder+'glossary.json')




<IPython.core.display.Javascript object>

In [6]:
from jupyterquiz import display_quiz

display_quiz(folder+'quizzes.json')

<IPython.core.display.Javascript object>

## --- Code Daulet ---

At times, it is also known as the "bell-shaped distribution" due to the resemblance of its probability density function graph to the shape of a bell.