***<h3>BAYESIAN EXERCISES</h3>***

***EXERCISE 1***:  
Provide a small and concise example of how bayesian statistics can be applied in your study degree. You should provide references to your answer if you don’t come up with an example on your own.

Bayesian Statistics played a profound role in the development of Natural Language Processing in the 90s.(Cohen, 2022)  
A great example of the use of Bayesian Statistics in NLP is sentiment analysis, for example, as used in **movie reviews or tweets**.  
$$P\left(w_{k} | c_{i}\right)=\frac{\operatorname{count}\left(w_{k}, c_{i}\right)+1}{\sum_{w \in V} \operatorname{count}\left(w, c_{i}\right) + |V|}$$  
- **The prior:** $P(c_{i})$ representing the likelihood of a document being in class $c_{i}$, without considering any specific words.  
- **The likelihood function** where $P\left(w_{k} | c_{i}\right)$ represents the probability of observing word $w_{k}$ given that the documents belongs to the class $c_{i}$. 
    - Then the numerator $\operatorname{count}\left(w_{k}, c_{i}\right)+1$ denotes the number of occurrences of word $w_{k}$ in documents belonging to the class $c_{i}$, with laplace smoothing. 
    - Finally the denominator $\sum_{w \in V} \operatorname{count}\left(w, c_{i}\right) + |V|$ is the laplace smoothed sum of the counts of all the words in the vocabulary that appear in the documents of the class $c_{i}$, plus the size of the vocabulary multiplied by the Laplace smoothing parameter.  
- **The posterior:** $P(w_{1}, w_{2}, ..., w_{n}) = \sum_{i} P(w_{1}, w_{2}, ...,w_{n}|c_{i}) \times P(c_{i})$ shows us the probability of the document belonging to class $c_{i}$ given the words $w_{1}, w_{2}, ..., w_{n}$ present in the document.  

***EXERCISE 2***:  
Consider the following derivation of the ELBO, a quantity used in variational Bayes inference. For
each of the 4 lines in the derivation, explain its justification (hint: remember the “three power tools
of statistics”).  

1. $\log p_{\boldsymbol{\theta}}(\mathbf{x})=\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x})\right]$ This line states that the log likelihood of observing data x is equal to the expectation of the log likelihood under the variational distribution $q_\phi(\mathbf{z} \mid \mathbf{x})$.  
  
2. $= \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})}\right]\right]$ This lines uses the definition of joint and conditional probability to express $p(x)$ in terms of joint and conditional probabilities.  
  
3. $=\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x})} \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z} \mid \mathbf{x})}\right]\right]$ The third line applies the property of logarithms where the difference of logarithms is the logarithm of the quotient. This is done in order to obtain the last step, or line 4.
  
4. $=\underbrace{\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})}\right]\right]}_{=\mathcal{L}_{\boldsymbol{\theta}, \phi}(\mathbf{x})}+\underbrace{\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}\left[\log \left[\frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})}\right]\right]}_{=D_{K L}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_{\boldsymbol{\theta}}(\mathbf{z} \mid \mathbf{x})\right)}$ This last line, following the third line splits the expectation in two terms. The first represents ELBO and the second one represents Kullback-Leibler divergence between the variational distribution and the true posterior.  

***EXERCISE 3***  
Consider the following pyMC3 model for Bayesian linear regression. There are 4 lines with errors.
Identify them, correct the error, and explain why it was an error (i.e., not just what was wrong, but
WHY it was wrong).

In [None]:
#Define the model

with pm.Model() as model:

    x = pm.Data('x', x_data)
    y = pm.Data('y_obs', y_observed)
    
    a = pm.Normal('slope', mu=0, sigma=1) #mu = 0, sd = 1
    b = pm.Normal('intercept', mu=0, sigma=1) #mu = 0, sd = 1
    s = pm.HalfNormal('sigma', sigma=0.001) # s = pm.HalfNormal()

    mu = pm.Deterministic('mu', a*x+b)

    likelihood = pm.Normal('y', mu=mu, sigma=s, observed=y) 

    step = pm.NUTS()

    trace = pm.sample(1000, tune= 1000, init=None, step=step, cores=2)

1. Setting a mu so high can make the model be overtly biased towards higher values of the slope and intercept. (2 mistakes)  
2. A standard deviation that large allows for a vast amount of plausible values, while it might be indicative of uncertainty, it will definitely make the prior less informative and therefore resulting in not regularizing the model properly. (2 mistakes)  
3. Thirdly, the function used to determinate s can't be a normal since the standard deviation used in the likelihood can't be negative, it therefore must be a HalfNormal. (1 mistake)  
4. The observed data is presently doing nothing, it must be included in the definition for the likelihood so the function has data to propagate from. Also  (1 mistake)
5. The mu used in the likelihood function should be defined using the pm.Deterministic() to properly represent the deterministic relationship between the data and the parameters. (1 mistake)

<h3>References:</h3>
Cohen, S. (2022). <em>Bayesian analysis in natural language processing</em>. Springer Nature.<br>  
GitHub Copilot v1.162.0 was used as assistance while developing the code for this assignment.