In [1]:
import numpy as np

# Week 3: Inference with Bayes' Theorem for Random Variables

### Product Rule for Random Variables 

- often in real world problems, we arent given the joint distribution of 2 random variables 
    - often we can compute it by using product rule 
- for random variables X and Y, product rule is this:
$P_{x,y} = p_{X|Y}(x|y)$ for all $x \epsilon X, y \epsilon Y$ such that $p_Y(y)>0$

IMPORTANT TO KNOW: $p_{X|Y}(x,y) = p_Y(y)p_{X|Y}(x|y)$ where if $p_Y(y)=0$ then $p_{X,Y}(x,y)$ is 0


### Bayes' Theorem for Random Variables

"posterior" distribution: what gives/informs our "belief" of what X given Y = y is after observing Y take on a special value y (and given a likelihood p_Y|X and distribution for pX)
- denotation: $p_{X|Y}(.|y)$
- note: it is a distribution for what we are inferring
- essentially we have an initial prob distribution for X, and as we get more info on Y's observed values, it informs and helps us update our values/beliefs abt X 

- Bayes' theorem for rand variables tells us how to compute posterior distribution/how to weight each possible val of rand var X after we see Y = y

$$P_{X|Y}(x|y) = \frac{p_X(x)p_{Y|X}(y|x)}{\sum_{x'}(x')p_{Y|X}(y|x')}$$

$$ \frac{P_{X,Y}(x,y)}{P_Y(y)} = \frac{P_X(x)P_{Y|X}{y|x}}{P_Y(y)} = \frac{P_X(x)P_{Y|X}(y|x)}{\sum_{x'}P_{X,Y}(x',y)}$$

NOTE: posterior distribution can be undefined, only when p_X(x) = 0

- computationally, is 2 step process after Y =y observation
1. weight the score of $p_X(x)$ by factor of $p_{Y|X}(y|x)$

$$\alpha(x|y) \triangleq p_X(x) p_{Y|X}(y|x)$$

alpha is the unnormalized posterior distrib/table atp 
- this is bc when adjusting weights on the fly, we no longer guarantee that the beliefs add up to 1

2. fix unnormalized posterior table by renormalizing 

$$p_{X|Y}(x|y)=\frac{\alpha(x|y)}{\sum_{x'}\alpha(x'|y)}=\frac{p_X(x)p_{Y|X}(y|x)}{\sum_{x'}p_X(x')p_{Y|X}(y|x')}$$

### Maximum A Posteriori (MAP) Estimation

- sometimes we care about the highest posterior probability (aka the highest value x that X can take on when Y = y)

- MAP estimate: value that X can take on that maximizes the posterior distribution
    denoted by: $\hat{x}_{MAP} (y)$
    is equal to: $\hat{x}_{MAP} (y) = arg max_x p_{X|Y}(x|y)$

    where arg max specifies that we are finding the value x that yields highest posterior probability, as opposed to the actual highest post prob. 

    - can be difficult to compute given larger sets 
    - should know the number of outcomes for the random variable (will determine what the threshold is for the max) 
    - ex: if two outcomes, then MAP should have value higher than 0.5

### Introduction to Independence 

independence: the fact that outcomes may not/do not relate to/inform the next/previous outcome 
    denoted by: $A\bot B$

- A is indep of B if probability of A and B occurring is the same as prob of A * prob B
- written as $p_{X|Y}(x|y)=p_X(x)$ which shows that even given Y, probability of X still remains the same as if Y never happened

Bernoulli and Binomial Random Variables

- Bernoulli: with only two outcomes, $\textit{can}$ be biased (as in uneven probability)
    - prob_table = [1: p, 0: 1-p]
    - if rand var X has this distribution, we say X ~ Bernoulli(p) (X "has distribution")
- Binomial: like multiple bernoulli's, specifies a number of times event occurs and a single probability to describe that
    - like counting # heads for n of these biased coin flips 



In [5]:
# Independent Random Variables Exercise 

'''proving p_{W,I}(w,i)=p_W(w)p_I(i)'''

# defining two joint tables w two variables
prob_W_I = np.array([[1/2, 0], [0, 1/6], [0, 1/3]])
prob_X_Y = np.array([[1/4,1/4],[1/12,1/12],[1/6,1/6]])

# marginalizing the other's axes to get the singular probability tables
prob_W = prob_W_I.sum(axis=1)
print(prob_W)
prob_I = prob_W_I.sum(axis=0)

prob_X = prob_X_Y.sum(axis=1)
prob_Y = prob_X_Y.sum(axis=0)

# we can do matrix multiplication, compare product of the two to og WI table values
print(np.outer(prob_W, prob_I))
print(np.outer(prob_X, prob_Y))

[0.5        0.16666667 0.33333333]
[[0.25       0.25      ]
 [0.08333333 0.08333333]
 [0.16666667 0.16666667]]
[[0.25       0.25      ]
 [0.08333333 0.08333333]
 [0.16666667 0.16666667]]


### Mutual v Pairwise Independence

mutual independence: given multiple random variables with a joint probability distribution that is = to product of all indiv distributions 

pairwise independence: for any 2 variables, they are independent
- however, looking at all three together (or multiple), knowing the other two DOES inform the next 
- ex: $\oplus$ (XOR)function, where knowing x and y will inform your z, although knowing x or y alone doesnt inform z

### Conditional Independence

Sam as mutual dependence, but with a third random variable that they are conditioned onto 

if $P_{X,Y|Z}(x,y|z) = P_{X|Z}(x|z)p_{Y|Z}(y|z)$ then they are conditionally independent
- that is, the two random variables probabilities dont change given Z
- having maginal independence DOESNT imply conditional indep. vice versa

QUESTION: is marginally independent the same as mutual independence

Explaining away: a concept/thing where once we observe one explanation, our belief of a different variable may go back down 

QUESTION: what is the difference between something being a joint probability and independent? it seems that to find the joint probability, you use P(x,y) = p(x)*p(y) but the same is also used to prove independence of two random variables 

