# Bayesian machine learning

- In the offline world it was sufficient to run experiments, ie. collect data, perform statistical tests and modify action based on that
- Today in the online world things are moving too fast. Web behaviour must be responsive and there is too much data to handle manually. Updating actions based on data must happen automatically, hence the use of bayesian machine learning

## Examples of A/B testing
- Medicine
    - A pharmaceutical company discovers a new drug for blood pressure and needs to find out if it works
    - Then perform and experiment. Select 1 group to give the drug and a second group to give a placebo
    - Measure the blood pressure of every person before and after their treatment and perform statistical tests to figure out if there is a difference between the groups
    
- Web pages
    - A website wants to optimize the number of visitors who purchase something (conversion rate)
    - Multiple versions of the website are used to test for trustworthiness of the design
    - A test is performed to see if one version peform better than another

## What is Bayesian machine learning?
- Bayes rule is just basic probability
- in the Bayesian approach, everything is a random variable
- suppose we want to measure the height of students in a class
    - frequentist approach:
        - measure the height of everyone in the class using a normal distribution
        - write down the likelihood of observing that data
        - use maximum likelihood estimation to figure out the parameters (mean and variance)
    - bayesian approach:
        - in the bayesian approach the parameters are not scalar numbers
        - they are probability distributions themselves
        
        
        

## Probability review
- marginal distributions: $p(A), p(B)$
- joint distribution: $p(A, B)$
- conditional distributions: $p(A|B), p(B|A)$

$p(A) = \sum_{B} p(A,B)$<br>
$p(B) = \sum_{A} p(A,B)$<br><br>
$p(A|B) = \frac{p(A,B)}{p(B)} = \frac{p(A,B)}{\sum_{A} p(A,B)}$<br>
$p(B|A) = \frac{p(A,B)}{p(A)} = \frac{p(A,B)}{\sum_{B} p(A,B)}$<br>

$p(B|A) = \frac{p(A,B)}{p(A)} = \frac{p(A|B)p(B))}{\sum_{B} p(A|B)p(B)}$<br><br>

- for continuous distributions, we call the probability distribution the probability density and instead of summations we integrate

## Examples
- suppose we want to find p(Buy|Country):<br>

|  | CA | US | MX |
|-|-|-|-|
| Buy = True | 20 | 50 | 10 |
| Buy = False | 300 | 500 | 200 | 

- an e-commerce website will likely be very interested if the website is performing poorly in a given country. The causes may be numerous: slower speed in given country due to distance from datacenters, irrevelant product placement etc.<br><br>

marginal probabilities
- p(country=CA) = 210/(210+550+320)=0.30
- p(country=US) = 550/(210+550+320)=0.51
- p(country=MX) = 210/(210+550+320)=0.19<br><br>

joint probabilities:
- p(Buy=True, Country=CA) = 20/1080 = 0.019
- p(Buy=False, Country=CA) = 300/1080 = 0.28
- p(Buy=True, Country=US) = 50/1080 = 0.046
- p(Buy=False, Country=US) = 500/1080 = 0.46
- p(Buy=True, Country=MX) = 10/1080 = 0.0093
- p(Buy=False, Country=MX) = 200/1080 = 0.185<br><br>

conditional probabilities:
- p(Buy=True|Country=CA) = 0.019/0.30 = 0.07
- p(Buy=False|Country=CA) = 0.28/0.30 = 0.93
- p(Buy=True|Country=US) = 0.046/0.51 = 0.09
- p(Buy=False|Country=US) = 0.46/0.51 = 0.91
- p(Buy=True|Country=MX) = 0.0093/0.19 = 0.05
- p(Buy=False|Country=MX) = 0.185/0.19 = 0.97<br><br>

Independence:<br>
$P(A,B) = P(A)P(B)$

Example:
- consider N succesive coin tosses
- the value of every successive coin toss is independent of the previous one, since it doesn't effect the probability
- we say the each coin toss is iid, independent and identically distributed
- meaning the the value of coin toss k is completely independent of all previous coin tosses
- the gambler's fallacy refers to the common thought process that if a person has lost many bets in a row, the probability of winning the next bet is somehow higher, since "it should average over time"
