<hr/>

# Introduction to Data Science
**Tamás Budavári** - budavari@jhu.edu <br/>

- Bayesian inference
- Prior: proper vs improper
- Likelihood function
- Maximum Likelihood Estimation
- Links to least squares

<hr/>

<h1><font color="darkblue">Bayesian Inference</font></h1>

### Joint & Conditional Probability
- Consider random variables $X$, $Y$ of events. Their **joint probability** is

>$\displaystyle P(X, Y) \neq P(X)\,P(Y)$ 
>
> instead
>
>$\displaystyle P(X, Y) = P(X)\,P(Y \lvert X)$ 
>
> where $P(Y  \lvert X)$ is the **conditional probability** of $Y$ given $X$

- For example, if $X$ represents the event of flipping head and $Y$ is tail on the same trial, $P(X,Y)=0$ because $P(Y \lvert  X)=0$. 

- But on separate trials, the events would be independent and we would have $P(Y \lvert  X)=P(Y)$.


### Bayes' Theorem
- The joint probability of $X$ and $Y$ discrete events

>$\displaystyle P(X,Y) = P(X)\,P(Y \lvert X)$ 
>
> and 
>
>$\displaystyle P(Y,X) = P(Y)\,P(X \lvert Y)$ 
>
> Their equality yields
>
>$\displaystyle P(X \lvert Y) = \frac{P(X)\,P(Y \lvert X)}{P(Y)}$ 


### Probability Densities
- It is also true in the continuous case and PDFs

>$\displaystyle P(X \lvert y) = \frac{P(X)\,p(y \lvert X)}{p(y)}$ 
>
> and
>
>$\displaystyle p(x \lvert Y) = \frac{p(x)\,P(Y \lvert x)}{P(Y)}$ 
>

- Also

>$\displaystyle p(x \lvert y) = \frac{p(x)\,p(y \lvert x)}{p(y)}$ 
>
> where
>
>$\displaystyle p(y) = \int p(x)\,p(y \lvert x)\,dx$ 
>
> to ensure that
>
>$\displaystyle \int p(x \lvert y)\,dx = 1$ 

### Probabilitistic Model
- From data $D$ we can **infer** the parameters $\theta$ of model $M$ 

>$\displaystyle p(\theta \lvert D) = \frac{p(\theta)\,p(D \lvert \theta)}{p(D)}$ 
>
> or including the model $M$ explicitly
>
>
>$\displaystyle p(\theta \lvert D,M) = \frac{p(\theta \lvert M)\,p(D \lvert \theta,M)}{p(D \lvert M)}$ 



### Likelihood Function
- From data $D$ we can **infer** the parameters $\theta$ of model $M$ 

>$\displaystyle p(\theta \lvert D) = \frac{\pi(\theta)\,{\cal{}L}\!_D(\theta)}{Z}$ 
>
> where the normalization
>
>$\displaystyle Z = \int \pi(\theta)\,{\cal{}L}\!_D(\theta)\ d\theta $ 

- The **posterior** is proportional to the **prior** times the **likelihood function** 

### Data
- A set of independent measurements

>$\displaystyle D = \Big\{x_i\Big\}_{i=1}^N$

- E.g., measuring the temperature a room

### Model Parametrization

- For example, the model is that all cities have the same temperature in Maryland

> We also need to state our prior knowledge about the temperature

- Let $\mu$ represent that temperature in all cities (same for all)

> We pick an appropriate prior - often people say we use a "flat" prior because we don't know...

### Alternative Parametrization

- We could have chosen another parametrization, say $\tan \phi$ with $\phi \in \left(-\frac{\pi}{2},\frac{\pi}{2} \right)$

> Clearly a "flat prior" means something different!
><br/>
> What should be the prior? Needs careful consideration!

- Non-informative prior?

> For more, see [Jeffreys](https://en.wikipedia.org/wiki/Harold_Jeffreys) prior

### What is the likelihood function?

- For a set of independent measurements

>$\displaystyle {\cal L}\!_D(\mu) = p(D \lvert \mu) = p(\{x_i\!\}\lvert\mu) = \prod_{i=1}^N p(x_i\lvert\mu) = \prod_{i=1}^N \ell\!_{i}(\mu)$

- For example, Gaussian uncertainties

>$\displaystyle \ell\!_{i}(\mu) = \frac{1}{\sqrt{2\pi\sigma_i^2}}\ \exp\left\{-\frac{(x_i-\mu)^2}{2\sigma_i^2}\right\}$



### Detour: Improper Priors

- The posterior PDF is

>$\displaystyle p(\mu|D) = \frac{\pi(\mu) \prod {\ell}\!_i(\mu)}{\int \pi(\mu) \prod {\ell}\!_i(\mu)\,d\mu}\ $ 

- Uniform prior?

> Using $\pi(\mu)\!=\!1$ is clearly wrong but what if the prior is flat over the interval where likelihood function is non-zero (if!), the normalization cancels from the ratio


### Estimation

- Expected value

>$\displaystyle \int \mu\, p(\mu \lvert D)\, d\mu$

- Variance: 2nd central moment




### Maximum Likelihood Estimation

- Maximizing ${\cal{}L}$ is the same as minimizing $\,-\!\log{\cal{}L}$ 

> $\displaystyle -\log{\cal{}L(\mu)} = \mathrm{const.} + \sum_{i=1}^N \frac{(x_i\!-\!\mu)^2}{2\sigma_i^2}$
><br/>
><br/>
> Cf. the method of least squares




### Result

- Weighted average! Using $w_i = 1 \big/ \sigma_i^2$

> $\displaystyle \hat{\mu} = \frac{\sum w_i x_i}{\sum w_i}$

- Also variance!

>$\displaystyle \frac{1}{\sigma_{\mu}^2} = \sum w_i = \sum \frac{1}{\sigma_i^2}$
><br/>
><br/>
> If all have the same $\sigma$, we have
><br/>
><br/>
>$\displaystyle \frac{1}{\sigma_{\mu}^2} = \frac{N}{\sigma^2}$
$\ \ \ \rightarrow\ \ \ \
\displaystyle \sigma_{\mu} = {\sigma} \big/{\sqrt{N}}$



### Exercise: your 1st classification problem 

> Among some observed objects 1% belongs to a special type, e.g., quasars mixed with many stars. Using a classification method, 99% of these special objects can be correctly selected. This method also selects 0.5% of the other types of objects erroneously.
><br/>
><br/>
> What is the probability of having a special type if an object is selected by the method?
