# Maximum Likelihood 


* *Method of Maximum Likelihood:*
    * A *data likelihood* is how likely the data is given the parameter set
    * So, if we want to maximize how likely the data is to have come from the model we fit, we should find the parameters that maximize the likelihood
    * A common trick to maximizing the likelihood is to maximize the log likelihood.  Often makes the math much easier.  *Why can we maximize the log likelihood instead of the likelihood and still get the same answer?*
    * Consider: $\max \ln \exp \left\{ -\frac{1}{2}\left(y(x_n, \mathbf{w}) - t_n\right)^2\right\}$ We go back to our original objective. 


## Maximum Likelihood for the Bernoulli Distribution

* Lets look at this in terms of binary variables, e.g., Flipping a coin:  $X =1$ is heads, $X=0$ is tails
* Let $\mu$ be the probability of heads.  If we know $\mu$, then: $P(x = 1 |\mu) = \mu$ and $P(x = 0|\mu) = 1-\mu$
\begin{eqnarray}
P(x|\mu) = \mu^x(1-\mu)^{1-x} = \left\{\begin{array}{c c}\mu & \text{ if } x=1 \\ 1-\mu & \text{ if } x = 0 \end{array}\right.
\end{eqnarray}

* This is called the *Bernoulli* distribution.  The mean and variance of a Bernoulli distribution is: 
\begin{equation}
E[x] = \mu
\end{equation}
\begin{equation}
E\left[(x-\mu)^2\right] = \mu(1-\mu)
\end{equation}
* So, suppose we conducted many Bernoulli trials (e.g., coin flips) and we want to estimate $\mu$

### Method: Maximum Likelihood
\begin{eqnarray}
p(\mathscr{D}|\mu) &=& \prod_{n=1}^N p(x_n|\mu) \\
&=& \prod_{n=1}^N \mu^{x_n}(1-\mu)^{1-x_n}
\end{eqnarray}

* Maximize : (*What trick should we use?*)
\begin{eqnarray}
\mathscr{L} = \sum_{n=1}^N x_n \ln \mu + (1-x_n)\ln(1-\mu)
\end{eqnarray}

\begin{eqnarray}
\frac{\partial \mathscr{L}}{\partial \mu} =  0 &=& \frac{1}{\mu}\sum_{n=1}^N x_n - \frac{1}{1-\mu }\sum_{n=1}^N (1 - x_n)\\
0 &=& \frac{(1-\mu) \sum_{n=1}^N x_n - \mu \sum_{n=1}^N (1- x_n)}{\mu(1-\mu)}\\
0 &=& \sum_{n=1}^N x_n - \mu \sum_{n=1}^N x_n - \mu \sum_{n=1}^N 1 + \mu \sum_{n=1}^N x_n\\
0 &=& \sum_{n=1}^N x_n - \mu N\\
\mu &=& \frac{1}{N}\sum_{n=1}^N x_n = \frac{m}{N}
\end{eqnarray}
where $m$ is the number of successful trials. 

* So, if we flip a coin 1 time and get heads, then $\mu = 1$ and probability of getting tails is 0.  *Would you believe that? We need a prior!*

## The Gaussian Distribution:
* Consider a univariate Gaussian distribution:
\begin{equation}
\mathscr{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2} \right\}
\end{equation}
* $\sigma^2$ is the variance OR $\frac{1}{\sigma^2}$ is the *precision*
* So, as $\lambda$ gets big, variance gets smaller/tighter.  As $\lambda$ gets small, variance gets larger/wider.
* The Gaussian distribution is also called the *Normal* distribution. 
* We will often write $N(x|\mu, \sigma^2)$ to refer to a Gaussian with mean $\mu$ and variance $\sigma^2$.
* *What is the multi-variate Gaussian distribution?* 

* What is the expected value of $x$ for the Gaussian distribution?
\begin{eqnarray}
E[x] &=& \int x p(x) dx \\
     &=& \int x \frac{1}{\sqrt{2\pi \sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2} \right\} dx
\end{eqnarray}
* *Change of variables:*  Let
\begin{eqnarray}
y &=& \frac{x-\mu}{\sigma} \rightarrow x = \sigma y + \mu\\
dy &=& \frac{1}{\sigma} dx \rightarrow dx = \sigma dy
\end{eqnarray}
* Plugging this into the expectation: 
\begin{eqnarray}
E[x] &=& \int \left(\sigma y + \mu  \right)\frac{1}{\sqrt{2\pi}\sigma} \exp\left\{ - \frac{1}{2} y^2 \right\} \sigma dy \\
&=& \int \frac{\sigma y}{\sqrt{2\pi}} \exp\left\{ - \frac{1}{2} y^2 \right\} dy + \int \frac{\mu}{\sqrt{2\pi}} \exp\left\{ - \frac{1}{2} y^2 \right\} dy 
\end{eqnarray}
* The first term is an odd function: $f(-y) = -f(y)$  So, $E[x] = 0 + \mu = \mu$



## MLE of Mean of Gaussian

*  Let $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N$ be samples from a multi-variance Normal distribution with known covariance matrix and an unknown mean.  Given this data, obtain the ML estimate of the mean vector. 
	\begin{equation}
	p(\mathbf{x}_k| {{\mu}}) = \frac{1}{(2\pi)^{\frac{l}{2}}\left| \Sigma \right|^{\frac{1}{2}}}\exp\left( -\frac{1}{2}(\mathbf{x}_k - {{\mu}})^T\Sigma^{-1}(\mathbf{x}_k - {\mu})\right)
	\end{equation}
* We can define our likelihood given the $N$ data points.  We are assuming these data points are drawn independently but from an identical distribution (i.i.d.):
	\begin{equation}
	\prod_{n=1}^N p(\mathbf{x}_n| {{\mu}}) = \prod_{n=1}^N \frac{1}{(2\pi)^{\frac{l}{2}}\left| \Sigma \right|^{\frac{1}{2}}}\exp\left( -\frac{1}{2}(\mathbf{x}_n - {{\mu}})^T\Sigma^{-1}(\mathbf{x}_n - {\mu})\right)
	\end{equation}
*  We can apply our "trick" to simplify
	\begin{eqnarray}
	\mathscr{L} &=& \ln \prod_{n=1}^N p(\mathbf{x}_n| {{\mu}}) = \ln \prod_{n=1}^N \frac{1}{(2\pi)^{\frac{l}{2}}\left| \Sigma \right|^{\frac{1}{2}}}\exp\left( -\frac{1}{2}(\mathbf{x}_n - {{\mu}})^T\Sigma^{-1}(\mathbf{x}_n - {\mu})\right)\\
	&=& \sum_{n=1}^N  \ln \frac{1}{(2\pi)^{\frac{l}{2}}\left| \Sigma \right|^{\frac{1}{2}}}\exp\left( -\frac{1}{2}(\mathbf{x}_n - {{\mu}})^T\Sigma^{-1}(\mathbf{x}_n - {\mu})\right)\\
	&=& \sum_{n=1}^N  \left( \ln \frac{1}{(2\pi)^{\frac{l}{2}}\left| \Sigma \right|^{\frac{1}{2}}} + \left( -\frac{1}{2}(\mathbf{x}_n - {{\mu}})^T\Sigma^{-1}(\mathbf{x}_n - {\mu})\right) \right) \\
	&=&  - N \ln (2\pi)^{\frac{l}{2}}\left| \Sigma \right|^{\frac{1}{2}} + \sum_{n=1}^N  \left( -\frac{1}{2}(\mathbf{x}_n - {{\mu}})^T\Sigma^{-1}(\mathbf{x}_n - {\mu}) \right) 
	\end{eqnarray}
* Now, lets maximize:
	\begin{eqnarray}
	\frac{\partial \mathscr{L}}{\partial \mu} &=& \frac{\partial}{\partial \mu} \left[- N \ln (2\pi)^{\frac{l}{2}}\left| \Sigma \right|^{\frac{1}{2}} + \sum_{n=1}^N  \left( -\frac{1}{2}(\mathbf{x}_n - {{\mu}})^T\Sigma^{-1}(\mathbf{x}_n - {\mu}) \right)\right] = 0 \\
	&\rightarrow& \sum_{n=1}^N  \Sigma^{-1}(\mathbf{x}_n - {\mu}) = 0\\
	&\rightarrow& \sum_{n=1}^N  \Sigma^{-1}\mathbf{x}_n  = \sum_{n=1}^N  \Sigma^{-1} {\mu}\\
	&\rightarrow& \Sigma^{-1} \sum_{n=1}^N \mathbf{x}_n  = \Sigma^{-1} {\mu} N\\
	&\rightarrow& \sum_{n=1}^N \mathbf{x}_n  = {\mu} N\\
	&\rightarrow& \frac{\sum_{n=1}^N \mathbf{x}_n}{N} = {\mu}\\
	\end{eqnarray}
* So, the ML estimate of $\mu$ is the sample mean!

