#### Some Tex macros - don't touch
$$ 
\def\R{{\mathbb R}} 
\newcommand{\tbrkt}[1]{(#1)}
\newcommand{\mbrkt}[1]{\left(#1\right)}
\newcommand{\vecuc}{\mathbf{U}}
\newcommand{\vecu}{\mathbf{u}}
\newcommand{\vecx}{\mathbf{x}}
\newcommand{\vecs}{\mathbf{S}}
\newcommand{\vecz}{\mathbf{z}}
\newcommand{\vecy}{\mathbf{y}}
\newcommand{\vecY}{\mathbf{Y}}
\def\ba{\mathbf{a}}
\def\bb{\mathbf{b}}
\def\be{\mathbf{e}}
\def\bx{\mathbf{x}}
\def\by{\mathbf{y}}
\def\bz{\mathbf{z}}
\def\bc{\mathbf{c}}
\def\bm{\mathbf{m}}
\def\bt{\mathbf{t}}
\def\bv{\mathbf{v}}
\def\bu{\mathbf{u}}
\def\bw{\mathbf{w}}
\def\bF{\mathbf{F}}
\def\bJ{\mathbf{J}}
\def\bA{\mathbf{A}}
\def\bH{\mathbf{H}}
\def\bT{\mathbf{T}}
\def\bX{\mathbf{X}}
\def\bY{\mathbf{Y}}
\def\bW{\mathbf{W}}
\def\bB{\mathbf{B}}
\def\bD{\mathbf{D}}
\def\bS{\mathbf{S}}
\def\bI{\mathbf{I}}
\def\bL{\mathbf{L}}
\def\bU{\mathbf{U}}
\def\bV{\mathbf{V}}
\def\bZ{\mathbf{Z}}
\def\b0{\mathbf{0}}
\def\b1{\mathbf{1}}
\def\cC{\mathcal{C}}
\def\cD{\mathcal{D}}
\def\cE{\mathcal{E}}
\def\cL{\mathcal{L}}
\def\cN{\mathcal{N}}
\def\cO{\mathcal{O}}
\def\cU{\mathcal{U}}
\def\cX{\mathcal{X}}
\def\cW{\mathcal{W}}
\def\mR{\mathbb{R}}
\def\mN{\mathbb{N}}
\def\EE{\mathbb{E}}
\def\btheta{\mathbf{\theta}}
\def\bvartheta{\mbox{\boldmath $\vartheta$}}
\def\bomega{\mbox{\boldmath $\omega$}}
$$

# Why use Log Likelihood instead of Likelihood?

Note that likelihood usually enters the ML pipeline as the -ve of the cost function i.e. maximizing the likelihood corresponds to minimizing the cost function. In fact, [most cost functions are some form of likelihoods](https://datascience.stackexchange.com/q/10188/38717).  E.g. In linear regression, the squared error loss directly comes from the assumption of Gaussian error. Similarly, the cross entropy loss used in logistic regression directly comes from the Bernoulli distribution. Thus, likelihood plays an important role in almost all ML problems. Here we take a closer look at why we try to work with **log likelihood instead of likelihood**.

#### Independence assumption
1. Given $n$ datapoints $\bX = (x_1,x_2,\dots,x_n)$ which are sampled **independently sampled from the same distribution** $f(.,\btheta)$, we usually want to make some inference about the parameters $\theta$ of the distribution. E.g. $f$ could be a Gaussian distribution and we want to estimate its parameters - mean $\theta_1$ and variance $\theta_2$.  One of the widely used approaches to make inference is to finding parameters that maximize the probability that we observe the given data $\bX$ - This procedure is also known as [Maximum Likelihood Estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). 
The likelihood of observing the data under the **independence** setting is given by $f(x_1,\btheta)f(x_2,\btheta)\dots f(x_n,\btheta)$. Maximizing this product expression with respect to $\btheta$ would be cumbersome even if the density $f$ has a simple expression such as polynomial  (as we have to compute $\frac{\partial f(x_1,\btheta)}{\partial \btheta}f(x_2,\btheta)\dots f(x_n,\btheta) + f(x_1,\btheta) \frac{\partial f(x_2,\btheta)}{\partial \btheta} \dots f(x_n,\btheta) + \dots + f(x_1,\btheta)\dots f(x_{n-1},\btheta)\frac{\partial f(x_n,\btheta)}{\partial \btheta}  $). However, most real-world probability densities have much more complicated expressions which makes this maximization almost intractable. Taking a logarithm of this product helps with the tractability. 
Since log is monotonically increasing, the same $\btheta^{*}$ which maximizes the product expression would also maximize the $\log (f(x_1,\btheta)f(x_2,\btheta)\dots f(x_n,\btheta)) = \sum_i \log f(x_i,\btheta)$. Evaluating the gradient expression with this log likelihood is much more easy to handle (as we have to just evaluate $\sum_i \frac{1}{f(x_i,\btheta)}\frac{\partial f(x_i,\btheta)}{\partial \btheta}$ which does not involve any product terms. Further this gradient can be computed *in parallel* by considering batches of the input datapoints, which cannot be done in the product expression above. **This is a big deal especially when we are dealing with thousands of datapoints.**)

#### Most densities are from the exponential family
2. Most densities we encounter in day-to-day applications come from the [exponential family](https://en.wikipedia.org/wiki/Exponential_family) - e.g. Gaussian, Binomial, Bernoulli, Dirichlet, Poisson, etc. The general form of density of exponential family distributions is given by

$f_X(x,\btheta) = h(x)\,\exp\!\bigl[\,\eta(\btheta) \cdot T(x) +A(\btheta)\,\bigr]$, where $h,\eta,A,T$ are known functions - e.g. For the Gaussian distribution $f(x;\mu,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}$, we have

\begin{align} 
\boldsymbol {\eta} &= \left[\,\frac{\mu}{\sigma^2},~-\frac{1}{2\sigma^2}\,\right]^{\mathsf T} \\
h(x) &= \frac{1}{\sqrt{2 \pi}} \\
T(x) &= \left( x, x^2 \right)^{\rm T} \\
A({\boldsymbol \eta}) &= \frac{\mu^2}{2 \sigma^2} + \log |\sigma| = -\frac{\eta_1^2}{4\eta_2} + \frac{1}{2}\log\left|\frac{1}{2\eta_2} \right|
\end{align}

Hence, when dealing with most real-world distributions, it makes sense to take the logarithm which can cancel the exponential in the density expression, thereby yielding a much simpler gradient expression. 


#### Computational issues
3. Consider the scenario where we have 1000 datapoints sampled from a standard Normal distribution. Try evaluating the product of their densities $f(x_1,\btheta)f(x_2,\btheta)\dots f(x_n,\btheta)$ where each density term can range from the order of $10^{-1}$ to $10^{-3}$; it would be a very small number, close to zero.  If one were not careful, the machine would do a crude approximation of the very small number to zero. Evaluating the gradients when the absolute value of the objective is so small can lead to errors. However, considering the log likelihood $\sum_i \log f(x_i,\btheta)$, the individual terms in this summation are of large enough magnitude (usually with a -ve sign) and hence no such approximation errors occur. I have burnt my fingers more than once with this approximation of very small densities to zeros. 