# About
* created on
* references
    * https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1

* Objective

# The 3 types of probability

1. Marginal Probability
* If A is an event, then the marginal probability is the probability of that event occurring, i.e. P(A)
    * The marginal probability of flipping an unbiased coin and it landing on heads is P(H) = 0.5

2. Joint Probability
* The probability of the intersection of two or more events. 
* Visually it is the intersection of the circles of two events on a Venn Diagram
* If A and B are two events then the joint probability of the two events is written as P(A ∩ B). 

3. Conditional Probability
* The conditional probability is the probability that some event(s) occur given that we know other events have already occurred. If A and B are two events then the conditional probability of A occurring given that B has occurred is written as P(A|B). 

# Maximum Likelihood Estimation
* Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed.
* Say we have some data and when we look at the spread of data, we think maybe it fits a Gaussian(Normal) distribution
* Gaussian distributions have 2 parameters, $\mu $ and $\sigma$
    * Different values of these parameters => difference curves
* Our question is, <u> which curve was most likely responsible for creating the data points that we observed?  </u>
    * <b> <i> Maximum likelihood estimation is a method that will find the values of μ and σ that result in the curve that best fits the data. </i> </b>
    
    ![1_uLKl0Nz1vFg6bmfiqpCKZQ.png](attachment:1_uLKl0Nz1vFg6bmfiqpCKZQ.png)

## Calculating the Maximum Likelihood Estimates
* Now we can move on to learning how to calculate the parameter values. The values that we find are called the maximum likelihood estimates (MLE)

* Example
    * Suppose we have three data points this time and we assume that they have been generated from a process that is adequately described by a Gaussian distribution 
    * 3 data points : 9,9.5,11
    * <i>  How do we calculate the maximum likelihood estimates of the parameter values of the Gaussian distribution μ and σ? </i>
    
* What we want to calculate is the  <b> total probability of observing all of the data, i.e. the joint probability distribution of all observed data points</b>. 
    * To do this we would need to calculate some conditional probabilities, which can get very difficult. 
    * So it is here that we’ll make our first assumption - <i> The assumption is that each data point is generated independently of the others. </i>
    * This assumption makes the maths much easier. If the events (i.e. the process that generates the data) are independent, then the total probability of observing all of data is the product of observing each data point individually (i.e. the product of the marginal probabilities).
 
 
* The probability density of observing a single data point, $x$ that is generated from a gaussian distribution is given by:
![1_t4zrihvhtlZJZsvcX3jRjg.png](attachment:1_t4zrihvhtlZJZsvcX3jRjg.png)

* The semi colon used in the notation P(x; μ, σ) is there to emphasise that the symbols that appear after it are parameters of the probability distribution.

* plugging int the values of our data points
![1_rFzbQ614IR4zEwBM3k1V0Q.png](attachment:1_rFzbQ614IR4zEwBM3k1V0Q.png)

* Now, we just have to figure out the values of μ and σ that results in giving the <b> maximum value of the above expression</b>

* One way to do this is with calculus. Yup, say hello to your old friend <i> differentiation </i>
    * By differentiating the function and equating it to 0, we can find the maxima
    * boom! we have our MLE values
* With the above equation, differentiation is definitely a pain to do, so we can simplify it by taking the natural logarithm of the expression
    * this is ok because the natural logarithm is a monotonically increasing function
        * This means that if the value on the x-axis increases, the value on the y-axis also increases (see figure below). 
        * This is important because it ensures that the maximum value of the log of the probability occurs at the same point as the original probability function. 
        * Therefore we can work with the simpler log-likelihood instead of the original likelihood.

* Taking the ln of the original expressions, and simplifying them, we get 
![1_xjDrGJ_JHLMa7619jFkjLA.png](attachment:1_xjDrGJ_JHLMa7619jFkjLA.png)

* This expression can be differentiated to find the maximum. To get the values of each parameter, we conduct partial differentiation.
* In this example we’ll find the MLE of the mean, μ. To do this we take the partial derivative of the function with respect to μ, giving

![1_kdjQQo5jUX9a2Z0kblJ4Hg.png](attachment:1_kdjQQo5jUX9a2Z0kblJ4Hg.png)
* setting the LHS to 0 and rearranging for $\mu$

$$
    \mu = \frac{9+9.5+11}{3} = 9.833
$$
---

# Can maximum likelihood estimation always be solved in an exact manner?
* No is the short answer. 
* It’s more likely that in a real world scenario the derivative of the log-likelihood function is still analytically intractable (i.e. it’s way too hard/impossible to differentiate the function by hand). Therefore, iterative methods like Expectation-Maximization algorithms are used to find numerical solutions for the parameter estimates. The overall idea is still the same though.


# When is least squares minimisation the same as maximum likelihood estimation?
Least squares minimisation is another common method for estimating parameter values for a model in machine learning. It turns out that when the model is assumed to be Gaussian as in the examples above, the MLE estimates are equivalent to the least squares method.