# WEEK 32 : Maximum Likelihood Estimation : 

 

https://learnwith.campusx.in/s/courses/637339afe4b0615a1bbed390/take

## Difference between Likelihood and Probability

1. __Probability :__ 
    - Probability refers to the measure of the likelihood that a given event will occur.<br></br>
    - It is a mathematical concept used to quantify the chance of an event happening, typically expressed as a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.<br></br> 
    - For example, when you roll a fair six-sided die, the probability of rolling a 3 is 1/6, as there is only one favorable outcome out of the six possible outcomes.<br></br>

2. __Likelihood :__ 
    - Likelihood, on the other hand, is used in the context of statistical inference and represents how well a certain statistical model explains observed data.<br></br>
    - It is the probability of the observed data given a specific model parameter. Unlike probability, likelihood is not a measure of the probability of an event, but rather __a measure of the support provided by the data for a particular model or hypothesis__.<br></br> 
    - For instance, in a coin-flipping experiment, if you observe 10 heads out of 10 tosses, the likelihood of the coin being biased towards heads is higher than that of it being a fair coin.

In simpler terms:

- __Probability is used to describe the chance of a future event occurring.__


- __Likelihood is used to describe the support provided by observed data for a particular hypothesis or model.__

In summary, probability deals with future events, while likelihood deals with the support of data for a particular explanation or model.

## Probability vs Likelihood

### **Probability:**

Probability measures the likelihood of a future event occurring based on known information. It quantifies the chances of an event happening. Probability is used when you have some data and you want to make predictions or assess the likelihood of various outcomes.

**Example of Probability:**
Let's say you are rolling a fair six-sided die. The probability of rolling a 4 is 1/6 because there is 1 favorable outcome (rolling a 4) out of 6 possible outcomes (rolling any number from 1 to 6).

### **Likelihood:**

Likelihood, on the other hand, is used when you have observed some data and you want to make inferences about the parameters or characteristics of a statistical model. It measures how well a particular model or set of parameters explains the observed data.

**Example of Likelihood:**
Imagine you have a bag of 100 marbles, and you want to know the probability of drawing a red marble. You draw 10 marbles, and 8 of them are red. The likelihood of the probability of drawing a red marble being 0.8 (80%) based on your observed data (8 out of 10 marbles) is high. In this case, you are using likelihood to assess the probability based on the observed data.

### Fromula for likelihood:

$$\frac{1}{\sqrt {2 \pi \sigma^2}}*e^{\frac{{-(x-\mu)}^2}{2\sigma^2}}$$

##### In summary, probability deals with predicting future events based on known information, while likelihood deals with assessing how well a particular model or parameter explains observed data. 



##### Probability is forward-looking, while likelihood is backward-looking and is often used in statistical modeling and inference.

### Example 1 - Coin Toss:

__for probability :__
- It will be a **bernoulli distribution** as output is binary ie. head or tail


- Bernoulli has one parameter called "p"


- here $p = p(H)$ = probability of getting head = 0.5


- $q = (1-p) = p(T)$ = probability of getting tail = 0.5


- **probability mass function (PMF)** = $p*k + (1-p) (1-k),$ 
    - where for head k = 1 and for tail k = 0

#### for likelihood:

- flipping coin 5 times and getting H,H,H,H,H


- __so what is the probability that the coin is fair?__


- so the probability = $p(H)^5=(0.5)^5$ which is a very small number


- __hence, the likelihood that the coin is fair ie. head = 0.5, given 5 times consecutively heads is coming is very low__



### conclusion:
> __if we have value of the parameter (p) and we try to calculate chance of event happening. That is known as probability.__



> __but if we already have a chance of event and we try to calculate the plausibility of that event happening. That is called likelihood__

### example for continuous:

#### Normal distribution of heights $\;\mu=150cm,\;\sigma=10cm$

- if distribution is known then we know the parameters i.e $(\mu\;\sigma)$


- chance of the event happening is called probability

###### chance of height of a random person being between 170cm - 180cm


#### $\rightarrow\;$probability:


![Screenshot%202023-09-04%20005517.png](attachment:Screenshot%202023-09-04%20005517.png)


##### p (170 < x < 180) = 0.0214

#### $\rightarrow\;$likelihood:

- based on height being 100 cm, what is the likelihood it will follow this distribution i.e __N ($\mu$ = 150, $\sigma$ = 10cm)__


$$L(\mu,\sigma|X=100)=\frac{1}{\sqrt {2 \pi \sigma^2}}*e^{\frac{{-(x-\mu)}^2}{2\sigma^2}}$$

![Screenshot%202023-09-04%20015015.png](attachment:Screenshot%202023-09-04%20015015.png)

In [37]:
import math

In [38]:
def normal_distribution_pdf(x, mean, std_dev):
    """
    calculate the Normal distribution PDF.
    """
    exponent = -((x-mean)**2)/(2*(std_dev**2))
    return (1/(std_dev * math.sqrt(2 * math.pi))) * math.exp(exponent)

In [39]:
normal_distribution_pdf(100, 150, 10)

# 0.0000001486719514734298

1.486719514734298e-07

In [42]:
normal_distribution_pdf(130, 150, 10)

# very low likelihood

0.005399096651318806

In [43]:
normal_distribution_pdf(140, 150, 10)
# justified

0.02419707245191434

In [40]:
normal_distribution_pdf(100, 180, 10)

# 0.0000000000000005052271083536894

5.052271083536894e-16

In [41]:
normal_distribution_pdf(100, 190, 10)

# 0.00000000000000000010279773571668917

1.0279773571668917e-19

### Probability Vs Likelihood

**Probability :** This is a measure of the chance that a certain event will occur out
of all possible events. It's usually presented as a ratio or fraction, and it ranges
from 0 (meaning the event will not happen) to 1 (meaning the event is certain
to happen).



**Likelihood :** In statistical context, likelihood is a function that measures the
plausibility of a particular parameter value given some observed data. It
quantifies how well a specific outcome supports specific parameter values.



**More Definitions**
A probability quantifies how often you observe a certain outcome of a test,
given a certain understanding of the underlying data.
A likelihood quantifies how good one’s model is, given a set of data that’s been
observed.



Probabilities describe test outcomes, while likelihoods describe models.

## Maximum Likelihood Estimation:

Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a
statistical model given some observed data.

### 1. Coin toss:

![Screenshot%202023-09-04%20025826.png](attachment:Screenshot%202023-09-04%20025826.png)

$$p(0.6)^5>p(0.5)^5$$

##### likelihood of getting head will be maximum when it is a biased coin $p(H)=1$. so when tossed 5 times  = $(1)^5=1$ 

### 2. 2 green,3 red balls. likelihood of getting green ball

![Screenshot%202023-09-04%20030520.png](attachment:Screenshot%202023-09-04%20030520.png)

##### likelihood of getting a green ball will be maximum when all the balls are green $p(G)=1$ . so when picked 5 times =  $(1)^5=1$
 

### Maximum Likelihood Estimation for Normal Distribution

- if we have "n" datapoints and we know it will follow Normal distribution. It will follow that Normal distribution where $\mu$ will be mean of all the observed data points.


- if we have "n" datapoints and we know it will follow Normal distribution. It will follow that Normal distribution where $\sigma$ will be std dev of all the observed data points.

### Maximum Likelihood Estimation in Machine Learning

__NOTE :__ ___MLE will only be applied on parametric ML algorithms such as Linear regression, logistic regression, Neural networks etc. It doesnot work on non-parametric algos such as Decision tree and random forest etc.___

![Screenshot%202023-09-04%20145002.png](attachment:Screenshot%202023-09-04%20145002.png)

## Some Important Questions

### 1. Is MLE a general concept applicable to all machine learning algorithms



Maximum Likelihood Estimation (MLE) is a general statistical concept that
can be applied to many machine learning algorithms, particularly those that
are parametric (i.e., defined by a set of parameters), but it's not applicable
to all machine learning algorithms.


MLE is commonly used in algorithms such as linear regression, logistic
regression, and neural networks, among others. These algorithms use MLE
to find the optimal values of the parameters that best fit the training data.
However, there are some machine learning algorithms that don't rely on
MLE. 


For example:


1. __Non-parametric methods :__ Some machine learning methods, such as k-
Nearest Neighbors (k-NN) and Decision Trees, are non-parametric and do not make strong assumptions about the underlying data
distribution. These methods don't have a fixed set of parameters that
can be optimized using MLE.


2. __Unsupervised learning algorithms :__ Some unsupervised learning
algorithms, like K-means clustering, use different objective functions,
not necessarily tied to a probability distribution.


3. __Reinforcement Learning :__ Reinforcement Learning methods generally
don't use MLE, as they are more focused on learning from rewards and
punishments over a sequence of actions rather than fitting to a specific
data distribution.

### 2. How is MLE related to the concept of loss functions?

In machine learning, a loss function measures how well a model's
predictions align with the actual values. The goal of training a machine
learning model is often to find the model parameters that minimize the loss
function.


Maximum Likelihood Estimation (MLE) is a method of estimating the
parameters of a statistical model to maximize the likelihood function, which
is conceptually similar to minimizing a loss function. In fact, for many
common models, minimizing the loss function is equivalent to maximizing
the likelihood function.


MLE and the concept of loss functions in machine learning are closely
related. Many common loss functions can be derived from the principle of
maximum likelihood estimation under certain assumptions about the data
or the model. By minimizing these loss functions, we're effectively
performing maximum likelihood estimation.

### 3. Then why does loss function exist, why don't we maximize Likelihood

The confusion arises from the fact that we're using two different
perspectives to look at the same problem.


In many machine learning algorithms, the aim is to minimize the difference
between the predicted and actual values, and this is typically represented
by a loss function. When we talk about minimizing the loss function, it's
essentially the same as saying we're trying to find the best model
parameters that give us the closest predictions to the actual values.




On the other hand, when we look at the problem from a statistical
perspective, we talk in terms of maximizing the likelihood of seeing the
observed data given the model parameters. This is represented by a
likelihood function.


For many models, these two perspectives are equivalent - minimizing the
loss function is the same as maximizing the likelihood function. In fact,
many common loss functions can be derived from the principle of MLE
under certain assumptions about the data.


So why do we often talk about minimizing the loss function instead of
maximizing the likelihood? 

There are a few reasons:

1. __Computational reasons :__ It's often easier and more computationally
efficient to minimize a loss function than to maximize a likelihood
function. This is particularly true when working with complex models
like neural networks.


2. __Generalization :__ The concept of a loss function is more general and can
Session on Maximum Likelihood Estimation Page 13


3. __Generalization :__ The concept of a loss function is more general and can
be applied to a wider range of problems. Not all machine learning
problems can be framed in terms of maximizing a likelihood. For
example, many non-parametric methods and unsupervised learning
algorithms don't involve likelihoods.


4. __Flexibility :__ Loss functions can be easily customized to the specific needs
of a problem. For instance, we might want to give more weight to
certain types of errors, or we might want to use a loss function that is
robust to outliers.



In summary, while the concepts of loss function minimization and
maximum likelihood estimation are closely related and often equivalent,
the concept of a loss function is more flexible and computationally
convenient, which is why it's more commonly used in the machine learning
community.

### 4. Then why study about maximum likelihood at all?

The study of Maximum Likelihood Estimation (MLE) is essential for several
reasons, despite the prevalence of loss functions in machine learning:


1. __Statistical Foundation :__ MLE provides a strong statistical foundation for
understanding machine learning models. It gives a principled way of
deriving the loss functions used in many common machine learning
algorithms, and it helps us understand why these loss functions work
and under what assumptions.


2. __Interpretability :__ The MLE framework gives us a way to interpret our
model parameters. The MLEs are the parameters that make the
observed data most likely under our model, which can be a powerful
way of understanding what our model has learned.


3. __Model Comparison :__ MLE gives us a way to compare different models
on the same dataset. This can be done using tools like the Akaike
Information Criterion (AIC) or the Bayesian Information Criterion (BIC),
which are based on the likelihood function and can help us choose the
best model for our data.


4. __Generalization to Other Methods :__ MLE is a specific case of more
general methods, like Expectation-Maximization and Bayesian
inference, which are used in more complex statistical modelling.
Understanding MLE can provide a stepping stone to these more advanced topics.


5. __Deeper Understanding :__ Lastly, understanding MLE can give us a deeper
understanding of our models, leading to better intuition, better model
selection, and ultimately, better performance on our machine learning
tasks.


In short, while you can often get by with a practical understanding of loss
functions and optimization algorithms in applied machine learning,
understanding MLE can be extremely valuable for gaining a deeper
understanding of how and why these models work.