# Likelihood in Statistics

## What is Likelihood?

Likelihood refers to a concept in statistics that measures how probable a particular set of observations is, given specific parameters of a statistical model. Unlike probability, which predicts future outcomes based on known parameters, likelihood works in the opposite direction: it evaluates the plausibility of model parameters given observed data.

## Importance of Likelihood

The concept of likelihood is central to many statistical methods, including maximum likelihood estimation (MLE), where it's used to estimate the parameters of a statistical model. By maximizing the likelihood function, statisticians can find the parameter values that are most likely to have resulted in the observed data, thereby fitting the model to the data effectively.

## Likelihood vs. Probability

- **Probability** is used to predict the likelihood of future data occurrences given known parameters.
- **Likelihood**, on the other hand, assesses the plausibility of parameter values given the data already observed.

## Example: Tossing a Coin

Consider the scenario of tossing a fair coin three times, resulting in two heads and one tail. The likelihood function in this case would quantify how likely it is to observe this specific outcome for different biases of the coin (parameter values), such as a fair coin (50% heads) or a biased coin (e.g., 60% heads).

## Key Points

- Likelihood provides a way to assess different parameter values for a statistical model based on observed data.
- It is a fundamental concept in statistical inference and is crucial for parameter estimation and model selection.
- Understanding likelihood allows statisticians and data scientists to make informed decisions about the models they use to describe real-world phenomena.

Remember, while likelihood and probability are related concepts, they are used differently within the context of statistical analysis. Likelihood focuses on evaluating parameter values based on observed data, making it an essential tool for statistical modeling and inference.


# Understanding Likelihood with a Coin Toss Example

In statistics, the concept of likelihood helps us understand how plausible different parameter values are, given observed data. Let's explore this through a simple example involving tossing a coin.

## Scenario

You toss a coin three times, observing the sequence: Heads, Heads, Tails (HHT).

## Parameter (\(\theta\))

- \(\theta\) represents the probability of the coin landing on heads.
- For a fair coin, \(\theta = 0.5\). However, the coin could be biased, meaning \(\theta\) could vary between 0 and 1.

## Changing Parameters

To understand how the likelihood changes with \(\theta\), consider:

- **Fair Coin (\(\theta = 0.5\))**: The likelihood of observing HHT is \(0.5^2 \times 0.5 = 0.125\).
- **Biased Coin (\(\theta = 0.6\))**: The likelihood increases to \(0.6^2 \times 0.4 = 0.144\).
- **More Biased Coin (\(\theta = 0.7\))**: The likelihood is \(0.7^2 \times 0.3 = 0.147\).

## Observations

- The likelihood of observing HHT changes as we adjust \(\theta\), reflecting different biases of the coin.
- By comparing likelihoods across different \(\theta\) values, we identify which bias (\(\theta\) value) makes the observed outcome most plausible.
- This example illustrates that a slight bias towards heads (\(\theta > 0.5\)) may be more consistent with observing HHT.

## Conclusion

The essence of the likelihood concept in statistical inference is to evaluate how plausible different parameter values (\(\theta\)) are, given the data observed. In our coin toss example, varying \(\theta\) allows us to hypothesize about the coin's bias and determine which hypothesis best explains the observed data (HHT). The \(\theta\) that maximizes the likelihood function is considered the most likely estimate of the coin's true bias based on the data.


heads on a particular throw is confined to one of six discrete values: θ ∈ {0, 0.2, 0.4, 0.6, 0.8, 1.0}. Using this information we compute the various probabilities of each possible outcome, which are displayed in Table 4.1.
In tabular form, we can see the effect of varying the data (moving along each row) and contrast it with the effect of varying θ (moving down each column).
If we hold the parameter fixed – regardless of the value of θ – and vary the data by moving along each row, the values sum to 1, meaning that this is a valid probability distribution. By contrast, when we hold the number of heads fixed and vary the parameter θ, by moving down each column the values do not sum to 1. When θ varies we do not have a valid probability distribution, meriting the use of the term likelihood.
In Bayesian inference, we always vary the parameter and hold the data fixed (we only obtain one sample). Thus, from a Bayesian perspective, we use the term likelihood to remind us that p(data|θ) is not a probability distribution.
Table 4.1  The probabilities/likelihoods for two coin flips, where the probability of heads (θ) is confined to the discrete values: {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. X is the number of heads we obtain in two throws of the coin.

| Probability of coin landing heads up, θ | Number of heads, X |
|----------------------------------------|---------------------|
|                                        | 0     | 1    | 2    |
|----------------------------------------|-------|------|------|
| 0.0                                    | 1.00  | 0.00 | 0.00 |
| 0.2                                    | 0.64  | 0.32 | 0.04 |
| 0.4                                    | 0.36  | 0.48 | 0.16 |
| 0.6                                    | 0.16  | 0.48 | 0.36 |
| 0.8                                    | 0.04  | 0.32 | 0.64 |
| 1.0                                    | 0.00  | 0.00 | 1.00 |
|----------------------------------------|-------|------|------|
| Total                                  | 2.20  | 1.60 | 2.20 |

# Likelihood Functions vs. Probability Distributions

## Probability Distributions
- **Definition**: Mathematical functions that give the probabilities of different outcomes.
- **Requirement**: Probabilities must sum to 1 across all possible outcomes.

## Likelihood Functions
- **Definition**: Functions of parameters given observed data, indicating how well the parameters explain the data.
- **Key Property**: The values of a likelihood function do not sum to 1 over parameter values.

## Why Don't Likelihoods Sum to 1?
- **Different Purpose**: Likelihoods are not probabilities but measures of plausibility for parameters given data.
- **Parameter Estimation**: Likelihoods are used to estimate parameters, not to describe the complete range of a random variable's outcomes.
- **Model Fitting**: They help in fitting statistical models to observed data, unlike probability distributions which predict data given known parameters.

## Bayesian Inference and Likelihood
- **Role in Bayesian Inference**: Likelihoods are combined with prior distributions to form posterior distributions.
- **Normalization**: The posterior distribution, which is a probability distribution, will sum to 1 by the process of normalization.

## Conclusion
- **No Problem**: It is not a problem that likelihoods do not sum to 1; it is by design and serves a different purpose from probability distributions.


# Exchangeability and Random Sampling

## Exchangeability

- **Definition**: 
  - A sequence of random variables is *exchangeable* if their joint probability distribution is invariant to permutations. The order of data does not affect their joint probability.

- **Implications**:
  - Exchangeability suggests that each data point is equally informative about the underlying distribution and there is no inherent ordering to the data.

- **In Bayesian Statistics**:
  - Exchangeability allows for modeling without specifying the data's sequence, often implying a shared underlying parameter influencing all data points.

## Random Sampling

- **Definition**: 
  - *Random sampling* is a technique where each subset of data from a population has an equal chance of selection, ensuring each sample is unbiased and representative.

- **Requirement**:
  - Assumes each data point is independent and identically distributed (i.i.d.), which is essential for the representativeness of the sample.

## Connection Between Exchangeability and Random Sampling

- **From Random to Exchangeable**: 
  - A random sample is by definition exchangeable, as the randomness ensures the irrelevance of order.

- **From Exchangeable to Random**:
  - If a sample is exchangeable, it can often be treated as a random sample because the order does not convey additional information.

## Practical Implication

- **Exchangeability as a Safety Net**:
  - In real-world scenarios where true random sampling is challenging, assuming exchangeability provides a basis for robust statistical inference.

In essence, exchangeability allows us to treat a sample as if it were randomly drawn, even when perfect randomness in the sampling process is unattainable. This assumption is fundamental to many statistical techniques, particularly in Bayesian inference, where it justifies the use of likelihoods in updating beliefs based on observed data.


# Maximum Likelihood Estimation (MLE)

## Overview
- **Purpose**: MLE is used to find the parameter values that make the observed data most probable under a specified statistical model.

## The Likelihood Function
- **Definition**: 
  - A likelihood function `L(θ | data)` is a function of the parameters `θ` that measures the probability of the observed data under those parameters.
  - Unlike a probability distribution, it is not normalized; it's a relative measure.

## Estimating Parameters
- **Process**:
  - Choose a model with parameters `θ` that could have generated the data.
  - Define the likelihood function for this model given the observed data.
  - Find the parameter values `θ` that maximize this function.

## Maximization
- **Techniques**:
  - Often involves taking the derivative of the likelihood function with respect to `θ`, setting it to zero, and solving for `θ`.
  - In practice, it's common to work with the natural logarithm of the likelihood function, known as the log-likelihood, since it simplifies the calculus and the maximization problem without affecting the parameter estimates.

## Properties of MLE
- **Consistency**: 
  - As the sample size increases, the MLE tends to converge to the true parameter value.
- **Efficiency**: 
  - Among all unbiased estimators, MLE tends to have the smallest variance.
- **Asymptotic Normality**:
  - As the sample size grows, the distribution of MLE tends to approach a normal distribution.

## Practical Considerations
- **Complex Models**:
  - For models that result in complex likelihood functions, numerical methods may be required to find the MLE.
- **Limitations**:
  - MLE assumes the model is correct and can be biased for small sample sizes.

MLE is a foundational tool in statistics for parameter estimation, widely used for its strong theoretical properties and practical effectiveness.


In [2]:
import numpy as np
from statsmodels.base.model import GenericLikelihoodModel

# Sample data: 1 represents heads, 0 represents tails
data = np.array([1, 0, 1, 1, 1, 0, 0, 1, 0, 1])

# Define a class for the binomial model
class BinomialModel(GenericLikelihoodModel):
    def loglike(self, params):
        p = params[0]
        # Calculate the log-likelihood for binomial distribution
        log_likelihood = np.sum(self.endog * np.log(p) + (1 - self.endog) * np.log(1 - p))
        return log_likelihood

# Instantiate the model with data
model = BinomialModel(data)

# Fit the model by maximizing the log-likelihood
results = model.fit(start_params=np.array([0.5]), method='nm', disp=0)

# Estimated parameter (probability of heads)
p_hat = results.params[0]

print(f"Estimated probability of heads (p): {p_hat}")
print(f"Confidence Interval: {results.conf_int()}")

# Evaluate the goodness of fit
log_likelihood = model.loglike(results.params)
print(f"Log-Likelihood: {log_likelihood}")


Estimated probability of heads (p): 0.6000000000000003
Confidence Interval: [[0.29636369 0.90363631]]
Log-Likelihood: -6.730116670092565




# MLE Interpretation of Binomial Distribution Parameter Estimation

Given a dataset from coin flips, where `1` represents heads and `0` represents tails, we use Maximum Likelihood Estimation (MLE) to estimate the probability of flipping heads (`p`).

## Results from Python MLE:
- The `statsmodels` library allows us to fit our model to the data, maximizing the log-likelihood to find our parameter estimate.

### Estimated Probability of Heads (`p_hat`):
- The value of `p_hat` is the MLE for the probability of flipping heads.
- This is the value that, given our model, makes the observed data most likely.

### Confidence Interval:
- The confidence interval provides a range of plausible values for `p` based on the observed data.
- A typical confidence interval used is 95%, which informs us that, if we were to repeat our experiment many times, 95% of the confidence intervals calculated from those experiments would contain the true `p`.

### Log-Likelihood:
- The log-likelihood is a measure of the probability of observing the given data under the estimated model parameters.
- A higher log-likelihood means a better fit of the model to the data.

## Interpretation:
- The MLE gives us a point estimate for the parameter `p`. It tells us the most likely proportion of heads in a coin flip based on our data.
- The confidence interval around `p_hat` gives us an idea of the precision of our estimate.
- We assume the coin flips are independent events, and thus the likelihood of each sequence of coin flips is simply the product of the likelihoods of individual flips.

## Considerations:
- This example assumes a binomial model is appropriate for the data.
- It is essential to check model assumptions in practice, which can include the independence of events and the appropriateness of the binomial distribution for the data.

This simplified example illustrates how MLE is used to estimate parameters that explain our data under a given model. The `statsmodels` library facilitates this process by providing tools to maximize the likelihood function and compute confidence intervals.
