## Maximum Likelihood Estimation (MLE)

#### Resources

- https://www.quantstart.com/articles/Maximum-Likelihood-Estimation-for-Linear-Regression
- https://www.cscu.cornell.edu/news/statnews/stnews50.pdf
- http://stats.stackexchange.com/questions/112451/maximum-likelihood-estimation-mle-in-layman-terms

#### Overview

MLE is a method of estimating the parameters of a statistical model based on a set of observations. It is important to remember that:
- a statistical model assumes that a particular observation (i.e. $x_{1},y_{1}$) approximates some type of probability distribution. Broadly speaking, a probabilty distribution outputs, as the name suggests, a probability: 
    - In Logistic Regression, it can be assuemd each data point follows the Sigmoid function
    - In Linear Regression, it can be assumed each data point follows the Gaussian function 

In Machine Learning, the MLE is considered the cost function. It is the metric used to determine the quality of the model relative to a set of observations (i.e. training data).  

##### Probability Distributions: A Quick Review

Example: You work in a potato chip manufacturing company. You want to determine how much variation there is in a 500 g bag of potato chips. Over the years, you are able to sample the weight of 1 million bags. You plot each measurement on a frequency chart (histogram) and realize this produced a $\textbf{normal distribution}$ curve.
- this should be expected since these observations follow the $\textit{Central Limit Theorem (CLT)}$. The mean (center of the curve) represents the expected value, or the most likely potato chip bag weight you should see (highest probability)
- some bags will be a lot heavier than the mean, while others will be lighter. However, the probabilty of observing these weights will be much lower
- the total range in weights from all observations (i.e. lowest weight observed to highest weight) is the variance for this distribution
- once this distribution curve has been created (with mean $\mu$ and variance $\sigma^{2}$ unique to that distribution), one can now determine the chances of manufacturing a bag chips of a certain weight by calculating the $\textbf{z-score}$ statistic

### MLE in Detail

We had mentioned above that the MLE is the cost function in a Machine Learnign problem. Therefore, the goal of MLE is to find th eparamters of a particular statistical model that is the best fit, or most likely to fit, the observed data.

#### Analogy

You have been summed to be part of a jury. You are presented with the following information:
- the charges that describe the purpose of the trial
- the truth of what happened according to the prosecution
- the truth of what happened according to the defence
- the evidence
Goal: you the juror, within th econtext of the specified charges, and given the evidence, must select the version of the truth that most likely occured

In MLE:
- the framework of the trial is the statistical model selected by the user
- the prosecution is one set of parameter values used to fit the statistical model
- the defence is another set of parameter values used to fit the statistical model
- the evidence are one set of observations, i.e. training data
Goal: In reality there are many more 'versions of the truth' than just two (prosecution and defence). However the goal is the same, which is to find the parameter values that best fit the statistical model to the training data

#### How does MLE do this ?

- given a Supervised Learning training set, every value of x (i.e. the feature, predictor, independent variable) will have some observed ouput y (i.e. response, dependent variable). Your task is to model the relationship between x and y. i.e. estimate output y given x and parameters $\theta$
- the goal of MLE is to calculate how likely it would be to estimate the real response, y, using our current model and current model parameters
- to calculate the $\textit{Likelihood}$, we assume that each and every observation is $\textbf{independent}$. This means that the likelihoos of seeing y for every x parameterized by theta are $\textbf{multiplied}$. Intuitively, this is no different than calculating the probability of rolling a 4 twice in a row using a die (1/6 * 1/6 = 1/36)

<blockquote>
<strong>Note: Probability vs Likelihood</strong>
<p>Probability is used before data are available to describe possible future outcomes given a fixed value for the parameter (or parameter vector). Likelihood is used after data are available to describe a function of a parameter (or parameter vector) for a given outcome.</p>
</blockquote>

- Therefore, we can represent the Likelihood function for a particular model with continuous probability distribution as: 

$$\mathcal{L}(\theta|x)=f_{\theta}(x)$$

- The MLE selects the set of values of the model parameters that maximizes the likelihood function. For given set of model paramaters, MLE is calculated by taking the dot product of each likelihood function at a given observation, $x$:

$$\mathcal{L}(\theta;x_{1},...,x_{n})=\prod_{i=1}^{n} f(x_{1},...x_{n}|\theta)=f(x_{i}|\theta)$$

- In practice the algebra is more convenient to work with the natural logarithm of the likelihood function. This becomes clear during optimization. Taking the derivative of a dot product requires use of the product rule, and is exceedingly complicated as the number of observations and features increases (does not scale). Remember, we find the partial derivative of the cost function in order to implement the gradient ascent/descent algorithm in order to calculate the gradient. Therefore, taking the og of the Likelihood functino reduces the complexity of the equation:

$$ln\mathcal{L}(\theta;x_{1},...,x_{n})=\sum_{i=1}^{n} f(x_{i}|\theta)$$


In [39]:
import numpy as np
import math

#Calculate the weighted linear relationhip
#h(x)=weight_1*x_1 + weight_2*x_2
weight= np.array([1,-1.5])
x_1 = np.array([2,0,3,4])
x_2 = np.array([1,2,3,1])

features = np.column_stack([x_1,x_2])
score = np.dot(features, weight)

#actual observations. Need to separate positive outcomes from negative ones
y = np.array([1,0,0,1])
pos_index = np.array([0,3]) 
neg_index = np.array([1,2]) #remember, probability of a negative output is the compliment of a positive output

#Calculate probabilities using the Logistic Model
pos_logit = 1/(1+math.e**(-score[pos_index])) #probability that output is positive (when actual observtion is positive)
neg_logit = 1/(1+math.e**(-score[neg_index])) #probabilty that output is positive (when actual observation is negative)

#calculate Likelihood: the dot product of the probability of each observation
likelihood = pos_logit[0]*pos_logit[1]*neg_logit[0]*neg_logit[1]

print(likelihood)

0.00497680406472


## Using MLE in Linear Regression

In Linear Regression, the most common way to fit our model to the observed data is the Residual Sum of Squares (RSS). This way of estimating parameters is very intuitive, however it's not the onyl way. We can also approach the same problem from a statistical framework, using MLE to accomplish the same thing.

The linear model:
$$h_{\theta}(x)=\theta x + \epsilon$$

where, $h_{\theta}(x)$ is the estimated value, $\theta$ is the parameter of the model, $x$ is the feature, and $\epsilon$ is the error. Note that we did not include an intercept. This simplifies the problem without affecting the results.

We can assume that the error is normally distributed with a mean centered at 0. This means that the probability distribution of the error term follows the Gaussian function. When the statistic $\mu=0$, there is an equal chance that the error of the estimated value $h_{\theta}(x)$ will be larger or smaller than the real observation $y$. The implications of this proof means that the term $\theta x$ is also normally distributed:

$$\epsilon=h_{\theta}(x) - \theta x$$

What this means:
- the linear regression model ouputs a continuous variable (e.g. house price)
- to use MLE, we need an output that's a probability distribution
- at each value x in the dataset, the point that lies on the linear model is considered the mean value for a distribution with a constant variance. This also means that each point on that line has a gaussian distribution, where the degree in which an observation deviates from that mean is a probability contrained to that distribution
- our hypothesis function is no longer the equation for a line as we saw in the standard Linear Regression problem using RSS, but the density curve of the normal distribution, with parameters $\mu$ and $\sigma$
- the intuition behind this is that for every single $x$ and constant parameter $\sigma$, the gaussian model will have an expected value, $\mu$ at $h(x)$. What is the likelihood of seeing the observed $y$ ?
- As before, we repeat this for every single $x$ in our dataset using the same parameters, multiply the probabilities, and come up with a score
- this process cwill be repeated using the gradient ascent algorithm (incremental steps in parameter values) until the Maximum Likelihood Estimation is achieved
- Note that this result will return parameter values almost identical to RSS


## Using MLE in Logistic Regression