# CPS600 - Python Programming for Finance 
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# Distributions, Likelihood & Bayesian Analysis

###  October 16, 2018

Much of today's content is gotten from the [SESYNC Workshop](https://www.sesync.org/opportunities/research-short-courses/bayesian-modeling-for-socio-environmental-data) on Bayesian modeling

Last time we had a whirlwind of probability and statistics ideas. This time, we'll slow down and look at some interesting applications of Bayes' theorem.

**Lab Review**

After we see how to do maximum likelihood estimation, you will be *theoretically* equipped to do maximum likelihood estimation on the Weibull distribution.

In [1]:
import pandas as pd
# Loading the data
df = pd.read_csv('distroCSV.csv')
distro1 = df['distro1']
distro2 = df['distro2']
distro3 = df['distro3']
distro4 = df['distro4']

In [3]:
# Thanks to Fangzhi for the 'solutions' below.
# (Some edits are necessary - demonstration purposes only)
import math
import numpy as np
from scipy.optimize import fsolve

d1_std, d2_var, d3_std, d4_var = (np.std(np.array(distro1)),
                               np.var(np.array(distro2)),
                               np.std(np.array(distro3)),
                               np.var(np.array(distro4)))
                               
d1_avg, d2_avg, d3_avg, d4_avg = (np.mean(np.array(distro1)),
                                  np.mean(np.array(distro2)),
                                  np.mean(np.array(distro3)),
                                  np.mean(np.array(distro4)))

print("parameters of unknown1, normal distribution:","sigma:",d1_std,"mu:",d1_avg)

# For the log-normal
def logNormal(x):
    sigma,mu = x
    return [math.exp(mu+sigma**2/2)-d4_avg, 
            (math.exp(sigma**2)-1)*math.exp(2*mu+sigma**2)-d4_var]
result = fsolve(logNormal,[1,1])
print("parameters of unknown4, log normal distribution:","sigma:",
      result[0],"mu:",
      result[1])


#unknown2 gamma distriubtion,where E(x)= k * theta， variance = k * theta^2
theta = d2_var/d2_avg
k = d2_avg/theta
print("parameters of unknown2, gamma distribution:","k:",k,"theta:",theta)

#unknown 3 weibull distribution
def w(x):
    lam,k = x
    return [lam * math.gamma(1+1/k)-d3_avg,
            (lam**2)*(math.gamma(1+2/k)-(math.gamma(1+1/k))**2)-d3_std]
result = fsolve(w, [1,1])
print('parameters of unknown3, weibull distribution: lam:',result[0],'k:',result[1])

parameters of unknown1, normal distribution: sigma: 1.0264711046684125 mu: 0.9951602680401787
parameters of unknown4, log normal distribution: sigma: 0.6166121027028655 mu: -0.48964165969425616
parameters of unknown2, gamma distribution: k: 0.06820157375523678 theta: 368.71742057622646
parameters of unknown3, weibull distribution: lam: 0.3484598061921611 k: 0.6263839326870151


Some good documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html).

**Bayes' Theorem**

Once again, before we dive in, recall what we learned last time about conditional probability and *Bayes' Theorem*.

$$ P(B \mid A) = \frac{P(A \mid B)P(B)}{P(A)} = \frac{P(A \mid B)P(B)}{\sum_k P(A \mid B_k)P(B_k)} $$


The *likelihood* of a parameter (or vector of parameters) $p$ given some observations $X$ is defined to be the probability of observing the data given the parameter.

$$ L(p \mid X) = P(X \mid p)$$

Actually, the definition is looser than that because we do not require that the likelihood be a probability perse.

**Likelihood**

Why *likelihood*?
* Likelihood is a component of all Bayesian models
* Maximum likelihood is an important statistical approach in its own right.

Our objectives here:
* Understand the concept of likelihood and its relationship to the probability of data conditional on parameters
* Describe a likelihood profile and how it differs from a probability density function
* Understand how to compose likelihoods for multiple parameters and multiple observations

There is a distinction between estimating data from a parameter vs estimating a parameter from data.

<img src="infLike.png" style="width: 800px;"/>


Example:
> The true proportion of defective units, $\theta$, of widgets produced by a Syracuse factory is $.12$. A sample of $24$ units includes $4$ defective units. What is the probability of obtaining these data assuming the estimated proportion is true? 

<img src="inferFinal
.png" style="width: 800px;"/>


On the other hand, *Bayesian inference* is based on the likelihood function.

<img src="BayesThetaPic.png" style="width: 800px;"/>


Example:
> We obtain a new sample of $24$ widgets from the same Syracuse factory and find that it includes $4$ defective units. In light of these data, what is the probability that the true value of prevalence, $\theta$ , is found in
$[q_L, q_U]$?

The key idea in likelihood is this

* In a probability mass or probability density function, the parameter $\theta$ is constant and the data $y$ are random variables. The function sums or integrates to $1$ over its support.

* In a likelihood function, the data are constant (known) and the parameter is unknown but fixed. We use $[y\mid \theta]$ to assess the likelihood of different values of $\theta$ in light of the data. In this case, the function does not sum or integrate to one over all possible values of the parameter.

$$ L(\theta \mid y) \propto [y \mid \theta] $$

Likelihood is *proportional* to probability or probability density.

Let's pause and discuss this notation a little.

$$L(\theta \mid y) \propto [y \mid \theta]$$
$$L(\theta \mid y) = c[y \mid \theta]$$
$$L(\theta \mid y) = [y \mid \theta]$$

Now let's have some more concrete examples.

*when the parameter is known and the data are random*

<img src="canBeans.png" style="width: 800px;"/>


* What are the possible outcomes?
* What PMF (probability mass function) would you use to model these data?
* What is the probability of each outcome?
* What is the sum of the individual probabilities?

Example:

<img src="beansNumbs.png" style="width: 800px;"/>



*when data are known and parameter is random*

<img src="manyCans.png" style="width: 800px;"/>


* What is the likelihood of each possible value of the parameters?
* How do the values sum?

Example:

> We have three hypothesized parameter values, $\frac{5}{6}, \frac{1}{2}, \frac{1}{6}$. Data in hand are $2$ white beans on $3$ draws. The likelihood of each parameter value is given in the table.



<img src="likelyTab.png" style="width: 800px;"/>



We can visualize this in a plot (I will later ask you to generate such a plot).

*A Likelihood Profile: 2 White Beans on 3 Draws*

<img src="likelyProfl.png" style="width: 800px;"/>



Class Exercise: Can of Beans
* Let's draw $10$ beans from one of three cans
* How much more likely is this can than the other two?

In [5]:
import numpy as np
np.random.binomial(10,1/2)

4

*A Likelihood Profile: 4 White Beans on 10 Draws*

<img src="fourBeans.png" style="width: 800px;"/>



Things to notice here:
* The shape of the curve has narrowed
* The center of the curve has shifted
* One of the possible values is much more likely than the others.

In summary:

*Likelihood vs Probability, Summarized*

<img src="summary.png" style="width: 900px;"/>



You're probably wondering about multiple parameters. We won't get into those computations, but let's look at how that is handled conceptually.

We used a single parameter, and that made it easy for us to visualize the problem. In general, we can have any number of parameters in a much more complex model of some process, e.g.

$$  \mu_i = g(\theta, x_i) $$
$$ L(\mu_i \mid y_i) \propto [y_i \mid \mu_i]$$

Going up one dimension, we can visualize this as a *likelihood surface*

<img src="likelySurf.png" style="width: 900px;"/>


If we assume *conditional independence*, then the likelihood is the **product of the individual likelihoods**. This is a pretty big assumption, but you've got to start somewhere.

$$ L(\theta \mid {\bf y}) = c\prod_{i=1}^n [y_i \mid {\bf \theta}]$$

The consequences of conditional independence (for us here) are that the *chain rule*:

$$ P(y_1,\ldots,y_n) = P(y_1 \mid y_2,\ldots,y_n)P(y_2 \mid y_3,\ldots,y_n)\ldots P(y_n)$$

in this case implies that we can get a joint probability as a product:

$$ P(y_1,\ldots,y_n) = P(y_1)\cdots P(y_n)$$

(which is what we wanted)

PMF [K|b] = (n  k) b^k (1-b)^(n-k)

Theoretically, this is all we need. But *numerically* it is not feasible to go around multiplying zillions of probbailities together - you get precision problems and lose information rapidly. We therefore take the logs and sum them:

$$ \log L(\theta \mid {\bf y}) = \log c + \sum_{i=1}^n \log [y_i \mid {\bf \theta}] $$

**Pop Quiz** Why can we get away with this?

Let's see a *continuous* example.

Example: The Exponential Distribution & POTUS tweets

Supported on the *positive semiaxis*, an exponential random variable is described by a single paramter $\lambda$.

$$ P(t_i \mid \lambda) = \lambda e^{-\lambda t_i} $$

In a picture:

<img src="expPlots.png" style="width: 600px;"/>


The exponential distribution describes the 'waiting times' between events resulting from a [Poisson process](https://en.wikipedia.org/wiki/Poisson_point_process).

The parameter $\lambda$ is the average number of occurrences per unit time.

Let's look at POTUS tweets.

> Generate a datum quantifying how long we have to wait for a new tweet from POTUS. Write out the likelihood function *assuming wait times for POTUS tweets are governed by an exponential distribution*. Determine the maximum likelihood estimate (MLE) for $\lambda$.

Let's suppose we're looking to quantify the rate *per day* of tweets. From a certain sample we get:

* $991 / 151 = 6.6$ tweets per day
* $t_1 = 1 / 6.6 = 0.15$ days between tweets

With this single datapoint, we can analytically solve for the maximum likelihood estimator $\lambda_{MLE}$.

The definition

$$ L(\lambda \mid t_1) = [t_1 \mid \lambda] $$

The assumption on the distribution

$$ L(\lambda) = \lambda e^{-\lambda t_1} $$

Taking logs

$$ LL(\lambda) = \log(\lambda) - \lambda t_1 $$

Differentiating...

$$ \frac{dLL}{ d\lambda} = \frac{1}{\lambda} - t_1  $$

Equating to zero

$$ 0 = \frac{1}{\lambda_{MLE}} - t_1 $$

Solving...

$$ \lambda_{MLE} = 6.6 $$

But what does this look like? Here is the likelihood profile:

<img src="potusProfl.png" style="width: 800px;"/>


But what about the density? We would like a *posterior distribution*, and not just the *likelihood profile*.