# 2 (Optional) Probability, Distributions, and Measurement

This notebook is optional, but it's recommended if you want learn more about Python and/or probability theory.

### More on Python

If you're new to programming in general, you should be working your way through all the chapters of the [Python for Data Science](https://www.datacamp.com/courses/intro-to-python-for-data-science) DataCamp tutorial.

If you're a more accomplished programmer but new(ish) to Python, you can get a more detailed primer on the same material by going through chapters 1-4 of the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/).

Assuming you've completed these activities, the sections below will be a review of some key concepts.

### Objects and packages

Python is firmly grounded in the tradition of *object-oriented programming* (OOP). In this approach, the logic for dealing with different kinds of data is *encapsulated* or *attached* to the data. A data object can

- have *attributes* that tell us about that object instance
- have *methods* that are functions an object can perform


### What is an instance?

Every value in Python is an *instance* of some type. Each time you create a new value, the interpreter is *instantiating* an object of a defined type, allocating memory and setting initial values.

Thus, any variable we define is simply pointing the variable's name (in the current namespace) to a chunk of memory containing the instance.

Python keeps track of all object intances and cleans them when they are no longer needed.

You can determine the type of a value or variable using the `type` function:

In [None]:
x = 42
type(x)

The attributes and methods of an object are accessed by using the `.` operator. You can *inspect* the attributes and methods of any object in the notebook by typing `tab` after entering the variable name and `.`, as below:

In [None]:
# let's explore x:
# (place the cursor after x. and press the tab key)
x.

Let's try with a different type:

In [None]:
y = 'The answer to the ultimate question.'
type(y)

In [None]:
# let's explore the string type:
# (press tab after y.)
y.

Note: you can access documentation for functions and methods by using `Shift-Tab`.

In [None]:
# place the cursor between the parentheses and type `Shift-Tab`
y.lower()

**Question:** In your own words, what does the `lower` method do?

(double-click the next cell and enter your answer)

### What is a package?

Programming is all about abstraction, and code is meant to be reused. Python has tons of useful packages that extend its basic functionality. This means you don't have to (and shouldn't) reinvent basic algorithms. 

We'll talk more about how to find and evaluate packages later.

You can access external packages (or modules) by **importing** them into your workspace. When you import a package, you should give it a short name that you'll use to access the functions, etc, in the package.

You'll see (and start to use) lots of statements like the following:

```python
import numpy as np
```

After executing this statement, you can access the contents of the module as if they were attributes and methods of an object called `np`, like so: 

In [None]:
import numpy as np
np.arange(10)

### Arrays

The `numpy` package provides Python with an important data structure, the **array**. Arrays are so important to scientific programming that you will almost inevitably import `numpy` at the beginning of every script and notebook you use.

As we discussed previously, **scalar** data types like `int` and `float` can only represent a single number. But time series and point processes are inherently ordered collections of data, and arrays give us the ability to store and manipulate these large aggregates. 

#### Array terminology

- An **element** of an array is one item. It occupies a specific "slot" in the sequence.
- The **length** or **size** of an array is the number of elements.
- The **index** of an element is the numerical position of the element in the array. Python uses *zero-based indexing*, which means that the first element has the index `0`.
- The **data type** of the array is the type of the elements in the array. In arrays, all the elements have the same type. In `numpy`, the data type is stored in the `dtype` attribute.

#### Array operations

- **indexing** allows you to access specific elements of an array
- **slicing** allows you to access specific subsets of an array
- **iteration** allows you to process each element of the array in sequence

In numpy, both indexing and slicing use square brackets (`[` and `]`)

In [None]:
# first, let's initialize the random seed so that we all get the same answers
np.random.seed(1)
# create an array with 100 random numbers
my_data = np.random.randn(100)
# use indexing to get the first element:
my_data[0]

Slicing also uses square brackets, but instead of a single index, you use two indices separated by `:` to specify a range:

In [None]:
# this will retrieve the first 10 elements of my_array
my_data[0:10]

Notice how indexing returns a scalar, but slicing returns a new array (technically, a view of a part of the original array). Also notice that the last index is *exclusive*; that is, element 10 is NOT returned.

#### Some simple problems

Write an expression for the 88th value of `my_data` in the cell below:

Write an expression for the last 10 elements of `my_data`:

### Basics of Probability

Let's consider two events, $A$ and $B$.

**The probability of $A$**

$$p(A) \in [0, 1]$$

**The probability of not $A$**

$$\neg p(A) = 1 - p(A)$$

**The probability of $A$ *or* $B$**

$$p(A \cup B) = p(A) + p(B) - p(A \cap B)$$ 

If $A$ and $B$ are mutually exclusive, 

$$p(A \cup B) = p(A) + p(B)$$

**The probability of $A$ and $B$. This is also called the joint probability**

$$p(A,B) = p(A|B) p(B) = p(B|A) p(A)$$

If $A$ and $B$ are independent,

$$p(A,B) = p(A) p(B)$$

**The probability of $A$ given $B$. This is called the conditional probability**

$$p(A \mid B) = \frac{p(A \cap B)}{p(B)} = \frac{p(B \mid A) p(A)}{p(B)}$$



## Simple exercises

Using the equations above, let's try and figure out the probabilities of the following:

1. Rolling a 5 on a 6-sided die.
2. Not rolling a 3 on a 6-sided die.
3. Rolling a 4 or a 5 on a 6-sided die.
4. Rolling less than 4 or an even number on a 6-sided die.
5. Rolling two 3's in a row on a 6-sided die.

Enter your answers as text in the cell below:

## Harder question

#### The case of Tim Tebow

What is the probability of becoming a Major League Baseball (MLB) player if you hit a home run (HR) in your first at-bat in the minor leagues? We have the following important information:

a) 5% of future MLB players hit a home run in their first minor league at-bat.

b) 1% of minor league players make it to MLB.

c) Only 0.1% of players hit a homerun in their first minor league at-bat.

Hint: You are trying to solve $P(MLB \mid HR)$.

**Bonus**: What percentage of players who don't make it to MLB hit a home run in the first at bat?

Write code that *prints the answer* to the question(s):

## Probability Distributions

Some key terms:

- A **random variable** can take on different values depending on the outcome of a random phenomenon.
  - Random variables are usually denoted by capital letters (e.g. $X$)
  - A concrete value of a random variable is denoted by a lowercase letter (e.g. $x$)
- The **support** of a random variable is set of of values it can have
  - A **discrete** support includes only a finite, or countably infinite, set of values. 
  - A **continuous** support includes all the real numbers within some range. The number of possible values is *uncountably* infinite.
- A **distribution** describes the probability of a random variable having some value.
  - Distributions can be mathematical functions of some number of **parameters**.
      - Parameters are NOT observable


### Discrete vs continuous distributions

A **probability mass function** (*pmf*) describes the distribution of a *discrete* variable. For example, a fair N-sided die has a probability of coming up as 1 equal to $1/N$:

$$P(X = 1) = 1/N$$

A **probability density function** (*pdf*) describes the distribution of a *continuous* variable. Density functions can be used to calculate the probability of obtaining a value in a range, using integration:

$$P(a \leq X \leq b) = \int_a^b p(x) dx$$ 

We will use $P$ to denote discrete probabilities and $p$ to denote probability density functions.

## Examples

Let's look at some common discrete and continuous probability distributions. Run the following two cells to import some functions.

In [None]:
# load matplotlib inline mode
%matplotlib inline

# import some useful libraries
import numpy as np                # numerical analysis linear algebra
import matplotlib.pyplot as plt   # plotting

# import some distributions
import sys
sys.path.insert(0,"/project/psyc5270-cdm8j/comp-neurosci")
from comp_neurosci_uva.dists import uniform, normal, beta, gamma, invgamma, exp, poisson, laplace, students_t, noncentral_t, halfcauchy

In [None]:
# this cell defines a function that will let us plot probability distributions easily

def plot_prob(dist, support=[-5, 5], npoints=100, **kwargs):
    """Plot a probability distribution over a support interval.
    
    dist - a scipy distribution function, discrete or continuous
    support - the range of values over which to evaluate the distribution
    npoints - the number of points within the support to evaluate (for density functions only)
    """
    try:
        x = np.linspace(support[0], support[1], npoints)        
        prob = dist.pdf(x)
    except AttributeError:
        x = np.arange(support[0], support[1])
        prob = dist.pmf(x)
    plt.plot(x, prob, lw=3, **kwargs)
    plt.xlabel('Value')
    plt.ylabel('Probability')

### [Uniform](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))

A continuous probability distribution assigning equal probability over a (continuous) range. The parameters of the uniform distribution are the upper and lower bounds. The density of a uniform distribution is proportional to the difference between the bounds

$$U(l,u) = \frac{1}{u - l}$$

In [None]:
# plot the PDF
plot_prob(uniform(lower=-2, upper=2))
plot_prob(uniform(lower=-1, upper=1))


**Question**: Is this a density or a mass function? Why is the height of the distribution lower when the range is larger?

### [Normal](https://en.wikipedia.org/wiki/Normal_distribution)

Normal distributions are found throughout math and nature, due to the Central Limit Theorem. Also known as Gaussian distributions, after its discoverer, Carl Friedrich Gauss. The normal distribution also has two parameters, the mean ($\mu$) and standard deviation ($\sigma)$. If a random variable $X$ is drawn from a normal distribution, it can be denoted as: 

$$X \sim N(\mu,\sigma^2)$$

In [None]:
plot_prob(normal(mean=0, std=1))
plot_prob(normal(mean=1, std=2), support=[-5, 8])

### [Beta](https://en.wikipedia.org/wiki/Beta_distribution)

Only has support between 0 and 1. Useful to help determine the probability of a probability. Also has two parameters, $\beta$ and $\alpha$.

In [None]:
plot_prob(beta(alpha=0.5, beta=0.5), support=[0,1])
plot_prob(beta(alpha=2, beta=5), support=[0,1])

### [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution)

The Poisson distribution reflects the probability of observing some number of events within a window of time (for example, the number of buses that go by some point on the road). The Poisson distribution has only a single parameter, $\lambda$.

In [None]:
plot_prob(poisson(1), support=[0,10], marker='o')

**Question**: Is the poisson distribution a density or a mass function? Why?

#### And more...

Take some time now to explore:

- Gamma
- Inverse Gamma
- Exponential
- Student's t
- Half Cauchy
- Poisson

**Assignment**: choose two distributions. Try to replicate the illustrative plots on their respective pages on Wikipedia. Insert code cells below to generate the plots.

### 3C Observation and Inference

Why is probability theory important to computational neuroscience? 

Fundamentally, the problem we face is that we can't directly determine any physical quantity. We can only take a measurement of it, and that measurement will have errors. Thus, each time we make a measurement, we are going to get a value that comes from a **distribution**.

Even more troubling, we rarely are able to directly measure the actual quantities we care about. Think back to the last exercise (or run it, if you don't remember!). It was clear that a limited set of sounds were activating the neuron (i.e., generating synaptic excitation), but we didn't have a direct measurement of how strong that excitation was. We could only observe that the neuron produced action potentials at higher rates during certain intervals.

The process of using observations to gain information about unobservable quantities is called **inference** or **estimation**, and probability theory gives us the tools we need to make this connection.

### Sampling and Measurement

Let's come up with a **observational model** that formalizes what we think is going on when we make a measurement.

Observational models are fundamental to all statistical analyses and will be a part of many of the more complex models we develop.

Let's say that we have a bar of length $\mu$. We can measure the bar's length as many times as we like, and each time we do so, we'll get a value:

$$y_i = \mu + \varepsilon_i$$

Notice the subscript $i$. This is a numerical **index** for the measurement. Let's stick with the Python convention and have $y_0$ indicate the value of the first measurement.

This model simply says that any given measurement $y_i$ will be the true length of the bar plus some (hopefully small) error $\varepsilon_i$. That is, we're assuming **additive error**.

### The normal error model

There are many potential sources of error in our measurement.

The bar might be fluctuating slightly in length due to changes in temperature. This is an example of an **intrinsic error**.

The instrument making the measurement might have limited precision due to how it's constructed or because of electrical interference. This is an example of **extrinsic error**.

Thanks to the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), a reasonable assumption is that the sum of all these sources of error will have a normal (or Gaussian) distribution.

We can formalize this assumption by saying that $\varepsilon_i$ is **drawn** or **sampled** from a normal distribution. This relationship is often signified with $\sim$.

$$\varepsilon_i \sim N(0, \sigma^2)$$

$N$ represents the normal distribution, and as we saw above, this distribution has two parameters: mean and variance.

**Question**: Why is the mean for the error distribution zero? What does the variance correspond to?

### Simulating measurements

Python can generate numbers from a distribution. The numbers are usually not truly random, because computers are (generally) deterministic, but the values will occur with the probability specified by the underlying PDF. Let's generate 10 measurements of our bar:

In [None]:
from comp_neurosci_uva import data
# true length of bar:
mu = 12.1
# number of measurements
N = 10
# errors
epsilon = normal(mean=0.0, std=data.e2_std).rvs(N)
# measurements
y = mu + epsilon

# inspect the variables
print("errors:", epsilon)
print("measurements:", y)

An aside: numpy arrays support **broadcasting**, which is what allows us to simply add a scalar `mu` to an array `epsilon` and get a new array in which `mu` has been added to every element of `epsilon`. If you're used to lower-level langauges like C or Java, make sure you take advantage of this feature, as it's MUCH faster than iterating through the array.

### Histograms

It's useful to look at the raw data, but distributions are usually better visualized as **histograms**. A histogram is a plot that divides a range into a set of intervals or **bins**, and then counts the number of values in each bin.

Matplotlib has a histogram function, so no need to reinvent the wheel here:

In [None]:
freq, bins, _ = plt.hist(y, range=(10, 14), bins=10, density=True)

Oops! All the observations fall in two bins. **Assignment**: Copy the code cell above, paste it below this cell, and edit to change the `range` and/or `bins` parameters so that the plot gives you more useful information.

### Fitting the observational model

Once you've got a nice histogram, try fitting it to a normal PDF. Copy the plot statement from the code cell above into the cell below, then adjust the mean and std parameters in the `plot_prob` function until you get what looks like a good fit.

In [None]:
### COPY histogram command in the line below

### EDIT this line to adjust the PDF
plot_prob(normal(mean=???, std=???), support=(11, 13))

The `mean` and `std` parameters you settle on are called your `estimate`.

It may not be possible to achieve a good fit with only two observations. What effect does this have on your ability to come up with good parameter estimates?

Explore whether increasing the sample size helps. Copy and paste ONLY the relevant lines of code from above and paste them into the cell below. (Normally we avoid copying and pasting like the plague, but this will help you to think about what each statement is doing so that you can choose ONLY the relevant ones).

In [None]:
## paste your code here

### Summary statistics

Obviously, manually fitting a PDF to the histogram is both tedious and error-prone.

Because we're using a normal error model, we can estimate the parameters directly by computing the *sample mean* and *sample standard deviation*.

Consult the numpy documentation (see link under the `Help` menu) and find the array methods that will compute these summary statistics, then edit the code cell below so that it prints out the values.

In [None]:
print("The mean is", y.<insert-method-call-here>)
print("The standard deviation is", y.<insert-method-call-here>)

**Question**: How close were your estimates to the summary statistics? How close are the summary statistics to the values of $\mu$ and $\sigma$ that were used to generate the data?

### 3D: Data and Probability Models

#### Readings and other resources

- If you'd like to explore the concept of maximum likelihood more, check out this [interactive visualization](https://rpsychologist.com/d3/likelihood/) from [@krstoffr](https://twitter.com/krstoffr).

## Data as a random variables

Most statistics operates under the assumption that any observed data are actual samples drawn from some (to be learned) model of the world.

In the standard approach, these models are simply probability distributions, with parameters that govern the behavior of the model (i.e., what we can expect they will produce).

## Uncertainty

If the observed data samples are simply random draws from a probability distribution, then the level of uncertainty will decrease as we gain more data samples.

Statistical inference involves figuring out what model (e.g., probability distribution, but we will be building more complicated models later in the course) and parameters generated the data. 

Let's spend some time trying to perform this task by hand...

In [None]:
# load matplotlib inline mode
%matplotlib inline

# import some useful libraries
import numpy as np                # numerical analysis linear algebra
import pandas as pd               # efficient tables
import matplotlib.pyplot as plt   # plotting
import ipywidgets as widgets      # interactive widgets
from ipywidgets.widgets.interaction import show_inline_matplotlib_plots
from IPython.display import display, clear_output
import pickle

# import the distributions wrapped from scipy
from comp_neurosci_uva import dists

In [None]:
# load in the data
with open('data/random_data.pickle', 'rb') as f:
    data = pickle.load(f)

# tell us something about the data
for i,d in enumerate(data):
    print('Dataset %d has %d samples' % (i, len(d)))

In [None]:
# mapper between dist name and object
dist_dict = {'Beta': dists.beta,
             'Exponential': dists.exp,
             'Gamma': dists.gamma,
             'Normal': dists.normal,
             'Uniform': dists.uniform}

# Dropdown widget to pick datasets
ds_ind = widgets.Dropdown(options=range(len(data)),
                          description="Dataset")

# Checkbox for whether to show log likelihood
like_check = widgets.Checkbox(description='Show Log Likelihood', 
                              value=False)

# Checkbox for whether to show stem plot
stem_check = widgets.Checkbox(description='Show Stem Plot', 
                              value=False)

# set up the distributions tab
dist_tab = widgets.Tab(allow_none=False)
children = []
titles = []

# beta params
beta_alpha = widgets.FloatText(value=.5,
                               description='alpha')
beta_beta = widgets.FloatText(value=.5,
                              description='beta')
children.append(widgets.HBox([beta_alpha, beta_beta]))
titles.append('Beta')

# exp params
exp_lam = widgets.FloatText(value=5.0,
                            description='lam')
children.append(widgets.HBox([exp_lam]))
titles.append('Exponential')

# gamma params
gamma_alpha = widgets.FloatText(value=.5,
                                description='alpha')
gamma_beta = widgets.FloatText(value=.5,
                              description='beta')
children.append(widgets.HBox([gamma_alpha, gamma_beta]))
titles.append('Gamma')

# normal params
normal_mean = widgets.FloatText(value=0.0,
                                description='mean')
normal_std = widgets.FloatText(value=1.0,
                               description='std')
children.append(widgets.HBox([normal_mean, normal_std]))
titles.append('Normal')

# uniform params
uniform_lower = widgets.FloatText(value=0.0,
                                  description='lower')
uniform_upper = widgets.FloatText(value=1.0,
                                  description='upper')
children.append(widgets.HBox([uniform_lower, uniform_upper]))
titles.append('Uniform')

# add all the children and set the tab titles
dist_tab.children = children
for i in range(len(titles)):
    dist_tab.set_title(i, titles[i])

# set the full user interface
ui = widgets.VBox([widgets.HBox([ds_ind, like_check, stem_check]), dist_tab])

# define plotting function
def plot_data_and_dist(*vals, **kwargs):
    # first plot the data
    dat = data[ds_ind.value]
    plt.hist(dat, bins='auto', density=True, alpha=.5);
    
    # now plot the pdf of the dist
    npoints = 100
    
    # add support for 10% of the data range on either side
    support = (dat.min() - np.ptp(dat)*.1,
               dat.max() + np.ptp(dat)*.1)
    x = np.linspace(support[0], support[1], npoints)
    
    # get the selected dist and params
    params = {c.description: c.value for c in 
              dist_tab.children[dist_tab.selected_index].children}
    dist = dist_dict[dist_tab.get_title(dist_tab.selected_index)](**params)
    
    # calculate the pdf
    pdf = dist.pdf(x)
            
    # plot the pdf and add labels
    plt.plot(x, pdf, lw=3)
    plt.xlabel('Value')
    plt.ylabel('Probability')
    if like_check.value:
        # calculate the log like
        log_like = np.log(dist.pdf(dat)).sum()

        # add it to the plot with some formatting
        plt.title('Log Like: {:3.4f}'.format(log_like))
    
    if stem_check.value:
        # include the stem plot
        plt.stem(dat, dist.pdf(dat), 'g')

        
# set up triggers for updating the plot
out = widgets.interactive_output(plot_data_and_dist, 
                                 {'ds_ind': ds_ind,
                                  'like_check': like_check,
                                  'stem_check': stem_check,
                                  'beta_alpha': beta_alpha,
                                  'beta_beta': beta_beta,
                                  'exp_lam': exp_lam,
                                  'gamma_alpha': gamma_alpha,
                                  'gamma_beta': gamma_beta,
                                  'normal_mean': normal_mean,
                                  'normal_std': normal_std,
                                  'uniform_lower': uniform_lower,
                                  'uniform_upper': uniform_upper,
                                 })

# wrapper for tab change
# required b/c selecting tabs can't trigger plots like other widgets
def tab_change(*args, **kwargs):
    with out:
        clear_output(wait=True)
        plot_data_and_dist()
        show_inline_matplotlib_plots()
dist_tab.observe(tab_change, 'selected_index')

## Interactive data fitting (part 2)

The widget below (h/t Per Sederberg) shows the histogram of a randomly generated distribution in blue, overlaid with the PDF from one of five different distribution types in orange. Select one of the datasets and try to figure out what distribution it was generated from. You'll likely need to adjust the parameters of the distribution.

Refer back to Notebook 2 and/or Wikipedia if you want to remind yourself of what the different distributions look like and how they change shape with their parameters.

In [None]:
# show everything
display(ui, out)

**Q**: What dataset did you pick? What distribution and parameters best fit the observations by eye?

## Likelihood calculation

We don't have to do this by eye. Because our model has a probability density function (PDF), we can calculate the likelihood of observing the data given the model and parameters.

For any given model and parameters, you can determine the probability of having observed any individual data point by evaluating the PDF at the value of that data point (*Turn on the Stem Plot.*)


## Maximum Likelihood Estimation

The goal then becomes to maximize the likelihood of observing the data given a model and parameters:

$$P(D \mid \theta, M)$$

As long as the data points are all independent, the likelihood of observing all of them is the product of all the probabilities.

$$P(D \mid \theta, M) = \prod_i p(d_i \mid \theta, M)$$

It is more efficient and computationally tractable to maximize the log of $P(D|\theta,M)$, so we typically convert the likelihood into a sum of log likelihoods.

$$ \log P(D \mid \theta, M) = \sum_i \log p(d_i \mid \theta, M)$$

**Q** Turn on the Log Likelihood checkbox above and see if you can do better. Note that you'll get a numerical error if any of the data lie outside the support of the distribution, but this will go away once you fix that issue. In the cell below, enter your best estimates of the parameters.

## Automated optimization

Many approaches have been developed for searching parameter spaces to find the parameters that generate the maximum or minimum value of a function.

### Nelder--Mead Simplex

One very popular algorithm is the Nelder--Mead simplex.

It involves growing and shrinking a simplex (a generalization of a triangle to multiple dimensions) to search the parameter space efficiently to minimize a function.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Nelder-Mead_Himmelblau.gif/640px-Nelder-Mead_Himmelblau.gif)

In [None]:
# scipy includes lots of optimization methods
import scipy.optimize as opt

In [None]:
# Define a likelihood function
def like_fun(params, *args):
    # pull the model and dat out of the args
    model = args[0]
    dat = args[1]
    
    # instantiate the model with the params
    dist = model(*params)
    
    # calc the log like
    log_like = np.log(dist.pdf(dat)).sum()
    if np.isnan(log_like):
        log_like = -np.inf
    
    # return the negative of it to minimize
    return -log_like

## Calculate the best-fitting params

Execute the following code block to run Nelder-Mead on your dataset, using your chosen model from the widget above. Hopefully, in some small amount of time, the algorithm will converge on a solution, and you'll get a printout. See if you can figure out what the best parameter estimates are from the output.

Notice that each time you run the code below, a new starting value for the optimization is generated. Try running the optimization a few times and see if you get similar answers. Why is it important to do this?

If you're not sure if you've chosen the right PDF, go back to the widget, pick a different one, and rerun the cell below.

In [None]:
# grab the distribution and data from the UI above
dist_name = dist_tab.get_title(dist_tab.selected_index)
model = dist_dict[dist_name]
dat = data[ds_ind.value]

# set the bounds for the distribution
bound_dict = {'Beta': [(0, 10), (0, 10)],
              'Exponential': [(0, 20)],
              'Gamma': [(0, 10), (0, 10)],
              'Normal': [(-10, 10), (0, 10)],
              'Uniform': [(-10, 10), (-10, 10)]}
bounds = bound_dict[dist_name]

# generate a random starting point based on the bounds
# NB: it's possible to generate invalid starting points
x0 = [dists.uniform(*b).rvs() for b in bounds]

# print some information about the distribution and starting values
print('Dataset:', ds_ind.value)
print('Distribution:', dist_name)
print('Starting value:', x0)
print()

# run the optimizer
# NOTE, not all methods make use of the bounds method
res = opt.minimize(like_fun, x0, args=(model, dat), 
                   #bounds=bounds,
                   #method='L-BFGS-B',
                   #method='BFGS',
                   method='Nelder-Mead',
                   #method='TNC'
                  )
print(res)

## Visualize the fit

Let's visualize the likelihood as a 2D function of the parameters and see how our fit looks. If you're far away from where you think you should be, try picking some different values in the widget and/or re-running the optimization.

Note that this will only work for PDFs that have two parameters.

In [None]:
# set the grid of points to evaluate
x = np.linspace(bounds[0][0], bounds[0][1], 100)
y = np.linspace(bounds[1][0], bounds[1][1], 100)
xx, yy = np.meshgrid(x, y, sparse=True)

# evaluate the likelihood
z = np.log(model(xx, yy).pdf(dat[:, np.newaxis, np.newaxis])).sum(0)
z = np.exp(z)

# plot the contour and the best-fit value
plt.contourf(x, y, z, 100)
plt.plot(res.x[0], res.x[1], 'x', markersize=12, color='red')
plt.colorbar()
#plt.xlabel()
#plt.ylabel()

**Q**: Is your estimate at the peak of the likelihood function? Are some parameters constrained more or less than others?

## Assessing model fit

An important question you need to address when using a statistical model is how well your model fits the data.

Your fit can be bad for any number of reasons, but one rather common one is that your model is wrong.

If your model is really wrong, then the parameter estimates are not worth much.

However, because all models are wrong (see Box's Dictum), you can only compare models to see which are less wrong.

How do we do this?

We could simply compare the maximum likelihoods between the models, but that doesn't take into account the complexity of each model and amount of data. 

One, more principled, approach is Bayesian Information Criterion (BIC):

$$BIC = \text{ln}(n)k - 2\text{ln}(\hat{L}),$$

where $n$ is the number of data points, $k$ is the number of parameters, and $\hat{L}$ is the maximum likelihood value of the model $M$.

***SMALLER BIC values are better!!!***

In [None]:
# Calculate the model fit with BIC
# recall that the optimization returns the negative log likelihood
n = len(dat)
k = len(res.x)
L = -res.fun
bic = np.log(n)*k - 2*(L)
print('BIC:', bic)

## Model Comparison

But how do we decide if one model is better than the other? We can compare BIC values between models, turning them into a Bayes Factor!

$$BF = exp((BIC_0 - BIC_1)/2)$$

This is interpreted with the help of the following guidelines:

| Bayes Factor | Evidence |
|--------------|----------|
| 1--3         | Weak     |
| 3--20        | Positive |
| 20--150      | Strong   |
| >150         | Very Strong | 


**Q** Copy your first BIC value into the cell below, assigning it to `bic_0`. Now try fitting your data to another  distribution functions using the widget above. Assign the second BIC value to `bic_1` below and the execute the cell to calculate the Bayes Factor. Then, in the text cell below, describe which two PDFs you considered, the maximum likelihood *parameter estimates*, and which model fit the best.

In [None]:
# calc Bayes Factor
# (enter numbers from the BIC assessments above)
bic_0 = -33.93
bic_1 = -31.07
np.exp((bic_0 - bic_1)/2.)

## t-test example

We now have all the tools necessary to perform statistical inference (though we will improve on all these approaches in the coming weeks). You can:

- Use optimization techniques to identify the parameters that give rise to the maximum likelihood of observing the data given the model
- Assess model fit
- Compare models to guide model selection

Let's try a simple example of performing a t-test via model comparison approaches!

You're hopefully quite familiar with how you can use a t-test to determine if a the mean of a sample is significantly different from zero, or to determine if two samples have different means. But what's going on under the hood?

Let's consider the one-sample t-test. You run a t-test and get a $p$ value, which tells you the probability that you'd get your observations if the data were from a normal distribution with mean $\mu = 0$ (the null hypothesis).

The model comparison standpoint is similar, but what you're testing is whether there's more evidence for a model where $\mu \neq 0$ than for one in which $\mu = 0$. We can make this comparison using Bayes Factors.

In [None]:
# generate some data that may or may not be significantly different from zero.
# you can play around with the sample size, mean, and std deviation later
A = dists.normal(.3, .5).rvs(10)

# plot it
plt.hist(A, bins='auto', density=True);

## Standard t-test

First we'll perform a standard one-sample t-test on our data.

In [None]:
# do a one-sample t-test
import scipy.stats as stats

stats.ttest_1samp(A, 0.0)

## Fit a Student's t model

Next we fit the full model to the data.

In [None]:
# Define a likelihood function
def students_like(params, *args):
    # pull the model and dat out of the args
    dat = args[0]
    df = len(dat) - 1
    
    # instantiate the model with the params, 
    # the df is determined from the data
    dist = dists.students_t(params[0], params[1], df)
    
    # calc the log like
    log_like = np.log(dist.pdf(dat)).sum()
    if np.isnan(log_like):
        log_like = -np.inf
    
    # return the negative of it to minimize
    return -log_like

# pick an central starting point
x0 = [0.0, 1.0]

# run the optimization
res = opt.minimize(students_like, x0, args=(A,), 
                   #bounds=bounds,
                   #method='L-BFGS-B',
                   #method='BFGS',
                   method='Nelder-Mead',
                   #method='TNC'
                  )
print(res)

# calculate the BIC for this model and save it
n = len(A)
k = len(res.x)
L = -res.fun
bic_1 = np.log(n)*k - 2*(L)
print('BIC:', bic_1)

## Fit a null hypothesis model

Now we fit a model representing the null hypothesis that the mean of the data is actually 0.0. 

Note how we simply fix the mean of the Student's t distribution to zero, but still fit the standard deviation.

In [None]:
# Define a likelihood function
def students_null_like(params, *args):
    # pull the model and dat out of the args
    dat = args[0]
    df = len(dat) - 1
    
    # instantiate the model with the params
    # mean is fixed at zero and the df is determined from the data
    dist = dists.students_t(0.0, params[0], df)
    
    # calc the log like
    log_like = np.log(dist.pdf(dat)).sum()
    if np.isnan(log_like):
        log_like = -np.inf
    
    # return the negative of it to minimize
    return -log_like

# start at same point (though mean is fixed at zero)
x0 = [1.0]

# run the optimization
res = opt.minimize(students_null_like, x0, args=(A,), 
                   #bounds=bounds,
                   #method='L-BFGS-B',
                   #method='BFGS',
                   method='Nelder-Mead',
                   #method='TNC'
                  )
print(res)

# calculate and print the BIC
n = len(A)
k = len(res.x)
L = -res.fun
bic_0 = np.log(n)*k - 2*(L)
print('BIC:', bic_0)

## Model Comparison

Now that we have the BIC values for each model, we can use the Bayes Factor to determine whether the full model is preferred to the null model. 

We want a big number here. If it's less than 1.0 then there is no evidence that the alternative/full model should be preferred to the null model (i.e., the mean of the distribution is not different from 0.0).

In [None]:
# calculate Bayes Factor
bf = np.exp((bic_0 - bic_1)/2.)
print('Bayes Factor:', bf)

**Q**: In the cell that generates the observations, try three different values for the mean, standard deviation OR sample size. Then run the model-fitting code for each case. In the cell below, compare and contrast how changing these values affect the Bayes Factor and the p-value in the standard t-test.