# 2 Probability, Distributions, and Measurement

## 2A More on Python

If you're new to programming in general, you should be working your way through all the chapters of the [Python for Data Science](https://www.datacamp.com/courses/intro-to-python-for-data-science) DataCamp tutorial.

If you're a more accomplished programmer but new(ish) to Python, you can get a more detailed primer on the same material by going through chapters 1-4 of the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/).

Assuming you've completed these activities, the sections below will be a review of some key concepts.

### Objects and packages

Python is firmly grounded in the tradition of *object-oriented programming* (OOP). In this approach, the logic for dealing with different kinds of data is *encapsulated* or *attached* to the data. A data object can

- have *attributes* that tell us about that object instance
- have *methods* that are functions an object can perform


### What is an instance?

Every value in Python is an *instance* of some type. Each time you create a new value, the interpreter is *instantiating* an object of a defined type, allocating memory and setting initial values.

Thus, any variable we define is simply pointing the variable's name (in the current namespace) to a chunk of memory containing the instance.

Python keeps track of all object intances and cleans them when they are no longer needed.

You can determine the type of a value or variable using the `type` function:

In [None]:
x = 42
type(x)

The attributes and methods of an object are accessed by using the `.` operator. You can *inspect* the attributes and methods of any object in the notebook by typing `tab` after entering the variable name and `.`, as below:

In [None]:
# let's explore x:
# (place the cursor after x. and press the tab key)
x.

Let's try with a different type:

In [None]:
y = 'The answer to the ultimate question.'
type(y)

In [None]:
# let's explore the string type:
# (press tab after y.)
y.

Note: you can access documentation for functions and methods by using `Shift-Tab`.

In [None]:
# place the cursor between the parentheses and type `Shift-Tab`
y.lower()

### What is a package?

Programming is all about abstraction, and code is meant to be reused. Python has tons of useful packages that extend its basic functionality. This means you don't have to (and shouldn't) reinvent basic algorithms. 

We'll talk more about how to find and evaluate packages later.

You can access external packages (or modules) by **importing** them into your workspace. When you import a package, you should give it a short name that you'll use to access the functions, etc, in the package.

You'll see (and start to use) lots of statements like the following:

```python
import numpy as np
```

After executing this statement, you can access the contents of the module as if they were attributes and methods of an object called `np`, like so: 

In [None]:
import numpy as np
np.arange(10)

### Arrays

The `numpy` package provides Python with an important data structure, the **array**. Arrays are so important to scientific programming that you will almost inevitably import `numpy` at the beginning of every script and notebook you use.

As we discussed previously, **scalar** data types like `int` and `float` can only represent a single number. But time series and point processes are inherently ordered collections of data, and arrays give us the ability to store and manipulate these large aggregates. 

#### Array terminology

- An **element** of an array is one item. It occupies a specific "slot" in the sequence.
- The **length** or **size** of an array is the number of elements.
- The **index** of an element is the numerical position of the element in the array. Python uses *zero-based indexing*, which means that the first element has the index `0`.
- The **data type** of the array is the type of the elements in the array. In arrays, all the elements have the same type. In `numpy`, the data type is stored in the `dtype` attribute.

#### Array operations

- **indexing** allows you to access specific elements of an array
- **slicing** allows you to access specific subsets of an array
- **iteration** allows you to process each element of the array in sequence

In numpy, both indexing uses square brackets (`[` and `]`)

In [None]:
# first, let's initialize the random seed so that we all get the same answers
np.random.seed(1)
# create an array with 100 random numbers
my_data = np.random.randn(100)
# use indexing to get the first element:
my_data[0]

Slicing also uses square brackets, but instead of a single index, you use two indices separated by `:` to specify a range:

In [None]:
# this will retrieve the first 10 elements of my_array
my_data[0:10]

Notice how indexing returns a scalar, but slicing returns a new array. Also notice that the last index is *exclusive*; that is, element 10 is NOT returned.

#### Some simple problems

Write an expression for the 88th value of `my_array` in the cell below:

Write an expression for the last 10 elements of `my_array`:

## 2B Basics of Probability

Let's consider two events, $A$ and $B$.

**The probability of $A$**

$$P(A) \in [0, 1]$$

**The probability of not $A$**

$$1 - P(A)$$

**The probability of $A$ *or* $B$**

$$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$ 

If $A$ and $B$ are mutually exclusive, 

$$P(A \cup B) = P(A) + P(B)$$

**The probability of $A$ and $B$. This is also called the joint probability**

$$P(A,B) = P(A|B) P(B) = P(B|A) P(A)$$

If $A$ and $B$ are independent,

$$P(A,B) = P(A) P(B)$$

**The probability of $A$ given $B$. This is called the conditional probability**

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B \mid A) P(A)}{P(B)}$$



### Simple examples

Using the table above, let's try and figure out the probabilities of the following:

1. Rolling a 5 on a 6-sided die.
2. Not rolling a 3 on a 6-sided die.
3. Rolling a 4 or a 5 on a 6-sided die.
4. Rolling less than 4 or an even number on a 6-sided die.
5. Rolling two 3's in a row on a 6-sided die.

Enter your answers as text in the cell below:

### Harder question

#### The case of Tim Tebow

What is the probability of becoming a Major League Baseball (MLB) player if you hit a home run (HR) in your first at-bat in the minor leagues? We have the following important information:

a) 5% of future MLB players hit a home run in their first minor league at-bat.

b) 1% of minor league players make it to MLB.

c) Only 0.1% of players hit a homerun in their first minor league at-bat.

Hint: You are trying to solve $P(MLB \mid HR)$.

Bonus: What percentage of players who don't make it to MLB hit a home run in the first at bat?

Write code to answer the question in the cell below:

### Probability Distributions

At the core of probability theory are the mathematical functions determining the probability of the potential outcomes of an experiment. 

In statistics, these distributions represent the models that attempt to describe the observed data. The equations take in parameters that determine the shape of the probability distributions.

Let's explore some continuous and discrete probability distributions relevant to quantifying data and models!!!

In [None]:
# load matplotlib inline mode
%matplotlib inline

# import some useful libraries
import numpy as np                # numerical analysis linear algebra
import matplotlib.pyplot as plt   # plotting
import ipywidgets as widgets      # interactive widgets

# import some distributions
from tools.dists import uniform, normal, beta, gamma, invgamma, exp, poisson, laplace, students_t, noncentral_t, halfcauchy

In [None]:
# this cell defines a function that will let us plot probability distributions easily

def plot_pdf(dist, support=[-5, 5], npoints=100):
    """Plot a probability density function over a support interval.
    
    dist - a scipy distribution function
    support - the range of values over which to evaluate the PDF
    npoints - the number of points within the support to evaluate
    """
    x = np.linspace(support[0], support[1], npoints)
    pdf = dist.pdf(x)
    plt.plot(x, pdf, lw=3)
    plt.xlabel('Value')
    plt.ylabel('Probability')

#### Uniform

A continuous probability distribution assigning equal probability over a range.

What happens when we change the range?


In [None]:
# plot the PDF
plot_pdf(uniform(lower=-2, upper=2))
plot_pdf(uniform(lower=-1, upper=1))


#### Normal

Need I say more?

In [None]:
plot_pdf(normal(mean=0, std=1))
plot_pdf(normal(mean=1, std=2))

#### Beta

Only has support between 0 and 1. Useful to help determine the probability of a probability.

We'll spend some time with Beta distributions in subsequent classes.

In [None]:
plot_pdf(beta(alpha=0.5, beta=0.5), support=[0,1])
plot_pdf(beta(alpha=2, beta=5), support=[0,1])

#### And more...

Take some time now to explore:

- Gamma
- Inverse Gamma
- Exponential
- Student's t
- Half Cauchy
- Poisson

**Assignment**: choose two distributions. Try to replicate the illustrative plots on their respective pages on Wikipedia. Insert code cells below to generate the plots.

## Observation and Inference

Why is probability theory important to computational neuroscience? 

Fundamentally, the problem we face is that we can't directly determine any physical quantity. We can only take a measurement of it, and that measurement will have errors. Thus, each time we make a measurement, we are going to get a value that comes from a **distribution**.

Even more troubling, we rarely are able to directly measure the actual quantities we care about. Think back to the last exercise (or run it, if you don't remember!). It was clear that a limited set of sounds were activating the neuron (i.e., generating synaptic excitation), but we didn't have a direct measurement of how strong that excitation was. We could only observe that the neuron produced action potentials at higher rates during certain intervals.

The process of using observations to gain information about unobservable quantities is called **inference** or **estimation**, and probability theory gives us the tools we need to make this connection.

## Sampling and Measurement

Let's come up with a **observational model** that formalizes what we think is going on when we make a measurement.

Observational models are fundamental to all statistical analyses and will be a part of many of the more complex models we develop.

Let's say that we have a bar of length $\mu$. We can measure the bar's length as many times as we like, and each time we do so, we'll get a value:

$$y_i = \mu + \varepsilon_i$$

Notice the subscript $i$. This is a numerical **index** for the measurement. Let's stick with the Python convention and have $y_0$ indicate the value of the first measurement.

This model simply says that any given measurement $y_i$ will be the true length of the bar plus some (hopefully small) error $\varepsilon_i$. That is, we're assuming **additive error**.

### The normal error model

There are many potential sources of error in our measurement.

The bar might be fluctuating slightly in length due to changes in temperature. This is an example of an **intrinsic error**.

The instrument making the measurement might have limited precision due to how it's constructed or because of electrical interference. This is an example of **extrinsic error**.

Thanks to the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), a reasonable assumption is that the sum of all these sources of error will have a normal (or Gaussian) distribution.

We can formalize this assumption by saying that $\varepsilon_i$ is **drawn** or **sampled** from a normal distribution. This relationship is often signified with $\sim$.

$$\varepsilon_i \sim N(0, \sigma^2)$$

$N$ represents the normal distribution, and as we saw above, this distribution has two parameters: mean and variance.

Why is the mean for the error distribution zero? What does the variance correspond to?

### Simulating measurements

Python can generate numbers from a distribution. The numbers are usually not truly random, because computers are (generally) deterministic, but the values will occur with the probability specified by the underlying PDF. Let's generate 10 measurements of our bar:

In [None]:
from tools import data
# true length of bar:
mu = 12.1
# number of measurements
N = 10
# errors
epsilon = normal(mean=0.0, std=data.e2_std).rvs(N)
# measurements
y = mu + epsilon

# inspect the variables
print("errors:", epsilon)
print("measurements:", y)

An aside: numpy arrays support **broadcasting**, which is what allows us to simply add a scalar `mu` to an array `epsilon` and get a new array in which `mu` has been added to every element of `epsilon`. If you're used to lower-level langauges like C or Java, make sure you take advantage of this feature, as it's MUCH faster than iterating through the array.

### Histograms

It's useful to look at the raw data, but distributions are usually better visualized as **histograms**. A histogram is a plot that divides a range into a set of intervals or **bins**, and then counts the number of values in each bin.

Matplotlib has a histogram function, so no need to reinvent the wheel here:

In [None]:
freq, bins, _ = plt.hist(y, range=(10, 14), bins=10, density=True)

Oops! All the observations fall in two bins. Edit the code cell above to change the `range` and/or `bins` parameters so that the plot gives you more useful information.

### Fitting the observational model

Once you've got a nice histogram, try fitting it to a normal PDF. Copy the plot statement from the code cell above into the cell below, then adjust the mean and std parameters in the `plot_pdf` until you get what looks like a good fit.

In [None]:
### COPY histogram command in the line below

### EDIT this line to adjust the PDF
plot_pdf(normal(mean=???, std=???), support=(11, 13))

The `mean` and `std` parameters you settle on are called your `estimate`.

It may not be possible to achieve a good fit with only two observations. What effect does this have on your ability to come up with good parameter estimates?

Explore whether increasing the sample size helps. Copy and paste ONLY the relevant lines of code from above and paste them into the cell below. (Normally we avoid copying and pasting like the plague, but this will help you to think about what each statement is doing so that you can choose ONLY the relevant ones).

In [None]:
## paste your code here

### Summary statistics

Obviously, manually fitting a PDF to the histogram is both tedious and error-prone.

Because we're using a normal error model, we can estimate the parameters directly by computing the mean and standard deviation.

Consult the numpy documentation (see link under the `Help` menu) and find the array methods that will compute these summary statistics, then edit the code cell below so that it prints out the values.

In [None]:
print("The mean is", y.<insert-method-call-here>)
print("The standard deviation is", y.<insert-method-call-here>)

How close were your estimates to the summary statistics? How close are the summary statistics to the values of $\mu$ and $\sigma$ that were used to generate the data?

## Bonus activity

If you have time, try fitting a normal additive error model to data that were generated using **multiplicative error**:

In [None]:
z = data.mult_error_data(mu, N)