>#### "One sees, from this Essay, that the theory of probabilities is basically just common sense reduced to calculus; it makes one appreciate with exactness that which accurate minds feel with a sort of instinct, often without being able to account for it."
> "Théorie Analytique des Probabilités" Pierre-Simon Laplace

# Probability

Because they deal with uncertain events most of the machine learning methods can be framed in the language of probability calculus. 
In this notebook I will very briefly recall the basics concepts of the probability calculus and introduce the notation I will be using, hopefully consistently, throughout the lecture.

But keep in mind that this is not a supposed to be a textbook  on probability! Please treat this as a list of concepts and definitions that you have to refresh. It will also serve as a breief introduction to various Python packages. But again this is not a tutorial on Python. The code is provided as a guidance for you and it's up to you to lookup  explanantion in documentation if  needed. I  am of course also happy to help. You can consult me on the chat on Teams. 

The lecture includes some simple problems to help you check your understanding. Some problems have answers right in the notebook. I will try to hide the content of this cells, please try to solve the problem before looking at the answer. 

## Random events

Imagine any process that can have an upredictable outcome. This could be the results of a coin toss,  number of passengers on the bus etc. Let's however assume that we know the set of all possible outcomes of this process and call this set $S$. This set is often called _sample space_.

Any subset $A$ of $S$ denoted $A\subseteq S$ will be called an _event_. If process has an outcome $s\in S$ then we say that the event $A$ happened if $s\in A$. An event that contain only one set element $\{s\}$ will be called an _elementary_ event, _atomic_ event or _sample point_.

Typical textbook example would be a  coin toss. In this case $ S=\{H, T\}$ and there are only four possible events (including the empty set).  There are two elementary events $\{H\}$ nad $\{T\}$.  

#### Example: Dice roll

What is the sets of all possible outcomes of a roll of two dice? How many elements it contains? Write down the event $A$ - "the sum of the points is 9".  

$$S=\{(i,j): i,j\in\{1,2,3,4,5,6\}\},\quad \#S=36,\quad A=\{(3,6), (4,5), (5,4), (6,3)\}$$

For larger examples this would be impractical, but just for fun let's code this in Python

In [None]:
from itertools import product

In [None]:
S_dice = {(i,j) for i,j in product(range(1,7), range(1,7))}

In [None]:
print(len(S_dice))
print(S_dice)

In [None]:
A = set( filter(lambda s: sum(s)==9, S_dice) )
print(A)

### Probability of an event

Because the outcome of a process is unpredictable, so are the events.    However some events are more likely to happen then others and we can quantify this by assigning  a number to each event that we call _probability_ of that event:

$$0\leq P(A) \leq 1$$

What this number really means is still subject to discussion and interpretation and I will not address this issue. For me this is a measure of "degree of certainty" with zero probability denoting impossible event and one denoting a certain event.  What is important is that those numbers cannot be totaly arbitrary. To be considered a valid measure, probabilities must satisfy several  axioms consistent with our common sense: 

1.

$$P(A)\ge 0$$

2.

$$P(S)=1$$

3.

For any integer $k>1$ including $k=\infty$ if events $A_i$ are mutually disjoint that is for each $i\neq j$ $A_i \cap A_j =\varnothing$ then 

$$P(A_1\cup A_2\cup\cdots \cup A_k) = P(A_1)+P(A_2) + \cdots + P(A_k)$$

An important colorary is that when the set of outcomes is countable the probability of an event $A$ is the sum of the probabilities for each elementary event contained in $A$:

$$P(A) = \sum_{s\in A}P(\{s\})$$

A set is countable when we can assign an unique natural number to each of its elements, in other word we can count its elements. All finite sets are of course countable. An example of not countable set is provided e.g. by the real numbers or any interval $[a,b)$ with $b>a$. 

This means that in case of countable outcomes it is enough to specify the probability of each elementary event. 

In the following  I will ommit the set parenthesis for the elementary events i.e. assume $P(s)\equiv P(\{s\})$. It follows from axiom 2. and 3. that

$$\sum_{s\in S} P(s) = 1$$

#### Problem: Complementary event

Prove that 

$$P(S\setminus A)= 1-P(A)\text{ where } S\setminus A = \{s\in S: s\notin A\}$$

__Answer__

It follows directly from the second and third axiom after noting that

$$(S\setminus A) \cup A = S \text{ and } (S\setminus A) \cap A = \varnothing$$

## Calculating probability

The concept of the probability can be somewhat hazy and verges upon philosophy. My take on this is that to calculate the probability we need a _model_ of the process. E.g. for the dice example the model is that all elementary events are equiprobable,  leading to assigning probability $1/36$ to every possible two dice roll outcome. 

The connection with experiment (reality) is given by the Borel's law of large number. It states that if we repeat an experiment under same conditions many times, the fraction of times an event happens will converge to the probability of this event. This is a fundation of _frequentist_ interpretation of probability. 

It is harder to interpret the probability of one-off events. E.g. "there is a 30% chance that it will rain tomorrow', or there is 80% chance that Andrzej Duda will win the elections". By the way the last statement actually should be the conditional probability: "assuming that the elections will take place". Please take some time to think how that statements can be interpreted.

## Conditional probability

How does a probability of an event change when we know that some other event happed? That is a central question in machine learning and is  answered by _conditional_ probability

$$P(A|B)$$

This denotes the probability that event $A$ happened on  condition that the event $B$ happend. The formal definition is

$$P(A|B) = \frac{P(A\cap B)}{P(B)}$$

Let's take as an example roll of two dice. What is the probability that  the sum is seven ?

And now suppose that someone tells us that we have rolled three on one die. Did the the probability change?  Again I will use some Python code althought it is probably faster to   calculate this "by hand". Try it before proceding further.

First let's calculate the probability of that event without any conditions.  The event $A$ - "we have rolled seven points in total" contains $6$ elementary events 

In [None]:
A = set( filter(lambda s: sum(s)==7, S_dice) )
print(len(A))
print(A)

In [None]:
# Just to have nice fractions instead of floats
from fractions import Fraction 
P_A =  Fraction(len(A),len(S_dice))
print(P_A)

Event $B$ - "there is a three on one die" contains $11$ elements

In [None]:
B = set( filter(lambda s: s[0]==3 or s[1]==3 , S_dice) )
print(len(B))
print(B)

In [None]:
P_B = Fraction(len(B), len(S_dice))
print(P_B)

Event $A\cap B$ has only two elements

In [None]:
A_cap_B = A.intersection(B)
P_A_cap_B = Fraction(len(A_cap_B), len(S_dice))
print(len(A_cap_B))
print(A_cap_B)
print(P_A_cap_B)

And finally

In [None]:
P_A_cond_B = P_A_cap_B/P_B
print(P_A_cond_B)

So this is marginally bigger then $P(A)=1/6$. 

## Bayes theorem

It is very important to keep in mind that conditional probability $P(A|B)$ is not symetric! _E.g._ when it rains the probability that sidewalk will be wet is one. On the other hand when the sidewalk is wet it does not imply  with certainty that it has rained, it may have  been _e.g._ washed by our neighbour. But as we will see many times in course of this lecture the ability to "invert: conditional probability comes in very handy. 

By definition

$$P(B|A) = \frac{P(A \cap B)}{P(A)}\quad\text{and}\quad P(A|B) = \frac{P(A \cap B)}{P(B)}$$

we can use second expression to calculate $P(A\cap B)$ and subsitute it into first to obtain

$$\boxed{P(B|A) = \frac{P(A|B)P(B)}{P(A)}}$$

This formula is know as Bayes theorem. 

Let's apply it to the "wet sidewalk problem" i.e. we look in the morning through our window and see wet sidewalk. What is the probability that it has rained at night? 

If $A$ is the event "sidewalk is wet" and $B$ is the event "it has rained" then $P(A|B)=1$ and 

$$P(rain|wet)=\frac{P(rain)}{P(wet)}=\frac{P(rain)}{P(wet|rain)P(rain)+P(wet|wash)P(wash)}=\frac{P(rain)}{P(rain)+P(wash)}$$

This leads to a paradox if $P(rain)+P(wash) >1 $. It would imply that after seeing wet sidewalk the probability of rain _decreases_. However in writing the denominator we have made an implicit but reasonsable assumption that events "rain" and "wash" are mutually exclusive and so the sum of their probabilities must be less then one.  We can write this explicitely 

$$\frac{P(rain)}{P(rain)+P(wash|\neg rain)(1-P(rain))}  $$

Let's consider some "corner cases". If our neigbour always washes the sidewalk when it does not rain then the results is $P(rain)$ - sidewalk is always wet, we do not have any additional information.  

If our neigbour never washes the sidewalk then results is one - the only reason for wet sidewalk is rain so when it is wet it must have rained.

If our neighbour washed the sidewalk only half of the times when it does not rain we obtain

$$\frac{ 2 P(rain)}{1+P(rain)}$$

So if _e.g._ $P(rain)=1/7$  seeing wet sidewalk increses that chance to

In [None]:
print(2 * Fraction(1,7)/(1+Fraction(1,7)))

Let's plot this using `matplotlib`  and `numpy` libraries. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [12,8]

We can plot the whole family of plots corresponding to different values of $P(wash|\neg rain)$

In [None]:
ps = np.linspace(0,1,100)
plt.xlabel("P(rain)")
plt.ylabel("P(rain|wet)");
plt.plot(ps, ps, c='grey', linewidth = 1);
for pw in [0.1, 0.2, 0.3, 0.4, 0.5, 0.75]:
    plt.plot(ps, ps/(ps+pw*(1-ps)),label = "P(w|not r) = {:.2f}".format(pw)); 
plt.grid()
plt.legend();

__Problem__: Base rate fallacy

You are tested for a rare disease (1 person in 250). Test has 80%  true positive rate and  10% false positive rate. i.e. test gives positive (you are ill) result for 80% of ill patients and for 10% of healthy patients.   

Your are tested positive, what are the chances you have the disease? 

__Answer__

What we need is the  probability that we are ill on condition that we have been tested positive:

$$P(ill|P)= \frac{P(ill, P)}{P(P)}$$

The probability of being ill and tested positive is 

In [None]:
p_ill_p = 0.004 * 0.8  

The probability of being tested positive is

$$P(P)=P(ill,P)+P(\neg ill, P)$$

In [None]:
p_p = .004*0.8 + 0.996*0.1

and finally 

In [None]:
p_ill_cond_p = p_ill_p/p_p
print("{:4.1f}%".format(100*p_ill_cond_p))

So there is no cause to despair yet :) 

### Increase of information (learning)

One could say that this test is useless if positive  result gives only $3\%$ chance of being ill. And  this particular test was actually discarded. But it is not totaly useless. Before taking the test our chance of being ill was $0.4\%$. After seing the positive result it "jumped" more then ten times to $3.1\%$. 

$$0.004 \longrightarrow 0.031$$

After seing a negative result our chances of being ill dropped for times:

$$0.004 \longrightarrow 0.001 $$ 

## Independent events

It may happen that  knowledge that $B$ happened does not change  the probability of $A$

$$P(A|B) = P(A)$$

We say then that  events $A$ and $B$ are _independent_. 

For example when throwing the coin the outcome of toss does not depend in any way on the outcome of previous tosses or in case of dice the  face they land on are independent etc. 

Substituting the definition of conditional independence 

$$\frac{P(A\cap B)}{P(B)}  = P(A)$$

we obtain  a more familiar factorisation criterion for joint probability of independent events

$$P(A\cap B) = P(A)\cdot P(B)$$

## Discrete random variables

The notion of an unpredictable process is too general and in the following we will restrict ourself to outcome sets that are subsets of $\mathbb{R}^M$. We will call such a process a _random variable_. 

If the outcome set is countable, in particular if it is finite, then we call such random variable _discrete_. As shown above  to characterise such variable is enough to assign  the probability to each of the elements of the outcome set. This is called _probability mass function_ (pmf). 

We will denote the probability of random variable $X$  taking a value $s\in S$ by

$$P(X=s)\equiv P_X(\{s\})$$ 

However we will often abreviate it further

$$ P_X(s) \equiv P_X(\{s\})$$

I will omit the subscript $X$ when it's clear from the context  which random variable I have in mind. 

### Join probability distribution

When we have to random variables $X$ and $Y$ with outcome sets $S_X$  and $S_Y$ we can treat them  together as one random variable with outcome set 
$S_{X\times  Y}=S_{X}\times S_Y$ and joint probability mass function

$$P_{X\times Y}(X=x, Y=y)$$

If we are interested in only one of the variables we can calculate its probability mass function as _marginal_ pmf

$$P_X(X=x)= \sum_y P(X=x, Y=y)\qquad P_Y(Y=y)= \sum_x P(X=x, Y=y)$$

### Idependent random variables

The concept of independence applies also to random variables. We say that two random variables $X$ and $Y$ are independent iff (if and only if)

$$P(X=x|  Y=y)= P(X=x)\quad\text{for all }x,y$$

or equivalently

$$P_{X\times Y}(X=x, Y=y)= P_X(X=x)\cdot P_Y(Y=y) \quad\text{for all }x,y$$

For example when $X$ and $Y$ represents a first and second toss of a coin they are independent random variables. 

### Expectation value

Expectation value of a function with respect to a random variable $X$ is defined as

$$E_X[f(X)] \equiv \sum_i f(x_i)P(X=x_i),\quad x_i\in S_X$$ 

In particular the expectation value of the random variable _i.e._ its _mean_ or _average_ is 

$$E_X[X] \equiv \sum_i x_i P(X=x_i)$$ 

and the variance

$$\operatorname{var}(X)=\sigma^2 = E[(X-E[X])^2]$$

The square root of variance $\sigma$  is called _standard deviation_. 

#### Problem: linearity of expectation value

Show that

$$E_{X\times Y}[a X + b  Y]= a E_X[X] + b E_Y[Y]\quad\text{where }a,b\text{ are constants}$$

$$E_{X\times Y}[a X + b  Y]=\sum_{x,y}\left(a x + b y\right) P(X=x, Y=y) = a \sum_{x,y} x  P(X=x, Y=y)+b \sum_{x,y}  y P(X=x, Y=y) $$

$$a \sum_{x,y} x  P(X=x, Y=y) = a\sum_x  x \sum_y P(X=x, Y=y)= a\sum_x x P(X=x) = a E[X]$$

and same for other term.

__Problem:__ Variance

Show that 

$$\operatorname{var}(X) = E[X^2]-E[X]^2$$

__Answer__

$$\operatorname{var}(X) = E[(X-E[X])^2] = E\left[X^2-2 E[X]+E[X]^2\right]$$

$E[X]$ is a constant so using the linearity of expectation value we obtain

$$E\left[X^2-2 E[X]+E[X]^2\right]=E[X62]+2E[X]E[X]-E[X]^2$$

### Covariance and correlation

The expectation value of a product of two random variables is given by 

$$E_{X\times Y}[X\cdot Y]=\sum_{x,y} x\cdot y\, P(X=x , Y=y)$$

If the two random variables are independent then

$$E_{X\times Y}[X\cdot Y]=\sum_{x,y} x y P(X=x , Y=y)
=\sum_{x,y} x y P(X=x)P(Y=y)=
\left(\sum_{x} x P(X=x)\right)
\left(\sum_{y} y P(Y=y)\right)
$$

leading to the familiar result that the expectation value of independent random variables factorises

$$E_{X\times Y}[X\cdot Y]=E_X[X] E_Y[Y]$$

The quantity 

$$\operatorname{cov}(X,Y)=E_{X\times Y}[X\cdot Y]-E_X[X] E_Y[Y]=E[(X-E[X])(Y-E[Y])]$$

is called a _covariance_ and when variables $X$ and $Y$ are independent then it is equal to zero. Please take note however that zero covariance does not imply indpendence.

Magnitude of the covariance depeds on the magnitude of random variables e.g. scaling one variable by $a$ will also scale the covariance by $a$. That is why often a normalised version called _correlation_ coeficient is used:

$$\operatorname{corr}(X,Y)=\frac{E\left[(X-E[X])(Y-E[Y])\right]}{\sqrt{E\left[(X-E[X])^2\right]E\left[(Y-E[Y])^2\right]}}$$

__Problem__: Linear dependence

Please check that when variables $X$ and $Y$ are linearly dependent _i.e._  $Y =a \cdot X + b$ correlation between them is 1 or -1. 

Let's illustrate this with some Python code

In [None]:
xs = np.random.uniform(size=10000)
ys = np.random.uniform(size=10000)

In [None]:
#covariance 
np.mean( (xs-xs.mean())*(ys-ys.mean() ))

We get same result using build in function

In [None]:
np.cov(xs,ys)

In [None]:
#correlation
np.mean( (xs-xs.mean())*(ys-ys.mean() ))/np.sqrt(np.mean( (xs-xs.mean())**2)*np.mean((ys-ys.mean() )**2))

In [None]:
np.corrcoef(xs,ys)

In [None]:
zs = xs + ys 
np.corrcoef((xs,ys,zs))

__Problem:__ Variance of sum  of independent random variables

Show that if random variables $X$ and $Y$ are independent then

$$\operatorname{var}(X+Y) = \operatorname{var}(X) +  \operatorname{var}(Y)$$

Some other characteristics of the random variables include

### Median

Median $m$ is a number that divides the values of the random variable into two  sets as equiprobable as possible

$$P(X\le m) \ge \frac{1}{2}\text{ and } P(X\ge m) \ge \frac{1}{2}$$

#### Problem: Median for coin toss

What is a median for coin toss if you assign value $1$ to heads and $0$ to tails? 

### Mode 

The mode is the value for which the probability mass function has its maximum. That's an element most likely to be sampled.

$$\operatorname{mode}(X)=\underset{x_k}{\operatorname{argmax}} P(X=x_k)$$

### Entropy

Last characteristic of an distribution that I would like to introduce is the _entropy_ 

$$H[X] \equiv -\sum_i P_X(x_i) \log P_X(x_i)=-E[\log X] $$

Entropy is a "measure of randomnes", the greater entropy, the greater randomness or harder to predict outcome. 

Take for example a coin toss with unfair coin with probability $p$ of comming up heads. The entropy is 

$$-p\log p - (1-p)\log(1-p)$$

In [None]:
ps = np.linspace(0,1,100)[1:-1] # we reject 0 and 1
plt.plot(ps, -ps *np.log(ps)-(1-ps)*np.log(1-ps));

We can see that the entropy is maximum when $p=1/2$ and zero when $p=0$ or $p=1$ that is when the outcome is certain. 

## Some common discrete random variables

### Bernouli distribution

Bernoulli distribution has a two elements outcome set $S=\{0,1\}$. It is charaterised by a single parameter

$$p\equiv P(X=1)$$

The expectation value of this distribution  is equal to

$$0\cdot (1-p) + 1\cdot p = p$$

#### Problem: variance of Bernouli distribution

Calculate the variance of Bernouli distribution

#### Problem: median of the Bernoulli distribution 

Calculate the median of Bernouli distribution

The `scipy.stats` module contains large quantity of distribution objects.

In [None]:
import scipy.stats as st

In [None]:
b = st.bernoulli(0.7)

This produces a Bernouli distribution object with $p=0.7$. In scipy this is calles _frozen_ distribution meaning that its parameters are set (frozen). Using this object we have acces to different properties of the distribution:

In [None]:
print(b.pmf(0), b.pmf(1), b.mean(), b.var())

A very usefull feature is the posibility to generate random numbers according to the distribution

In [None]:
b.rvs(size = 10)

Another way to use distributions is to pass the paramaters directly without creating a frozen distribution

In [None]:
st.bernoulli.rvs(p = 0.5,size = 10)

### Binomial distribution

Binomial distribution is characterised by two parameters: $n$ and $p$ and it is the number of successes (ones) in $n$ independent Bernouli trials. So the outcome set $S=\{0,\ldots,n\}$
has $n+1$ elements. The probability mass function is

$$P(X=k) = \binom{n}{k}p^k(1-p)^{n^k}$$

#### Problem: Mean and variance of binomial random variable

Calculate the mean (expectation value) and variance of binomial random variable.

#### Hint
Use the fact that binomial random variable can be expressed as a sum of $n$ independent Bernouli random variables.

In [None]:
n=12
bin = st.binom(n=n,p=0.7)

In [None]:
plt.bar(np.arange(0,n+1), bin.pmf(np.arange(0,n+1)));

### Categorical (multinoulli) distribution

The categorical distribution is a straightforward generalisation of Bernouilli distribution. While Bernoulli distribution has two possible outcomes categorical has 
random variable has $m$.  It is characterised by providing its probability mass function, that is probability for every category

$$p_k= P(X=x_i),\; k=1,\ldots,m\quad \sum_{k=1}^m p_k = 1$$

We can construct the multinoulli distribution in Python as follows

In [None]:
pk = np.asarray([0.3,0.4,0.1,0.2])
m = st.rv_discrete(values=([1,2,3,4],pk));

We can visualise distribution by histograming. `pyplot.hist` function conveniently both creates and plots a histogram. 

In [None]:
plt.hist(m.rvs(size = 1000), bins=4, range=(0.5,4.5) , rwidth =0.5,  align = 'mid', histtype='bar');
plt.scatter(np.arange(1,5), pk*1000,s=40,c='red', zorder = 5, label="$\\bar{n}_k$")
plt.xticks(np.arange(1,5));
plt.xlabel("k")
plt.legend();


In analogy to Bernoulli distribution  the entropy of the categorical distribution is  greatest when all categories are equally probable with $p_k=1/m$.

### Multinomial distribution

The multinomial distribution generalises the binomial distribution in a same way that mutinoulli distribution generalises Bernoulli distribution. 
Binomial distributions counts the number of successes in $n$ Bernouilli trials. The multinomial random counts the number of results in each category for $n$ trials of the categorical random variable. 

$$S=\left\{(n_1,\ldots,n_m): n_k\gt0, \sum_{k=1}^m n_k = n \right\}$$

$n_k$ is the number of times that number $k$ has showed up in the $n$ trials of the categorical random variable. Its pmf is given by

$$P(n_1,\ldots,n_m)=\frac{n!}{n_1!\cdots n_m!}p_1^{n_1}\cdots p_m^{n_m}$$

In [None]:
from scipy.stats import multinomial

In [None]:
mult = multinomial(n=10,p=[0.4,0.5,0]) # when probabilities do not add to one the last probability is  adjusted acordingly

In [None]:
mult.p

In [None]:
sample = mult.rvs(10)
print(sample)

In [None]:
sample.sum(axis =1)

### Poisson distribution

Assume that average number of  clients in a shop (or  clicks on a web page) is $\mu$ per hour. If the probability of visit or click is constant in time then the distribution of the number of clients in one hour is given by the Poisson probability mass function

$$P(X = k) = e^{-\mu}\frac{\mu^k}{k!}$$

In [None]:
mu = 7.7
ks = np.arange(0,20)
plt.scatter(ks, st.poisson.pmf(ks,mu=mu))
plt.axvline(mu, linewidth=1, c='grey')
plt.xticks(ks);

It is easy to check that the expectation value of this distribution is indeed $\mu$

$$E[X]=e^{-\mu}\sum_{k=0}^\infty k \frac{\mu^k}{k!} 
e^{-\mu}\mu \frac{\text{d}}{\text{d}\mu}\sum_{k=0}^\infty  \frac{\mu^k}{k!} = e^{-\mu}\mu \frac{\text{d}}{\text{d}\mu} e^\mu = \mu$$

 __Problem__: Variance of Poisson distribution

Calculate the variance of Poisson distribution. 

__Hint__:  Use same "differentatiation trick". 

### Long tailed distributions

All the above distributions were "well behaved" in the sense that we could calculate both mean and variance for each. Actually all of them had all the moments. The nth moment is defined as expectation value $E[X^n]$.  But there exists distribution that don't even have the average! Take for example the Zipf distribution with probability mass function

$$P(Z=k|\alpha)=\frac{1}{\zeta(\alpha)}\frac{1}{k^\alpha}\quad k>0,\; \alpha>1$$

where $\zeta(\alpha)$ is the Riemann zeta function. This distribution does not have any moments with $n\ge\alpha-1$  as

$$E_Z[k^n]=\frac{1}{\zeta(\alpha)}\sum_{k=1}^\infty \frac{1}{k^{\alpha-n}}$$

which diverges when $\alpha-n\le1$.

In [None]:
ns = np.arange(1,20)
plt.bar(ns,  st.zipf.pmf( ns, 1.7), width=.5);

The distribution above does not have a finite expectation value

In [None]:
st.zipf.mean(1.7)

Please note that median is well defined even in such case

In [None]:
st.zipf.median(1.7)

When a random variable does not have a well defined average the sample average does not converge! When gathering more and more data we will get eratic, eventually diverging  behaviour. Below I have plotted the average calculated on sample size from 1 to 100000. For comparison I have included same plot with Poisson distribution with same mean. Please run this cell several times, see how the output changes! Then experiment with different values of the parameter $\alpha$.

In [None]:
n=10000
alpha = 2.1
zipf_mu  = st.zipf.mean(alpha)
zipf_data = st.zipf.rvs(alpha,size = n)
plt.plot(np.cumsum(zipf_data)/np.arange(1,n+1),label="Zipf $\\alpha = {:5.2f}\\; \\mu = {:5.2f}$".format(alpha,zipf_mu));

mu = 10 if np.isinf(zipf_mu) else zipf_mu
poisson_data = st.poisson(mu).rvs(n)
plt.plot(np.cumsum(poisson_data)/np.arange(1,n+1), label="Poisson $\\mu = {:5.2f}$".format(mu));
plt.axhline(mu,c = 'red', linewidth = 1, linestyle ='--')
plt.legend();

You can see that the average can change abruply after adding one more sample.  You may think that does distributions are "pathological". Nevertheless they occur quite frequetly in real settings _e.g._ the wealth distribution follows similar curve. So when for example you take 1000 people and measure average height you will not change this average drasticaly even when she is a midget or a giant. But when you calculate the average income it may happen that the person you add is Bill Gates and the average will change dramatically.

# Continuous random variables

By continous random variables we will understand variables with have a connected subset of $\mathbb{R}$ e.g. an interval as the outcome set. 

### Probability density function

When the set of the outcomes is not countable _i.e._ we cannot enumerate them, we cannot  specify probability of the event by adding probabilities of elementary events it contains.  Actually for most of the interesting continous random variables the probability of a single outcome is zero

$$P(X=x) = 0$$

However we can ask for the probability that the outcome is smaller then some number:

$$F(x) = P(X\le x)$$

This is called a cummulative distribution function (cdf) or _cummulant_.

We can also ask for the probability that the outcome lies in a small interval $\Delta x$

$$P(x<X<x+\Delta x)$$

For small intervals and "well behaved" random variables we expect that this probability will be proportional to $\Delta x$, so let's take the ratio and go to the limit $\Delta x\rightarrow 0$

$$\frac{P(x<X<x+\Delta x)}{\Delta x}\underset{\Delta x\rightarrow 0}{\longrightarrow} P_X(x)$$

If this limit exists it's called probability density function (pdf).  

There is a relation between cdf and pdf  

$$ P_X(x) =\frac{\text{d}}{\text{d}x}F_X(x)\qquad F_X(x) = \int\limits_{-\infty}^x P_X(x')\text{d}x'$$

Most of the definitions and properties of the probability mass function apply to probability density function with summation changed to integral _e.g._

$$E_X[f(X)]\equiv \int\text{d}x f(x) P(x)$$

## Some useful continuous random variables

### Normal distribution

Probably  the most known, if not the only known, continuous distribution is the _normal_ or Gaussian distribution. It is characterised by its mean $\mu$  and  variance $\sigma^2$. Its probability density function is

$$P(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\displaystyle -\frac{(x-\mu)^2}{2\sigma^2}}$$

and it has a characteristic bell-like shape

In [None]:
xs = np.linspace(-5,7,500)
for s in [0.25, 0.5, 1,2]:
    plt.plot(xs,st.norm.pdf(xs, loc=1, scale=s), label="$\\sigma = {:4.2f}$".format(s));
plt.axvline(1, c='grey', linewidth=1);
plt.legend();

The prevalence of this  random variable can be attributed to central limit theorem that states that, under some mild assumptions,  the sum of independent random variables  approaches the normal random variable as the number of variables tends to infinity. 

Another feature  of the normal distribution is that it is the distribution with highest entropy given mean and variance. 

As you can see on the probability density function $P_X(x)$ is not restricted to be less then one. That's because this is a _density_. We  can meaningfully only ask about probability of $X$ having an outcome in an interval  which is given by the area under a fragment of the curve

In [None]:
distrib  = st.norm(loc=1, scale=0.25)
a = 0.75
b = 0.90
xs = np.linspace(0,2,500)
ab = np.linspace(a,b,100)

plt.plot(xs,distrib.pdf(xs));
plt.fill_between(ab,distrib.pdf(ab), alpha=0.5 )
plt.axvline(1, c='grey', linewidth=1);
area = distrib.cdf(b)-distrib.cdf(a)
plt.text(0.2, 1.4, "$P(a<X<b) = {:2f}$".format(area), fontsize=14)

In [None]:
n=100000
sample = distrib.rvs(size=n)
( (a<sample) & (sample<b)).sum()/n

The area was calculated using the cumulative distribution function

$$P(a<X<b)=F_X(b)-F_X(a)$$

In [None]:
xs = np.linspace(0,2,500)
plt.plot(xs,distrib.cdf(xs));
plt.plot([a,a,0],[0,distrib.cdf(a), distrib.cdf(a)], c='grey')
plt.plot([b,b,0],[0,distrib.cdf(b),distrib.cdf(b)], c='grey')

### Beta distribution

The  [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) has two parameters  $\alpha$ and $\beta$  and its probability density function is

$$P(x|\alpha,\beta) =  \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}
x^{\alpha-1}(1-x)^{\beta-1},\quad 0\leq x\leq 1
$$

Its importance stems from  the fact that it is a _conjugate_ prior to Bernoulli distribution so it is used to set the "probability on probability". You will learn more about this  in bayesian_analysis notebook. 

Here are plots of the probability density function for some values of $\alpha=\beta$

In [None]:
xs =np.linspace(0,1,250)
for a in [0.25,0.5,1,2,5,10]:
    ys = st.beta(a,a).pdf(xs)
    plt.plot(xs,ys, label='%4.2f' %(a,))
plt.legend(loc='best', title='$\\alpha=\\beta$');

And here for some values of $\alpha\neq\beta$

In [None]:
xs =np.linspace(0,1,250)
for a in [0.25,0.5,1,5]:
    ys = st.beta(a,2.0).pdf(xs)
    plt.plot(xs,ys, label='%4.2f' %(a,))
plt.legend(loc=1, title='$\\alpha$');

It can be more convenient to parametrise  Beta distrubution by its mean and variance. The mean and variance of Beta distribution are 

$$\mu = \frac{\alpha}{\alpha+\beta}\quad\text{and}\quad \sigma^2=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$$

Introducing a new auxiliary variable

$$\nu = \alpha+\beta$$

we have 

$$\alpha = \mu \nu,\quad \beta = (1-\mu)\nu,\quad \sigma^2=\frac{\mu(1-\mu)}{\nu +1} $$

so

$$\nu=\frac{\mu(1-\mu)}{\sigma^2}-1$$

and finally

$$\alpha = \mu \left(\frac{\mu(1-\mu)}{\sigma^2}-1\right)\quad\text{and}\quad\beta = (1-\mu) \left(\frac{\mu(1-\mu)}{\sigma^2}-1\right)$$

## To be continued ... 