In [1]:
import sys
sys.path.append('..')

import math
import numpy as np

import metrics

# Probabilities

## Events

$$P(A\cup B) = P(A) + P(B) - P(A \cap B)$$
$$P(E|F) = \frac{P(E \cap F)}{P(F)}$$
$A$ and $B$ independants $\iff P(A \cap B) = P(A)P(B)$

### Bayes formula

$$P(E|F) = \frac{P(F|E)P(E)}{P(F)}$$

## Discrete random variable

$X$ discrete, can only take an enumerable number values.  

### Probability Mass Function (PMF)

$$p(a) = P\{X = a\}$$
$$\forall i: \space p(x_i) \geq 0$$
$$\sum_{i=1}^\infty p(x_i) = 1$$


### Cumlative Distribution Fuction (CDF)

$$F(a) = P\{X \leq a\} = \sum_{x_i \leq a} p(x_i)$$

## Continous Random Variable

$X$ conitnous variable, can take an infinite number of real values.

### Probability Density Function (PDF):

$$P\{X \in B\} = \int_{B} f(x)dx$$
$$P{a \leq X \leq a} = \int_a^b f(x)dx$$
$$P \{ X \in ]-\infty; +\infty[ \} = 1$$
$$P \{ X = a \} = 0$$

### Cumulative Distribution Function (CDF)

$$F(a) = P \{X < a \} = P \{ X \leq a \} = \int_{-\infty}^a f(x)dx$$
$$\frac{d}{da} F(a) = f(a)$$
$$P \{a \leq X \leq b \} = F(b) - F(a)$$

## Joint distributions

### Discrete joint distributions

$$p(x,y) = P \{X=x, Y=y\}$$
$$p_X(x) = \sum_y p(x,y)$$
$$p_Y(y) = \sum_x p(x,y)$$

### Continuous joint distributions

$$P \{X \in A, Y \in B\} = \int_B \int_A f(x,y)dxdy$$
$$f(a, b) = \frac{\partial^2}{\partial a \partial b} F(a,b)$$
$$P \{X \in A\} = \int_A \int_{-\infty}^{+\infty} f(x,y)dydx$$
$$P \{Y \in B\} = \int_B \int_{-\infty}^{+\infty} f(x,y)dxdy$$

## Conditional distributions

$$P_{X|Y}(x|y) = P\{X = x | Y=y\} = \frac{p(x,y)}{p_Y(y)} \text{ (discrete variables)}$$
$$f_{X|Y}(x|y) = P\{X = x | Y=y\} = \frac{f(x,y)}{f_Y(y)} \text{ (continuous variables)}$$

## Dependance

Two random variables are independant if the value of one does not affect the probability distribution of the other.  

Thwo discrete random variables $X$ and $Y$ are independant if and only if:
$$p_{X,Y}(x,y) = p_X(x)p_Y(y) \space \forall x,y$$

Thwo continuous random variables $X$ and $Y$ are independant if and only if:
$$F_{X,Y}(x,y) = F_X(x)F_Y(y) \space \forall x,y$$

The same condition holds with the probability density $f$.  

If two random variables are not indepandant, they are called depandant.

## Expectation

The expection of a random variable is the average value of this random variable.

$$\mathbb{E}[X] = \sum_{i=1}^n x_ip(x_i) \text{  (discrete variable)}$$
$$\mathbb{E}[X] = \int_{-\infty}^{+\infty} xf(x)dx \text{  (continous variable)}$$

The expecation of an expression $\mathbb{E}_{x \sim X}[g(x)]$ is the average value of $f$, when $x$ comes from the random variable $X$.

$$\mathbb{E}[g(X)] = \sum_{i=1}^n g(x_i)p(x_i) \text{  (discrete variable)}$$
$$\mathbb{E}[g(X)] = \int_{-\infty}^{+\infty} g(x)f(x)dx \text{  (continous variable)}$$

### Properties

$$\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]$$
$$\mathbb{E}[\alpha X] = \alpha \mathbb{E}[X], \space \alpha \in \mathbb{R}$$

### Estimate

Let $x \in \mathbb{R}^N$ a sample of size $N$ from random variable $X$.  An estimate of the expectation (or mean) of $X$ is:
$$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i$$

In [2]:
def mean(x):
    N = len(x)
    res = 0
    for i in range(N):
        res += x[i]
    return res / N

x = np.random.randn(137) * 1.5 + 3.1
print(mean(x))
print(np.mean(x))

2.9529466934036375
2.952946693403637


## Variance

The variance is a mesure of how much the value of a random variable change from it's expected value.

$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

$$\text{Var}(\alpha X + \beta) = \alpha^2 \text{Var}(X), \space \alpha, \beta \in \mathbb{R}$$

Standard deviation $\sigma(X)$:
$$\sigma(X) = \sqrt{\text{Var}(X)}$$  

Let $x \in \mathbb{R}^N$ a sample of size $N$ from random variable $X$.  An estimate of the variance of $X$ is:
$$\text{Var}(x) = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2$$

But this estimate is a biased estimate. Bessel's correction tries to correct the bias by dividing by $N-1$ instead of $N$:

$$\text{Var}(x) = \frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})^2$$

In [3]:
def variance(x, bias=True):
    N = len(x)
    div = N if bias else N - 1
    mu = np.mean(x)
    return np.sum((x - mu)**2) / div

def std(x, bias=True):
    return np.sqrt(variance(x, bias))

x = np.random.randn(137) * 1.5 + 3.1
print(variance(x))
print(np.var(x))
print(std(x))
print(np.std(x))

print(np.cov(x.reshape(1, -1)))
print(variance(x, bias=False))

2.07029010630017
2.07029010630017
1.438850272370329
1.438850272370329
2.0855128276700245
2.085512827670024


## Covariance

The Covariance is a mesure of the joint variability of 2 random variables.  
A positive value means there is a positive linear relationship ($X$ have great values when $Y$ have great values and $X$ have low values when $Y$ have low values).  
A negative value means there is a negative linear relationship ($X$ have great values when $Y$ have low values and $X$ have low values when $Y$ have great values).  

Covariance between 2 random variables $X$ and $Y$.
$$\text{cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$$
$$\text{cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]$$


### Properties

$$\text{cov(X,X)} = \text{Var(X)}$$
$$\text{cov(X,Y)} = \text{cov(Y,X)}$$

### Estimate

Let $x$ and $y \in \mathbb{R}^N$ samples from respectives random variables $X$ and $Y$. An estimate of the covariance between $X$ and $Y$ is:

$$\text{cov}(x,y) = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})$$

As for the variance, we can correct the bias:
$$\text{cov}(x,y) = \frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})$$

### Covariance Matrix

Let $X = (X_1, \text{...}, X_n)$ a random vector, where each entry $X_i$ is a random variable.  
We define the covariance matrix $\Sigma \in \mathbb{R}^{n*n}$ suchat that the entry $(i,j)$ is the covariance between $X_i$ and $X_j$:
$$\Sigma_{ij} = \text{cov}(X_i, X_j)$$
$$\Sigma_{ii} = \text{Var}(X_i)$$
$$\Sigma_{ij} = \Sigma_{ji}$$


It is also called the auto-covariance matrix or the variance-covariance matrix.  
It generalizes the notion of variance to multiple dimensions.  

Let $X \in \mathbb{R}^{n*p}$, where each a row is a sample of size $p$ of the random variable $X_i$.  
We can compute an estimate of the covariance matrix $\Sigma \in \mathbb{R}^{N*N}$:
$$\Sigma_{ij} = \text{cov}(x_i, x_j) = \frac{1}{p} \sum_{k=1}^p (x_{ik} - \bar{x}_i)(x_{jk} - \bar{x}_j)$$

When the matrix $X$ is centered (each row has mean 0), it simplifies to:
$$\Sigma = \frac{1}{p} \sum_{k=1}^p  x_{:,k} x_{:,k}^T = \frac{1}{p} X X^T$$

In [4]:
def covar(x, y, bias=True):
    N = len(x)
    div = N if bias else N - 1
    mux = np.mean(x)
    muy = np.mean(y)
    return np.sum((x - mux) * (y - muy)) / div

def covar_mat(X, bias=True):
    n = len(X)
    C = np.empty((n,n))
    for i in range(n):
        C[i,i] = variance(X[i], bias=bias)
    for i in range(n):
        for j in range(i+1,n):
            C[i,j] = covar(X[i], X[j], bias=bias)
            C[j,i] = C[i,j]
    return C

def covar_mat2(X, bias=True):
    p = X.shape[1]
    div = p if bias else p - 1
    X -= np.mean(X, axis=1, keepdims=True)
    return (X @ X.T) / div

x = np.random.randn(137) * 1.1 - 1.7
y = np.random.randn(137) * 0.3 + 4.1

print(covar(x, y))
print(np.cov(np.vstack((x,y)), bias=True)[0,1])

X = np.random.randn(108, 37) * 1.45 - 0.67
C1 = np.cov(X, bias=True)
C2 = covar_mat(X, bias=True)
print(metrics.tdist(C1, C2))
C1 = np.cov(X, bias=False)
C2 = covar_mat(X, bias=False)
print(metrics.tdist(C1, C2))

C1 = np.cov(X, bias=True)
C2 = covar_mat2(X, bias=True)
print(metrics.tdist(C1, C2))

-0.007947126398190951
-0.00794712639819096
9.513896123207123e-15
9.881154319026808e-15
3.904817326864065e-15


## Correlation

Correlation is a mesure to how close two variables are to having a linear relationship with each other.  
The correlation is often mesured by a correlation coefficient. They exist different types of correlation coefficients.  

Two variables are said uncorellated when their correlation coefficient is $0$. It means there is no increasing or decreasing trends between the 2 variables.  

## Correlation and dependance

The correlation can be seen as a more specific kind of dependance. Two variables can be uncorrelated (no sign of specific trends between them), and yet be dependant.  
$X$ and $Y$ correlated implies that they are dependants (but the opposite direction is false).  
Similarlyy, $X$ and $Y$ independant implies that they are uncorrelated (but the opposite direction is false). 

### Pearson correlation coefficient

It is a measure of the linear correlation between 2 variables $X$ and $Y$, ranging from $-1$ to $1$, with $+1$ a total positive linear correlation, $-1$ a total negative linear correlation, and $0$ no linear correlation.  

![correlation_examples](https://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png)

$$\rho(X,Y) = \frac{\text{cov}(X,Y)}{\sigma(X)\sigma(Y)}$$

### Estimate

Let $x$ and $y \in \mathbb{R}^N$ samples from respectives random variables $X$ and $Y$. An estimate of the pearson correlation coefficient between $X$ and $Y$ is:

$$\rho(x,y) =\frac{\text{cov}(x,y)}{\sigma(x)\sigma(y)} = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^N (x_i - \bar{x})} \sqrt{\sum_{i=1}^N (y_i - \bar{y})}}$$

### Correlation Matrix

The idea is similar to the covariance matrix. We present the correlation matrix for the pearson correlation coefficient.  
Let $X = (X_1, \text{...}, X_n)$ a random vector, where each entry $X_i$ is a random variable.  
We define the correlation matrix $\Sigma \in \mathbb{R}^{n*n}$ suchat that the entry $(i,j)$ is the correlation between $X_i$ and $X_j$:
$$\Sigma_{ij} = \text{cov}(X_i, X_j)$$
$$\Sigma_{ii} = 1$$
$$\Sigma_{ij} = \Sigma_{ji}$$  

Let $X \in \mathbb{R}^{n*p}$, where each a row is a sample of size $p$ of the random variable $X_i$.  
We can compute an estimate of the correlation matrix $\Sigma \in \mathbb{R}^{N*N}$:
$$\Sigma_{ij} = \rho(x_i, x_j)$$

When each rows of $X$ as mean $0$ and starndard deviation $1$, it simplifies to:
$$\Sigma = \frac{1}{p} \sum_{k=1}^p  x_{:,k} x_{:,k}^T = \frac{1}{p} X X^T$$

In [16]:
def corr(x, y):
    sx = std(x)
    sy = std(y)
    return covar(x,y) / (sx * sy)

def corr_mat(X):
    p = X.shape[1]
    X -= np.mean(X, axis=1, keepdims=True)
    X /= np.std(X, axis=1, keepdims=True)
    return (X @ X.T) / p
    

x = np.random.randn(137) * 1.1 - 1.7
y = np.random.randn(137) * 0.3 + 4.1

print(corr(x, y))
print(np.corrcoef(np.vstack((x,y)))[0,1])

X = np.random.randn(4, 37) * 1.45 - 0.67

C1 = np.corrcoef(X)
C2 = corr_mat(X)
print(metrics.tdist(C1, C2))

-0.07060091200764533
-0.07060091200764526
9.559066178840504e-16


## Common distribution

### Normal distribution (Gaussian)

$$X \sim \mathcal{N}(\mu, \sigma^2)$$
Parameters:
- $\mu$: mean
- $\sigma^2 \geq 0$: variance


$$\text{PDF: } f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \text{exp}(-\frac{(x - \mu)^2}{2\sigma^2})$$

$$\text{CDF: } F(x) = \frac{1}{2}[1 + \text{erf}(\frac{x - \mu}{\sigma \sqrt{2}})]$$
$$\mathbb{E}[X] = \mu$$
$$\text{Var}(X) = \sigma^2$$

$$ \text{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^{x} e^{-t^2}dt$$

In [6]:
_box_muller = [None]
def norm_box_muller():    
    if _box_muller[0] is not None:
        res = _box_muller[0]
        _box_muller[0] = None
        return res
    
    u1, u2 = np.random.rand(2)
    r = np.sqrt(-2*np.log(u1))
    theta = 2*np.pi*u2
    x = r * np.cos(theta)
    y = r * np.sin(theta)
    _box_muller[0] = x
    return y

_marsagalia_polar = [None]
def norm_marsagalia_polar():
    if _marsagalia_polar[0] is not None:
        res = _marsagalia_polar[0]
        _marsagalia_polar[0] = None
        return res

    while True:
        x, y = 2 * np.random.rand(2) - 1
        s = x**2 + y**2
        if s < 1 and s>0:
            break
    
    f = np.sqrt((-2*np.log(s))/s)
    a, b = x*f, y*f
    _marsagalia_polar[0] = a
    return b
    
    

N = 1000000

x = np.random.randn(N) * 4.5 - 1.3
print('[NP]  mu =', np.mean(x))
print('[NP] std =', np.std(x))


x = np.empty(N)
for i in range(N): x[i] = 4.5 * norm_box_muller() - 1.3
print('[BM]  mu =', np.mean(x))
print('[BM] std =', np.std(x))

x = np.empty(N)
for i in range(N): x[i] = 4.5 * norm_marsagalia_polar() - 1.3
print('[MP]  mu =', np.mean(x))
print('[MP] std =', np.std(x))

[NP]  mu = -1.3049802174821152
[NP] std = 4.500744466064507
[BM]  mu = -1.2979656266302555
[BM] std = 4.498800445728129
[MP]  mu = -1.2969973770770145
[MP] std = 4.502027093734645


In [7]:
#Generate from gaussian using quantile function
import scipy.stats


def norm_cdf(x):
    return 1/2 * (1 + scipy.special.erf(x / np.sqrt(2)))

def norm_quantile(x):
    def f(v):
        return norm_cdf(v) - x
    return scipy.optimize.brentq(f, -10, 10)

def randn_qt(size):
    u = np.random.rand(size)
    x = np.array([norm_quantile(v) for v in u])
    return x

v = 0.6
b1 = scipy.stats.norm.ppf(v)
b2 = norm_quantile(v)
print(b1)
print(b2)
print(metrics.tdist(b1, b2))





x = randn_qt(100000) * 4.5 - 1.3
print('[QT]  mu =', np.mean(x))
print('[QT] std =', np.std(x))

0.2533471031357997
0.2533471031357997
0.0
[QT]  mu = -1.3085498511283453
[QT] std = 4.504173063970272


###  Binomial distribution

$$X \sim B(n, p)$$
Parameters:
- $n$: number of trials
- $p \in [0, 1]$: success probability for each trial.

$p(X = k)$: $k$: number of successes.

$$\text{PMF: } f(k) = \binom{n}{k} p^k(1-p)^{n-k}$$
$$\mathbb{E}[X] = np$$
$$\text{Var}(X) = np(1 - p)$$

$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$

### Multinomial distribution

Parameters:
- $n$: number of trials
- $p_i$: probability of event $i$: $\sum p_i = 1$, $p_i >= 0$

$X$ discrete vector of size $K$: $X_i$: number of realisations of the event $i$.

$$\text{PMF: } f(x) = \binom{n}{x_1\text{...} x_k} \prod_{i=1}^K p_i^{x_i}$$

$$\mathbb{E}[X_i] = np_i$$
$$\text{Var}(X_i) = np_i(1 - p_i)$$
$$\text{Cov}(X_i, X_j) = -np_ip_j \space (i \neq j)$$

$$\binom{n}{k_1 \text{...} k_m}= \frac{n!}{\prod_{i=1}^m k_i!}$$

In [8]:
def rand_multinomial(p):
    s = 0
    p2 = np.empty(len(p))
    for i in range(len(p)-1):
        s += p[i]
        p2[i] = s
    p2[-1] = 1
    
    u = np.random.rand()
    k = 0
    while u > p2[k]:
        k += 1
    return k

N = 1000000
x = np.empty(N).astype(np.int)
p = [0.1, 0.6, 0.3]
for i in range(N):
    x[i] = rand_multinomial(p)
    
print('p[0]:', np.mean(x==0))
print('p[1]:', np.mean(x==1))
print('p[2]:', np.mean(x==2))

p[0]: 0.099862
p[1]: 0.599834
p[2]: 0.300304


### Multivariate Normal distribution

$$X \sim \mathcal{N}(\mu, \Sigma)$$
Parameters:
- $\mu \in \mathbb{R}^p$: mean
- $\Sigma \in \mathbb{R}^{p*p}$: covariance matrix (positive semi-definite)

$$\text{PDF: } f(x) = ((2\pi)^{p} \text{det}(\Sigma))^{-\frac{1}{2}} \exp(-\frac{1}{2} (x - \mu)^T \Sigma^{-1}(x-\mu))$$

$$\mathbb{E}[X] = \mu$$
$$\text{Var}(X) = \Sigma$$

In [9]:

rmu = np.array([0.5, -1.2, 4.6])
rsig = np.array([[0.4, 1.2, -1.8],[2.5,-2.8,-1.9],[-1.4,6.7,2.5]])
rsig = rsig.T @ rsig
N = 1000000

print('mu =', rmu)
print('sig=')
print(rsig)

X = np.random.multivariate_normal(rmu, rsig, size=N, check_valid='raise')
mu = np.mean(X, axis=0)
sig = 1/N * (X - mu.reshape(1,3)).T @ (X - mu.reshape(1,3))
print('[NP] mu =', mu)
print('[NP] sig=')
print(sig)


def normal_multivariate(mu, sig, size):
    N = size
    p = len(mu)
    X = np.empty((N,p))
    d, V = np.linalg.eig(sig)
    Q = np.sqrt(d).reshape(1,p) * V 
    
    
    for i in range(N):
        xn = np.random.randn(p)
        X[i] = Q @ xn + mu
    return X
    
X = normal_multivariate(rmu, rsig, size=N)
mu = np.mean(X, axis=0)
sig = 1/N * (X - mu.reshape(1,3)).T @ (X - mu.reshape(1,3))
print('mu =', mu)
print('sig=')
print(sig)

mu = [ 0.5 -1.2  4.6]
sig=
[[  8.37 -15.9   -8.97]
 [-15.9   54.17  19.91]
 [ -8.97  19.91  13.1 ]]
[NP] mu = [ 0.50405347 -1.21032113  4.59377737]
[NP] sig=
[[  8.38767344 -15.93293949  -8.99163212]
 [-15.93293949  54.22101876  19.95884064]
 [ -8.99163212  19.95884064  13.12666483]]
mu = [ 0.50254462 -1.20765602  4.59670213]
sig=
[[  8.37677982 -15.9048349   -8.97634335]
 [-15.9048349   54.09143531  19.90138387]
 [ -8.97634335  19.90138387  13.09848767]]


### Beta distribution

$$X \sim \text{Beta}(\alpha, \beta)$$

Parameters:
- $\alpha \in \mathbb{R} > 0$
- $\beta \in \mathbb{R} > 0$

The parameter $x \in \mathbb{R}$ must bet in $[0,1]$

$$\text{PDF: } f(x) = \frac{x^{\alpha-1} (1-x)^{\beta - 1}}{B(\alpha,\beta)}$$

$$\text{where } B(\alpha,\beta) = \frac{\Gamma (\alpha) \Gamma(\beta)}{\Gamma (\alpha + \beta)}$$

$$\text{where } \Gamma(z) = \int_{0}^{+\infty} x^{z-1} e^{-x}dx$$  

$$E[X] = \frac{\alpha}{\alpha + \beta}$$
$$\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$$

The beta distribution is the conjugate prior probability distribution of the bernoulli, bonomial, and geometric distributions.  
It is usually used to describe prior knowledge concerning the probability of success of an event.

### Dirichlet distribution

$$X \sim \text{Dir}(\alpha)$$

Parameters:
$\alpha \in \mathbb{R}^K$, $K \geq 2$, $\alpha_k > 0$

Input: $x \in \mathbb{R}^K$, with $x_k \in [0,1]$, and $\sum_{k=1}^Kx_k=1$

$$\text{PDF: } \frac{1}{B(\alpha)} \prod_{i=1}^K x_i^{\alpha_i-1}$$

$$\text{where } B(\alpha) = \frac{\prod_{i=1}^K \Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^K\alpha_i)}$$

$$E[X_i] = \frac{\alpha_i}{\sum_{k=1}^K \alpha_k}$$
$$\text{Var}(X_i) = \frac{\alpha_i(\alpha_0 - \alpha_i)}{\alpha_0^2(\alpha_0 + 1)}$$
$$\text{where } \alpha_0 = \sum_{i=1}^K \alpha_i$$  

The dirichlet distribution is a multivariate generalization of the beta distribution.  
It's the conjugate prior probability distribution of the categorical and polynomial distribution.