This notebook will cover the definitions, numpy and torch equivalent implementation of Probability and Information theory from MIT's deep learning book. 


1. [Study group](http://www.youtube.com/watch?v=Db7B8yBAnHQ)
2. [Book](http://www.deeplearningbook.org/)




## Import libraries

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# Probability 
The study of uncertainty is known as probability.
1. We can use probability to represent degree of belief  e.g. Doctors diagnose a patient with a disease using certain uncertainty 1 being absolutely certain and 0 otherwise.
2. We can use probability to denote the rate at which an event is likely to occur e.g. Probabilty that a die will land 6 is 1/6, probability that a coin will land heads is 0.5. 

To work with probability we need two pieces of information: The random variable and the probability distribution.

## Random variable

A random variable is a variable that can take on diﬀerent values randomly. Random variables are a description of a possible states given a probability distribution. The probability distribution tells how likely a state will occur. Random variable can be discrete or continuous. $\text{x}$ denotes a random variable and $x$ denotes it's value. 

## Probability distribution

A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states. The description of a probabiltiy distribution depends on whether the variables are discrete or continuous. 

## Discrete variables and probability mass functions (PMF)
The probability mass function(PMF) maps from a state of a random variable to the probability of that random variable taking on that state. The probability that $\text{x} = x$ is denoted as $P (x)$, with a probability of 1 indicating that $\text{x} = x$ is certain and a probability of 0 indicating that $\text{x} = x$ is impossible. 

1. The distributions are usually written in this form: $\text{x}\sim P(\text{x})$
2. When PMFs are used on over many variables it is called joint probability distribution. $P(\text{x}=x,\text{y}=y)$

PMFs must satisfy the following properties
* The domain of $P$ must be the set of all possible states of x.
* $\forall{x} \in \text{x}, 0 \leq P(x) \leq 1$. That means that probality of a given state must be greater than equal to 0 or less than equal to 1.
* $\sum_{\forall{x} \in \text{x}} P(x)=1$. The sum of probabilites of all the states must be equal to 1. 

## Continous variables and probability dense functions (PDF)
A probability density function (PDF) is used to define the random variable’s probability coming within a distinct range of values, as opposed to taking on any one value. It is denoted as $p(x)$.It must satisfy the following conditions:

* The domain of  𝑃  must be the set of all possible states of x.
* $\forall{x} \in \text{x}, p(x) \geq 0$
* $\int p(x)dx=1$

Since PDF doesn't provide probability for a distinct value, we have to find probability over a range of values which can done by integrating over the range in PDF function. $\int_{[a,b]}p(x)dx$

## Marginal probability
When we have a probability distribtution over a set of variables, the marginal probability is the probability distribution over a subset.

E.g.

If we have random variables $\text{x}$ and $\text{y}$ and $P(\text{x},\text{y})$ is known then $P(\text(x))$ can be calculated using the sum rule.
$$
\forall x \in \text{x}, P(\text{x} = x) = \sum_{y}P(\text{x} = x, \text{y} = y)
$$
For continuous variable
$$
p(x) = \int p(x,y)dx
$$

## Conditional probability
Conditional probability is the probability of an event, given that another event has occured. It is calculated using this formula only when $P(\text{x}=x) > 0$.
$$
P(\text{y} = y | \text{x} = x) = \frac{P(\text{y}=y,\text{x}=x)}{P(\text{x}=x)}
$$



## Independence and conditional indepdence


Two random variables x and y are independent if their probability distribution can be expressed as a product of two factors, one involving only x and one involving only y:
$$
\forall x \in \text{x},y \in \text{y}, p(\text{x} = x, \text{y} = y) = p(\text{x} = x)p(\text{y} = y)
$$
Two random variables x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z:
$$
\forall x \in \text{x},y \in \text{y}, z \in \text{z}, p(\text{x} = x, \text{y} = y | \text{z} = z) = p(\text{x} = x | \text{z} = z)p(\text{y} = y | \text{z} = z))
$$

## Expectation, Variance and Covariance

The expectation or expected value of some function $f(x)$ with respect to a probability distribution $P (x)$ is the average or mean value that $f$ takes on when $x$ is drawn from $P$. For discrete variables:
$$
\mathbb{E}_{\text{x}\sim P}[f(x)] = \sum_{x}P(x)f(x)
$$
For continuous variables:
$$
\mathbb{E}_{\text{x}\sim p}[f(x)] = \int p(x)f(x)dx
$$
The variance gives a measure of how much the values of a function of a random variable x vary as we sample diﬀerent values of x from its probability distribution. If the variance is low the values of f(x) are near their expected value. The square root of variance is standard deviation
$$
Var(f(x))= \mathbb{E}[(f(x) - \mathbb{E}[f(x)])^2]
$$
The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:
$$
Cov(f(x),g(x))=\mathbb{E}[(f(x) - \mathbb{E}[f(x)])](g(y) - \mathbb{E}[g(y)]]
$$
* High absolute values of the covariance mean that the values change very much and are both far from their respective means at the same time.
* If the sign of the covariance is positive, then both variables tend to take on relatively high values simultaneously.
* If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other takes on a relatively low value and vice versa


# Common probability distribution

## Bernouli distribution
The Bernoulli distribution is a distribution over a single binary random variable. It is controlled by a single parameter $\phi \in [0, 1]$, which gives the probability of the random variable being equal to 1.
$$
P(\text{x}=x)= \phi^{x} (1-\phi)^{1-x}\\
\mathbb{E}_{x}[x] = \phi\\
Var_{x}(x)=\phi(1-\phi)
$$
## Implementation

In [None]:
# Torch
bernouli = torch.distributions.bernoulli.Bernoulli(probs=(0.7))
print(f"Bernouli sample: {bernouli.sample((1,))}")

# Numpy 
bernouli_numpy = np.random.binomial(1,0.5,)
print(f'\nBernouli sample numpy: {bernouli_numpy}')


## Multinomial distribution

A multinomial distribution is the distribution over vectors in $(0, . . . , n)^{k}$ representing how many times each of the k categories is visited when n samples are drawn from a multinoulli distribution. The book focuses on multinouli distribution which is a special case of multinomial distribution when $n=1$.

## Implementation
The torch output shows all the 20 samples. While the numpy output shows how many times an even occured e.g the first event at index 0 occured 7 times if the output is [7,3,8,2]

In [None]:
#Torch
multinomial = torch.distributions.categorical.Categorical(probs=torch.Tensor([0.4,0.2,0.3,0.1]))
print(f"Multinomial sample: {multinomial.sample((20,))}")

# Numpy
multinomial_numpy =  np.random.multinomial(n=20, pvals=[0.4,0.2,0.3,0.1])
print(f"Multinomial numpy sample: {multinomial_numpy}")

## Gaussian distribution
The most commonly used distribution that anyone will encounter is the normal distribution which is also known as Gaussian distribution.
$$
N(x,\mu,\sigma^{2}) = \sqrt{\frac{1}{2\pi\sigma^{2}}}(-\frac{1}{2\sigma^{2}}(x-\mu)^{2})
$$
The two parameters $\mu \in \mathbb{R}$ and $\sigma \in (0, \inf)$ control the normal distribution. The parameter $\mu$ gives the coordinate of the central peak. This is also the mean of the distribution: $\mathbb{E}[x] = \mu$. The standard deviation of the distribution is given by $\sigma$, and the variance by $\sigma^{2}$.


## Bell curve

In [None]:
mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.title('Bell curve')
plt.ylabel('p(x)')
plt.show()

## Implementation

In [None]:
# Torch
t_n = torch.distributions.normal.Normal(loc=0.0, scale=1.0)
print(f"Sample: {t_n.sample((20,))}")

# Numpy
t_num = np.random.normal(loc=0.0,scale=1.0,size=(20,))
print(f"\nSample: {t_num}")

## Exponential and Laplace distributions
Exponential distribution has a sharp point at $x=0$
$$
p(x;\lambda) = \lambda 1_{x \geq 0}exp(-\lambda x)
$$
The exponential distribution uses the indicator function $1_{x \geq 0}$ to assign probability
zero to all negative values of $x$

Laplace distribution has a sharp peak of probability at $\mu$
$$
Laplace(x;\mu,\gamma) = \frac{1}{2\gamma}exp(-\frac{|x-\mu|}{\gamma})
$$
![Laplace distribution](http://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Laplace_pdf_mod.svg/1280px-Laplace_pdf_mod.svg.png)
## Implementation

In [None]:
# Torch
m = torch.distributions.exponential.Exponential(torch.tensor([1.0]))
print(f"Sample: {m.sample((20,))}\n")

#loc is mean scale is std
m = torch.distributions.laplace.Laplace(loc=torch.tensor([0.0]), scale=torch.tensor([1.0]))
print(f"Sample: {m.sample((20,))}\n")


# Numpy
m = np.random.exponential(scale=1.0)
print(f"Sample: {m}\n")

m = np.random.laplace(loc=0.0, scale=1.0)
print(f"Sample: {m}\n")

## Dirac and Empirical distributions

Dirac is a type of probability distribution where all the mass clusters around a single point. It is defined using the following function:
$$
p(x)=\lambda(x-\mu)
$$
Here the mass is $\lambda$ shifted by $\mu$ so in the diagram you can see that there is a peak of probability mass at $x=\mu$
The Dirac delta function is deﬁned such that it is zero-valued everywhere except 0, yet integrates to 1. Here is a diagram showing the dirac function.
[Dirac function](http://en.wikipedia.org/wiki/Dirac_delta_function#/media/File:Dirac_distribution_PDF.svg.png).

In the book it mentions that the dirac delta function is a genralized function meaning that it doesn't associate a real value with each value $x$. It puts less mass on all other points except 0.

Dirac delta distribution is used as a component of empirical distribution
$$
\hat{p}(x) = \frac{1}{m}\sum_{i=1}^{m} \lambda (x-x^{(i)})
$$
In this equation, each points from $x^{(1)}, ..., x^{(m)}$ has a probability mass of $\frac{1}{m}$. In case of discrete variables, the probability of each state is basically the frequncy of that state in the training set.



# Common functions
## Sigmoid
$$
\sigma(x) = \frac{1}{1+exp(-x)}
$$


In [None]:
# Torch
t = torch.arange(-10,10,0.2)
s = torch.special.expit(t)
plt.plot(t,s)
plt.ylabel('𝜎(x)')

## Softplus

$$
\zeta(x)=log(1+exp(x))
$$

In [None]:
# Torch
t = torch.arange(-10,10,0.2)
Softplus = torch.nn.Softplus()
s = Softplus(t)

plt.plot(t,s)
plt.ylabel('𝜁(𝑥)')