## 1. Probability for ML

The probability plays a key role in moder machine learning application as it is all about modeling and making predictions. For instance;
* making predictions about the probability of a patient having certain eye related disease(glaucoma etc) in the next year, given person's medical history
* detecting anomalies and spams
* structuring reward and punishment mechanisms in reinforcement learning while an agent performs certain tasks
* in recommendation engines, we can predict the probability of a user who might buy a particular product ato recommend related products alongside with it. 

To be able to understand above mentioned use cases clearly, first we need to have a vivid understanding of ```probability and information theory ```.

While ***probability theory*** is a fundamental mathematical framework to represent uncertainty, ***information theory*** is a way to measure the amount uncertainty quantitatively. 

### 1.1 Why to study probaility
In Machine Learning applications we deal with uncertain and stochastic (nondetemrnisitic) quantities. These uncertainties mainly come from many sources. Such as:

* ```Incomplete observability```: When making predictions about certain events in applications we mostly dont have all observations from the event happening. 
 
``` 
[TODO]
* add example
```

* ```Incomplete modeling```: While building machine learning models it is not always possible to include all information observed into model. As a result, the model has uncertainty in its predictions. 

``` 
[TODO]
* add example
```

Another scenario in which we need to apply probability is that in certains cases it is more useful to use a simple probabilistic rule rather than modeling a deterministic complex rule. 


``` 
[TODO]
* add example
```


### 1.2 Random Variable
We will be using random variables a lot throughout this notebook as they are descriptions of states that are possible. 

We will donate random variable with $\mathbr{x} = x$and it s possoible values $x_1, x_2,...,x_n$.

Random variable may have discrete or continous values. While discrete random variables have a finite or countably imfinite number of states, continuous random variables have real values. 

https://www.math.ubc.ca/~pwalls/math-python/jupyter/latex/

https://en.wikibooks.org/wiki/LaTeX/Mathematics

https://towardsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-learning-ed1509b2eb3f

https://machinelearningmastery.com/why-learn-probability-for-machine-learning/

https://www.math.ubc.ca/~pwalls/math-python/jupyter/latex/

https://d2l.ai/chapter_preliminaries/probability.html#independence

### 1.3 Probability Distributions
A probability distribution is a mathematical function that describes all the possible values and likelihoods that a random variable or a set of random variable can take. It is defined based on whether the variables are discrete or continous.



#### 1.3.1 Discrete Variables and Probability Mass Functions

$\forall x \in X, \quad \exists y \leq \epsilon$ 

We usually define a probability distirbution over  discrete variables by a ***probability mass function (PMF)*** and denoted by capital ***P***. 

PMF maps from a state of a random variable to the probability of that random variable taking on that state. In mathematical terms we define that as  follows: 
$\mathrm{x} = x$ is $P(x)$

While defining a ***joint probability*** over many variables at the same time, we use the notation as $P(\mathrm{x} = x,\mathrm{y} = y)$ and it denotes the probaility $\mathrm{x} = x$ and $\mathrm{y} = y$ simultaneously. It can be also written by only $P(x, y)$. 

There is certain conditions that $P$ must satisfy in order to be PMF on a random variable:
* $\forall x \in \mathrm{x}, 0 \leq P(x) \leq 1$. In other words, the probabilty must be between 0 and 1.
* $\sum_{x \in \mathrm{x}} P(x) = 1$. That is a ***normalization*** term in probability and it indicates an important property. That allows us not to obtain probabilities greater than 1. 

#### Sample 1: Discrete Uniform distribution
To demosntrate this by plotting uniform distribution. The ***uniform distribution*** makes each states equally likely.

[TODO]
https://docs.scipy.org/doc/scipy/reference/tutorial/stats/discrete_randint.html


In [5]:
import numpy
import matplotlib.pyplot as plt
import seaborn as sns
array = [1,2,3,4,5]
plt.plot(array)
plt.show()
# \mathrm{x}
#https://plot.ly/python/getting-started/#initialization-for-offline-plotting
#https://realpython.com/python-histograms/
#https://jakevdp.github.io/PythonDataScienceHandbook/04.09-text-and-annotation.html
#https://matplotlib.org/3.1.1/gallery/text_labels_and_annotations/usetex_demo.html#sphx-glr-gallery-text-labels-and-annotations-usetex-demo-py

ModuleNotFoundError: No module named 'seaborn'

#### 1.3.2 Continuous Variables and Probability Density Functions

When we work with continuos variables we use ***probability density functions (PDF)*** instead of PMF. We denote pdf as $p$. 
PDF must satisfy following conditions:
* $\forall x \in \mathrm{x}, p(x) \geq 0$
* $\int p(x) dx = 1$

#### Sample 2: Continuos Uniform Distribution
We can define uniform distribution for continuos variable as a function of $u(x;a,b)$, where $a$ and $b$ are the end points of the interval with $b \gt a$. This whole notation equals to:
$$
u(x;a,b) = \frac{1}{b-a}
$$ 

### 1.3 Marginal Probability

Marginalization is the operation of determining  $P(y)$  from  $P(x,y)$.

$$
P(y) = \sum_{x}P(x,y)
$$
 
This is also known as the **sum rule**. The probability or distribution as a result of marginalization is called a marginal probability or a marginal distribution. It is an answer of folloing question:
 * what is the probability of a subset of random variables from a superset of them

### 1.4 Conditional Probability

In many machine learning problems,  conditional probability gives the probability of some event, given some other event occured. We denote that as $\mathrm{y} = y$ given $\mathrm{x} = x$ by $P(\mathrm{y} = y  
|  \mathrm{x} = x)$. This can be formulated as follow:

$$
P(\mathrm{y} = y  |  \mathrm{x} = x) = \frac{P(\mathrm{y} = y,\mathrm{x} = x)}{P(\mathrm{x} = x)}
$$

### 1.5 Independence and Conditional Indepedence

It is also important to understand the difference between dependence and independence in probability. Two random variables  $\mathrm{x}$ and  $\mathrm{y}$ are independent means that the occurrence of one event of  $\mathrm{x}$  does not reveal any information about the occurrence of an event of  $\mathrm{y}$. 

In mathmetical terms:

$$
\forall x \in \mathrm{x}, \forall y \in \mathrm{y},P(\mathrm{x} = x,\mathrm{y} = y) =P(\mathrm{x} = x)P(\mathrm{y} = y) 
$$

Another usefull property of probability is conditional independece. That means two random variables $x$ and $y$ conditionally independet given a random variable $z$. It can be factorized as follows: 
$$
\forall x \in \mathrm{x}, \forall y \in \mathrm{y}, \forall z \in \mathrm{z},P(\mathrm{x} = x,\mathrm{y} = y |\mathrm{z} = z)   = P(\mathrm{x} = x |\mathrm{z} = z)P(\mathrm{y} = y |\mathrm{z} = z) 
$$

The compact notations of independence and conditional independence are given below:
* ```Independence```: $x \perp y$
* ```Conditional Independence```: $x \perp y | z$

### 1.6 Expectation, Variance and Covarince
**Expected value** of some function $f(x)$ w.r.t. a probability distribution $P( \mathrm{x})$ is the mean value that $f$ takes when $x$ is drawn from P.



$$
E_x = \sum_{x}P(x)f(x)$$


**Variance** is the measure of variability in the data from the mean value. In probability, it is the variability of $f$ from it's expected value drawn from it's probability distribution. 
$$
Var(f(x)) = E[(f(x) -E[f(x)])^2]
$$

If the variance is low, $f(x)$ approximates close to mean value. 

**Covariance** is the measure of linear relationship between values. 
$$
Cov(f(x),g(y)) = E[(f(x) -E[f(x)])(g(y) -E[g(y)])]
$$


Covariance matrix of a random variable $x \in \mathbb{R}^n $ is an $n \times n$ matrix and its diagonal elements give the variance.

### 1.7 Commonly used Probability Distributions

#### 1.7.1 Binomial distribution
A binomial random variable is the number of successes in n trials of a random experiment. A random variable x is said to follow binomial distribution when, the random variable can have only two outcomes(success and failure).Naturally , binomial distribution is for discrete random variables

In [4]:
import numpy as np
import matplotlib as plt
n=10 # number of trials
p=0.5 # probability of success
s=1000 # size
b_d = np.random.binomial(n,p,s)
#https://www.datacamp.com/community/tutorials/probability-distributions-python

In [6]:
import tensorflow_probability as tfp

ModuleNotFoundError: No module named 'tensorflow_probability'

#### 1.7.2 Bernoulli Distribution
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial, for example, a coin toss. So the random variable $x$ which has a Bernoulli distribution can take value 1 with the probability of success, $p$, and the value 0 with the probability of failure, $q$ or $1−p$. The probabilities of success and failure need not be equally likely. The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted $(n=1)$.

In [5]:
#code

#### 1.7.3 Multinoulli Distribution


The __multinoulli__ or __categorical distribution__ is a distribution over a single discrete variable with *k* different states, where *k* is finite. The multinoulli distribution is a special case of the __multinomial distribution__, which is a generalization of Binomial distribution. A multinomial distribution is the distribution over vectors in ${0, \cdots, n}^k$ representing how many times each of the *k* categories visited when *n* samples are drawn from a multinoulli distribution.

#### 1.7.3 Gaussian Distribution

A normal distribution has a bell-shaped density curve described by its mean μ and standard deviation σ.  The probability distribution function of a normal density curve with mean μ and standard deviation σ at a given point x is given by:



The most commonly used distribution over real numbers is the __normal distribution__, also known as the __Gaussian distribution__:

$$\color{green}{\mathcal{N}(x; \mu, \sigma^2) = \sqrt{\frac{1}{2 \pi \sigma^2}} exp \Big(- \frac{1}{2 \sigma^2} (x - \mu)^2 \Big) \tag{15}}$$

The two parameters $\mu \in \mathbb{R}$ and $\sigma \in (0, \infty)$ control the normal distribution. The parameter $\mu$ gives the coordinate of the central peak. This is also the mean of the distribution: $\mathbb{E}[\mathrm{x}] = \mu$. The standard deviation of the distribution is given by $\sigma$, and the variance by $\sigma^2$.

The density curve is symmetrical, centered about its mean, with its spread determined by its standard deviation showing that data near the mean are more frequent in occurrence than data far from the mean.


In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good choice because, it has high entropy and central limit theorem suggests that sum of several independent random variables is normally distributed.


The normal distribution generalizes to $\mathbb{R}^n$, in which case it is known as the __multivariate normal distribution__. It may be parameterized with a positive definite symmetric matrix $\Sigma$:

$$\color{green}{\mathcal{N}(x; \mu, \Sigma) = \sqrt{\frac{1}{(2 \pi)^n det(\Sigma)}} exp \Biggr(- \frac{1}{2} (x - \mu)^\top \Sigma^{(-1)} (x - \mu) \Biggr) \tag{16}}$$

The parameter $\mu$ still gives the mean of the distribution, though now it is vector valued. The parameter $\Sigma$ gives the covariance matrix of the distribution.

### 1.8 Mixture of Distributions

One common way of combining simpler distributions to define probability distribution is to construct a __mixture distribution__. A mixture distribution is made up of several component distributions. On each trial, the choice of which component distribution should generate the sample is determined by sampling a component identity from a multinoulli distribution:

$$\color{orange}{P(\mathrm{x}) = \displaystyle\sum_i P(c = i) \ P(\mathrm{x} | c = i) \tag{21}}$$

where $P(c)$ is the multinoulli distribution over component identities.

The mixture model allows us to briefly glimpse a concept that will be of paramount importance later—the __latent variable__. A latent variable is a random variable that we cannot observe directly. Latent variables may be related to x through the joint distribution.

### 1.9 - Bayes' Rule

__Bayes' rule__ is a useful tool that computes the conditional probability $P( x | y)$ from $P(y | x)$. Here 

- $P( x | y)$ is called the _posterior_; what we are trying to estimate, 
- $P(y | x)$ is called the _likelihood_; the probability of observing the new evidence, given our initial hypothesis, 
- $P(x)$ is called the _prior_; this is the probability of our hypothesis without any additional prior information,
- $P(y)$ is called the _marginal likelihood_; this is the total probability of observing the evidence.

The Bayes' rule can be summed up as:

$$\color{orange}{P(x | y) = \frac{P(x) \ P(y | x)}{P(y)} \tag{24}}$$

Even though $P(y)$ appears in the formula, it is usually feasible to compute $P(y) = \sum_x P(y | x) P(x)$, so we do not need to begin with knowledge of $P(y)$.

### 1.10 - Information Theory

Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal. In the context of machine learning, we can also apply information theory to continuous variables where some of these message length interpretations do not apply. 

The basic intuition behind the information theory is that a likely event should have low information content, less likely events should have higher information content and independent events should have additive information.


To satisfy these properties, we define the __self-information__ of an event $\mathrm{x} = x$ to be:

$$\color{orange}{I(x) = -log \ P(x) \tag{27}}$$

Self information deals only with a single outcome. We can quantify the amount of uncertainty in an entire probability distribution using the __Shannon entropy__:

$$\color{orange}{H(\mathrm{x}) = \mathbb{E}_{x \sim P} [I(x)] = -\mathbb{E}_{x \sim P}[log \ P(x)] \tag{28}}$$

also denoted as $H(P)$.

Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits needed on average to encode symbols drawn from a distribution P. Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy. When $\mathrm{x}$ is continuous, the Shannon entropy is known as the __differential entropy__.

Entropy isn't remarkable for its interpretation, but for its  properties. For example, entropy doesn't care about the actual *x* values like variance, it only considers their probability. So if we increase the number of values *x* may take then the entropy will increase and the probabilities will be less concentrated.

If we have two separate probability distributions P(x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the __Kullback-Leibler (KL) divergence__:

$$\color{orange}{D_{KL} (P \| Q) = \mathbb{E}_{x \sim P} \Big[ log \ \frac{P(x)}{Q(x)} \Big] = \mathbb{E}_{x \sim P} [log \ P(x) - log \ Q(x)] \tag{29}}$$

In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols drawn from probability distribution P, when we use a code that was designed to minimize
the length of messages drawn from probability distribution Q.



The KL divergence has many useful properties, most notably being nonnegative. The KL divergence is 0 if and only if P and Q are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables.

A quantity that is closely related to the KL divergence is the __cross-entropy__ $H(P, Q) = H(P) + D_{KL} (P \| Q)$, which is similar to the KL divergence but lacking the term on the left:

$$\color{orange}{H(P, Q) = - \mathbb{E}_{x \sim P} \ log \ Q(x) \tag{30}}$$

Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence, because Q does not participate in the omitted term.

### Probabilistic Models
Explain in Graph Neural Nets and Bayesian Nets section