### Probability

one note in reading: 😅 the vertical line '|' in $Pr(x|y)$ is read in english as "**Given**"

In [13]:
%%HTML
<style>
    body {
        --vscode-font-family: "Arial";
    }
</style>

Probability is critical to deep learning. In supervised learning, deep networks implicitly rely on a probabilistic formulation of the loss function. In unsupervised learning,
generative models aim to produce samples that are drawn from the same probability
distribution as the training data. Reinforcement learning occurs within Markov decision
processes, and these are defined in terms of probability distributions. 

**Random variable and probability distribution**:

A random variable xdenotes a quantity that is uncertain. It may be discrete (take only
certain values, for example integers) or continuous (take any value on a continuum, for
example real numbers). If we observe several instances of a random variable x, it will
take different values, and the relative propensity to take different values is described by
a probability distribution Pr(x).
For a discrete variable, this distribution associates a probability Pr(x= k) ∈[0,1] with
each potential outcome k, and the sum of these probabilities is one. For a continuous
variable, there is a non-negative probability density Pr(x= a) ≥0 associated with each
value a in the domain of x, and the **integral** of this **probability density function (PDF)** over this domain must be one. This density can be greater than one for any point a. From here on, we assume that the random variables are continuous. The ideas are exactly the same for discrete distributions but with sums replacing integrals.

**Joint probability**:

Consider the case where we have two random variables $x$ and $y$. The joint distribution $Pr(x,y)$ tells us about the propensity that $x$ and $y$ take particular combinations of values

<img src=./images/joint_probability.png width=650>

**marginization**:

If we know the joint distribution $Pr(x,y)$ over two variables, we can recover the marginal
distributions $Pr(x)$ and $Pr(y)$ by integrating over the other variable:

$$\int \text{Pr}(x, y) \cdot dx = \text{Pr}(y)$$

$$\int \text{Pr}(x, y) \cdot dy = \text{Pr}(x)$$

This process is called marginalization and has the interpretation that we are computing the distribution of one variable regardless of the value the other one took. The
idea of marginalization extends to higher dimensions, so if we have a joint distribution $Pr(x,y,z)$, we can recover the joint distribution $Pr(x,z)$ by integrating over $y$

$?$ what is **Probability Density Function**:

A Probability Density Function (PDF) describes the relative likelihood of a continuous random variable taking on a particular value. in simple terms, The PDF tells you how dense the probability is around a value

the core idea:

For a continuous variable $X$ with a PDF $f(x)$, the probability that $X$ lies within an interval $[a, b]$ is:

$$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$

Key Properties:
- $f(x) \geq 0$ for all $x$
- Total area under the curve = 1: $\int_{-\infty}^{\infty} f(x) \, dx = 1$

**Conditional Probability**:

$Pr(x|y)$ is the probability of $x$ given $y$. The conditional probability $Pr(x|y)$ can be
found by taking a slice through the joint distribution $Pr(x,y)$ for a fixed $y$. This slice is
then divided by the probability of that value $y$ occurring (the total area under the slice)
so that the conditional distribution sums to one:

$$\Pr(x|y) = \frac{\Pr(x, y)}{\Pr(y)}$$  

this is a function of $x$ when $y$ is fixed.

$$\Pr(y|x) = \frac{\Pr(x, y)}{\Pr(x)}$$

this is a function of $y$ when $x$ is fixed

Note that, When we consider the conditional probability $Pr(x|y)$ as a function of $x$, it *must* sum
to one. When we consider the same quantity $Pr(x|y)$ as a function of $y$, it is termed the
**likelihood** of $x$ given $y$ and does *not* have to sum to one.

more on **Likelihood**:

Interpreting $\Pr(x \mid y)$ as a Function of $y$:

If we fix $x$, and view $\Pr(x \mid y)$ as a function of $y$ instead of $x$, it is no longer a probability distribution. Instead, we call it the **likelihood**: $$\mathcal{L}(y) = \Pr(x \mid y) \quad \text{(as a function of } y \text{, with } x \text{ fixed)}$$

$?$ This likelihood function does not need to sum to 1 over y. Why?

$!$ Because it’s not a probability distribution over y.

Instead, it’s used as a measure of how likely the fixed data x is under different hypothetical values of y. In statistics and machine learning, this comes up in parameter estimation — especially in **maximum likelihood estimation (MLE)**.

#### Bayes' Rule:

from the equation above we can rearrange $Pr(x|y)$ and $Pr(y|x)$ so that:

$$\Pr(x|y) = \frac{\Pr(y|x) \Pr(x)}{\Pr(y)}.$$

This expression relates the conditional probability $Pr(x|y)$ of $x$ given $y$ to the conditional
probability $Pr(y|x)$ of $y$ given $x$ and is known as **Bayes’ rule**.

so, the Key Concept of Bayes' Rule:

Each term in this Bayes’ rule has a name. The term $Pr(y|x)$ is the likelihood of $y$
given $x$, and the term $Pr(x)$ is the *prior probability* of $x$. The denominator $Pr(y)$ is
known as the *evidence*, and the left-hand side $Pr(x|y)$ is termed the *posterior probability*
of $x$ given $y$. **The equation maps from the prior $Pr(x)$ (what we know about $x$ before observing y) to the posterior $Pr(x|y)$ (what we know about $x$ after observing $y$).**

### Expectation:

Formally, the expectation of a random variable $X$, denoted as $\mathbb{E}[X]$, is a weighted average of all possible values that $X$ can take, weighted by their probabilities.

If $X$ is continuous, then we integrate over all values of $X$:

$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) \, dx$$

Where $f_X(x)$ is the probability density function (PDF) of $X$.

Consider a function $f[x]$ and a probability distribution $Pr(x)$ defined over $x$. The expected value of a function $f[\bullet]$ of a random variable $x$ with respect to the probability distribution $Pr(x)$ is defined as:

$$\mathbb{E}_x[f(x)] = \int f(x) \Pr(x) \, dx$$

As the name suggests, this is the expected or average value of $f[x]$ after taking into account
the probability of seeing different values of $x$. This idea generalizes to functions $f[\bullet,\bullet]$ of
more than one random variable:

$$\mathbb{E}_{x,y}[f(x,y)] = \iint f(x,y) \Pr(x,y) \, dx \, dy$$

**An expectation is always taken with respect to a distribution over one or more variables.**

If we drew a large number I of samples $\{x_i\}_{i=1}^I$ from $Pr(x)$, calculated $f[x_i]$ for
each sample and took the average of these values, the result would approximate the
expectation $E[f[x]]$ of the function:

$$\mathbb{E}_x[f(x)] \approx \frac{1}{I} \sum_{i=1}^I f(x_i)$$

### Mean, Variance and Covariance

For some choices of function $f[\bullet]$, the expectation is given a special name. These quantities are often used to summarize the properties of complex distributions. For example,
when $f[x] = x$, the resulting expectation $E[x]$ is termed the mean, $\mu$. It is a measure of the center of a distribution. Similarly, the expected squared deviation from the
mean $E[(x−\mu)2]$ is termed the variance, $\sigma^2$. This is a measure of the spread of the
distribution. The standard deviation $\sigma$ is the positive square root of the variance. It
also measures the spread of the distribution but has the merit that it is expressed in the
same units as the variable $x$.

As the name suggests, the **covariance** $E[(x−\mu x)(y− \mu y )]$ of two variables x and y
measures the degree to which they co-vary. Here µx and µy represent the mean of the
variables x and y, respectively. The covariance will be large when the variance of both
variables is large and when the value of x tends to increase when the value of y increases.

important note on **Covariance**:

The covariances of multiple random variables stored in a column vector $x \in R^D$ can be
represented by the D×D covariance matrix $E[(x−µx)(x−\mu x)^T ]$, where the vector µx
contains the means E[x]. The element at position (i,j) of this matrix represents the
covariance between variables xi and xj.

**Variance Identity**:

$$
\begin{align*}
\mathbb{E}[(x - \mu)^2] &= \mathbb{E}[x^2] - \mathbb{E}[x]^2, \\
\mathbb{E}[(x - \mu)^2] &= \mathbb{E}[x^2 - 2\mu x + \mu^2] \\
&= \mathbb{E}[x^2] - \mathbb{E}[2\mu x] + \mathbb{E}[\mu^2] \\
&= \mathbb{E}[x^2] - 2\mu \cdot \mathbb{E}[x] + \mu^2 \\
&= \mathbb{E}[x^2] - 2\mu^2 + \mu^2 \\
&= \mathbb{E}[x^2] - \mu^2 \\
&= \mathbb{E}[x^2] - \mathbb{E}[x]^2,
\end{align*}
$$

#### Standardization:

Setting the mean of a random variable to zero and the variance to one is known as
*standardization*. This is achieved using the transformation:

$$z = \frac{x - \mu}{\sigma}$$

This transformation has the following properties:
- $\mathbb{E}[Z] = 0$
- $\text{Var}(Z) = 1$

So $Z$ is a new variable that represents how many standard deviations a particular value of $X$ is from its mean. That’s why $Z$ is often called a z-score in statistics.

this is useful in:
1. comparison across distributions: If you have two variables (say height and weight), they may be measured on completely different scales. Standardizing them allows you to compare them directly.
2. Normalization for Gaussian Distributions: If $X \sim \mathcal{N}(\mu, \sigma^2)$, then $Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$. This is called the **standard normal distribution**.

The mean of the new distribution over z is given by:

$$\mathbb{E}[z] = \mathbb{E}\left[\frac{x - \mu}{\sigma}\right] \\
= \frac{1}{\sigma} \mathbb{E}[x - \mu] \\
= \frac{1}{\sigma} (\mathbb{E}[x] - \mathbb{E}[\mu]) \\
= \frac{1}{\sigma} (\mu - \mu) = 0$$

The variance of the new distribution is given by:

$$\mathbb{E}[(z - \mu_z)^2] = \mathbb{E}\left[(z - \mathbb{E}[z])^2\right] \\
= \mathbb{E}[z^2] \\
= \mathbb{E}\left[\left(\frac{x - \mu}{\sigma}\right)^2\right] \\
= \frac{1}{\sigma^2} \mathbb{E}[(x - \mu)^2] \\
= \frac{1}{\sigma^2} \cdot \sigma^2 = 1$$

In the multivariate case, we can standardize a variable $x$ with mean $\mu$ and covariance
matrix $\Sigma$ using:

$$z = \Sigma^{-1/2} (x - \mu)$$

So the vector z is a transformed version of x that has been:
1.	Centered (zero mean),
2.	Scaled and rotated (via $\Sigma^{-1/2}$) so that its covariance becomes the identity matrix.

The result will have a mean E[z] = 0 and an identity covariance matrix $E[(z−E[z])(z−E[z])^T ] = I$. To reverse this process, we use:

$$x = \mu + \Sigma^{1/2} z$$