# Probability
Author: Vo, Huynh Quang Nguyen

In [1]:
import numpy as np

# Acknowledgements:
The contents of this note are based on the lecture notes and the materials from the sources listed below:

1. _Essential Math for Data Science_ in 6 Weeks webinar given by Dr. Thomas Nield.
Available in O'Reily Learning: [Essential Math for Data Science in 6 Weeks](https://learning.oreilly.com/attend/essential-math-for-data-science-in-6-weeks/0636920055929/0636920055928/)


2. _Probability Cheatsheet v2.0_ given by William Chen, and Joe Blitzstein, with contributions from Sebastian Chiu, Yuan Jiang, Yuqi Hou, and Jessy Hwang.
Available in Github: [http://github.com/wzchen/probability_cheatsheet]


3. _Deep Learning_ textbook by Dr. Ian Goodfellow, Prof. Yoshua Bengio, and Prof. Aaron Courville. The book is available for public access via an designated website: [Deep Learning textbook](https://www.deeplearningbook.org/)


4. _Introduction to Probability and Statistics for Engineers and Scientists_ textbook by Prof. Sheldon Ross from University of Southern California.

# Table of Contents
1. [Symbols and Abbreviation](#Section1)
1. [Introduction to Probability](#Section2)
2. [Overview of Probability](#Section3)
3. [Bayes' theorem](#Section4)
4. [Random Variables](#Section5)

# I. Symbols and Abbreviation <a name = "Section1"></a>

# II. Introduction to Probability <a name = "Section2"></a>
## 1. Why do we care about probability?
1. Machine learning must always deal with uncertain quantities, and sometimes may also need to deal with stochastic (non-deterministic) quantities which come from many sources. There are three main sources:
     * **Inherent stochasticity** in the system being modeled. For example, the dynamics of subatomic particles in quantum mechanics systems are probabilistic, or a hypothetical card game where we assume that the cards are truly shuﬄed into a random order.
     * **Incomplete observability**. Even deterministic systems can appear stochastic when we cannot observe all of the variables that drive the behavior of the system. For example, in the Monty Hall problem, the outcome given the contestant’s choice is deterministic but uncertain from the contestant's point of view.
     * **Incomplete modeling**. For example, when we use a model that must discard some of the information we have observed, the discarded information results in uncertainty in the model’s predictions.


2. Noted that the term _stochasticity_ refers to randomnes. Meanwhile, the term _determinism_ refers to a type of system in which the outputs are always the same given starting conditions or initial state.

<div>
    <img src = "images/monty_hall.png" width = 100%/>
    </div>

Figure 1: Visualization of the famous Monty Hall problem and its decision flowchart (adapted from **Brilliant Math & Science Wiki**). Monty Hall problem is a thought-experiment where you are asked to open three doors. Behind each door, there is either a car or a goat. You choose a door. The host, Monty Hall, picks one of the other doors, which he knows has a goat behind it, and opens it, showing you the goat. Monty then asks whether you would like to switch your choice of door to the other remaining door. Assuming you prefer having a car more than having a goat, and according to the game rules the host will always reveal a goat, do you choose to switch or not to switch? This thought experiment is a prime example of incomplete observability.


## 2. Applications of probability
Together with linear algebra, statistics and multivariate calculus, probability plays a crucial role in machine learning. Below are several examples of its application in machine learning.

<div>
    <img src="images/naive_bayes.png" />
    </div>
    
Figure 2: Illustration behind the Naive Bayes algorithm. In this algorithm, we estimate the probability distribution of our data ($P(x_{\alpha}|y)$) independently in each dimension, and then obtain an estimate of the full data distribution by assuming conditional independence $P(x|y)=\prod_{\alpha} P(x_{\alpha}|y)$. 

## 3. Probability vs. statistics
1. Probability and statistics often get confused and said interchangeably, but there is a distinction:
    * Probability is solely about studying likelihood.
    * Statistics utilizes data to discover likelihood.


2. In practicality, these two things are going to be tightly tied together, as one can argue it is hard to have probability without data.

# III. Overview of Probability <a name = "Section3"></a>
## 1. What is probability?
1. We can simply understand probability as how likely an event will happen, based on observations or belief. Some examples of probability are:
    * How likely is it we will get 7 heads in 10 fair coin flips?
    * What is the likelihood our flight will be late?
    * How certain are we that a product is defective?
    
    
2. Probability is expressed in two ways:
    * As a percentage: 60% chance our flight will be late 
    * As a ratio: 3:2 odds our flight will be late
    
    
## 2. Probability philosophies
There are two philosophies of probability:
1. **Frequentist probability**, which is the most popularly understood approach to probability, believes that the frequency of an event provides hard evidence of the probability ~ related directly to the rates at which events occur.
    * This implies if we repeated an event infinitely many times, then proportion $p$ of the repetitions would result in that outcome. Therefore, if we gather more data, we will increase confidence in the probability. 
    * Frequentist probability tends to work best when a lot of data is available, reliable, and complete.
    * Commonly used tools include p-values, confidence intervals, prediction intervals, tolerance intervals, etc.
    * Noted that when we are prefering probability as frequentist, we usually use the term chance.
    
    
2. **Bayesian probability**, which is much more abstract in that it assigns subjective beliefs in a probability and not just data ~ related to the degree of belief.
    * This implies an arbitrary probability can be assigned based on subjective beliefs, and then data can be used to gradually update that belief.
    * Bayesian methods tend to work well when data is limited, a large amount of domain knowledge is present, or uncertainty is hard to eliminate. 
    * Bayesian tools include the Bayes factor and credible intervals. 
$$
$$
<div>
    <img src = 'images/bayes_example.png' width = 35%>
    </div>

Figure 1: An example of Bayesian probability: imagine we are flipping coin, and we know that a coin has a 50% chance of landing a **head**. We flip a coin 10 times, and get 7 heads simultaneously. As a result, we update the beta distribution of the **head** (to the right) and see there are greater likelihoods of heads being more than 50%. 

## 3. Fundamentals of probability

### a) Sample space and event space
1. In probabilistic computation, we always couple the probability with sample space and event space, so what are them?


2. The sample space is the set of all possible outcomes of a experiment or an event. We usually denote the sample space as $\Omega$. For example, two successive coin tosses have a sample space of $\{HH, HT, TH, TT\}$, where “H” denotes “head” and “T” denotes “tail”.


3. The event space $\mathcal{A}$ is the space of potential results of a experiment or an event. The event space $\mathcal{A}$ is obtained by considering the collection of subsets of $\Omega$.


4. It is often confusing between the sample space $\Omega$ and the event space $\mathcal{A}$. As a detailed clarification, let's consider an example of coin flipping. We know that there are only two possible outcomes (either "head" or "tail); therefore, the sample space is $\Omega = \{H,T\}$. On the other hand, the event space is different because there are three possible events: 
    * Flipping a coin and only getting "heads": {H};
    * Flipping a coin and only getting "tails": {T};
    * Flipping a coin and getting either "heads" or "tails": {H,T}.
    
Therefore, our event space is $\mathcal{A} = \{H,T, \{H,T\}\}$. Noted that each event is jus a subset of the sample space $\Omega$. 


### b) Basics of probability
1. As mentioned above, we understand that probability is a measure of how likely an outcome $x$ is. Probability is typically represented as a number $P(x)$ between 0.0 and 1.0, or as a percentage between 0% and 100%.


2. The probability of an event $P(x)$ not occurring can be calculated by $1.0 − P(x)$, which indicates both outcomes must add to 1.0. This is the basis of the complement rule.


3. When we work with a single simple probability, it is known as a marginal probability.


4. As mentioned above, we know that probability can be based on data, a belief, or both.
    * Based on data: as an example, if we sample 10 products from a factory line and find 4 items are defective, that would be a 40% defective rate.
    * Based on belief: as an example, an engineer realizes an inferior material was used and guesses the defective rate for the product will be 50%. 
    * Based on data + belief: we can quantify the engineer’s belief and the data, merge them together, and find a 44.44% probability is most likely.
    

5. Combining the probability with the sample space and the event space, we get the probability space $(\Omega,\mathcal{A},P)$. Given a probability space (Ω; A; P), we want to use it to model some real-world phenomenon. 


6. In machine learning, we often avoid explicitly referring to the probability space, but instead refer to probabilities on quantities of interest $\mathcal{T}$. We usually define $\mathcal{T}$ as the target space and refer to elements of $\mathcal{T}$ as states. Additionally, we introduce a target space function $X: \Omega  \rightarrow \mathcal{T}$ that takes an element of $\Omega$ (an outcome) and returns a particular quantity of interest $x$, which is a value in $\mathcal{T}$.


#### Expressing probability as odds
1. We know that probability can be expressed as an odds ratio which means how many times we believe in something being true versus not being true. Odd ratios are also a helpful way to quantify subjective beliefs by means of “betting.”


2. Consider this example, if a friend of us is willing to pay us 200$\$$ if the Vietnamese football team can be qualified for the World Cup 2022, but we must pay him 50$\$$ if they do not, that means he believes the Vietnamese football team are 4x more likely to fail rather than succeed $\frac{200}{50} = 4.0$. In another case, if he pays 200$\$$ for them succeeding but I must pay him 1$\$$ if they don't, that means he REALLY believes the Vietnamese team are going to fail: 200x more likely ($\frac{200}{1}=200$). 


#### Turning Odds into Probabilities
1. We can turn an odds ratio $O(x)$ into a probability by using the following formula:
$$
P(x) = \frac{O(x)}{1 + O(x)}
$$

2. Let's consider the previous example, we can quantify our friend's belief as:
$$
P_1(x) = \frac{200}{200 + 5} \approx 0.976 
$$

$$
P_2(x) = \frac{200}{200 + 1} \approx 0.995
$$


### c) Probabilistic computation

#### Joint probabilities
1. Imagine we have two events that occur independently meaning they do not affect each other's outcomes, the probability of both events occurring simultanenously is:
$$
P(A\cap B) = P(A,B) = P(A) \times P(B)
$$


2. For example, consider the probability of flipping a coin and getting a  (Event A), and of rolling of a dice and getting a six (Event B).  Since A and B are independent, the probability of both events occurr simultaneously is:
$$
P(A\cap B) = \frac{1}{2} \times \frac{1}{6} = \frac{1}{12}
$$


3. Joint probabilities work on the so-called product rule, and we can use this to combine as many probabilities as we want. 

#### Union probabilites
1. There are two kinds of union probabilites: mutually exclusive (m.u) and non-mutually exclusive (n.m.u).


2. When two events are mutually exclusive, meaning that only one of the events can occur but not both, then the resulting probability is the sum of individual event's probability. For example, we want to compute the probability of getting a “4” or “6” on a die roll: because we cannot get “4” and a “6” simultaneously, we just add these probabilities together.
$$
P(A\cup B) = \frac{1}{6}\times \frac{1}{6} = \frac{1}{3}
$$


3. When two events are non-mutually exclusive, meaning that two events can occur simultaneously, then the resulting probability follows the so-called sum rule. For example, we want to compute the probability of getting a prime or even number on a die roll.
$$
P(A\cup B) = P(A) + P(B) - P(A\cap B)
$$


4. Noted that the sum rule is applicable for all union probabilities: in non-mutually exclusive events, the joint probability is zero. Similar to the product rule, we can apply the sum rule for as many probabilities as we like.
<div>
    <img src ="images/union_probabilities.png"/>
    </div>
    
Figure 2: An example of union probability. Consider a card deck, the mutually exclusive events are Aces and Kings because they cannot occurr simultaneously (a). Meanwhile, the non-mutually exclusive events are Hearts and Kings because they can occur simultaneously (b). If we want to compute the n.m.u, we must deduct the joint probability to avoid counting the latter repeatly.


#### Conditional probability
1. Conditional probability describe the probability Event A occurs given Event B $P(A|B)$ occurs:
    * If Event B has no impact on whether Event A occurs, then $P(A) = P(A|B)$.
    * If Event B does impact on Event A by increasing or decreasing the latter's probability, then $P(A)\neq P(A|B)$.


2. The conditional probability is computed as follows:
$$
P(A|B) = \frac{P(A\cap B)}{P(B)}
$$

# IV. Bayes' theorem <a name = "Section4"></a> 
## 1. Overview of Bayes' theorem
### a) What is Bayes' theorem?
1. As an example, consider the likelihood of being colorblind in humans, we know that the chance of someone being colorblind is 4.25%, and the probability of a male being colorblind is 8%. Does it mean:
    * Any colorblind person is 8% likely to be male? Or,
    * Any male is 8% likely to be colorblind?
    
    
2. We can reframe our questions to become:
    * What is the likelihood of any colorblind person to be male?
    * What is the likelihood of any male to be colorblind?
    
    
3. These questions can be easily answered using the famous Bayes' theorem:
$$
P(A|B) = \frac{P(A)\times P(B|A)}{P(B)}
$$

***
Likelihood of being colorblind: $P(blind) = 4.25\% = 0.0425$

Likelihood of a male being colorblind: $P(blind|male) = 8\% = 0.08$

Likelihood of being a male: $P(male) = 50\% = 0.5$

Likelihood of a colorblind being male: 
$$
P(male|colorblind) = \frac{P(male)\times P(blind|male)}{P(blind)} = \frac{0.08 \times 0.5}{0.0425} = 0.9411
$$

***

4. We can chain several conditional that affect an event of interest, assuming each condition is independent of the other conditions.

<div>
    <img src = "images/data-science-bayes-theorem.jpg" width  = 50%/>
    </div>

Figure 1: Visualization of Bayes' theorem. We can intepret this theorem as an approach to revise and update our probability of an event occurring after taking into consideration new information.

### b) Properties of Bayesian probability



# V. Random Variables <a name = "Section5"></a>
## 1. What is a random variable?
1. In probabilistic modeling and computation, a random variable $\mathbf{X}$ is a variable that can take on different values randomly. To simply put, we can understand a random variable as a variable whose possible values are numerical outcomes of a phenomenon.


2. A random variable has the following properties:
    * A random variables can be a vector-valued variables $\mathbf{X} = [x_1, x_2, ..., x_n]$.
    * On its own, a random variable is a description of the states that are possible of a phenomenon. Therefore, it must be coupled with a probability distribution that specifies how likely each of these states are.
    * Random variables may be discrete or continuous. 
    
    
3. Recall the mapping of the sample space into the target space $\mathbf{X}: \Omega \rightarrow \mathcal{T}$ in machine learning, this association/mapping is also referred as a random variable. 
    
    
### a) Discrete random variable:
1. A discrete random variable is one which may take on only a countable number of distinct values such as 0,1,2,3,4,... Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete.


2. Examples of discrete random variables include the number of children in a family, or the number of defective chips in a lot.


### b) Continuous random variable:
1. A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. 


2. Examples include height, weight, the amount of sugar in an orange, or the time required to run a mile.


## 2. Probability Distribution
### a) What is  a probability distribution?
1. A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states. 


2. We describes probability distribution based on the types of our random variable:
    * If the random variable $X$ is discrete, we will use a probability mass function $P(X = x)$.
    * If the random variable $X$ is continuous, we will use a probability density function.
    

3. We usually use the phrase univariate distribution to refer to distributions of a single random variable (whose states are denoted as $x$). On the other hand, we usually refer to distributions of more than one random variable as multivariate distributions, and will usually consider a vector of random variables (whose states are denoted by $\mathbb{x}$).


### b) Probability distribution of a discrete random variable
1. The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values.

2. Considering a random variable $\mathbf{X}$ may take $k$ different values, with the probability that $X = x_i$ defined to be $p_i$, and each probability is draw from the distribution $P(X)$. These probabilities $p_i$ must satisfy the following:
    * Each probability $p_i$ must lie within the range $0 < p_i < 1$ for each possible value $i$. Technically speaking, the range should be $0 \leq p_i \leq 1$; however, if the number of possible states is more than 2, the condition $0 < p_i < 1$ is more plausible.
    * The sum of all probabilities is one: $p_1 + p_2 + ... + p_k = 1$.
 
 
3. As an example, suppose a variable $X$ can take the values of 1, 2, 3, or 4. The probabilities with each outcome are described as follows:

| Outcome     	| 1   	| 2   	| 3   	| 4   	|
|-------------	|-----	|-----	|-----	|-----	|
| Probability 	| 0.1 	| 0.2 	| 0.3 	| 0.4 	|

4. We can visualize this probability with a histogram. In addition, we can compute the the probability that $X$ is equal to 2 or 3 as: $P(X = 2 \cup X = 3) = P(X = 2) + P(X = 3) = 0.2 + 0.3 = 0.5$. Similarly, the probability that X is greater than 1 is equal to $P(\geq 1) = 1 - P(X = 1) = 1 - 0.1 = 0.9$.


5. As mentioned above, we know that a probability mass function is used to describe the probability distribution of a discrete variable, so what is it exactly? We can simply understand it as a function that maps the probability that our random variable is exactly equal to some value. For example, considering the rolling of a die, we can visualize the probability of each possible value as follows:

<div>
    <img src= "images/fair_dice_probability_distribution.png" width = 50%/>
    </div>

Figure 1: Visualization of a probability mass function of the rolling of a die. All the numbers on the die have an equal chance of appearing on top when the die stops rolling.

6. We can use the following probability mass function to express the same information:
$$
p_X(x) = \begin{cases} \frac{1}{6}, x \in \{1,2,3,4,5,6\} \\ 0, \text{otherwise}
\end{cases}
$$

### c) Probability distribution of a continuous random variable
1. For a continous random variable, it is is not defined at any specific value. Instead, it is defined over an interval of values, and is represented by the area under a curve -in other words, an integral. As a result, the probability of observing any single value is equal to 0, since the number of values which may be assumed by the random variable is infinite.


2. Considering a random variable $X$ that may take all values over an interval of real numbers, the probability that $X$ is in the set of outcomes $A$ -in other words, $P(X) = P(A)$, is the area under a curve. The curve, which represents by a function $p_X(x)$, must satisfy the following:
    * The curve has no negative values: $p_X(x) > 0,  \forall x$.
    * The total area under the curve is equal to 1.


3. The curve that satisfies the above conditions is called the density curve. Because the probability of a continous random variable is described as the area under the density curve, it is why we refer the distribution function of this type of varialbe the probability density function.  


4. As an example, suppose we are studying the distribution of intelligence quotient (IQ) of a designated population. When we plot the population distribution as a function of IQ, we will get the famous bell-shaped curve. Using this curve, we can compute the probability that we encounter a random person having the IQ score within the range [70,130] to be 95.47%.
<div> 
    <img src = 'images/iq.jpg' width = 80%/>
    </div>

Figure 2: Visualization of the probability density function describing the distribution of IQ Scores in the general population (adapted from the Wechsler intelligence score).

### d) Commonly used probability distribution

#### Binominal distribution
1. Consider a random experiment or event that only has two possible outcomes ("Yes" or "No", "Pass" or "Fail", generalized as $X$ and $1-X$), we define the binomial distribution as simply the probability of a $X$ or $1-X$ outcome in an experiment or event that is repeated multiple times. 


2. Considering an event with a $n$ number of trials, a $p$ probability of an outcome $X$, and a $x$ number of times for a specific outcome within $n$ trials, the fomurla to compute the binomial distribution is:
$$
P(X) = \binom{n}{x}p^x(1-p)^{n-x}
$$


3. As an example, suppose we are investigating about the disk manufacturing process for a certain company. We know that the probability of a disk to be defective to be 0.01, and it is independently of each other. The company sells the disks in packages of 10 and offers a money-back guarantee that at most 1 of the 10 disks is defective. We try to answer the questions of:
    * What proportion of packages is returned? And,
    * If someone buys three packages, what is the probability that exactly one of them will be returned?

***
* We denote $x$ as the number of defective disks in a package. Because $X$ is a discrete random variable with two possible outcomes ("DEFECTIVE" or "FUNCTIONAL"), we can answer our questions by computing the probability that a package will have to be replaced using binominal distribution.


* To do so, firstly, we have to identify what is our $p$. According to the information about the defect probability, we know that $p = 0.01$.


* Secondly, we have to identify what is our $n$. Because the company sells the diskes in packages of 10, our $n$ is 10.


* Thirdly, we have to identify what is our $x$. Because of the company's return policy and assuming that customers always take advantage of the guarantee, our $x$ is then ranging from 1 to 10. 


* Therefore, our probability is as follows:
$$
P(X \geq 1) = P(X = 1) + P (X = 2) + ... + P(X = 10) = 1 - P(X = 0) = 1 - \binom{10}{0}(0.01)^0 (1-0.01)^{10} = 0.09561792499119548
$$
***
<div>
    <img src="images/binominal.png"/>
    </div>

#### 

In [2]:
##
# DEMO: COMPUTE THE IQ PROBABILITY
#
import math

def normal_pdf(x: float, mean: float, std_dev: float) -> float:
    return (1.0 / (2.0 * math.pi * std_dev ** 2) ** 0.5) * math.exp(-1.0 * ((x - mean) ** 2 / (2.0 * std_dev ** 2)))

def approximate_integral(a, b, n, f):
    delta_x = (b - a) / (n - 1)
    
    total_sum = 0
    for i in range(1, n + 1):
        midpoint = 0.5 * (2 * a + delta_x * (2 * i - 1))
        total_sum += f(midpoint)
        
    return total_sum * delta_x

p_between_70_and_130 = approximate_integral(a = 70, b = 130, n = 1000, f = lambda x: normal_pdf(x,100,15))
print(p_between_70_and_130)

##
# DEMO: BINOMINAL DISTRIBUTION COMPUTATION
#
import scipy
from scipy.stats import binom

n = 10 # number of disks in a package 
p = 0.01 # probability of a disk being defective

probability_array = []

for x in range(n+1):
    probability = binom.pmf(x,n,p)
    probability_array.append(probability)

result = 1 - probability_array[0]
print(result)  

0.954715196127298
0.09561792499119548


## 2. Expectation, Variance and Covariance
### a) Expectation (expected value)
1. Consider a random variable $X$ taking in the possible values $\mathbf{x} = \{x_1, x_2,...,x_n\}$, we know that the probability of all possible values for this random variable can be described by a probability function $p_X(x)$. Thus, the expected value (or expectation) is the mean value that $X$ takes on when its possible values $\mathbf{x}$ are drawn from $p_X(x)$.


2. To simply put, we can understand that an expected value is the theoretical mean value of a random phenomenon after occurring many times, if we want to understand expected value according to frequentism philosophy. In other word, the expected value is a measure of central tendency: a value for which the results will tend to.


3. Depends on the type of a random varialbe being whether discrete or continuous, we have two different methods to compute the expected value:
    * For discrete variables, the expected value is a weighted average.
$$
E[X] = \sum_i x_i p_X(X = x_i) = p_X(x_1)x_1 + p_X(x_2)x_2 + ... + p_X(x_n)x_n
$$

    * For continuous variables, the expected value is the integral of the curve density function $f(x)$ and the interval of possible values. 
$$
E[X] = \int_{-\infty}^{\infty} xf(x)dx
$$


3. As an example, considering the value $X$ we get from rolling of a dice, we know that all possible values are "1", "2", "3", "4", "5" and "6". Each possible value has a chance of $\frac{1}{6}$ of taking place. Thus, the expected value is:
$$
E[X] =\frac{1}{6}\times1 + \frac{1}{6}\times2  + \frac{1}{6}\times3 + \frac{1}{6}\times4 + \frac{1}{6}\times5 + \frac{1}{6}\times6 = 3.5
$$


4. Another fun example is computing the expected value of a lottery (in this case, the Eurojackpot) to determine whether it worths buying. The Eurojackpot is a transnational European lottery where the prize starts at 10,000,000€ and can roll over up to 90,000,000€. Playing the Eurojackpot costs 2€ per line. Considering the pot of 47,000,000€, we can compute the expected value of this pot by multiplying the odds of winning for each prize, and sum all products. The resulting expected value is 9.84€ meaning if we play this pot, we are expected to win 9.84 - 2 = 7.84€ in average. Therefore, it is worth buying.

<div>
    <img src = "images/eurojackpot.png" width = 80%/>
    </div>
    
Figure 2: An example of computing expected value of a pot in the Eurojackpot lottery. Noted that the odds mentioned here is based on the information provided by Veikkaus (the Finnish National Betting Agency), and is applicable before 10.10.2014. Since 10.10.2014, the odds have been adjusted to decreased the chance of winning. 

5. Expectations are linear. For example, consider two random variables $X$ and $Y$:
$$
E[\alpha X + \beta Y] = \alpha E[X] + \beta E[Y]
$$


6. Noted that because the expectation is also the mean of an event or a random variable $X$, sometimes we denote the former as $\mu_X$.

### b) Variance
1. Considering a random variable $X$ taking possible values $\{x_1,x_2,...,x_n\}$ that are drawn from a probability distribution $P(x)$, the variance gives us a measure of how much the values of our random variable vary as we sample different values of $X$ from $P(x)$:
$$
Var(X) = E[(X - E[X])^2]
$$


2. An alternative formula for $Var(X)$ is expressed as follows:
$$
Var(X) = E[X^2] - (E[X])^2
$$


3. Using the definition of expected value above, we can intepret the variance for discrete random variable and continuous random variable as follows:
    * Discrete $X$: $Var(X) = \sum x^2P(X) - E^2(X)$   
    
    
    * Continuous $X$: $Var(X) = \int_{-\infty}^{\infty} x^2f(x)dx - E^2[f(x)]$


4. As an example, considering the rolling of a die, we compute its variance to be 2.92.
<div>
    <img src = "images/variance.png" width = 50%/>
    </div>

Figure 3: An example of computing a variance for rolling of a dice.


5. The variance has the following interesting property:
$$
Var(\alpha X + b) = \alpha^2 Var(X)
$$


### c) Covariance
1. Considering two random variables $X$ and $Y$ taking possible values $\{x_1,x_2,...,x_n\}$ and $\{y_1,y_2,...,y_n$ drawing from two probability distributions $P(x)$ and $P(y)$, respectively. The covariance between $X$ and $Y$ gives us some sense of how much two values are linearly related to each other, as well as the scale of these variables:
$$
Cov[X, Y] = E[(X - E[X]),(Y - E[Y])]
$$

2. We can also express the covariance with the following alternative formula:
$$
Cov[X,Y] = E[XY] -E[X]E[Y].
$$

3. The covariance has the following interesting properties:
    * The covariance is commutative $Cov(X,Y) = Cov(Y,X)$.
    * The covariance is distributive $Cov(X_1 + X_2, Y) = Cov(X_1,Y) + Cov(X_2,Y)$.
    * $Cov(X,X) = Var(X)$.
    * $Cov(\alpha X,Y) = \alpha Cov(X,Y)$.
    
    
4. High absolute values of the covariance mean that the values change very much and are both far from their respective means (expectations) at the same time. If the sign of the covariance is positive, then both variables tend to take on relatively high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other takes on a relatively low value and vice versa.

### d) Statistical independence

# Appendix
## 1. How the joint probabilites work

## 2. Revisiting conditional probabilities

## 3. Revisiting the product rule
Consider a joint probability $P(\mathbf{x})$ consisting of many smaller probabilities $[p(x^{(1)}),..., p(x^{(n)})]$, we can generalize the product rule as. 
$$
P(\mathbf{x}) = P(x^{(1)},...,x^{(n)}) = P(x^{(1)})\prod_{i = 2}^n P(x^{i}|x^{(1)},...,x^{(n)}) 
$$