<img src='./fig/vertical_COMILLAS_COLOR.jpg' style= 'width:70mm'>

<h1 style='font-family: Optima;color:#ecac00'>
Máster en Big Data. Tecnología y Analítica Avanzada (MBD).
<a class="tocSkip">
</h1>

<h1 style='font-family: Optima;color:#ecac00'>
Fundamentos Matemáticos del Análisis de Datos (FMAD). 2022-2023.
<a class="tocSkip">
</h1>

<h1 style='font-family: Optima;color:#ecac00'>
04 Random Variables
<a class="tocSkip">   
</h1>  

## <span style='background:yellow; color:red'> Remember:<a class="tocSkip"> </span>     

+ Navigate to your `fmad2223` folder in the console/terminal.  
+ Execute `git pull origin main` to update the code
+ **Do not modify the files in that folder**, copy them elsewhere

In [1]:
# Standard Data Science Libraries Import

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy as scp


## Discrete Random Variables

### Theoretical Models Vs Empirical Data

+ We begin with a simple mental experiment. Imagine we roll a dice (a honest not-loaded one) a million times and we look at the relative frequencies of every possible result. What is your guess for the numbers in the second row of this table?  
$$
\quad\\
\begin{array}{|c|c|c|c|c|c|c|}
\hline
\text{value} & 1 & 2 & 3 & 4 & 5 & 6 \\
\hline
\text{relative frequency} & ? & ? & ? & ? & ? & ? \\
\hline
\end{array}
\quad\\
$$
    Those values that you clearly have in your mind are a **theoretical model** (your *prior*) of the outcome of this experiment. Of course, when we run the experiment and we get **empirical data** we do not expect the results to be a perfect match with the theory, because this is a **random experiment**. 

+ And that is precisely the notion of a discrete random variable $X$: *a theoretical model for the outcome of a random experiment with a finite number of possible outcomes.* More precisely (from the mathematical point of view) the result of the experiment is a discrete/countable set.

+ Therefore, in order to describe a discrete random variable $X$ we need to provide its **probability density table or function**. That is a table of all the possible values of $X$ and their corresponding probabilities:
$$
\quad\\
\begin{array}{|c|c|c|c|c|c|c|}
\hline
\text{value of }X: & x_1 & x_2 & \cdots & x_k \\
\hline
\text{Probability for that value: }P(X = x_i) & p_1 & p_2 & \cdots & p_k \\
\hline
\end{array}
\quad\\
$$
where $p_1 + p_2 + \cdots + p_k = 1$. Sometimes we will use *function notation* $f(x_i) = P(X = x_i)$, specially when we want to give a *formula* for the probability. We will soon see examples. 

+ **Exercise:** use NumPy to do the experiment with a million dice rolls and get their absolute and relative frequency table.  

In [101]:
# %load "./exclude/S04-001.py"

### Mean and Variance for a Discrete Random Variable

+ A discrete random variable $X$ is therefore a theoretical model for the distribution of values of a random variable in a population. The **population mean** or **expectation** of $X$ represents the mean or average of the values that $X$ takes *in the population*. It is denoted with the greek letter $\mu$ and also wit the symbol $E(X)$. When we need to clarify the random variable involved we will sometimes use a symbol such as $\mu_X$.

+ Similarly we define the **population variance** $\sigma^2$, using all the values in the population. Both $\mu$ and $\sigma^2$ should be considered as abstract or also *hidden* values that we want to *estimate*, getting approximate values, but we can not actually obtain with certainty.

+ One of the main goals of Statistics is to use sample data to estimate parameters of a population. Suppose that the discrete variable $X$ takes $n$ different values $x_1, x_2,\ldots, x_k$. If we have a sample of $X$ and the absolute frequencies **in that sample** are $f_1, f_2, \ldots,f_k$, then we can use that sample to give an estimate of the *population mean* $\mu$ using the *sample mean*: 
$$
\quad\\
\bar X = \dfrac{x_1 f_1 + \cdots + x_k f_k}{n} = x_1 fr_1 + \cdots + x_k fr_k
\quad\\
$$  
  where $fr_1, \ldots, fr_k$ are the *sample relative frequencies*. It is very important that you realize that $\bar X$ is an empirical quantity that comes out of a sample. Therefore it is something that we can compute using that sample that we have. On the other hand $\mu$ is a theoretical quantity because it belongs to the population and we do not have access to the population (we would not be needing Statistic if we did!).

+ Now, looking at the last formula, recall that the relative frequencies are closely related with probabilities. In fact, the idea of probability first appeared as a theoretical model of the relative frequency. And so if we want to give a theoretical definition of the mean or expectation of a discrete random variable the only sensible choice is this:
<h5 style= 'text-align:center;color:#ecac00'>
Mean of a Discrete Random Variable $X$
<a class="tocSkip">
</h5>
$$
\quad\\
\fbox{$\displaystyle\mu = E(X) = x_1 p_1 + \cdots + x_k p_k$}
\quad\\
$$
  That is, we have simply replaced relative frequencies with probabilities to go from empirical to theoretical. A similar reasoning lead to this expression for the:
<h5 style= 'text-align:center;color:#ecac00'>
Variance of a Discrete Random Variable $X$
<a class="tocSkip">
</h5>
$$
\quad\\
\fbox{$\displaystyle\sigma^2 = \operatorname{Var}(X) = 
(x_1 - \mu)^2 p_1 + \cdots + (x_k - \mu)^2 p_k$}
\quad\\
$$    

    The positive square root $\sigma$ of the variance is called the **standard deviation** of $X$.

**Exercise:** use Python (with NumPy or pandas) to compute $\mu$ and $\sigma^2$ for the random variable $X$ representing the outcome of a honest dice. 

In [124]:
# %load "./exclude/S04-002.py"

### Sampling Discrete Random Variables with Python


+ Suppose we have a discrete random variable $X$ with values $x_1, \ldots, x_k$ and corresponding probabilities $p_1, \ldots, p_k$. In order to run simulations of our experiments with $X$ we would like to be able to use Python to generate synthetic random samples of $X$, according to its probability distribution. We can do that with this code

In [133]:
values_X = np.arange(1, 7)
probs_X = np.ones(shape = 6) / 6

np.random.choice(values_X, 10, p=probs_X)

array([2, 4, 1, 3, 4, 6, 4, 6, 2, 1])

### Operations on Discrete Random Variables

+ **Example:** Assume that the population of interest is the set of households in a given city. And let the random variable $X$ represent the annual home insurance paid by each household. Similarly, let $Y$ represent the annual life insurance for each household. When we want to obtain the total amount of both insurance payments combined we need to consider the sum of the random variables $X + Y$. In many examples like this we would like to use the information about $X$ and $Y$ to obtain the properties of their sum $X + Y$ without having to redo the calculation. 

+ More generally, we are often interested in *linear combinations* of random variables, such as 
$$
\quad\\
a\,X + b\,Y,\qquad\text{ where }a\text{ and }b\text{ are numeric coefficients.}
\quad\\
$$

+ The mean or expectation of such a linear combination is simply the same linear combination of the expectations  of the individual variables:
$$
\quad\\
E(a\,X + b\,Y) = a\,E(X) + b\,E(Y)
\quad\\
$$

+ For the variance things get a little more complicated, because we need the notion of independence. Informally, $X$ and $Y$ are independent if knowledge about the value of $X$ does not affect the probability of the values of $Y$. The **covariance** of $X$ and $Y$ is
$$
\quad\\
\operatorname{cov}(X, Y) = E((X - \mu_X)(Y - \mu_Y))
\quad\\
$$
and the most general result says that
$$
\quad\\
\sigma^2(a\,X + b\, Y) = a^2\,\sigma^2_X + b^2\,\sigma^2_Y + 2\,a\,b\, \operatorname{cov}(X, Y)
\quad\\
$$
+ When $X$ and $Y$ are independent it turns out that $\operatorname{cov}(X, Y) = 0$ (creful, it does not work the other way round) and therefore **in the independence case** we get a simpler formula:
$$
\quad\\
\sigma^2(a\,X + b\, Y) = a^2\,\sigma^2_X + b^2\,\sigma^2_Y
\quad\\
$$



### The Distribution Function

+ The **distribution function** $F_X$ of a random variable $X$ (discrete or continuous) is defined by:
$$
\quad\\
F_X(k) = P(X\leq k)\qquad\text{ for any number }k
\quad\\
$$
You may think of $F(k)$ as the theoretical version of the table of cumulative relative frequencies. Therefore, it answers the question "*what is the probability that $X$ takes a value $\leq k$?*"

+ Because they are probabilities and because of their cumulative nature the typical graph for the distribution function of a discrete variable is a **stair shaped** graph like this one, climbing from 0 to 1 with a jump at each value of $X$ equal to the probability of that value:
![](fig/04-01-FuncionDistribucionVariableAleatoria.png)

## Binomial Variables

### Bernouilli Random Variables

+ A Bernouilli random variable is a very simple discrete random variable that only takes two values, 0 and 1, with the following probability table:
$$
\quad\\
\begin{array}{|l|c|c|}
    \hline
    \rule{0cm}{0.5cm}\text{Value of }X:&1&0\\
    \hline
    \rule{0cm}{0.5cm}\text{Probability for that value:}& p & q = 1 - p\\
    \hline
\end{array}
\quad\\
$$
These values 1 and 0 are (arbitrarily) called *success* and *failure* respectively.

+ These Bernouilli type variables are useful because they are the building blocks for more complex types of variables, as we will soon see.

+ **Example:** the variable $X = $ "number of appearances of six when rolling a single dice" is a Bernouilli variable with $p = 1/6$ and $q = 5/6$. We denote this with $X\sim Bernouilli(p)$ (the symbol $\sim$ is read "is of type ...")

+ The mean of a random variable $X\sim Bernouilli(p)$ is $\mu = p$, and its variance is $\sigma^2 = p\cdot q = p(1 - p)$.


### Binomial Random Variables

+ **Example:** Suppose that we roll a dice 11 times and we use that experiment (the whole set of 11 rolls of the dice) to define a random variable $X$ where:
$$
\quad\\
X = \textit{number of appearances of 6 in those 11 rolls of the dice}
\quad\\
$$

+ The situation in this example has these characteristics:

  $(1)$ There is a **basic experiment**, rolling a dice in this case, that gets **repeated $n$ times** (in the example $n = 11$).  
  
  $(2)$  The $n$ repeated basic experiments are **independent** of each other. That is, the outcome of one of the experiments is not affected in any way by the outcome of the others.
  
  $(3)$  Every individual instance or trial of the basic experiment can only result in **success** (in the example, rolling a 6) represented with value $1$; o in **failure** (not rolling a 6) represented with value 0.  
  
  $(4)$  The **probability of success** for every trial is $p$ and that for failure is therefore $q = 1- p$. In the example $p = 1/6, q= 5/6$.  
  
  $(5)$ Finally, **our variable $X$ is the number of successful trials (with outcome 1) in the whole set of $n$ independent trials**.

+ **Definition of Binomial Variable**  
  A discrete random variable  $X$ with the above characteristics is a binomial variable with parameters $n$ and $p$, and we will use the symbol $X \sim B(n, p)$ to denote this. 

### Experiments with Binomial Variables using Python

+ Let us see an example of a binomial variable. We will use the `prevalentHyp` variable in the `framingham` table that we have used in previous sessions. The variable takes the value 1 if the patient is hypertensive and 0 otherwise. Keep in mind that 1 and 0 are arbitrary, and so in this example *success* actually means that the patient is in fact hypertensive. 

+ **Exercise:**  
    (a) Load the data table into the `framingham` pandas DataFrame. You have done this before.  
    (b) Find the probability that a randomly chosen patient is hypertensive, and call it $p$.  
    (c) Instead of choosing a single patient, suppose that we choose seven patients at random and with replacement. Let $X$ denote the number of hypertensive patients among those seven. What values can this variable $X$ actually take?  
    (d) Use Python to choose a sample of seven patients (with replacement) and count the number of hypertensive patients in that sample.  
    (e) Iterate the previous step $N = 50000$ times and store the 50000 results in a NumPy array called `X_samples`. Get a relative frequency table of the different values in `X_samples.`  
(f) Choose the right plot to illustrate the contents of `X_samples`.

### Probability Density for Binomial Variables

+ The table of relative frequencies that you obtained in the previous exercise is an empirical approximation of the following expression for the following:
$$
\fbox{
$
\quad\\
\textbf{Theoretical probability density of a binomial variable }X\sim B(n, p)
\quad\\
\quad\quad\quad\quad
P(X = k) =\displaystyle\binom{n}{k}\,p^k\,q^{(n -k)}\quad\text{ for }\quad k = 0, 1, 2, \ldots, n
$
}
$$
where we recall $q = 1 - p$. Also the definition of the *binomial coefficient* is:
$$
\dbinom{n}{k}=\frac{\overbrace{n\left( n-1\right) \left( n-2\right) \cdots \left( n-k+1\right) }^{k\mbox{ factors}}}{k!}
$$
where $k! = k\cdot(k - 1)\cdot(k - 2)\cdot\,\cdots\,\cdot 2\cdot 1$ is the factorial of $k$.

* Luckily you will not have to compute these by hand, Python will do the hard work for us.

## Continuous Random Variables

## Normal Random Variables