<img src='./fig/vertical_COMILLAS_COLOR.jpg' style= 'width:70mm'>

<h1 style='font-family: Optima;color:#ecac00'>
Máster en Big Data. Tecnología y Analítica Avanzada (MBD).
<a class="tocSkip">
</h1>

<h1 style='font-family: Optima;color:#ecac00'>
Fundamentos Matemáticos del Análisis de Datos (FMAD). 2022-2023.
<a class="tocSkip">
</h1>

<h1 style='font-family: Optima;color:#ecac00'>
04 Random Variables
<a class="tocSkip">    
</h1>  

## <span style='background:yellow; color:red'> Remember:<a class="tocSkip"> </span>     

+ Navigate to your `fmad2223` folder in the console/terminal.  
+ Execute `git pull origin main` to update the code
+ **Do not modify the files in that folder**, copy them elsewhere

In [1]:
# Standard Data Science Libraries Import

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy as scp


## Discrete Random Variables

### Theoretical Models Vs Empirical Data

+ We begin with a simple mental experiment. Imagine we roll a dice (a honest not-loaded one) a million times and we look at the relative frequencies of every possible result. What is your guess for the numbers in the second row of this table?  
$$
\quad\\
\begin{array}{|c|c|c|c|c|c|c|}
\hline
\text{value} & 1 & 2 & 3 & 4 & 5 & 6 \\
\hline
\text{relative frequency} & ? & ? & ? & ? & ? & ? \\
\hline
\end{array}
\quad\\
$$
    Those values that you clearly have in your mind are a **theoretical model** (your *prior*) of the outcome of this experiment. Of course, when we run the experiment and we get **empirical data** we do not expect the results to be a perfect match with the theory, because this is a **random experiment**. 

+ And that is precisely the notion of a discrete random variable $X$: *a theoretical model for the outcome of a random experiment with a finite number of possible outcomes.* More precisely (from the mathematical point of view) the result of the experiment is a discrete/countable set.

+ Therefore, in order to describe a discrete random variable $X$ we need to provide its **probability density table or function**. That is a table of all the possible values of $X$ and their corresponding probabilities:
$$
\quad\\
\begin{array}{|c|c|c|c|c|c|c|}
\hline
\text{value of }X: & x_1 & x_2 & \cdots & x_k \\
\hline
\text{Probability for that value: }P(X = x_i) & p_1 & p_2 & \cdots & p_k \\
\hline
\end{array}
\quad\\
$$
where $p_1 + p_2 + \cdots + p_k = 1$. Sometimes we will use *function notation* $f(x_i) = P(X = x_i)$, specially when we want to give a *formula* for the probability. We will soon see examples. 

+ **Exercise:** use NumPy to do the experiment with a million dice rolls and get their absolute and relative frequency table.  

In [101]:
# %load "./exclude/S04-001.py"

### Mean and Variance for a Discrete Random Variable

+ A discrete random variable $X$ is therefore a theoretical model for the distribution of values of a random variable in a population. The **population mean** or **expectation** of $X$ represents the mean or average of the values that $X$ takes *in the population*. It is denoted with the greek letter $\mu$ and also wit the symbol $E(X)$. Similarly we define the **population variance** $\sigma^2$.

+ One of the main goals of Statistics is to use sample data to estimate parameters of a population. Suppose that the discrete variable $X$ takes $n$ different values $x_1, x_2,\ldots, x_k$. If we have a sample of $X$ and the absolute frequencies **in that sample** are $f_1, f_2, \ldots,f_k$, then we can use that sample to give an estimate of the *population mean* $\mu$ using the *sample mean*: 
$$
\quad\\
\bar X = \dfrac{x_1 f_1 + \cdots + x_k f_k}{n} = x_1 fr_1 + \cdots + x_k fr_k
\quad\\
$$  
  where $fr_1, \ldots, fr_k$ are the *sample relative frequencies*. It is very important that you realize that $\bar X$ is an empirical quantity that comes out of a sample. Therefore it is something that we can compute using that sample that we have. On the other hand $\mu$ is a theoretical quantity because it belongs to the population and we do not have access to the population (we would not be needing Statistic if we did!).

+ Now, looking at the last formula, recall that the relative frequencies are closely related with probabilities. In fact, the idea of probability first appeared as a theoretical model of the relative frequency. And so if we want to give a theoretical definition of the mean or expectation of a discrete random variable the only sensible choice is this:
<h5 style= 'text-align:center;color:#ecac00'>
Mean of a Discrete Random Variable $X$
<a class="tocSkip">
</h5>
$$
\quad\\
\fbox{$\displaystyle\mu = E(X) = x_1 p_1 + \cdots + x_k p_k$}
\quad\\
$$
  That is, we have simply replaced relative frequencies with probabilities to go from empirical to theoretical. A similar reasoning lead to this expression for the:
<h5 style= 'text-align:center;color:#ecac00'>
Variance of a Discrete Random Variable $X$
<a class="tocSkip">
</h5>
$$
\quad\\
\fbox{$\displaystyle\sigma^2 = \operatorname{Var}(X) = 
(x_1 - \mu)^2 p_1 + \cdots + (x_k - \mu)^2 p_k$}
\quad\\
$$    

    The positive square root $\sigma$ of the variance is called the **standard deviation** of $X$.

**Exercise:** use Python (with NumPy or pandas) to compute $\mu$ and $\sigma^2$ for the random variable $X$ representing the outcome of a honest dice. 

In [124]:
# %load "./exclude/S04-002.py"

### Sampling Discrete Random Variables with Python


+ Suppose we have a discrete random variable $X$ with values $x_1, \ldots, x_k$ and corresponding probabilities $p_1, \ldots, p_k$. In order to run simulations of our experiments with $X$ we would like to be able to use Python to generate synthetic random samples of $X$, according to its probability distribution. We can do that with this code

In [133]:
values_X = np.arange(1, 7)
probs_X = np.ones(shape = 6) / 6

np.random.choice(values_X, 10, p=probs_X)

array([2, 4, 1, 3, 4, 6, 4, 6, 2, 1])

### Operations on Discrete Random Variables

### The Distribution Function

## Binomial Variables

## Continuous Random Variables

## Normal Random Variables