# 05. Random Variables, Expectations, Data, Statistics, Arrays and Tuples
## [Mathematical Statistical and Computational Foundations for Data Scientists](https://lamastex.github.io/scalable-data-science/360-in-525/2018/04/)

&copy;2018 Raazesh Sainudiin. [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)

### Topics

1. Continuous Random Variables
- Expectations
- Data and Statistics
- Sample Mean
- Sample Variance
- Order Statistics
- Frequencies
- Empirical Mass Function
- Empirical Distribution Function
- Arrays
- Tuples
 

# Random Variables

A random variable is a mapping from the sample space $\Omega$ to the set of real numbers $\mathbb{R}$.  In other words, it is a numerical value determined by the outcome of the experiment.

We already saw *discrete random variables* that take values in a discrete set, of two types:

- those with with finitely many values, eg. the two values in $\{0,1\}$ for the Bernoulli$(\theta)$ RV 
- those with *countably infinitely many* values, eg. values in the set of all non-negative integers: $\{0,1,2,\ldots\}$, for the 'infinite coin tossing experiment' that records the number of times you wait until the first Heads occurs.

Now, we will see the other main type of real-valued random variable.

## Continuous random variable

When a random variable takes on values in the continuum we call it a continuous RV.

### Examples

- Volume of water that fell on the Southern Alps yesterday (See video link below)
- Vertical position above sea level, in micrometers, since the original release of a pollen grain at the head waters of a river
- Resting position in degrees of a roulettet wheel after a brisk spin

## Probability Density Function

A RV $X$ with DF $F$ is called continuous if there exists a piece-wise continuous function $f$, called the  probability density function (PDF) $f$ of $X$, such that, for any $a$, $b \in \mathbb{R}$ with $a < b$,

$$
P(a < X \le b) = F(b)-F(a) = \int_a^b f(x) \ dx \ .
$$


The following hold for a continuous RV $X$ with PDF $f$:

For any $x \in \mathbb{R}$, $P(X=x)=0$.
Consequentially, for any $a,b \in \mathbb{R}$ with $a \le b$ 
$$P(a < X < b ) = P(a < X \le b) = P(a \leq X \le b) = P(a \le X < b)$$
By the fundamental theorem of calculus, except possibly at finitely many points (where the continuous pieces come together in the piecewise-continuous $f$): 
$$f(x) = \frac{d}{dx} F(x)$$
And of course $f$ must satisfy:
$$\int_{-\infty}^{\infty} f(x) \ dx = P(-\infty < X < \infty) = 1$$


### You try at home
Watch the Khan Academy [video about probability density functions](https://youtu.be/Fvi9A_tEmXQ) to warm-up to the meaning behind the maths above. Consider the continuous random variable $Y$ that measures the exact amount of rain tomorrow in inches. Think of the probability space $(\Omega,\mathcal{F},P)$ underpinning this random variable $Y:\Omega \to \mathbb{Y}$. Here the sample space, range or support of the random variable $Y$ denoted by $\mathbb{Y} = [0,\infty) =\{y : 0 \leq y < \infty\}$.


## The Uniform$(0,1)$ RV

The Uniform$(0,1)$ RV is a continuous RV with a probability density function (PDF) that takes the value 1 if $x \in [0,1]$ and $0$ otherwise.  Formally, this is written  


$$
\begin{equation}
f(x) = \mathbf{1}_{[0,1]}(x) =
\begin{cases}
1 & \text{if } 0 \le x \le 1 ,\\
0 & \text{otherwise}
\end{cases}
\end{equation}
$$


and its distribution function (DF) or cumulative distribution function (CDF) is:


$$
\begin{equation}
F(x) := \int_{- \infty}^x f(y) \ dy =
\begin{cases}
0 & \text{if } x < 0 , \\
x & \text{if } 0 \le x \leq 1 ,\\
1 & \text{if } x > 1
\end{cases}
\end{equation}
$$


Note that the DF is the identity map in $[0,1]$. 

The PDF, CDF and inverse CDF for a Uniform$(0,1)$ RV are shown below

<img src="images/Uniform01ThreeCharts.png" alt="Uniform01ThreeCharts" width=500>

The Uniform$(0,1)$ is sometimes called the Fundamental Model.

The Uniform$(0,1)$ distribution comes from the Uniform$(a,b)$ family.   

$$
\begin{equation}
f(x) = \mathbf{1}_{[a,b]}(x) =
\begin{cases}
\frac{1}{(b-a)} & \text{if } a \le x \le b,\\
0 & \text{otherwise}
\end{cases}
\end{equation}
$$

This is saying that if $X$ is a Uniform$(a,b)$ RV, then all values of $x$ between $a$ and $b$, i.e., $a \le x \le b$, are equally probable.   The Uniform$(0,1)$ RV is the member of the family where $a=0$, $b=1$.    

 The PDF and CDF for a Uniform$(a,b)$ RV are shown from wikipedia below

<table style="width:95%">
  <tr>
    <th><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/96/Uniform_Distribution_PDF_SVG.svg/500px-Uniform_Distribution_PDF_SVG.svg.png" alt="500px-Uniform_Distribution_PDF_SVG.svg.png" width=250></th>
    <th><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Uniform_cdf.svg/500px-Uniform_cdf.svg.png" alt="wikipedia image 500px-Uniform_cdf.svg.png" width=250></th> 
  </tr>
</table>

You can dive deeper into this family of random vaiables <a href="https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)">here</a>.

SageMath has a function for simulating samples from a Uniform$(a,b)$ distribution.  We will learn more about this later in the course. Let's go ahead and use it to simulate samples from it below.

In [3]:
uniform(-1,1)  # reevaluate the cell to see how the samples change upon each re-evaluation

-0.15206205043245413

# Expectations

The *expectation* of $X$ is also known as the *population mean*, *first moment*, or *expected value* of $X$.

$$
\begin{equation}
E\left(X\right) := \int x \, dF(x) =
\begin{cases}
\sum_x x \, f(x) & \qquad \text{if }X \text{ is discrete} \\
\int x \, f(x)\,dx  & \qquad \text{if } X \text{ is continuous}
\end{cases}
\end{equation}
$$

Sometimes, we denote $E(X)$ by $E X$ for brevity.  Thus, the expectation is a single-number summary of the RV $X$ and may be thought of  as the average.

In general though, we can talk about the Expectation of a function $g$ of a RV $X$.  

The Expectation of a function $g$ of a RV $X$ with DF $F$ is:

$$
\begin{equation}
E\left(g(X)\right) := \int g(x)\,dF(x) =
\begin{cases}
\sum_x g(x) f(x) & \qquad \text{if }X \text{ is discrete} \\
\int g(x) f(x)\,dx  & \qquad \text{if } X \text{ is continuous}
\end{cases}
\end{equation}
$$


provided the sum or integral is well-defined.  We say the expectation exists if


$$
\begin{equation}
\int \left|g(x)\right|\,dF(x) < \infty \ .
\end{equation}
$$

When we are looking at the Expectation of $X$ itself, we have $g(x) = x$

Thinking about the Expectations like this, can you see that the familiar Variance of X is in fact the Expection of $g(x) = (x - E(x))^2$?

The variance of $X$ (a.k.a. second moment)

Let $X$ be a RV with mean or expectation $E(X)$.  The variance of $X$ denoted by $V(X)$ or $VX$ is

$$
V(X) := E\left((X-E(X))^2\right) = \int (x-E(X))^2 \,d F(x)
$$

provided this expectation exists.  The standard deviation denoted by $\sigma(X) := \sqrt{V(X)}$.

Thus variance is a measure of ``spread'' of a distribution.

The $k$-th moment of a RV comes from the Expectation of $g(x) = x^k$.

We call

$$
E(X^k) = \int x^k\,dF(x)
$$


the $k$-th moment of the RV $X$ and say that the $k$-th moment exists when $E(|X|^k) < \infty$.  


## Properties of Expectations



1. If the $k$-th moment exists and if $j<k$ then the $j$-th moment exists.
- If $X_1,X_2,\ldots,X_n$ are RVs and $a_1,a_2,\ldots,a_n$ are constants, then $E \left( \sum_{i=1}^n a_i X_i \right) = \sum_{i=1}^n a_i E(X_i)$
- Let $X_1,X_2,\ldots,X_n$ be independent RVs, then 
  - $E \left(  \prod_{i=1}^n X_i \right) = \prod_{i=1}^{n} E(X_i)$
  - $V(X) = E(X^2) - (E(X))^2$
- If $a$ and $b$ are constants, then: $V \left(aX + b\right) = a^2V(X) $
- If $X_1,X_2,\ldots,X_n$ are independent and $a_1,a_2,\ldots,a_n$ are constants, then: $V \left(  \sum_{i=1}^n a_i X_i \right) = \sum_{i=1}^n a_i^2 V(X_i)$

### You try at home

Watch the Khan Academy videos about [probability density functions](https://youtu.be/Fvi9A_tEmXQ) and [expected value](https://youtu.be/j__Kredt7vY) if you want to get another angle on the material more slowly step-by-step: