# Module 2: Properties of random variables



## Outline for the day

* Review Lecture 1
    - Where does the functionality come from?
    - Discussion: Getting reproducible
    - What is a random variable?
* Main descriptor of random variables
    - (Algebraic and central) moments
    - Population parameters vs. sample statistics
* Jointly distributed random variables
    - Covariance and correlation


## Where does the functionality come from?

* Python:
    - high-level programming language
    - readable syntax
    - flexible, modular, wide-spread
    - version 2.7 vs. 3.6
* Python modules:
    - numpy
    - matplotlib
    - pandas
    - stats
    - scikit learn
    - ...
* Editors and working environments:
    - Terminal
    - Text editor (Emacs, VI, ...)
    - PyCharm, Spyder, Anaconda, ...
    - Jupyter (local web server, many languages, Markdown cells, ...)
    

## Nature technology feature: Data transparency

* Have you done data analysis in point-and-click graphical applications such as Microsoft Excel?

* Have you covered reproducibility before elsewhere?

* Does automatisation = high quality results?

* Reproducible for whom?

* Can anyone steel and criticise my ideas, if I make my code open-source?

* "At the American Journal of Political Science, a recent emphasis on reproducibility extended average publication times by more than seven weeks"

![the_general_problem.png](attachment:the_general_problem.png)
-- The general problem by xkcd



## Random Variables

A random variable is obtained by assigning a numerical value to each
**outcome** of an experiment. (Technically, it's a real-valued function on $S$).

* $x$ is a realisation of $X$
* The distribution of $X$ is given by its p.d.f. (or p.m.f.), $f_X(x)=p(X=x)$

Random variables may be **discrete** or
**continuous**. In some cases, the outcomes of an experiment/observation
are already numeric, and these values will form the random variable of
interest (e.g. streamflow, annual discharge, peak discharge, etc.). In
other cases, we may have **categorical** data or other non-numeric data
for which we need to assign values to create our set of random
variables. An example may be ephemeral flow in a stream (e.g.
flow/no-flow), or cloud cover (here we could use octals).


> **Question**
A radioactive mass emits particles at an average rate of 15 particles per minute. Define the random variable X to be the number of particles emitted in a 10-minute time frame. Then X is what type of random variable?
- (a) discrete
- (b) continuous


> **Question**
One of these particles is emitted at noon today. Define the random variable Y to be the time elapsed between noon and the next emission. Then Y is what type of random variable?
- (a) discrete
- (b) continuous


### Vocabulary


__Population__ - a complete assemblage of all values representative of a random process, i.e. typically unknown

__Sample__ - a group of values selected for studying the properties of a population. I.e. a subset of the population.

__Statistical Inference__ - use information from a sample to draw conclusions (inferences) about
the population from which the sample was taken

__Parameter__ - a parameter is a value, usually unknown (and which therefore has to be estimated),
used to represent a certain population characteristic

__Statistic__ - A statistic is a quantity that is calculated from a sample of data used to estimate the
parameters of the population. So, it's a function of a random variable, i.e. itself a random variable

* **So: sample statistics are estimates for population parameters**.
* ** ...that means they are functions of random variables, and thereby themselves random variables **

__Estimator__ - An estimator is any quantity calculated from the sample data which is used to give
information about an unknown quantity in the population. For example, the sample mean is an
estimator of the population mean.

__Estimate__ - an estimate is the particular value of an estimator that is obtained from a particular
sample of data and used to indicate the value of a parameter. If the value of the estimator in a
particular sample is found to be 5, then 5 is the estimate of the population mean μ.

__Estimation__ - Estimation is the process by which sample data are used to indicate the value of an
unknown quantity in a population

__Moments__ - Parameters that describe a distribution. Typically: central tendency, dispersion, and symmetry




## Moments of a Distribution
When we consider a statistical analysis of data, and start to think about predictability of the events the data represents (say certain discharge quantities for example), we need to have some mechanism to describe that probability of the event occurring.


### Analogy from classical mechanics
The calculation of moments is based on a concept from classical mechanics (torque, or moment of force). For the first moment of a part $dA$ of an object $A$ about the origin we get $x \cdot dA$, so that we may define $\mu_1 ' = \int_A x \cdot dA$, where x is the distance from the origin (and $A$ its mass).

In general, the i-th moment of X with $p.d.f.$ $f(x)$ about the origin is:

$ \mu_i ' = \int_{-\infty}^\infty {x^i \cdot f(x)dx}$

For central moments, the distance to the mean (instead of the origin) is used (then $\mu '$ becomes $\mu$).

While moments are useful in describing a random variable,
it is important to keep in mind that two random variables may have the
same moments, but their probability distributions may
be quite different.

### Expected Value of a Random Variable

The expected value (or population mean) of a random variable X is denoted $E(X)$ (or $\mu$ or $\mu_1 '$) and is usually unknown.

For a *continuous* random variable with $p.d.f.$ $f(x)$, the expected value is defined as:

$$E(X) = \int_{-\infty}^\infty {x \cdot f(x)dx}$$

If $g(X)$ is a function of X, then the expected value of $g(X)$ is given by:

$$E[g(X)] = \int_{-\infty}^\infty {g(x) \cdot f(x)dx}$$

Therefore,
* The expected value of $x^i$ is equal to the $i^{ th}$ moment about origin $E[x^i ] = \mu_i '$
* The expected value of $(x- \mu)^i$ is equal to the $i^{th}$ central moment $E[(x- \mu)^i ] = \mu_i$

> **Question:** What are the values of the central moments $\mu_0$ and $\mu_1$?

#### Discrete Case

The expected value or expectation of a *discrete* random variable with a
**probability mass function**, $ P(X = x_i ) = p_i $ is:

$$E(X) = \sum_x {p_i  x_i}$$

$E(X)$ provides a summary measure of the average value taken by the
random variable and is also known as the **mean** of the random
variable.

It is possible to think of the *expected value* as a *weighted average*
within the sample space, $S$, where the weights are the probabilities of
$x_i$.


### Measures of Central Tendency

#### Arithmatic Mean

A sample estimate of the population mean is the arithmetic average (no $p.d.f.$ needed):

$$\bar{X} = {1\over n} \sum_i {x_i}$$

#### Geometric Mean

Used for the lognormal distribution (we'll cover this later). Evaluates
the *central tendency* by using the product of values (as opposed to the
arithmetic mean which uses their sum). The geometric mean is defined as
the $n^{th}$ root of the product of $n$ numbers:

$$\bar X_G = \left(\prod_{i=1}^n x_i \right)^{1/n} = \sqrt[n]{x_1 x_2 \cdots x_n};$$

Where, $\prod x_i = x_1 x_2 x_3 ... x_n$;

For instance, the geometric mean of two numbers, say 2 and 8, is just
the square root of their product; that is $\sqrt{2\cdot 8}=4$. As
another example, the geometric mean of the three numbers 4, 1, and 1/32
is the cube root of their product (1/8), which is 1/2; that is
$\sqrt[3]{4\cdot 1\cdot 1/32}=1/2$.

It is good when comparing, or using together random values of different
magnitude, as it naturally *normalizes* the values so that they may be
compared.

#### Weighted Mean

is the quantity

$$\bar{x} = \frac{ \sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i},$$

which means:

$$\bar{x} = \frac{w_1 x_1 + w_2 x_2 + \cdots + w_n x_n}{w_1 + w_2 + \cdots + w_n}.$$

Therefore data elements with a high weight contribute more to the
weighted mean than do elements with a low weight. The weights cannot be
negative. Some may be zero, but not all of them (since division by zero
is not allowed).

The common mean $\frac {1}{n}\sum_{i=1}^n {x_i}$ is a special case of
the weighted mean where all data have equal weights, $w_i=w$. When the
weights are normalized then $w_i'=\frac{1}{n}.$

#### Median

The median is that value of $x$ for which $P(X≤x) = P(X≥x) = 0.5$. In
the case of a **continuous** distribution the median corresponds to an
ordinate which separates density curve into two parts having equal areas
of 1⁄2 each.

$$\int_{-\infty}^{\mu_{md}} {f(x)dx} = \int_{\mu_{md}}^{\infty} {f(x)dx} = 0.5$$

Simply, there are equal chances of selecting data above and below the
median.

For **discrete** random variable, the median is the value halfway
through the *ordered data set*, below and above which there is an equal
number of data values.

$$\mu_{md} = x_p$$

Where; $p$ is determined from $\sum_{i=1}^p f(x_i) = 0.5$

#### Mode

Mode is simply a measure of where the most values are located in the
distribution. However, it is not necessarily equal to the mean or
median. The shape of the distribution has a greater impact on the median
and mean, and their lack of equivalence to the mode can indicate extreme
values.

for the **continuous** case, the mode is equal to the location where the
$p.d.f.$ reaches a maximum value:

$${df(x) \over dx } = 0;\  and \  {d^2f(x) \over dx^2} \lt 0$$

For **discrete** variable mode is the x value associated with
$Max_{i=1}^n f( x_i )$

*A sample or population may have none, one or more than one mode. (Can
you give an example of each case?)*

> **Question** Suppose you want to assess drinking water quality in a developing mining community. You take 10 water samples, analyse them in your lab, and obtain the following cyanide concentrations (in ug/L): 120, 130, 380, 390, 390, 400, 410, 410, 420, 430. What should be done with the values 120 and 130? Which of the following is the best course of action?
- (a) Delete them from the data set since they are outliers.
- (b) Keep them in the data set even though they are outliers.
- (c) Determine why these values were so much lower than the rest, then delete them.
- (d) Determine why these values were so much lower than the rest, then keep them in the data set, provided they they were produced by the *same processes* as the rest of the dataset.

In [1]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')



![ec_finse2.png](attachment:ec_finse2.png)

In [2]:
#read in data
df = pd.read_csv('2018-08-13T113000.csv',index_col=0,parse_dates=True)

df.head()

Unnamed: 0,u_m/s,v_m/s,w_m/s,T_degC,CO2_ppm,H2O_ppt
2018-08-13 11:30:00.000,-0.259165,3.33241,0.406093,10.876847,394.112,7.4551
2018-08-13 11:30:00.100,-0.114277,3.36608,0.832847,11.093791,394.535,7.32271
2018-08-13 11:30:00.200,-0.27447,3.76503,0.525727,11.33101,395.316,6.98443
2018-08-13 11:30:00.300,-0.258144,4.00175,0.402011,10.955879,395.061,6.91087
2018-08-13 11:30:00.400,-0.33773,3.95175,0.33824,10.984468,395.143,6.85304


In [3]:
#explore data range and coverage
df['T_degC'].plot()
df['T_degC'].resample('1min').mean().plot(label='mean')
df['T_degC'].resample('1min').median().plot(label='median')
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x10c5a8c50>

In [4]:
plt.figure()
ax = plt.subplot(111)

cnt, bins, patches = ax.hist(df['T_degC'], bins=20)

plt.axvline(x=np.mean(df['T_degC']), color='b')
plt.axvline(x=np.median(df['T_degC']), color='k')

from scipy.stats.mstats import gmean
plt.axvline(x=gmean(df['T_degC']), color='m')



<IPython.core.display.Javascript object>

<matplotlib.lines.Line2D at 0x10a944390>

### Measures of Dispersion

The term dispersion refers to the spread of the data.

We are interested
in understanding something about the *variability* of the data. If we
are told a distribution has an expected value, $E(X)$, of 0.5 --does
that give us any information about the *shape* of the distribution? To
have an idea of the probability properties of a distribution, we need to
understand more information than just the *central tendencies*. Some
common measures of dispersion include: **Range**, **Variance**, and the
**Coefficient of Variation**.

#### Range

The simplest measure, it is the difference between the maximum and
minumum values

$$Range = X_{max}  - X_{min}$$

#### Variance

Variance provides a **more robust** measure of the variability of a
population. The variance is a non-negative number given us insight into
how widely distributed the values of the random variable are likely to
be. If we have very high variance, it decreases the 'chance' that we
will draw the central value. In other words, the $x_i$ are more
scattered within the the distribution. Think of *Variance* \~
*variability*, or mean squared deviation. We symbolize variance with $Var(X) \ or \ V(X)$ or
$\sigma^2$

$$V(X) = \sigma^2 = E[(X-E(X))^2] = E(X^2)-E^2(X)$$

$$\sigma^2= E[(x-\mu)^2] = \int (x-\mu)^2 f(x)dx = \mu_2$$

Note that the variance is also the **second central moment**.

For a **discrete** variable, we can calculate variance:

$$\sigma_x^2 = {\sum{(X_i - \mu)^2 } \over n }$$

We can make a sample estimation of *population* variance from
observations:

$$S^2 = {\sum_{i=1}^n{(X_i - \bar x)^2 } \over n-1 }$$

Comparing these equations, note two distinct differences: (1) $\mu$
(*population mean*, which we cannot know) is replaced by $\bar x$
(*sample* mean, an estimate of $\mu$), and (2) $n$ is replaced by $n-1$
in order to have an unbiased estimation of population variance.

Take it as an exercise to prove this by deriving:
$E(S^2) = ... \sigma^2$

Often it is convenient to work with the **standard deviation** rather
than the variance because it has the *same units* as the original
variable. We can find the standard deviation simply from the variance:

$$\sigma = \sqrt{V(X)} = \sqrt { \sigma^2}$$

or from a *Sample Estimation*:

$$S = \sqrt {{\sum_{i=1}^n {x_i - \bar x)^2 }} \over n-1}$$

#### Coefficient of Variation

The coefficient of variation measures the spread of a set of data as a
proportion of its mean. It is often expressed as a percentage. It is
dimensionless, so it can be used for cross comparison, i.e. for
comparison of the variability of the same variable at different places,
etc. It is the ratio of the sample standard deviation to the sample
mean:

$$Cv = {S \over \bar x }$$

There is an equivalent definition for the coefficient of variation of a
population, which is based on the expected value and the standard
deviation of a random variable:

$$Cv = { \sqrt{V(X)} \over E(X)}$$

In [5]:
plt.figure()
ax = plt.subplot(111)

<IPython.core.display.Javascript object>

In [6]:
cnt, bins, patches = ax.hist(df['T_degC'], bins=20)
plt.axvline(x=np.mean(df['T_degC']), color='b')
plt.axvline(x=np.mean(df['T_degC']-np.std(df['T_degC'])), ls='--',color='b')
plt.axvline(x=np.mean(df['T_degC']+np.std(df['T_degC'])), ls='--',color='b')

print("Cv=%2.3f"% (np.std(df['T_degC'])/np.mean(df['T_degC'])) ) 

Cv=0.055


In [7]:
df2 = pd.read_csv('2018-08-13T093000.csv',index_col=0,parse_dates=True)
cnt, bins, patches = ax.hist(df2['T_degC'], bins=20)
plt.axvline(x=np.mean(df2['T_degC']), color='r')
#plt.axvline(x=np.mean(df2['T_degC']-np.std(df2['T_degC'])), ls='--',color='r')
#plt.axvline(x=np.mean(df2['T_degC']+np.std(df2['T_degC'])), ls='--',color='r')

print("Cv=%2.3f"% (np.std(df2['T_degC'])/np.mean(df2['T_degC'])) ) 

Cv=0.065


### Measures of Symmetry

Two further *moments* of the distribution exist, these are **skewness**
and **kurtosis**. They provide further information regarding the shape
of the distribution. Recall, all four of these moments or measures are
simply parameters that describe the shape of the distribution function.
We are concerned with what we may expect when we 'pull' a sample from a
population, given a distribution function, we have a better idea of what
to expect... beyond simply an expected value. So, as we think about the
shape (so far, central tendency and width), now we look at two measures
that give information on the left/right-ness of the distribution and the
peakedness.

![haan_fig33.png](attachment:haan_fig33.png)

#### Skewness

Skewness is a parameter that describes asymmetry in a random variable’s
probability distribution.

$$skewness = \int(x-\mu)^3f(x)dx = \mu_3$$

; it is also the **3rd moment** of a population.

**Coefficient of skewness** (dimensionless, more commonly used). Several
types of skewness are defined, the terminology and notation of which are
unfortunately rather confusing. "The" coefficient of skew of a
distribution is defined to be:

**population skewness** is:

$$\gamma = {\mu_3 \over (\mu_2)^{3/2 }} = {\mu_3 \over \sigma^3}$$

Where $\mu_3$ is the 3rd moment about mean, and $\sigma$ is the standard
deviation.

**Sample unbiased estimate** is:

$$Cs = { n \sum (x_i - \bar x)^3 \over (n-1)(n-2)S^3 }$$

*NOTE*: if $Cs \gt 0$, Positive skew (long tail in the right, most
hydrological variables, e.g. discharge hydrograph, rainfall, etc). If
$Cs \lt 0$, negative skew (tail left).

#### Kurtosis

Kurtosis is a parameter that describes how peaked is a random variable’s
probability distribution.

$$kurtosis = \int(x-\mu)^4f(x)dx = \mu_4$$

; it is also the **4th moment** of a population.

For a **sample** of $n$ values the *sample excess kurtosis* is:

$$k_2 = \frac{\mu_4}{\mu_{2}^2} -3 = \frac{\tfrac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^4}{\left(\tfrac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2\right)^2} - 3$$

where $\mu_4$ is the fourth sample moment about the mean, $\mu_2$ is the
second sample moment about the mean (that is, the **variance**), $x_i$
is the $i^{th}$ value, and $\overline{x}$ is the **sample mean**.

In [8]:
#calculate skewness
df.hist()
print(df.skew())

<IPython.core.display.Javascript object>

u_m/s     -0.663962
v_m/s      0.026707
w_m/s      0.530051
T_degC     0.675111
CO2_ppm   -0.723968
H2O_ppt    0.543470
dtype: float64


In [9]:
print(df.kurtosis())

u_m/s      0.058661
v_m/s     -0.437510
w_m/s      0.876392
T_degC     0.110910
CO2_ppm    0.274278
H2O_ppt   -0.041216
dtype: float64


### Jointly distributed random variables

#### Covariance

$Cov(X,Y) = \sigma_{x,y}=\mu_{1,1} = \int \int (x-\mu_x)(y-\mu_y)f_{X,Y}(x,y)dxdy$ 

* measures linear dependence of two random variables
* if X and Y are independent $\implies \sigma_{X,Y}=0$. (But not the opposite!)
* also, it's not a cause-and-effect measure
* has units of [X] times [Y]

In [14]:
df.plot(x='T_degC',y='w_m/s',kind='scatter')
#df.plot(x='CO2_ppm',y='w_m/s',kind='scatter')
#df.plot(x='H2O_ppt',y='w_m/s',kind='scatter')

#df.plot.hexbin(x='T_degC',y='w_m/s', gridsize=50)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1192b3c50>

In [15]:
#show Marginal distributions
import seaborn as sns

sns.jointplot(df['CO2_ppm'], df['w_m/s'], kind="hex")


<IPython.core.display.Javascript object>



<seaborn.axisgrid.JointGrid at 0x119314898>

#### Correlation

$\rho_{X,Y}=\frac{\sigma_{x,y}}{\sigma_x \sigma_Y}$

* dimensionless
* also only measures linear dependence and it's also not a cause-and-effect measure

![haan_fig35.png](attachment:haan_fig35.png)

In [None]:
df.corr()
#df.cov()

### Further properties

If Z is a linear function of two random variables X and Y, i.e.

$Z=aX+bY$

$E(Z)=E(aX+bY)=aE(X)+bE(Y)$

$Var(Z)=Var(aX+bY)=E(aX+bY)^2 - E^2(aX+bY)$

which can be shown to be

$Var(Z)=a^2 Var(X)+b^2 Var(Y)+2abCov(X,Y)$

* So: For uncorrelated random varibales, the variance of the sum is equal to the sum of the variances.

This can be generalised for Z being a linear function of n random varibales:

$Z=\sum a_i X_i$

Then:

$Var(Z)=\sum a_i^2Var(X_i)+2 \sum a_i a_j Cov(X_i,X_j)$

For a special case of this, let $X_i$ be a random sample of size n, and $a_i$ all be equal to 1/n. Then Z become the sample mean $\overline{X}$, and $Var(Z)=Var(\overline{X})$.

Since the $X_i$ form a random sample, $Cov(X_i,X_j)=0$.

$Var(Z)=Var(\overline{X})=\sum 1/n^2 Var(X)=n/n^2 Var(X)$

So, the variance of the mean of a random sample is the variance of the sample divided by the number of observation in the sample.

#### References / Notes

Concepts and Examples are based on

1. Ang H-S.A., Tang W.H.; Probability Concepts in Engineering Planning and Design, Volume 1: Basic Principles <http://books.google.no/books/about/Probability_Concepts_in_Engineering_Plan.html?id=EIRRAAAAMAAJ&redir_esc=y>
2. Haan, C.; Statistical Methods in Hydrology 2nd. Edition. 