# Numerical characteristics of random variables

### RV characteristics can be obtained mathematicaly (by equation) if we know the distribution, or they can be *estimated*, if we don't know the distribution.

### If distribution is known, characteristics are obtained using equations:

## Mathematical expectation

In Python we can calculate mathematical expectation of discrete RVs using formula $E[X] = \sum_x P(X=x)x$.

For continuous RVs, we must approximativly calculate the integral $E[X] = \int_{-\infty}^{\infty} xp(x)dx$, where $p(x)$ is the PDF of RV.

<mark>
Take every possible value ùë•, multiply it by how likely it is to occur, and add everything up. <br>
Mathematical expectation = Mean</mark>


In [20]:
import numpy as np

PP=np.array([[2,2,4],[1,1,2]])  # We first define relationships between particular probabilities
P=PP/np.sum(PP)   # Normalization so that the sum is equla to 1
# Possible values of RVs
y=np.array([1,2,6])
x=np.array([-1,1])
# Marginal distributions (separate distributions of only one variable):
Px=P.sum(axis=1)
Py=P.sum(axis=0)
Px,Py

(array([0.66666667, 0.33333333]), array([0.25, 0.25, 0.5 ]))

In [21]:
# Calculating mathematical expectation
# Without built-in numpy functions:
Ex = 0
for i in range(len(Px)):
    Ex+=Px[i]*x[i]
Ey = 0
for i in range(len(Py)):
    Ey+=Py[i]*y[i]

Ex,Ey

(np.float64(-0.3333333333333333), np.float64(3.75))

In [22]:
# np.dot(A,B) calculates exactly the sum of the products of the elements fo two vectors:
Ex=np.dot(Px,x)
Ey=np.dot(Py,y)
Ex,Ey

(np.float64(-0.3333333333333333), np.float64(3.75))

## Variance
For discrete RVs: ${\text Var}[X] = \sum_x P(X=x)(x-E[x])^2$.

For continuous RVs: ${\text Var}[X] = \int_{-\infty}^{\infty} (x-E(x))^2 p(x)dx$

<mark>
Take (x ‚àí mean)¬≤ for each possible x, multiply by its probability, and sum everything.</mark>
 
If variance is small: 
 
-values are close to the mean 

-low variability 

-predictable 
 
If variance is large: 

-values are far from the mean 

-high variability 

-unpredictable large swings


$\text{SD}(X) = \sqrt{ \sum_x P(X=x)\,(x - E[X])^2 } = \sqrt{VAR}$


In [4]:
from math import sqrt
varx = 0
for i in range(len(x)):
    varx+=Px[i]*(x[i] - Ex)**2
stdx = sqrt(varx)

vary = 0
for i in range(len(y)):
    vary+=Py[i]*(y[i] - Ey)**2
stdy = sqrt(vary)

stdx,stdy

(0.9428090415820634, 2.277608394786075)

In [24]:
# by using numpy .dot
Ex2=np.dot(Px,x**2)
Ey2=np.dot(Py,y**2)
stdx=sqrt(Ex2-Ex**2)
stdy=sqrt(Ey2-Ey**2)
stdx,stdy

(0.9428090415820634, 2.277608394786075)

## Covariance
Discrete RVs: ${\text Covar}[X,Y] = \sum_{x,y} P(X=x,Y=y)(x-E[x])(y-E(y))$.

Continuous RVs: ${\text Covar}[X,Y] = \int_{-\infty}^{\infty} (x-E(x))(y-E[y]) p(x,y)dxdy$

In [25]:
# Covariance
nx=x-Ex
ny=y-Ey
s=0
for i in range(len(x)):
    for j in range(len(y)):
        s+=P[i,j]*nx[i]*ny[j]
nx, ny, s

(array([-0.66666667,  1.33333333]),
 array([-2.75, -1.75,  2.25]),
 5.551115123125783e-17)

In [27]:
# numpy

print('The covariance is', np.dot(P.flatten(), np.outer(nx, ny).flatten()))

The covariance is 5.551115123125783e-17


## Correlation

In [28]:
s/(stdx*stdy),stdx,stdy

(2.5851005526422704e-17, 0.9428090415820634, 2.277608394786075)

### Function which calculates all the characteristics introduced above

In [32]:
import numpy as np
from math import sqrt

def num_characteristics(P,x,y):
    P/=np.sum(P) # normalization
    Px=np.sum(P,axis=1) # marginal distribution
    Py=np.sum(P,axis=0)
    Ex=np.dot(Px,x)
    Ey=np.dot(Py,y)

    Ex2=np.dot(Px,x**2)
    Ey2=np.dot(Py,y**2)

    stdx=sqrt(Ex2-Ex**2)
    stdy=sqrt(Ey2-Ey**2)

    nx=x-Ex
    ny=y-Ey

    cov=np.dot(P.flatten(), np.outer(nx, ny).flatten())
    corr=cov/(stdx*stdy)
    return {
        'P':P,
        'x':x,
        'y':y,
        'Px':Px,
        'Py':Py,
        'Ex':Ex,
        'Ey':Ey,
        'stdx':stdx,
        'stdy':stdy,
        'cov':cov,
        'corr':corr
    }


In [33]:
x=np.arange(1.,2.,0.2)
y=np.arange(0.,1.,0.2)
P=np.eye(5)
print(x,y)
P
A=num_characteristics(P,x,y)
print(A['cov'],A['corr'])

[1.  1.2 1.4 1.6 1.8] [0.  0.2 0.4 0.6 0.8]
0.07999999999999999 0.9999999999999981


### Mathematical expectiation of the binomial distribution

In [34]:
from scipy import stats
n=10
p=0.5
k = np.arange(0, n+1)
P_binom = stats.binom.pmf(k, n, p)

E_binom=np.dot(P_binom,k)
E_binom

4.999999999999999

### Variance of the binomial distribution

In [35]:
E2=np.dot(P_binom,k**2)
std_binom=sqrt(E2-E_binom**2)
std_binom

1.581138830084192

### Mathematical expectiation of the normal distribution

In [36]:
step=0.01
x = np.arange(-100, 100,step)
m=5
var=10
sig=var**0.5
P_norm = stats.norm.pdf(x, m, sig)

E_norm=np.sum(np.array([step*z*x for x,z in zip(x,P_norm)]))
E_norm
E_norm=np.dot(P_norm,x)*step
E_norm

4.999999999997442

### Variance of the normal distribution

In [37]:
step=0.1
x = np.arange(-100, 100,step)
m=5
var=10
sig=var**0.5
P_norm = stats.norm.pdf(x, m, sig)

Var_norm=np.sum(np.array([step*z*(x-E_norm)**2 for x,z in zip(x,P_norm)]))
Var_norm

10.00000000000057

## Median

In [38]:


x=np.array([1,2,6])
Px=np.array([ 0.25,  0.25,  0.5 ])
med_index=[i for i in np.arange(0,len(Px)) if (np.sum(Px[:i+1]) >= 0.5) and  (np.sum(Px[i:]) >= 0.5)]

med=np.mean(x[med_index])
med

4.0

### Median of normal distribution

In [44]:
step=0.01
x = np.arange(-100, 100,step)
m=5
var=10
sig=var**0.5
P_norm = stats.norm.pdf(x, m, sig)

med_index_norm=[i for i in np.arange(0,len(P_norm)) if (step*np.sum(P_norm[:i+1]) >= 0.5) and (step*np.sum(P_norm[i:]) >= 0.5)]

med_norm=np.mean(x[med_index_norm])
med_norm

5.000000000053717

### Median and mathematical expection of exponential distribution

In [45]:
lam=2
step=0.01
x = np.arange(0, 10, step)
P_exp=stats.expon.pdf(x, scale=1/lam)

E_exp=np.sum(np.array([step*z*x for x,z in zip(x,P_exp)]))

med_index_exp=[i for i in np.arange(0,len(P_exp)) if (step*np.sum(P_exp[:i+1]) >= 0.5) and (step*np.sum(P_exp[i:]) >= 0.5)]

med_exp=np.mean(x[med_index_exp])
E_exp, med_exp

(0.4999833118177803, 0.34500000000000003)

## Quantiles

In [46]:
# Inverse of CDF

p1=stats.norm.ppf(0.01, loc=0, scale=1)
p2=stats.norm.ppf(0.99, loc=0, scale=1)
p1,p2

(-2.3263478740408408, 2.3263478740408408)

In [47]:
# Median is qunatile of order 0.5 (50%-percentile)
p50=stats.expon.ppf(0.5, scale=1/lam)
p50

0.34657359027997264

##  If the probability distribution is not known, we must STATISTICALY ESTIMATE the characteristics. The more (independent) data points we have, the better is the approximation.

# Estimating RV characteristics by *generating samples* from distributions

In [48]:
# numpy mean value - ESTIMATE of the mathematical expectation
import numpy as np
x = np.arange(10)
x,np.mean(x)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 4.5)

In [49]:
# numpy function median() - ESTIMATE of the true median
np.median(x)

4.5

In [50]:
# Median is more robust to asymetric outliers:
x =np.append(x,100)
np.mean(x),np.median(x)

(13.181818181818182, 5.0)

In [51]:
# Mod - the most probable value
# This funciton estimates it as the most frequent value the RV takes
x =np.append(x,7)
stats.mode(x)

ModeResult(mode=7, count=2)

### Variability characterization :

In [52]:
# peak-to-peak :
raspon = np.ptp(x)
raspon

100

#### Estimate of the variance:
Two typical formulas:

$Var_1=\frac{\sum_{i=i}^N (x_i - \bar{x})^2}{n}$

$Var_2=\frac{\sum_{i=i}^N (x_i - \bar{x})^2}{n-1}$

The first one is the maximum likelihodd estiamte, and it is not centered (the mean value of the estimate is not equal to the true variance). The second one is centerd and it is used more often.

In [53]:
data = np.arange(7,14)
np.std(data, ddof=0)  # with ddof (delta degrees of freedom) we specify which estimate we use - ddof=0 gives Var1

2.0

In [54]:
np.std(data, ddof=1)  #ddof=1 gives Var2

2.160246899469287

In [56]:
# Estimate variance of normal distribution:
# generating samples:
import numpy as np
from scipy import stats
samples=stats.norm.rvs(size=1000, loc=0, scale=1)
np.std(samples, ddof=0),np.std(samples, ddof=1)

(1.0103029259993872, 1.0108084566419804)

In [57]:
# For lower number of samples, the estimate is worst , ddof=0 typically underestimates the true variance
samples=stats.norm.rvs(size=4, loc=0, scale=1)
np.std(samples, ddof=0),np.std(samples, ddof=1)

(0.45226841331586454, 0.5222345803477586)

### Comparison of estimated characteristics with the true values.

In [58]:
x=np.array([-1,0,1])

Px=np.array([ 2/5,  1/5,  2/5])

In [59]:
# generate samples from a discrete distribution:
np.random.choice(x,size=100,replace=True, p=Px)

array([ 1,  0,  1,  1,  1, -1,  1, -1,  0, -1, -1, -1, -1,  1, -1,  1,  1,
        1, -1,  1,  1,  1, -1, -1,  0, -1,  1, -1,  1, -1, -1, -1, -1,  1,
       -1,  0,  0,  1,  0,  1, -1,  1, -1,  0,  1, -1,  1,  1,  0, -1,  0,
        0,  0, -1,  1,  0,  0, -1, -1,  0, -1,  1,  1,  1,  0,  1, -1, -1,
        1,  1, -1,  0,  1,  1,  1,  1,  1, -1, -1, -1,  1,  1,  0,  1,  1,
        0, -1, -1,  1,  1,  1,  1, -1, -1,  1, -1, -1, -1,  1,  0])

In [60]:
# mean values
num_samples = [2,10,100,100000]

for n in num_samples:
    print(np.mean(np.random.choice(x, n, True, Px)))

-0.5
-0.2
0.05
0.00147


# Estimating characteristics on real-world data

Again we use the dataset [student alcohol consumption](https://www.kaggle.com/uciml/student-alcohol-consumption/home).

In [62]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('seaborn-talk')
atributi = ['G3','Dalc']
data_por = pd.read_csv("student-por.csv",usecols=atributi)
data_por.head()


  plt.style.use('seaborn-talk')


Unnamed: 0,Dalc,G3
0,1,11
1,1,11
2,2,12
3,1,14
4,1,13


In [63]:
# statistical estimation of MEDIAN
med_alk=data_por['Dalc'].median()
med_alk

1.0

In [64]:
# mean value
sr_vr_alk=data_por['Dalc'].mean()
sr_vr_alk

1.50231124807396

In [65]:
# Standard deviation estimate
std_alk=data_por['G3'].std()
std_alk

3.230656242804805

In [66]:
# All columns at once:
std_alk=data_por.std()
std_alk

Dalc    0.924834
G3      3.230656
dtype: float64

In [67]:
# Quantiles estimation:

data_por.quantile(0.1)  # e.g. this is qunatile of order 0.1

Dalc    1.0
G3      8.8
Name: 0.1, dtype: float64

In [68]:
# using pandas describe() method we can get all the important estimates in one command:
data_por.describe()

Unnamed: 0,Dalc,G3
count,649.0,649.0
mean,1.502311,11.906009
std,0.924834,3.230656
min,1.0,0.0
25%,1.0,10.0
50%,1.0,12.0
75%,2.0,14.0
max,5.0,19.0


In [69]:
# estimate of the covariance matrix:
data_por.cov()

Unnamed: 0,Dalc,G3
Dalc,0.855319,-0.611665
G3,-0.611665,10.43714


In [70]:
#Correlation matrix:
data_por.corr()

Unnamed: 0,Dalc,G3
Dalc,1.0,-0.204719
G3,-0.204719,1.0
