In [1]:
%pylab inline
import numpy as np

Populating the interactive namespace from numpy and matplotlib


## Compute statistics for a known distribution

Requested additions:
1. Do the same calcs for a sample.
2. Show how to do the calcs without .dot and outer, just standard python (much slower, but ok).
3. Add text to explain each step

### Defining a joint distribution over two discrete random variables
The probabilities are organized in a 2D array, where the columns correspond to values of $x$ and the rows correspond to values of $y$

In [2]:
# We start with positive weights that don't sum to 1
P=np.array([[1.,1,2],[1,2,1]])
P2=copy(P)
P

array([[1., 1., 2.],
       [1., 2., 1.]])

In [4]:
# We then normalize the weights
# using Pure Python

#Compute the sum
s=0
for i in range(shape(P)[0]):
    for j in range(shape(P)[1]):
        s+=P[i,j]
print('the sum is ',s)
#divide by the sum
for i in range(shape(P)[0]):
    for j in range(shape(P)[1]):
        P[i,j] /= s
P

the sum is  8.0


array([[0.125, 0.125, 0.25 ],
       [0.125, 0.25 , 0.125]])

In [5]:
# Using Numpy we can write it in a much shorter way
P2/=sum(P2)
P2

array([[0.125, 0.125, 0.25 ],
       [0.125, 0.25 , 0.125]])

In [6]:
# The values that the random variables X and Y take
x=np.array([1,2,3])
y=np.array([-1,1])

#### Computing Marginals
The marginal distributions are the probabilities associated with each random variable alone.

In [7]:
# The python way
Px=[0]*shape(P)[1]
Py=[0]*shape(P)[0]
for i in range(shape(P)[0]):
    for j in range(shape(P)[1]):
        Px

In [8]:
#the numpy way:
Px=sum(P,axis=0)
Py=sum(P,axis=1)
Px,Py

(array([0.25 , 0.375, 0.375]), array([0.5, 0.5]))

### Check whether $x$ and $y$ are independent

If X and Y are independent, $P(x,y) = P(x)P(y)$ for all values x and y. We can check this as follows

In [10]:
#The python way
indep = 1
for i in range(3):
    for j in range(2):
        Pxy = P[j,i]
        if Pxy != Px[i]*Py[j]:
            indep = 0
if indep:
    print("X and Y are indepenndent")
else:
    print("X and Y are not independent")

X and Y are not independent


In [14]:
#In numpy we can write the nested loop above by simply taking the outer products of the two vectors Px and Py. 
#This gives us a new matrix where the value of the (i,j)th element is the product Px[i]*Py[j]

outerprod_diff = np.outer(Px,Py).T - P
print(outerprod_diff)

if np.all(outerprod_diff==0):
    print("X and Y are indepenndent")
else:
    print("X and Y are not independent")
          

[[ 0.      0.0625 -0.0625]
 [ 0.     -0.0625  0.0625]]
X and Y are not independent


### Calculating the mean and standard deviation
To calculate the mean of $X$ and $Y$ under this distribtion in python, we need to iterate through the values of $x$ and $y$ and plug them into the formuls $E[X] = \sum_x P(X=x)x$. Similarly for standard deviation.


In [18]:
from math import sqrt

Ex = 0
for i in range(3):
    Ex+=Px[i]*x[i]
Ey = 0
for i in range(2):
    Ey+=Py[i]*y[i]

varx = 0
for i in range(3):
    varx+=Px[i]*(x[i] - Ex)**2
stdx = sqrt(varx)

vary = 0
for i in range(2):
    vary+=Py[i]*(y[i] - Ey)**2
stdy = sqrt(vary)

Ex,Ey,stdx,stdy

(2.125, 0.0, 0.7806247497997998, 1.0)

In [19]:
# In numpy you can use np.dot(A,B) which calculates the pairwise product of elements in A and B and sums them up

Ex=np.dot(Px,x)
Ey=np.dot(Py,y)
Ex2=np.dot(Px,x**2)
Ey2=np.dot(Py,y**2)
stdx=sqrt(Ex2-Ex**2)
stdy=sqrt(Ey2-Ey**2)
Ex,Ey,stdx,stdy


(2.125, 0.0, 0.7806247497997998, 1.0)

## Subtracting the mean 

We can create two new random variables $NX$ and $NY$ by subtracting the means from our original random variables. 
These variables will have the same standard deviation, but their expected values will now be 0

In [20]:
nx=x-Ex
nx

array([-1.125, -0.125,  0.875])

In [21]:
ny=y-Ey
ny

array([-1.,  1.])

### Calculate the covariance


In [24]:
s=0
for i in range(len(x)):
    for j in range(len(y)):
        s+=P[j,i]*nx[i]*ny[j]
print(f"Covariance = {s} ")#our expected values are now 0 so nothing to subtract

Covariance = -0.125 


In [25]:
# numpy
print(f"the covariance is  {np.dot(P.flatten(), np.outer(ny,nx).flatten())}")

the covariance is  -0.125


## Correlation


In [27]:
print('Correlation is', np.dot(P.flatten(), np.outer(ny,nx).flatten())/(stdx*stdy))

Correlation is -0.16012815380508713


## Empirical statistics

If we now draw samples from these distributions, we can see that the emperical statistics, the population mean, population standard deviation and population covariance approach the original values of mean, standard deviation and covariance.

In [39]:
# np.random.choice()

In [40]:
nx

array([-1.125, -0.125,  0.875])

In [35]:
numsamples = [2,10,100,100000]

for num in numsamples: 
    print (f"Population mean after drawing {num} samples = {np.mean(np.random.choice(nx, num, True, Px))}")

Population mean after drawing 2 samples = 0.375
Population mean after drawing 10 samples = -0.025
Population mean after drawing 100 samples = -0.035
Population mean after drawing 100000 samples = -0.00066


In [36]:
for num in numsamples: 
    print(f"Population std dev after drawing {num} samples = {np.std(np.random.choice(nx, num, True, Px))}")

Population std dev after drawing 2 samples = 0.0
Population std dev after drawing 10 samples = 0.7483314773547882
Population std dev after drawing 100 samples = 0.7522632517942108
Population std dev after drawing 100000 samples = 0.7807668886421862


In [37]:
#To calculate the covariance, we will generate samples (x,y) form the joint distribution ppossible samples
nxy =  np.array([(i,j) for i in nx for j in ny])
for num in numsamples:
    samples = np.random.choice(nxy.shape[0], num, True, P.T.flatten()), #choose rows
    print(f"Population covariance after drawing {num} samples = {np.cov(nxy[samples][:,0],nxy[samples][:,1])[0,1]}")

Population covariance after drawing 2 samples = 0.0
Population covariance after drawing 10 samples = 0.4
Population covariance after drawing 100 samples = -0.041616161616161655
Population covariance after drawing 100000 samples = -0.1288968973689736
