<h1>Bivariate normal distribution</h1>

<h2>CIE4604: Simulation and Visualisation</h2>
 
Created:  November 2014 by Hans van der Marel<br>
Updated:  November 2019 by Ullas Rajvanshi (converted the script in Python)<br>
<br>
Script that illustrates several examples of what you can do with a simulated bivariate normal distribution. See the lecture notes for a more detailed explanation

In [None]:
# importing all the libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2, ncx2
import math

### 1. Generate 10000 normal distributed measurements x and y
First generate independent normal distributed measurements x and y. Note that we could also have used a two-dimensional array xy=randn(2,n) to simulate these in one go.

In [None]:
n = 10000
x = np.random.randn(n, 1)
y = np.random.randn(n, 1)

Next we make a scatter plot with x and y. We give the figure a number and let plot return a plot handle. We can use the plot handle to make changes to the plot later (e.g. choose a different color). Note that for a scatter plot the axis should be equal.

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(x, y, marker='d', s=5, facecolors='b', edgecolors='y')
plt.axis('equal')
plt.xlabel('x')
plt.ylabel('y')

### 2. Calculate the probability that a measurement x,y falls outside a circle of radius 2
First we count the number of samples |N| with distance larger than 2 from the origin. We use |sum| in combination with a statistical test to do the counting. Matlab test return |1| when |true|. For the probability we have to divide by the number of measurements |n|.

In [None]:
r = 2
N = np.sum(x ** 2 + y ** 2 > r ** 2)  # counting number of samples with a distance larger than 2 from the origin
print("Number of samples outside the circle: ", N)
print("Fourend probability of a sample outside the circle: ", N / n)

To calculate the probability with different sample sizes we use the numpy function |cumsum| to compute the cumulative sum (in combination with the test). The result of |cumsum| is an array with the cumulative sum. To obtain the probability each element has to be divided by the number of elements in the cumulative sum, this is |0:n|.

In [None]:
Ncs = np.cumsum(x ** 2 + y ** 2 > r ** 2)
ss = np.arange(0, n )
plt.figure(figsize=(12, 5))
plt.plot(ss, Ncs / ss.T)
plt.xlabel('# of samples')
plt.ylabel(r'Probability $x^2+y^2>r^2$')

### 3. Normalized histogram of the distance squared
First we compute a histogram of the squared distance from the origin using the Matlab function |hist|. We use a bin size of 0.1 make a histogram in the range from 0 to 15. The return arguments |N| contain the counts and |X| the center of the bins. Matlab is case sensitive, so |X| and |x| are not the same. Type |help hist| for more information.

In [None]:
binsize = 0.1
x = np.random.randn(n, 1)
y = np.random.randn(n, 1)
[N, X] = np.histogram(x ** 2 + y ** 2, np.arange(0, 15, binsize))

Then we use plot the plot the *normalized* histogram, using a correction for the number of measurements and bin size. The normalized histogram should be a realization of a central Chi-square distribution with 2 degrees of freedom. For comparison we add the pdf of the Chi-square distribution to the plot in a different color. We use the matplotlib.pyplot function |pdf| to compute the pdf for values of |X|. See |help pdf|.

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(X[:-1], N / (n * binsize))
# for -1 read this https://stackoverflow.com/questions/18065951/why-does-numpy-histogram-python-leave-off-one-element
# -as-compared-to-hist-in-m
plt.plot(X, chi2.pdf(X, 2), color='r')
plt.xlabel(r'$Radius^2$')
plt.ylabel('PDF')

### 4. Cumulative distribution function of the distance squared
Instead of a normalized histogram and pdf, we can also plot the empirical cumulative histogram and cumulative distribution function (cdf).

In [None]:
cc = np.cumsum(N / (n))  # can also be done with a loop, but it's crappy!
c = np.concatenate((range(0, 1), cc[0:]))
plt.figure(figsize=(12, 5))
plt.plot(X, c)
plt.plot(X, chi2.cdf(X, 2), color='r')
plt.xlabel(r'$Radius^2$')
plt.ylabel('CDF')

With the help of the data cursor we can find in the plot probability P(r^2 < 4) (value on y axis) for distance squared value of 4 (value on x axis). This should be equal to one minus the value of section 2.

### 5. True probability using Matlab(TM) cdf function
To calculate the true probability that a sample fall outside the circle with radius 2 we use Matlab |cdf| function with the proper distribution.

In [None]:
prob = 1 - chi2.cdf(r ** 2, 2)
print('Real probability of sample outside the circle: ', prob)

### 6. Radius if the circle containing 95% of the measurements
To compute the radius from a probability we need the inverse cumulative distribute function |icdf|.

In [None]:
R = math.sqrt(chi2.ppf(0.95, 2))
print('Radius of circle containing 95% of the measurements: ', R)

### 7. Repeat the simulation 100 times, make histogram and compute standard deviation
We repeat the previous experiment 100 times. The emperically probabilities are not the same, they actually have a distribution of their own. We make a histogram of the computed probabilities and compute the standard deviation of the probability.

In [None]:
x = np.random.randn(n, 100)
y = np.random.randn(n, 100)
P = np.sum((x ** 2 + y ** 2 > r ** 2) / n, axis=0)
binsize = 0.001
plt.figure(figsize=(12, 5))
plt.hist(P, np.arange(0.12, 0.15, binsize), edgecolor='black', linewidth=0.5)
print('Standard deviation of the probability: ', P.std())

### 8. Compare standard deviation with theory
We can compute the standard deviation of the probability also using a formula given in Teunissen et al. (2009)

In [None]:
standard_deviation = math.sqrt(n * prob.mean() * (1 - prob.mean())) / n
print('"Real" standard deviation: ', standard_deviation)

### 9. Transform the variables
Compute new variables |x_n| and |y_n| from |x| and |y| using a linear transformation

In [None]:
x_n = x[:, 0]
y_n = y[:, 0] + 2 * x[:, 0]

According to the covariance propagation law the formal (theoretical) covariance matrix is

In [None]:
A = np.array([[1, 0], [2, 1]])
Qn_t = A * np.identity(2) * A.transpose()

We can also compute the empirical covariance matrix using the Matlab function |cov|

In [None]:
Qn_e = np.cov(np.array([x_n, y_n]))

### 10. Scatter plot of the transformed variables
We make a scatter plot of the transformed variable of the previous section, and overlay it with a second transformed variable.

In [None]:
plt.figure(figsize=(6, 6))
plt.scatter(x_n, y_n)
plt.scatter(x[:, 0], y[:, 0] - 2 * x[:, 0], color='r')
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')

### 11. Normalized histogram of distance squared for the transformed variables
We repeat section 3. The results will now be different because the distribution has changed. The pdf of the Chi-square distribution will not anymore fit the data.

In [None]:
N = np.sum(x_n ** 2 + y_n ** 2 > r ** 2)  # counting number of samples with a distance larger than 2 from the origin
print("Number of samples outside the circle: ", N)
print("Fourend probability of a sample outside the circle: ", N / n)
binsize = 0.1
[N, X] = np.histogram(x_n ** 2 + y_n ** 2, np.arange(0, 25, binsize))
plt.figure(figsize=(12, 5))
plt.plot(X[:-1], N / (n * binsize))
plt.plot(X, chi2.pdf(X, 2), color='r')
plt.xlabel(r'$Radius^2$')
plt.ylabel('PDF')

### 12. Add a bias to the x and y measurements, and repeat distance calculation
We add a bias to the data and repeat section 3. The probability function of the distance squared is now a non-central Chi-square distribution, with non centrality parameter 3^2+1^2=10.

In [None]:
x_b = x[:, 0] + 3
y_b = y[:, 0] + 1
[N, X] = np.histogram(x_b ** 2 + y_b ** 2, np.arange(0, 25, binsize))
plt.figure(figsize=(12, 5))
plt.plot(X[:-1], N / (n * binsize), label='Measured')
plt.plot(X, chi2.pdf(X, 2), label=r'$\chi^2 (2, 0)$', color='r')
plt.plot(X, ncx2.pdf(X, 2, 10), label=r'$\chi^2 (2, 10)$', color='k')
plt.xlabel(r'$Radius^2$')
plt.ylabel('PDF')
plt.legend()

### 13. Radius of the circle containing 95% of the measurements
Can be computed again from the inverse cumulative distribution function, using the proper distribution (non-central Chi-square distribution with two degrees of freedom, and non-centrality parameter of 10)

In [None]:
print(ncx2.ppf(0.95, 2, 10))