# Workshop 7, October 6, 2023

**Due by 9pm, October 10, 2023**


# Problem 1 Cumulative Distribution Function (CDF)


A Cumulative Distribution Function (CDF) is a probability distribution function that describes the probability that a random variable takes on a value less than or equal to a specific value. It is often used in probability and statistics to characterize the distribution of a random variable.

The CDF of a random variable X, denoted as F(x), is defined as:

$\large F(x) = P(X \leq x)$

Here, $F(x)$ represents the CDF of the random variable $X$, and $P(X \leq x)$ represents the probability that $X$ is less than or equal to $x$. The CDF provides a way to understand the cumulative probabilities associated with the values of $X$, which can be helpful in various statistical analyses and applications.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Fix random seed so that the data set is reproducible
np.random.seed(2023)
data = np.random.poisson(100,10000) + np.random.normal(125,5,10000)

Plot the distribution of `data` as a histogram. Determine the standard deviation (std. dev.)and mean of `data`. Set the range of the histogram to mean +/- 5 * std. dev.. In the hist() function, turn on and off the `density=True` option, and see how the Y axis values change. In addition, when the density option is on, change `bins` parameter for hist() and see how the Y axis values change. Google search what this option does in matplotlib.pyplot and summarize your understanding in a markdown cell.

In [None]:
# Your plotting code


In [None]:
# Turn this into a markdown and explain what density = True does

## Create the CDF for this data distribution

There are different ways of doing this. You can sort the original dataset in an ascending order and then count the nubmer of entries beyond a series of values ( of $x$). Alternatively, you can create a histogram with density turned out, read out the bin content (the y axis value of each bin), calculate the cumulative sum from the lower end of the range to the upper end of the rnage. Either way, it involves some research how these approaches can be implemented technically. 

In [None]:
# Your code here

# Problem 2 Verify the Central Limit Theorem

CLT claims that if a variable $X$ is the sum of **$N$** random numbers that are indepenendently and identically drawn from a probability distribution, then $X$ follows a Gaussian distribution when $N$ is large

$$X_N = \frac{1}{N}\sum_{i=0}^{N} x_i$$

where $ x_i $ is drawn from $ f(x)$ which has a mean of $\mu$ and a std. dev of $\sigma$

The distribution of $X_N$ should have a mean of $\mu$ and a std. dev. of $\sigma/\sqrt{N}$


In this exercise, we generate $x_i$ from a uniform distribution between 0 and 1, and we compare the distribution of $X_N$ with a Gaussian distribution with its mean and sigma predicted by the Central Limit Theorem.

In [None]:
# generate a sample 
np.random.seed(2023)
sample1 = np.random.uniform(0,1,(10000,2))
# this gives you a numpy array with 10000 entries in the axis 0
# and 2 entries in axis 1
# each entry in axis 0 has two elements 

Let's check if X2 = (x1 + x2)/2 is a random number that follows Gaussian distribution.

Answer the following questions to guide your check:
1. Based CLT, what would be the mean and standard deviation for this distribution?
2. How do you create a reference that is perfectly nomral and has the desired mean and std. dev.? If you generate a ranomd distribution under a PDF and the distribution has a sufficiently large number of entries, then that distribution can serve as a proxy of the PDF. 
3. Can you overlay X2 distribution with your reference and check if they are compatible? 

Create a plot like this one (does this look like the one you have?) ![CLT2.png](attachment:CLT2.png)

In [None]:
# Fix the code below so that the plot looks like the one shown above

X2 = 
# plot X2 as a histogram
plt.hist(X2,bins=,range=(0,1),density=True,label="$X_2$")

# Central limit theorem tells you that 
# Xn = (x1 + x2 + ... + xn)/n would follow a 
# Gassian distribution with a mean of mu, and a standard deviation of
# of sigma/sqrt(n)
# where mean is the mean value of the original random distribution followed by x_i
# and sigma is the standard deviation value of the original random distribution 

# Here we want to construct a reference histogram from the Gaussian distribution 
# What are the expected mean and sigma for Gaussian based on the CLT?

mean = 
sigma = 
reference = 

plt.hist(reference,bins=, range=,histtype='step',label="reference")
plt.legend()

### Now let's check N = 4
Let's check if $X_4$ = $(x_1 + x_2 + x_3 + x_4)/4$ is a random number that follows a uniform distribution between 0 and 1

Make the same plot as done in the last step.

### Repeat this check for N = 10

Make the same plot as done in the last step.


In [None]:
# Your code for N = 4 case

In [None]:
# Your code for N = 10 case

# Problem 3 Monte Carlo integration

We have the following function and we want to calculate integral of it from 0 to 0.9.

$\large f(x) = x \cdot sin( 3 \pi \cdot x)$

This can be achieved with Monte Carlo method.

In [None]:
# Define this function and visualize this function
def f(x):
    return x*np.sin(3*np.pi*x)

xval = np.linspace(0,1.0,1000)
plt.plot(xval,f(xval))
plt.xlabel('x')
plt.ylabel('y = f(x)')


Let's generate points uniformly distributed between 0 and 1.0 in $x$ and -1 and +1 in $y$. Visualize them

In [None]:
npoints = 100000 # Once your code has been developed, you may increase this to larger number to improve the estimate
np.random.seed(1006)
xr = np.random.uniform(0,1.0,npoints)
yr = np.random.uniform(-1,1,npoints)

# Visualization code
plt.plot(xval,f(xval))
plt.xlabel('x')
plt.ylabel('y = f(x)')
plt.scatter(xr,yr,s=0.5,c='red')

Geometrically, the integral is equivalent to calculate the area under the curve and above y = 0. If the function value is negative, then the area above the curve and below y = 0 contributes negatively to the integral.

The cell below demonstrate how you can get these areas

In [None]:
# This cell shows how to get area (points) under the curve

x_under_f = xr[yr < f(xr)]
y_under_f = yr[yr < f(xr)]

# Visualization code
plt.plot(xval,f(xval))
plt.xlabel('x')
plt.ylabel('y = f(x)')
plt.scatter(x_under_f,y_under_f,s=0.5,c='red')


In [None]:
# Your turn to develop the code to show how to get the points above the function



In [None]:
# Count points under the function where f(x) > 0 

x_positive = x_under_f[y_under_f>0]
y_positive = y_under_f[y_under_f>0]

# Counting

positive_count = x_positive.size

# Visualization code
plt.plot(xval,f(xval))
plt.xlabel('x')
plt.ylabel('y = f(x)')
plt.scatter(x_positive,y_positive,s=0.5,c='blue')



In [None]:
# your turn to count points above the function where f(x) <= 0 

Now we get all the ingredients and let's calculate the integral step by step

1. Calculate the difference between the `positive counts` and `negative counts`
2. Calculate the total area covered by the random points (without any selection). 
3. The integral is then $ \large A_{total}\cdot \frac{ N_{pos} - N_{neg} }{N_{total}}$

In [None]:
# Your code here

Verify the calculation with a numerical integration method built in scipy

In [None]:
from scipy.integrate import quad
integral_value, _= quad(f, 0,1)

print(f"Integral value is: {integral_value}")

How does your result compare to the estimate with scipy?

## Calculate the area within the `Heart Shape`
The Heart Shape is defined by the following parametric equations:

$
\begin{align*}
x(t) &= 16 \sin^3(t) \\
y(t) &= 13 \cos(t) - 5 \cos(2t) - 2 \cos(3t) - \cos(4t)
\end{align*}
$

These equations describe the coordinates $(x(t), y(t))$ of points on the Heart Shape curve as a function of the parameter $t$.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define the parametric equations
t = np.linspace(-1.0*np.pi, 1.0*np.pi, 10000)
x = 16 * np.sin(t)**3
y = 13 * np.cos(t) - 5 * np.cos(2*t) - 2 * np.cos(3*t) - np.cos(4*t)

# Plot the heart shape
plt.plot(x, y, 'r')
plt.title("Heart Shape")
plt.axis('equal')  # Ensure the aspect ratio is equal
#plt.axis('off')  # Turn off the axes

In [None]:
# your code here

# Here are some hints : 1) the shape is symmetric, 
# and therefore, we just need to figure out the area of on side of it (e.g., x>0). 
# 2) t is defined between -pi and +pi. Which part of the curve does (0,pi/2) correspond to? 
