# Workshop 6, September 28, 2023
due by 9 pm October 3, 2023

# Part 1 Visualization of Probability Density Functions using Scipy


## Scipy introduction
SciPy (pronounced "sigh-pie") is a Python-based ecosystem of open-source software for mathematics, science, and engineering. It is built on top of the NumPy library and provides a host of functionalities that extend beyond NumPy's capabilities. The name "SciPy" also refers specifically to the central library within this ecosystem, which is used for high-level computations.


The **scipy.stats** module is a significant part of the SciPy library, dedicated to statistical functions and algorithms.

- Diverse Distributions: The module provides tools to work with a wide range of probability distributions, both continuous and discrete. This includes standard distributions like normal, uniform, binomial, poisson, and many more specialized ones.

- Statistical Functions: Functions for computing statistics are available, such as mean, median, variance, and standard deviation. It also provides functions for more advanced statistical tests, like t-tests, ANOVA, chi-squared tests, and correlation coefficients.

- Random Sampling and Data Generation: Tools to generate random samples from various distributions, which can be particularly useful for simulations or generating synthetic data.

- Statistical Utilities: Functions for tasks like histogram creation, data transformation, and kernel density estimation.

**We will use functions available in scipy.stats module to understand the qualititve behavior of the most common PDFs/PMFs**

## Binomial Distribution
The probability mass function (PMF) for the Binomial distribution is:
$ P(X=k) = \binom{n}{k} p^k (1-p)^{n-k} $
where:
- $ n $ is the number of trials.
- $ p $ is the probability of success in any given trial.
- $ k $ is the number of successes.


**Variance and Standard Deviation for Binomial function**

Variance ($ \sigma^2 $):
$ \sigma^2 = n \times p \times (1-p) $

Standard Deviation ($ \sigma $):
$ \sigma = \sqrt{n \times p \times (1-p)} $

Comment: The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. It's commonly used in scenarios like coin tosses or quality control in manufacturing.



### Practice
Read the example code cell below to understand how you can use scipy module to draw a PMF for Binomial distribution. Create three sets of curves with different values for $n$ and $p$ and overlay them on the same plot. Summarize how the shape of this PMF would change as functions of $n$ and $p$ in a markdown cell


In [None]:
# Import necessary libraries
from scipy.stats import binom
import matplotlib.pyplot as plt
import numpy as np

# Define the parameters for the binomial distribution
n, p = 100, 0.5

# Create a range of possible outcomes (from 0 successes to n successes)
x = np.arange(0, n+1)

# Calculate the PMF values for each outcome
pmf = binom.pmf(x, n, p)

# Plot the PMF values
plt.plot(x, pmf)
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title('PMF of Binomial Distribution (n=100, p=0.5)')
plt.show()

In [None]:
# Practice

Your observation

# Poisson Distribution
The PMF for the Poisson distribution is:
$ P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!} $
where:
- $ \lambda $ is the mean number of occurrences in a given interval.
- $ k $ is the actual number of successes.
Comment: The Poisson distribution models the number of events occurring in a fixed interval of time or space. It's used in scenarios such as modeling the number of emails received in a day or the number of phone calls at a call center in an hour.

**Do you remember the variance and standard deviation of Poisson function?**

### Practice
Read the example code cell below to understand how you can use scipy module to draw a PMF for Poisson distribution. Create three sets of curves with different expectation values and overlay them on the same plot. Summarize how the shape of this PMF would change as functions of $\lambda$ in a markdown cell

In [None]:
from scipy.stats import poisson

lambda_val = 3
x = np.arange(0, 2*lambda_val)
pmf = poisson.pmf(x, lambda_val)
plt.bar(x, pmf)
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.title('PMF of Poisson Distribution (lambda=3)')
plt.show()


In [None]:
# Practice

In [None]:
Your observation


# Gaussian (Normal) Distribution
The probability density function (PDF) for the Gaussian distribution is:
$ f(x|\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} e^{ - \frac{(x-\mu)^2}{2\sigma^2} } $
where:
- $ \mu $ is the mean of the distribution.
- $ \sigma $ is the standard deviation.
Comment: The Gaussian or normal distribution is ubiquitous in statistics and is used in a plethora of scenarios due to the Central Limit Theorem. Examples include modeling grades in a large class or errors in measurements.

**Do you remember the variance and standard deviation of Gaussian function?**


### Practice
Read the example code cell below to understand how you can use scipy module to draw a PDF for Normal distribution. Create three sets of curves with different expectation values and overlay them on the same plot. Summarize how the shape of this PDF would change as functions of $\mu$ and $\sigma$ in a markdown cell

In [None]:
from scipy.stats import norm

mu, sigma = 0, 1
x = np.linspace(-5, 5, 1000)
pdf = norm.pdf(x, mu, sigma)
plt.plot(x, pdf)
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('PDF of Normal Distribution (mu=0, sigma=1)')
plt.show()


In [None]:
# Practice 

In [None]:
Your observation

# Exponential Distribution
The PDF for the Exponential distribution is:
$ f(x|\lambda) = \lambda e^{-\lambda x} $
for $ x \geq 0 $, 0 otherwise.
where:
- $ \lambda $ is the rate parameter.
Comment: The Exponential distribution is often used to model the time elapsed between events, such as the time between customer arrivals in a queue or the life expectancy of electrical components.

**Do you remember the variance and standard deviation of the Exponential function?**


### Practice
Read the example code cell below to understand how you can use scipy module to draw a PDF for Exponential distribution. Create three sets of curves with different expectation values and overlay them on the same plot. Summarize how the shape of this PDF would change as functions of $\lambda$ in a markdown cell

In [None]:
from scipy.stats import expon

lambda_val = 0.5
x = np.linspace(0, 5, 1000)
pdf = expon.pdf(x, scale=1/lambda_val)
plt.plot(x, pdf)
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('PDF of Exponential Distribution (lambda=0.5)')
plt.show()


In [None]:
# Practice

In [None]:
Your observation

# The real exericse

Draw the Poisson function with $\lambda = 1000$. Now, overlay a Gaussian function and a Binomial function with this Poisson function. Tune the parameters of Gaussian and Binomial function so that their shapes are as close to the Poisson function as Poisson. 

- Hint: what descriptive statistics could tell you about the shape of a function? 

# Part 2 Random Number Generation with Numpy

NumPy's random numbers are based on the Mersenne Twister algorithm, which is a widely-used method for generating pseudo-random numbers.

Key Features of NumPy's Random Module:

- Broad Range of Distributions: Beyond basic uniform and normal distributions, NumPy offers functions to generate samples from dozens of distributions, such as Binomial, Poisson, Exponential, and many more.

- Versatility: You can generate single random numbers, arrays of random numbers, or even matrices. These numbers can be integers or floats, depending on your needs.

- Reproducibility with Seeds: NumPy allows setting a 'seed' for its random number generator. By using the same seed, you can reproduce the same sequence of random numbers, which is invaluable for debugging or sharing results with others.

- Random Sampling: NumPy provides functions to randomly sample from arrays, which can be used for bootstrapping or other resampling techniques.

## Basic Usage Examples:

In [None]:
# Importing the necessary module
import numpy as np

# --- Seed for Reproducibility ---
# Setting a seed ensures that the random numbers are reproducible. If you run this code multiple times,
# you'll get the same sequence of random numbers, which is useful for debugging.
# Try changing the seed and see if you get identical outcomes below
np.random.seed(32)

# --- Generating Random Floats ---
# Generate a single random float between 0 (inclusive) and 1 (exclusive)
single_random_float = np.random.rand()
print("Single Random Float:", single_random_float)

# Generate a 1D array of random floats
random_floats_1d = np.random.rand(5)
print("\n1D Array of Random Floats:", random_floats_1d)

# Generate a 2D array (matrix) of random floats
random_floats_2d = np.random.rand(3, 2)  # 3 rows, 2 columns
print("\n2D Array of Random Floats:\n", random_floats_2d)

# --- Generating Random Integers ---
# Generate a single random integer between 1 (inclusive) and 6 (exclusive), simulating a dice roll
dice_roll = np.random.randint(1, 6)
print("\nDice Roll (1 to 5):", dice_roll)

# Generate a 1D array of random integers within a range
random_integers_1d = np.random.randint(1, 10, 5)  # 5 integers between 1 (inclusive) and 10 (exclusive)
print("\n1D Array of Random Integers:", random_integers_1d)

# Generate a 2D array (matrix) of random integers within a range
random_integers_2d = np.random.randint(1, 100, (3, 2))  # 3 rows, 2 columns with numbers between 1 and 99
print("\n2D Array of Random Integers:\n", random_integers_2d)

# --- Controlling the Sample Size ---
# Randomly choose 3 numbers from a given list (with replacement)
sample_with_replacement = np.random.choice([1, 2, 3, 4, 5], size=3)
print("\nSample with Replacement:", sample_with_replacement)

# Randomly choose 3 numbers from a given list (without replacement)
sample_without_replacement = np.random.choice([1, 2, 3, 4, 5], size=3, replace=False)
print("\nSample without Replacement:", sample_without_replacement)



## Uniform distribution

- follow the example below and generate a uniform distribution between $\pi$ and $+3\pi$

In [None]:
# Import necessary modules
import numpy as np
import matplotlib.pyplot as plt

# Setting seed ensures reproducibility of random numbers
np.random.seed(0)

# --- Uniform Distribution ---
# Generate 1000 random numbers from a uniform distribution between 0 and 1
uniform_numbers = np.random.rand(1000)

# Plotting the histogram for the uniform distribution
plt.hist(uniform_numbers, bins=20, density=True)
plt.title('Uniform Distribution [0, 1]')
plt.show()



In [None]:
# Your code here

## Poisson distribution

follow the example below as well as np.random's web reference, and generate a 2D numpy array of shape (2,100) where the first row are 100 poisson random numbers with a mean of 50, and the second row are 100 poisson random numbers with a mean of 150

In [None]:
# --- Poisson Distribution ---
# Parameters for the Poisson distribution (mean and variance are both lambda)
lambda_val = 5

# Generate 1000 random numbers from a Poisson distribution with lambda = 5
poisson_numbers = np.random.poisson(lambda_val, 1000)

# Plotting the histogram for the Poisson distribution
plt.hist(poisson_numbers, bins=30, density=True)
plt.title('Poisson Distribution (lambda=5)')
plt.show()



In [None]:
# Your code here

## Gaussian random numbers

follow the example below and generate a set of random numbers, each which are a product of two Gaussian random numbers, $A$ and $B$. The two Gaussian random numbers, $A$ and $B$, should be generated from Gaussian PDF of (mu = 10, sigma = 3) and Gaussian PDF of (mu = 15, sigma = 5), respectively

Draw the distribution of this set of random number $A \cdot B$

In [None]:
# --- Normal (Gaussian) Distribution ---
# Parameters for the normal distribution
mu, sigma = 0, 1

# Generate 1000 random numbers from a normal distribution with mean = 0 and standard deviation = 1
normal_numbers = np.random.normal(mu, sigma, 1000)

# Plotting the histogram for the normal distribution
plt.hist(normal_numbers, bins=30, density=True)
plt.title('Normal Distribution (mu=0, sigma=1)')
plt.show()



In [None]:
# Your code here



## Exponential random numbers

Building on the code below, you should calculate the mean value of all the random numbers generated, the mean value of the random numbers that are smaller than 8, the mean value of the random numbers that are smaller than 6. Make a scatter plot for these three set of numbers. Each set includes a pair of values: the total number of random numbers, and the mean value.

In [None]:
# --- Exponential Distribution ---
# Parameters for the exponential distribution (mean = 1/lambda)
lambda_exp = 0.5

# Generate 1000 random numbers from an exponential distribution with lambda = 0.5
exponential_numbers = np.random.exponential(1/lambda_exp, 1000)

# Plotting the histogram for the exponential distribution
plt.hist(exponential_numbers, bins=30, density=True)
plt.title('Exponential Distribution (lambda=0.5)')
plt.show()



In [None]:
# Your code here

## Binomial Distribution

Use scipy.stats to produce a Binomial PDF that matches the one used to generate the random numbers. Overlay that PDF with the random data set generated below.

In [None]:
# --- Binomial Distribution ---
# Parameters for the binomial distribution
n, p = 100, 0.5

# Generate 1000 random numbers from a binomial distribution with n=10 trials and p=0.5 probability of success
binomial_numbers = np.random.binomial(n, p, 1000)

# Plotting the histogram for the binomial distribution
plt.hist(binomial_numbers, density=True, bins=50,range=(25,75))
plt.title('Binomial Distribution (n=10, p=0.5)')
plt.show()


In [None]:
# Your code here

# Part 3 Indepedent events vs correlated events

## Part 3.1
Generate two set (x,y) of Gaussian random numbers using the same PDF (mu = 10, sigma = 2). Each set has 10,000 entries. Use the code below to show them in a two-dimensional histogram

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# How do you do the generation?



# Plot 2D histogram
plt.hist2d(x, y, bins=[np.linspace(0, 20, 50), np.linspace(0, 20, 50)], cmap='inferno')
plt.colorbar(label='Counts')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('2D Histogram')
plt.show()



# Part 3.2 
First generate a set of Gaussian random numbers with PDF (mu = 10, sigma = 2), $k = {k_i}$. Then, use the generated $k$ values to generate another set of random number $l$, where each $l$ value is generated from a Gaussian PDF with mu = $k_i$, sigma = 2. Show k,l in a two-dimensional histogram just like the one used in cell above. Compare and contrast these two plots. Write down your analysis. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt


# How do you generate the data set

# Plot 2D histogram
plt.hist2d(x, y, bins=[np.linspace(0, 20, 50), np.linspace(0, 20, 50)], cmap='inferno')
plt.colorbar(label='Counts')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('2D Histogram')
plt.show()
