# Probability Assignment

To get full credit in this assignment you need to use `numpy`, `scipy` and `pandas` libraries. Sometimes you need to type equations - type equations in Latex math notation. To produce the plots you can use any plotting library you need.

PS1: We run the assignment through chatGPT the questions and you will be referred to the Dean if we find that a robot answered your questions.

PS2: We are also monitoring solution websites and we will take action against anyone that uploads this to a solution website.

## Problem 1 (80 points)

A surgeon analyzes surgical videos and models events that occur. He describes the problem statement in [here](https://thomasward.com/simulating-correlated-data/). Your job is to replicate the solution in Python and demonstrate your understanding of the steps performed by  including adequate explanation of the code in either markdown cells or inline to the code.  You can insert as many markdown or code cells you need to perform the analysis. 


## Question 1a (10 points)

Write the code for generating the `gs` variable. This is the simplest random variable of the problem and can be generated independent of the others. 

In [None]:
mu, sigma = 7.25, 0.875 #mean and variance values, in order to gloves sizes to be between 5.5-9
gs = np.random.normal (mu, sigma, 10000) #generating 10000 gloves size samples

## Question 1b (20 points)

We have three variables, `ak`, `pp`, and `ptime`. Write the code for generating these variables from Multivate Gaussian distribution and replicate the associated plots. 



In [None]:
mean = [0, 0, 0] #all variables have zero mean
cov = [[1, 0.6, -0.9], [0.6, 1, -0.5], [-0.9, -0.5, 1]] #3x3 covariance matrix
APT = np.random.multivariate_normal(mean, cov, 10000) #generating 10000 samples

#observing some properties
print(APT.mean) #mean matrix
print(np.cov(APT.T)) #covariance matrix
print(np.corrcoef(APT.T)) #correlation matrix

# This is 3x3 plot of distribution, it barely shows the correlation between variables
df = pd.DataFrame(APT, columns=['ak','pp','ptime']) #generating data frame
axes = pd.plotting.scatter_matrix(df, alpha=1) #generating plot
plt.tight_layout()
plt.grid()
plt.show()

#Each of them is a normal distribution. For example ak,
plt.hist(APT[:,0], bins=20)
plt.grid()
plt.show()

## Question 1c (20 points)

Perform the probability inrtegral transform and replicate the associated plots.

In [None]:
U = norm.cdf(APT, loc=0, scale=1); #Applying CDF and obtaining uniform distribution
#plotting for the first column of U, it is really uniform distribution!
plt.hist(U[:,0], bins=20)
plt.grid()
plt.show()

print(np.corrcoef(U.T)) #correlation matrix

#This is 3x3 plot of distribution and it barely shows the correlation
df = pd.DataFrame(U, columns=['ak','pp','ptime']) #generating data frame
axes = pd.plotting.scatter_matrix(df, alpha=0.2) #generating plot
plt.tight_layout()
plt.grid()
plt.show()


## Question 1d (20 points)

Perform the inverse transform sampling.

In [None]:
#Making a poisson distribution with an average number of 5 air knots (ak)
ak = poisson.ppf(U[:,0], 5) #Using first column of U  
#Plotting          
plt.hist(ak, bins=15) 
plt.grid()
plt.show()

# Making a poisson distribution with a mean of 15 passing point (pp)
pp = poisson.ppf(U[:,1], 15) #Using second column of U
                   
plt.hist(pp, bins=15)
plt.grid()
plt.show()

#Making a normal distribution with a mean of 120 and standart deviation of 30
ptime = norm.ppf(U[:, 2], 120, 30) #Using third column of U

plt.hist(ptime, range=(0,250), bins=20)
plt.show()



## Question 1e (10 points)

Replicate the final plot showcasing the correlations between the variables.


In [None]:
#Checking correlations one by one
print(np.corrcoef(ak, pp))
print(np.corrcoef(ak, ptime))
print(np.corrcoef(pp, ptime))
print(np.corrcoef(gs, ak))
print(np.corrcoef(gs, pp))
print(np.corrcoef(gs, ptime))

#Final plot
data = {"ak": ak,"pp": pp,"ptime": ptime,"gs": gs} #Creating a data dictionary
# This is a 4x4 plot of distribution
df = pd.DataFrame(data, columns=['ak','pp','ptime','gs']) #Generating data frame
axes = pd.plotting.scatter_matrix(df, alpha=0.2) #Plotting
plt.tight_layout()
plt.grid()
plt.show()

## Problem 2 (20 points)

You now pretend that the $n=4$ dimensional data you generated in Problem 1 arrive sequentially one at a time (the co-called **online** learning setting). Introduce the index $i$ to represent the ith arriving data sample $\mathbf x_i$. 

1. Write the expression of the *sample* correlation matrix (5 points)
2. Write the expression of the sample correlation matrix that can be estimated recursively and plot the elements of the sample correlation matrix from $i=1$ to $i=100$ (15 points)

In [None]:
# We have ak, pp, ptime, gs each has 10000 samples. We need to receive them sequentially,
# correlation matrix will be updated.
#Each matrix actually is a base for prediction of the next step

# We need to update data frame each time then compute correlation matrix again.

for i in range(1,100):
    #Update variables
    aknew    = ak[0:i]
    ppnew    = pp[0:i]
    ptimenew = ptime[0:i]
    gsnew    = gs[0:i]
            
    data = {"ak": aknew,"pp": ppnew,"ptime": ptimenew,"gs": gsnew} #Creating a data dictionary
    df = pd.DataFrame(data, columns=['ak','pp','ptime','gs']) # This is 4x4 plot of distribution
    print(df.corr()) #correlation matrix
    plt.matshow(df.corr(), fignum = "int") #correlation matrix plot
    time.sleep(1)