Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

---

# Section B: Question 2

*In which you will write functions to compute conditional probabilities from an (estimated) probability distribution.*

Suppose we have a $n \times p$ matrix $X$ with entries $x_{ij}$ (so that $1 \leq i \leq n$, $1  \leq j \leq p$). 

Let's define some (mathematical) functions $F_j(v), S_j(v)$ given by
$$
F_j(v) = \frac{1}{n}\sum_{i = 1}^n I(x_{ij} \leq v)~~~~\text{and} ~~~~~ S_j(v) = \frac{1}{n}\sum_{i = 1}^n I(x_{ij} > v),
$$
where
* $j$ is some column index of $X$;
* $v$ is some arbitrary number;
* $I($"statement"$)$ is the *indicator function*: it returns 0 if the statement is false, and 1 otherwise. For example, this means that if $x_{ij} \leq v$, then $I(x_{ij} \leq v) = 1$; if $x_{ij} > v$, then $I(x_{ij} \leq v) = 0$. 

In plain English: function $F_j(v)$ is the proportion of rows of $X$ where the $j$-th column of $X$ is less than, or equal to, some value $v$. We can think of $F_j(v)$ as a cumulative distribution distribution, $P(X_j \leq v)$.

Likewise, let's define $F_{jk}(v_1, v_2), S_{jk}(v_1, v_2)$ as
$$
F_{jk}(v_1, v_2) = \frac{1}{n}\sum_{i = 1}^n I(x_{ij} \leq v_1)I(x_{ik} \leq v_2) ~~~ \text{and} ~~~ S_{jk}(v_1, v_2) = \frac{1}{n}\sum_{i = 1}^n I(x_{ij} > v_1)I(x_{ik} > v_2)
$$
where both $j$ and $k$ represent column indices of $X$.

Your tasks are:

## Part 1a
Write a (Python) function to calculate called `marginal_dist(x_matrix, j, v, lower)`, where `x_matrix` is a two-dimensional *list* meant to represent some matrix $X$; `j` is a valid column index for `x_matrix`; `v` is a number; and $v$ is a Boolean variable. This function should return $F_j(v)$ if `lower=True` and $S_j(v)$ if `lower=False`. The default value of `lower` should be `True`.

*In the implementation of this function you are not allowed to use any packages.*

In [1]:
def marginal_dist(x_matrix, j, v, lower = True):

    if lower==True:
        sum_of_I=0
        n=0
        for i in range(len(x_matrix)):   #loop that goes through every row of the matrix
            if x_matrix[i][j]<=v:   #checks if the element is less than or equal to v
                sum_of_I+=1  #1 is added if the condition is satisfied
            n+=1  #number of rows
            Fj_v=sum_of_I/n  #Calculates the cumulative distribution
        return Fj_v
    
    elif lower==False:
        sum_of_I=0
        n=0
        for i in range(len(x_matrix)):   #loop that goes through every row of the matrix
            if x_matrix[i][j]>v:  #checks if the element is less than or equal to v
                sum_of_I+=1  #1 is added if the condition is satisfied
            n+=1  #number of rows
            Sj_v=sum_of_I/n
        return Sj_v
    # YOUR CODE HERE

In [2]:
# Test your code in this block
matrix_1=[[1,2,3],[4,5,6],[7,8,9]]
marginal_dist(matrix_1, 0, 3, lower=False)

0.6666666666666666

## Part 1b
Write a separate Python function `nd_marginal_dist(x_matrix, j, v, lower = True)` of analogous functionality, but where `x_matrix` is a `ndarray` and you are allowed to use `numpy` in any way you want, but *no other package*.

In [3]:
import numpy as np #import the necessary modules

def nd_marginal_dist(x_matrix, j, v, lower = True):

    if lower==True:
        sum_of_I=0
        n=0
        for i in range(len(x_matrix)):  #loop that goes through every row of the matrix
            sum_of_I+=(np.where(x_matrix[i][j]<=v,1,0))  #checks if the row in the j-th column is less than or equal to v and adds 1 to sum_of_I if satifisfied
            n+=1  #number of rows
            Fj_v=sum_of_I/n  #Calculates the cumulative distribution
        return Fj_v
    
    elif lower==False:
        sum_of_I=0
        n=0
        for i in range(len(x_matrix)):  #loop that goes through every row of the matrix
            sum_of_I+=(np.where(x_matrix[i][j]>v,1,0)) #check if the row in the j-th column is greater than v and adds 1 to sum_of_I if satisfied
            n+=1  #number of rows
            Sj_v=sum_of_I/n
        return Sj_v
    
    # YOUR CODE HERE

In [4]:
# Test your code in this block
import numpy as np
matrix_1=np.array([[1,2,3],[4,5,6],[7,8,9]])
nd_marginal_dist(matrix_1, 0, 3, lower=False)

0.6666666666666666

In [None]:
# Leave this block empty


## Part 2a

Repeat the same idea now for an analogous function (without *NumPy*) to be called `pairwise_dist(x_matrix, j, k, v1, v2)`. This function should return $F_{jk}(v_1, v_2)$ if `lower=True` and $S_{jk}(v_1, v_2)$ if `lower=False`. The default value of `lower` should be `True`.


In [5]:
def pairwise_dist(x_matrix, j, k, v1, v2, lower = True):
    
    if lower==True:
        sum_of_I=0
        n=0
        for i in range(len(x_matrix)):  #loop that goes through every row of the matrix
            if x_matrix[i][j]<=v1 and x_matrix[i][k]<=v2:  #checks if the row in the j-th column is less than or equal to v1 and if the same row in the k-th column is less than or equal to v2
                sum_of_I+=1  #1 is added if the condition is satisfied
            n+=1  #number of rows
            Fj_v=sum_of_I/n  #Calculates the cumulative distribution
        return Fj_v  
    
    elif lower==False:
        sum_of_I=0
        n=0
        for i in range(len(x_matrix)):  #loop that goes through every row of the matrix
            if x_matrix[i][j]>v1 and x_matrix[i][k]>v2:  #checks if the row in the j-th column is greater than v1 and if the same row in the k-th column is greater than v2
                sum_of_I+=1  #1 is added if the condition is satisfied
            n+=1  #number of rows
            Sj_v=sum_of_I/n
        return Sj_v   
    # YOUR CODE HERE

In [6]:
# Test your code in this block
matrix_1=[[1,2,3],[4,5,6],[7,8,9]]
pairwise_dist(matrix_1, 0, 1, 3,4,lower=False)

0.6666666666666666

## Part 2b

Now write the corresponding *NumPy* version of this function `nd_pairwise_dist(x_matrix, j, k, v1, v2, lower = True)`, but where `x_matrix` is a `ndarray` and you are allowed to use `numpy` in any way you want, but *no other package*. 

In [7]:
import numpy as np  #import the necessary modules

def nd_pairwise_dist(x_matrix, j, k, v1, v2, lower = True):
    if lower==True:
        sum_of_I=0
        n=0 
        for i in range(len(x_matrix)):  #loop that goes through every row of the matrix
            sum_of_I+=(np.where(x_matrix[i][j]<=v1,1,0)) and (np.where(x_matrix[i][k]<=v2,1,0))  #checks if the row in the j-th column is less than or equal to v1 and if the same row in the k-th column is less than or equal to v2, adds 1 to sum_of_I if satifisfied
            n+=1  #number of rows
            Fj_v=sum_of_I/n  #Calculates the cumulative distribution
        return Fj_v
    
    elif lower==False:
        sum_of_I=0
        n=0
        for i in range(len(x_matrix)):  #loop that goes through every row of the matrix
            sum_of_I+=(np.where(x_matrix[i][j]>v1,1,0)) and (np.where(x_matrix[i][k]>v2,1,0))  #checks if the row in the j-th column is greater than v1 and if the same row in the k-th column is greater than v2, adds 1 to sum_of_I if satifisfied
            n+=1  #number of rows
            Sj_v=sum_of_I/n
        return Sj_v  
    
    # YOUR CODE HERE

In [8]:
# Test your code in this block
import numpy as np
matrix_1=[[1,2,3],[4,5,6],[7,8,9]]
nd_pairwise_dist(matrix_1, 0, 1, 3,4,lower=False)

0.6666666666666666

In [None]:
# Leave this block empty


## Part 3a

The conditional probability $\mathbb{P}(X_j \leq v1 \mid X_k \leq v2)$ is given by

$$ \mathbb{P}(X_j \leq v1 \mid X_k \leq v2) = \begin{cases} \frac{F_{jk}(v1, v2)}{F_k(v2)} ~~~~ \text{if} ~~~ F_k(v2) > 0 \\ \text{undefined} ~~~~ \text {otherwise}\end{cases}$$

Write a function (without *NumPy*) to be called `conditional_prob(x_matrix, j, k, v1, v2)` to calculate $\mathbb{P}(X_j \leq v1 \mid X_k \leq v2)$. You may use your functions from the previous parts, if you wish. Your function should print the warning `"Conditional probability undefined"` and return `None` if the conditional probability is undefined.

In [9]:
def conditional_prob(x_matrix, j, k, v1, v2):
    if marginal_dist(x_matrix, k, v2, lower = True)>0:  #checks if 𝐹𝑘(𝑣2)>0 so the equation is defined
        prob=pairwise_dist(x_matrix, j, k, v1, v2)/marginal_dist(x_matrix, k, v2) #calculation for the conditional probability ℙ(𝑋𝑗≤𝑣1∣𝑋𝑘≤𝑣2) with the functions that contain NumPy 
        return prob 
    else:
        print("Conditional probability undefined")
        return None #no value returned for this function

    # YOUR CODE HERE

In [10]:
# Test your code in this block
matrix_1=[[1,2,3],[4,5,6],[7,8,9]]
conditional_prob(matrix_1, 2, 0, 3,4)

0.5

## Part 3b
Now write the corresponding *NumPy* version of this function `nd_conditional_prob(x_matrix, j, k, v1, v2, lower = True)`, but where `x_matrix` is a `ndarray` and you are allowed to use `numpy` in any way you want, but *no other package*. You may use your functions from the previous parts, if you wish.

In [11]:
import numpy as np

def nd_conditional_prob(x_matrix, j, k, v1, v2):
    if nd_marginal_dist(x_matrix, k, v2, lower = True)>0:  #checks if 𝐹𝑘(𝑣2)>0 so the equation is defined
        prob=nd_pairwise_dist(x_matrix, j, k, v1, v2)/nd_marginal_dist(x_matrix, k, v2)  #calculation for the conditional probability ℙ(𝑋𝑗≤𝑣1∣𝑋𝑘≤𝑣2) with the functions that contain NumPy
        return prob
    else:
        print("Conditional probability undefined")
        return None  #no value returned for this function
    
    # YOUR CODE HERE

In [12]:
# Test your code in this block
import numpy as np
matrix_2=np.array([[1,2,3],[4,5,6],[7,8,9]])
nd_conditional_prob(matrix_1, 2, 0, 3, 4)

0.5

In [None]:
# Leave this block empty


## Part 4
Write a function called `benchmark_conditional_prob` where you provide some test data, which will take the form of a two-dimensional list and a `ndarray`, and report the wall-clock time of calling each of the two Python functions from Part 3.

In [13]:
import time

def benchmark_conditional_prob():
    matrix_1=[[1,2,3],[4,5,6],[2,4,5]]
    start_1=time.time()  #get the current time
    conditional_prob(matrix_1, 1, 2, 4.5, 4)  #call the first function which doesn't use NumPy
    end_1=time.time()  #get the current time
    time_1=end_1-start_1  #calculate the time the first function took to run
    print("The wall-clock time for matrix_1 is", time_1)
    
    matrix_2=np.array([[1,2,3],[4,5,6],[2,4,5]])
    start_2=time.time()  #get the current time
    nd_conditional_prob(matrix_2, 1, 2, 4.5, 4)  #call the first function which doesn't use NumPy
    end_2=time.time()  #get the current time
    time_2=end_2-start_2  #calculate the time the second function took to run
    print("The wall-clock time for matrix_2 is", time_2)
    return time_1, time_2  #return the time both functions took to run

    # YOUR CODE HERE

In [14]:
# Test your code in this block
benchmark_conditional_prob()

The wall-clock time for matrix_1 is 7.867813110351562e-06
The wall-clock time for matrix_2 is 0.00011110305786132812


(7.867813110351562e-06, 0.00011110305786132812)

In [None]:
# Leave this block empty


# Part 5

Explain what each function does, concisely but in enough detail so that any of your peers can follow. Aim for a total of 6-10 sentences.

YOUR ANSWER HERE

For the function in Part 1a, we had to write a function that calculated the cumulative distribution of the proportion of rows in a column of a matrix that is less than or greater than an arbitrary number by looping through the rows. For the function in Part 1b, we had to write a function that involved NumPy to calculate the sum of I either when the row is less than or greater than an arbitrary number.
The function in Part 2a involved calculating the cumulative distribution of the proportion of rows in 2 columns of a matrix that are both simultaneously either less than or greater than their respective arbitrary number, v and k. This was done by looping through both rows simultaneously. For the function in part 2b, we had to write a function that involved NumPy to calculate the sum of I either when the rows in 2 columns were less than or greater than their respective arbitrary number.
The function in Part 3a involved using the function from part 2a to calculate 𝐹𝑗𝑘(𝑣1,𝑣2) and dividing by the function from part 1a which was used to calculate the 𝐹𝑘(𝑣2). This function was based on the condition that 𝐹𝑘(𝑣2) was greater than 0 or else the calculation would be undefined. For the function in part 3b, this involved both functions from parts 2a and 1a which contain NumPy and dividing the function from part 2a by 1a.
The function in Part 4 tested the time it took to run both functions from parts 3a and 3b as the function in 3b uses NumPy. The TIME package was used for this function to create the time interval for both functions from the start of running the function to when the function ends.