# Solution for Homework 4

In all problems below, please comment your code sufficiently well so that the grader can follow what you are doing with ease. For non-coding answers, please make sure to formulate your explanation and answers in the form of complete English sentences. It is not sufficient to just leave comments in the code (without a full sentence explanations in problems that ask for those) or vice versa (to leave full sentence explanations but not code comments). You will need both to get full credit. 

## Problem 1

Write a function named "border" that takes as its input two integers n and m ($1 \le n,m \le 100$) and outputs a NumPy array of size $(n,m)$. Your array should be filled with zeros except for the "border" (that is, the first and last column and the first and last row) which should be filled with ones. 

In [2]:
import numpy as np

def border(n, m):
    """Creates an array with shape (n, m) that is all zeros
    except for the border (i.e., the first and last rows and
    columns), which should be filled with ones."""
    
    grid = np.ones((n,m))
    grid[1:n-1,1:m-1]=0
    return grid

# check that your function works as expected (if it does this check should return "True")
# np.array_equal(border(3,3), np.array([[1, 1, 1], [1, 0, 1], [1, 1, 1]]))

True

In [None]:
grid = np.ones(

## Problem 2

(a) Suppose that x and y are both NumPy arrays of size $(1,n)$. 
Write a function that computes Pearson's sample correlation coefficient $r$ 
between the entries in x and y. 

Recall, that Pearson's sample correlation coefficient is defined as 
$$r(x,y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum\limits_{i=1}^n (x_i - \bar{x})^2 \sum\limits_{i=1}^n(y_i - \bar{y})^2}} $$

In [3]:
def pearsons_correlation(x, y):
    """Computes Pearson's correlation coefficient between vectors x and y."""
    
    return np.sum((x-x.mean())*(y-y.mean()))/np.sqrt(np.sum((x-x.mean())**2)*np.sum((y-y.mean())**2))
    
# check that your function works
# if it does, the comparison below should return "True"

import numpy as np

np.random.seed(10)
n = 10
x = np.random.random((1,10))
y = np.random.random((1,10))    

round(pearsons_correlation(x,y),6) == round(np.corrcoef(x,y)[0,1],6)
# we're rounding, because the built-in NumPy method for correlation 
# may use more efficient arithmetic than we do in our home-made function. 

True

(b) Create a NumPy array filled with uniform(0,1) random values with n rows and m columns (for the combinations of n and m listed below). In each case, find the maximum of the Pearson correlation coefficients for each of the $\binom{m}{2}$ pairs of columns in your array. 

That is, if the columns of your array are $x_1, \ldots, x_m$, find 
$$\max(r) = \max\limits_{a,b \in \{1,\ldots,m\}, a \neq b} r(x_a, x_b)$$

Enter the values you find into the table below (rounded to two after decimal digits).

In [6]:
import numpy as np
np.random.seed(10)

def max_corr(n,m):
    array = np.random.random((n,m)) 
    """find maximum correlation between columns of (n,m) random uniform (0,1) array"""
    correlations = np.zeros((m,m)) # make matrix to store correlations between columns
    for a in range(m):             # loop over columns
        for b in range(m):
            correlations[a,b] = np.corrcoef(array[:,a], array[:,b])[0,1] # find Pearson's correlation between columns
                # Alternatively, you can also use your own code from part (a)
    return round(max(correlations[np.eye(m)==0]),2) # disregard correlations of columns with themselves 
                # that is, ignore the "1" entries in the diagonal and find the max of the rest. 

max_corr(100,50)

0.34

|  rows n | columns m  | max(r)  |
|:-:|:-:|:-:|
| 10  | 20  | |
| 10  | 50  |  |
| 10  | 100  |  |
| 100  | 20  |  |
| 100  | 50  |  |
| 100  | 100  |  |

Copy and paste this table to the Markdown cell below and complete it. 

|  rows n | columns m  | max(r)  |
|:-:|:-:|:-:|
| 10  | 20  | 0.73 |
| 10  | 50  | 0.82 |
| 10  | 100  | 0.96 |
| 100  | 20  | 0.26 |
| 100  | 50  | 0.34 |
| 100  | 100  | 0.41 |

Copy and paste this table to the Markdown cell below and complete it. 

**Remark:** Considering observations on many variables (large m) on few observations (small n) leads to high "spurious" correlations. 

## Problem 3

Suppose that $X$ is a matrix with n rows and m columns. In statistics, these matrices arise freqently when we collect (numerical) data on m variables which are each observed on n independent individuals. Suppose we denote the columns of $X = (x_1, \ldots, x_m)$. Then the covariance matrix of $X$ is defined as 

$$ \mbox{Cov}(X) = \left( \begin{array}{cccc}
Var(x_1) & Cov(x_1, x_2) & \cdots & Cov(x_1, x_m) \\
Cov(x_2, x_1) & Var(x_2) & \cdots & Cov(x_2, x_m) \\
\vdots & \vdots & \ddots & \vdots \\
Cov(x_m, x_1) & Cov(x_m, x_2) & \cdots & Var(x_m) 
\end{array} \right) $$

where Var($x_i$) is the sample variance of the entries in the $i^{th}$ column of $X$ and Cov($x_i, x_j$) ($i \neq j$) is the sample covariance of the entries in columns $i$ and $j$. 

$$ Cov(x_i, x_j) = \frac{1}{n-1}\sum\limits_{k=1}^n (x_{ki} - \bar{x}_i)(x_{kj}-\bar{x_j})$$

That is, the covariance matrix is a square $m\times m$ matrix whose diagonal entries are the sample variances of the columns of $X$ and whose off-diagonal entries are covariances between two columns of $X$, respectively. 

Write *your own* function that takes a Numpy array of shape (n,m) as its input and returns the  Covariance matrix of the array. Do not use any built-in functions to compute variance and/or covariance. Write your own functions, instead. It's ok to use NumPy routines for sums, squares, or square-roots. 

Check your work by comparing your result to that of the ```np.cov()``` function applied to the same input.

In [8]:
def covariance_matrix(X):
    """ Finds the covariance matrix of (n,m) shaped array X"""
    
    n = np.shape(X)[0]
    m = np.shape(X)[1]
    cov_matrix = np.zeros((m,m)) # make "empty" of floats to later store covariances in 
    for i in range(m):            # iterating over COLUMNS of X
        for j in range(m):
            cov_matrix[i,j] = (1/(n-1))*np.sum((X[:,i]-X[:,i].mean())*(X[:,j]-X[:,j].mean())) 
                # assemble covariance matrix
    return cov_matrix
    
## check your work
X = np.random.random((5,3))  # or make up some other matrix for X 
np.array_equal(np.round(np.cov(X.T),4), np.round(covariance_matrix(X),4)) 
# again, we're rounding because your "hand" computation will differ slightly from NumPy's internal computations

True

## Problem 4

The code given below (don't change the seed, please) generates n=10,000 IID random samples from Student's t-distribution with degree of freedom $\nu = 5$. Let $X \sim t(\nu=5)$ be a random variable with a $t_5$ distribution. For each of the following quantities, find their values using the methods your have learned in Math 161A. State your answer together with a short description of how you find it. Also find the best estimate of each quantity you can produce based on the 10,000 generated random numbers.

In [10]:
from scipy import stats
import numpy as np

np.random.seed(10)
Data = np.array(stats.t.rvs(df=5,size=10000))

(a) $P(-1 \le X < 2)$

In [11]:
np.sum((Data>=-1)&(Data<2))/len(Data)     # find relative frequency of data in [-1,2)

0.7648

The estimate of $P(-1 \le X \le 2)$ is 0.7648

(b) The $77^{th}$ percentile of $X$.

In [12]:
sorted(Data)[7700] # find quantile

0.796361368761396

The estimate of the 77th percentile of X is 0.7964.

(c) Var($X$) (that is the sample variance of $X$)

In [13]:
print((1/(len(Data)))*np.sum((Data-Data.mean())**2)) # finding sample variance
# alternatively
print(np.var(Data))

1.6497983797104132
1.649798379710413


The estimate of $Var(X)$ is 1.6498.

## Problem 5

The code below generates a NumPy array X of shape (12,8) filled with integers. Please don't change the seed.

In [None]:
import numpy as np

np.random.seed(10)
X = np.random.randint(10,size = (12,8)) 
X

(a) Write a function called "swap" that takes as its input the array $X$ and two integers n and m ($ 1 \le n,m \le 12, n \neq m$) and returns the matrix X in which rows n and m have been swapped. 

In [15]:
def swap(X,n,m):
    """swaps rows n and m of array X"""

    fancy_idx = list(range(np.shape(X)[0]))
    fancy_idx[n-1] = m-1
    fancy_idx[m-1] = n-1
    return X[fancy_idx]
    
# check your work
np.array_equal(swap(X,1,2), np.array([[9, 0, 8, 6, 4, 3, 0, 4],
       [9, 4, 0, 1, 9, 0, 1, 8],
       [6, 8, 1, 8, 4, 1, 3, 6],
       [5, 3, 9, 6, 9, 1, 9, 4],
       [2, 6, 7, 8, 8, 9, 2, 0],
       [6, 7, 8, 1, 7, 1, 4, 0],
       [8, 5, 4, 7, 8, 8, 2, 6],
       [2, 8, 8, 6, 6, 5, 6, 0],
       [0, 6, 9, 1, 8, 9, 1, 2],
       [8, 9, 9, 5, 0, 2, 7, 3],
       [0, 4, 2, 0, 3, 3, 1, 2],
       [5, 9, 0, 1, 0, 1, 9, 0]]))  

False

(b) Write a function called "sort_by_column" that takes as its input the array X and an integer k ($1\le k \le 8$) and sorts the rows of X by the values of column k. Numbers that were in the same row before, should still be in the same row. Don't worry about ties - if there are ties, I don't care how you order them. 

In [14]:
def sort_by_column(X,k):
    """ sort rows of X by values in column k"""
    
    idx = X[:,k-1].argsort() # create index based on k-th column
    return X[idx,:]