### CS4102 - Geometric Foundations of Data Analysis I
Prof. Götz Pfeiffer<br />
School of Mathematical and Statistical Sciences<br />
University of Galway

# Week 6

## 0. Questions

* How to compute the $r^2$ for the skin care example?
* How to compute simultaneous interval estimates?
* ...

## 1. Numpy - a quick overview

* `numpy` is *the* library for matrix algebra in Python.
* Usually, it's name is abbreviated as `np`.

In [None]:
import numpy as np

###  Arrays

* The fundamental data type in numpy is `ndarray`, often just called 'array',
* We can use this data type for vectors and matrice .... 

In [None]:
np.array([1,7,3,4,9,11])  #  a vector

In [None]:
np.array([[3,5],[-1,2]])  # a matrix

In [None]:
np.array([[[1,2],[3,4]],[[5,6],[7,8]]]) # a 3-dim'l tensor

### Special Matrices

* There are commands for creating matrices of zeros or ones

In [None]:
np.zeros((3,2))

In [None]:
np.ones((3,2))

### Reshaping

* Each array has a shape: its size in each dimension.
* The shape on an array can be modified

In [None]:
a = np.array(range(16))
a

In [None]:
a.reshape(4,4)

In [None]:
a.reshape(-1,8)

In [None]:
a.reshape((2,2,2,2))

### Floating Point Ranges: `arange` vs. `linspace`

In [None]:
np.arange(16)

In [None]:
np.arange(1,32,2)  # start, stop (exclusive), stepsize

In [None]:
np.arange(1, 1.3, 0.1)  # exclusve?  yes but there are rounding errors

In [None]:
np.linspace(1, 1.3, 4)  # start, stop (inclusive), count

### Mathematical Operations

* We can use `+` and `-` for adding and subtracting matrices (of the same shape)

In [None]:
a = np.array(range(1,7)).reshape(2,3)
a

In [None]:
a + a

In [None]:
a - a

* We can use `*` to multiply a matrix with a scalar ...

In [None]:
3 * a

In [None]:
a * 0

* ... but not for matrix multiplication :-(
* The `*` operator applied to matrices $A = (a_{ij})$ and $B= (b_{ij})$ (of the same shape) yields their *Hadamard product*:
that is the matrix $C = (c_{ij})$ (of the same shape as $A$ and $B$) with $c_{ij} = a_{ij} b_{ij}$

In [None]:
a * a

## 2. Hypothesis Testing: Skin Care Example

* We apply the same procedure as before, to find the coefficient of determination for the data in the file `cream.csv`.
* We start by importing the packages (`csv` and `numpy`).

In [None]:
import csv
import numpy as np

In [None]:
with open('cream.csv') as csvfile:
    rows = list(csv.DictReader(csvfile))

* The main difference is that the data file now has one more column of $x$-values.

In [None]:
rows[0]

* Still, we can build numpy arrays `X` and `Y`, representing the matrices $X$ and $Y$, in a similar fashion as before.

In [None]:
X = np.array([[1, row['xone'], row['xtwo']] for row in rows], dtype=float)
Y = np.array([[row['y']] for row in rows], dtype=float)

* The matrix formula for computing the coefficients $B = (b_0, b_1, b_2)^t$ of the least squares fit $y = b_0 + b_1 x_1 + b_2 x_2$ is still the same:
$$
  B = (X^t X)^{-1} X^t Y
$$
* And so is the sequence of steps used to compute it.

In [None]:
XtX = X.T @ X  # T for transpose, @ for matrix multiplication
XtY = X.T @ Y
B = np.linalg.inv(XtX) @ XtY
print(B)

* From this, we compute the ingredients for the quantities SSE, SSR and SSTO as before ...

In [None]:
Yhat = X @ B
ybar = sum(Y)/len(Y)
ybar = ybar[0]
BtXtY = B.T @ XtY
YtY = Y.T @ Y
nybar2 = len(Y) * ybar**2

* ... and then the quantities themselves.

In [None]:
SSR = BtXtY - nybar2
SSE = YtY - BtXtY
SSTO = YtY - nybar2
SSR = SSR[0,0]
SSE = SSE[0,0]
SSTO = SSTO[0,0]
print("SSR =", SSR, ", SSE =", SSE, ", SSTO =", SSTO)

* Finally $r^2 = \mathrm{SSR}/\mathrm{SSTO}$

In [None]:
r2 = SSR/SSTO
r2

* Next, the F-test requires slightly modified quantities:
$$
\mathrm{MSR} = \frac{\mathrm{SSR}}{p-1}, \qquad
\mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}, \qquad
F^* = \frac{\mathrm{MSR}}{\mathrm{MSE}}
$$
* Here, $p = 3$ and $n = 15$.
* Let's compute $F^*$!

In [None]:
n = len(X)
p = len(X[0])
print("n =", n, ", p =", p)
MSR = SSR/(p-1)
MSE = SSE/(n-p)
Fstar = MSR/MSE
Fstar

* Then, assuming that the errors $\epsilon_i$ are independent $N(0, \sigma^2)$, we choose a confidence level $\alpha = 0.05$ and find the value of the $F$-distribution with $p-1$ and $n-p$ degrees of freedom.
* This value can be found in a table, online or off-line, or with the help of the `scipy.stats` package

In [None]:
from scipy.stats import f
alpha = 0.05
f.pdf(1 - alpha, p-1, n-p)

* As this value is clearly smaller than $F^*$, we can reject the null hypothesis $\mathcal{C}_0$ at level $\alpha$.

* In order to quickly chek whether the $\epsilon_i$ are independent and normally distributed, we can plot them (i) against the actual error $\hat{y}_i$, (ii) against the input data $x_{i1}$, (iii) against the input data $x_{i2}$.
* The $x_{i1}$ reside in column $1$ of the array `X`, from where we can extract them as `X[:,1]`, using a *slice* (`:` for all) in the first dimension, and an index (`1` for column $1$) in the second dimension.

In [None]:
X[:,1]

* For plotting, we use the `matplotlib.pyplot` package under its nickname `plt`.

In [None]:
import matplotlib.pyplot as plt

In [None]:
Yhat = X @ B
E = Y - Yhat
plt.plot(Yhat, E, 'b.')

In [None]:
plt.plot(X[:,1], E, 'go')

In [None]:
plt.plot(X[:,2], E, 'r+')

* The **estimated covariance matrix** for the least squares model 
$$
y_i = \beta_0 + \beta_1 x_{i,1} + \dots + \beta_{p-1} x_{i,p-1} + \epsilon_i
$$
is $S^2(B) = \mathrm{MSE} (X^t X)^{-1}$.

In [None]:
S2B = MSE * np.linalg.inv(XtX)
S2B

* Theory says that if $q$ of the $\beta_k$ are jointly estimated, the confidence intervals
with coefficient $1 - \alpha$ are
$$
b_k - T \cdot s(b_k) \leq \beta_k \leq b_k + T \cdot s(b_k),
$$
where $T = t(1 - \frac{\alpha}{2q}, n - p)$.
* The value of the $T$-distribution can be found in a table, online or off-line, or again with the help of the `scipy.stats` package.

In [None]:
from scipy.stats import t
q = 2
T = t.pdf(1 - alpha/2/q, n - p)
T

* So, when estimating $\beta_1$ and $\beta_2$ jointly, after extracting the values $s(b_k)$ as square roots of the diagonal values of the array `S2B`, we can find the *lower bounds* of the confidence intervals for the $\beta_k$ as follows.

In [None]:
SB = np.diagonal(S2B)**0.5
B[:,0] - T * SB

* And the *upper bound*:

In [None]:
B[:,0] + T * SB