02-computations.Rmd

# Computational considerations


In this tutorial, we will explore some basic **R** commands and illustrate their use on the Auto dataset (`Auto`) from the `ISLR` package.


## Calculation of least square estimates

Consider as usual $\boldsymbol{y}$ and $n$-vector of response variables and a full-rank $n \times p$ design matrix $\mathbf{X}$. We are interested in finding the ordinary least square coefficient $\hat{\boldsymbol{\beta}}$, the fitted values $\hat{\boldsymbol{y}} =  \mathbf{X}\hat{\boldsymbol{\beta}}$ and the residuals $\boldsymbol{e} = \boldsymbol{y} - \mathbf{X}\boldsymbol{\beta}$.

Whereas orthogonal projection matrices are useful for theoretical derivations, they are not used for computations. Building $\mathbf{H}_{\mathbf{X}}$ involves a matrix inversion and the storage of an $n \times n$ matrix. In Exercise series 2, we looked at two matrix decompositions: a singular value decomposition (SVD) and a QR decomposition. These are more numerically stable than using the normal equations $(\mathbf{X}^\top\mathbf{X})\boldsymbol{\beta} = \mathbf{X}^\top\boldsymbol{y}$ (the condition number of the matrix $\mathbf{X}^\top\mathbf{X}$ is the square of that of $\mathbf{X}$ --- more on this later). 
The code related to the SVD and QR decompositions is provided for reference, so you can validate the derivations in the exercise. You won't need it in practice.

**Optional** material: for more details about the complexity and algorithms underlying the different methods, the reader is referred to these notes of [Lee](www.math.uchicago.edu/~may/REU2012/REUPapers/Lee.pdf). 

We can fit a simple linear model with an intercept and a linear effect for the weight,
$$ \texttt{mpg}_i = \beta_0 + \texttt{hp}_i\beta_1 +\varepsilon_i.$$

We form the design matrix $(\boldsymbol{1}_n^\top, \texttt{hp}^\top)^\top$ and the vector of regressand $\texttt{mpg}$, then proceed with calculating the OLS coefficients $\hat{\boldsymbol{\beta}}$, the fitted values $\hat{\boldsymbol{y}}$ and the residuals $\boldsymbol{e}$.

We can compute first the ordinary least square estimates using the formula $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{y}$. The fitted values are $\hat{\boldsymbol{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ and the residuals $\boldsymbol{e} = \boldsymbol{y} - \hat{\boldsymbol{y}}$.

```{r loadmtcars}
data(Auto, package = "ISLR")
y <- Auto$mpg
X <- cbind(1, Auto$horsepower)
n <- nrow(X)
p <- ncol(X)
# Estimation of beta_hat:
XtX <- crossprod(X)
Xty <- crossprod(X, y)
# Solve normal equations
beta_hat <- as.vector(solve(XtX, Xty))
#same as beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y

##Create residuals and fitted values
fitted <- as.vector(X %*% beta_hat)
res <- y - fitted
```


The residuals $\boldsymbol{e} = \boldsymbol{y} -\hat{\boldsymbol{y}}$ can be interpreted as the *vertical* distance between the regression slope and the observation. For each observation $y_i$, a vertical line at distance $e_i$ is drawn from the prediction $\hat{y}_i$. 

```{r verticaldist}
plot(mpg ~ horsepower,  data = Auto, 
     xlab = "Power of engine (hp)", 
     ylab = "Fuel economy (miles/US gallon)", 
     main = "Fuel economy of automobiles",
     ylim = c(0, 50),
     # the subsequent commands for `plot`  tweak the display
     # check for yourself the effect of removing them
     # bty = "l" gives L shaped graphical windows (not boxed)
     # pch = 20 gives full dots rather than empty circles for points
     bty = "l", pch = 20) 
#Line of best linear fit
abline(a = beta_hat[1], b = beta_hat[2])

#Residuals are vertical distance from line to 
for(i in 1:nrow(X)){
  segments(x0 = Auto$horsepower[i], y0 = fitted[i], y1 = fitted[i] + res[i], col = 2)
}
```

The same scatterplot, this time using `ggplot2`.

```{r ggplotmtcars}
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
#Create data frame with segments
vlines <- data.frame(x1 = Auto$horsepower, y1 = fitted, y2 = fitted + res)
ggg <- ggplot(Auto, aes(x = horsepower, y = mpg)) + 
        geom_point() + 
        labs(x = "Power of engine (hp)", 
             y = "Fuel economy (miles/US gallon)", 
             title = "Fuel economy of automobiles") +
      geom_segment(aes(x = x1, y = y1, xend = x1, yend = y2, color = "red"), 
                   data = vlines, show.legend = FALSE) + 
      geom_abline(slope = beta_hat[2], intercept = beta_hat[1])
print(ggg)
```

### Interpretation of the coefficients

If the regression model is 
$$y_i = \beta_0 + \mathrm{x}_{i1}\beta_1 + \mathrm{x}_{i2}\beta_2 + \varepsilon_i,$$ the interpretation of $\beta_1$ in the linear model is as follows: a unit increase in $x$ leads to $\beta_1$ units increase in $y$, everything else (i.e., $\mathrm{x}_{i2}$) being held constant. 

For the `Auto` regression above, an increase of the power of the engine by one horsepower leads to an average decrease of `r abs(round(beta_hat[2], 2))` miles per US gallon in distance covered by the car. We could easily get an equivalent statement in terms of increase of the car fuel consumption for a given distance.

## The `lm` function 

The function `lm` is the workshorse for fitting linear models. It takes as input a formula: suppose you have a data frame containing columns `x` (a regressor) and `y` (the regressand); you can then call `lm(y ~ x)` to fit the linear model $y = \beta_0 + \beta_1x + \varepsilon$. The explanatory variable `y` is on the left hand side,
while the right hand side should contain the predictors, separated by a `+` sign if there are more than one.
If you provide the data frame name using `data`, then the shorthand `y ~ .` fits all the columns of the data frame  (but `y`) as regressors.

To fit higher order polynomials or transformations, use the `I` function to tell **R** to interpret the input "as is". 
Thus, `lm(y~x+I(x^2))`, would fit a linear model with design matrix $(\boldsymbol{1}_n, \mathbf{x}^\top, \mathbf{x}^2)^\top$. A constant is automatically included in the regression, but can be removed by writing `-1` or `+0` on the right hand side of the formula.

```{r}
# The function lm and its output
fit <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
fit_summary <- summary(fit)
```

The `lm` output will display OLS estimates along with standard errors, $t$ values for the Wald test of the hypothesis $\mathrm{H}_0: \beta_i=0$ and the associated $P$-values. Other statistics and information about the sample size, the degrees of freedom, etc., are given at the bottom of the table.

Many methods allow you to extract specific objects. For example, the functions `coef`, `resid`, `fitted`, `model.matrix` will return $\hat{\boldsymbol{\beta}}$, $\boldsymbol{e}$, $\hat{\boldsymbol{y}}$ and $\mathbf{X}$, respectively.

```{r} 
names(fit)
names(fit_summary)
```


The following simply illustrates what has been derived in Exercise series 2.  **R** has devoted functions that are coded more efficiently. 

### Singular value decomposition


The SVD decomposition in **R** returns a list with elements `u`, `d` and `v`. `u` is the orthonormal $n \times p$ matrix, `d` is a vector containing the diagonal elements of $\mathbf{D}$ and `v` is the $p \times p$ orthogonal matrix. Recall that the decomposition is 
\[\mathbf{X} = \mathbf{UDV}^\top\]
and that $\mathbf{VV}^\top= \mathbf{V}^\top\mathbf{V}=\mathbf{U}^\top\mathbf{U}=\mathbf{I}_p$. The matrix $\mathbf{D}$ contains the singular values of $\mathbf{X}$, and the diagonal elements $\mathrm{d}_{ii}^2$ corresponds to the (ordered) eigenvalues of $\mathbf{X}^\top\mathbf{X}$.

The following shows how to use the SVD decomposition in **R**. This material is **optional** and provided for reference only.

```{r svd}
svdX <- svd(X)
# Projection matrix
Hx <- tcrossprod(svdX$u)
# t(U) %*% U gives p by p identity matrix
all.equal(crossprod(svdX$u), diag(p))
# V is an orthogonal matrix
all.equal(tcrossprod(svdX$v), diag(p))
all.equal(crossprod(svdX$v), diag(p))
# D contains singular values
all.equal(svdX$d^2, eigen(XtX, only.values = TRUE)$values)
# OLS coefficient from SVD
beta_hat_svd <- c(svdX$v %*%  diag(1/svdX$d) %*% t(svdX$u) %*% y)
all.equal(beta_hat_svd, beta_hat)
```


### QR decomposition

**R** uses a QR-decomposition to calculate the OLS estimates in the function `lm`. There are specific functions to return coefficients, fitted values and residuals. One can also obtain the $n \times p$ matrix $\mathbf{Q}_1$ and the upper triangular $p \times p$ matrix $\mathbf{R}$ from the thinned QR decomposition, 
\[\mathbf{X} = \mathbf{Q}_1\mathbf{R}.\]

The following shows how to use the QR decomposition in **R**. This material is **optional** and provided for reference only.

```{r qr}
Hx <- X %*% solve(crossprod(X)) %*% t(X)
qrX <- qr(X)
Q1 <- qr.Q(qrX)
R <- qr.R(qrX)
# Compute beta_hat from QR
beta_hat_qr1 <- qr.coef(qrX, y) #using built-in function
beta_hat_qr2 <- c(backsolve(R, t(Q1) %*% y)) #manually
all.equal(beta_hat, beta_hat_qr1, check.attributes = FALSE)
all.equal(beta_hat, beta_hat_qr2, check.attributes = FALSE)
# Compute residuals
qre <- qr.resid(qrX, y)
all.equal(qre, c(y - X %*% beta_hat), check.attributes = FALSE)
# Compute fitted values
qryhat <- qr.fitted(qrX, y)
all.equal(qryhat, c(X %*% beta_hat), check.attributes = FALSE)
# Compute orthogonal projection matrix
qrHx <- tcrossprod(Q1)
all.equal(qrHx, Hx)
```

## The hyperplane of fitted values

In class, we presented a linear model for the `Auto` dataset of the form
$$\mathsf{mpg}_i = \beta_0 + \beta_1 \mathsf{hp}_i + \beta_2 \mathsf{hp}_i^2 + \varepsilon_i$$ 
and claimed this was a linear model. This is indeed true because we can form the design matrix $[\mathbf{1}_n,  \mathsf{hp}, \mathsf{hp}^2]$ and obtain coefficients $\hbb$. The graphical depiction is counterintuitive.

```{r, echo = FALSE}
library(ggplot2)
data(Auto, package = "ISLR")
mod <- lm(mpg ~ horsepower + I(horsepower^2),  data = Auto)
ggplot(data = Auto, aes(x = horsepower, y = mpg)) + 
  geom_point() + 
  labs(x = "Power of engine (hp)", 
             y = "Fuel economy (miles/US gallon)", 
             title = "Fuel economy of automobiles") +
      geom_line(data = data.frame(hp = Auto$horsepower, fitted = mod$fitted),
                aes(hp, fitted, col = "red"), show.legend = FALSE)
```

This quadratic curve is nothing like an hyperplane! Let $\bs{y} \equiv \texttt{mpg}$, $\mathsf{x} = \texttt{hp}$ and $\mathsf{z} = \texttt{hp}^2$. But recall that we are working in three dimensions (the intercept gives the height of the hyperplane) and the coordinates of our hyperplane are  
$$\beta_0 + \beta_1x-y +\beta_2z =0.$$
However, the observations will always be such that $z = x^2$, so our fitted values will lie on a one-dimensional subspace of this hyperplane.

The following 3D depiction hopefully captures this better and shows the fitted hyperplane along with the line on which all the ($x_i, z_i$) observations lie.

```{r hyperplane_config, echo = FALSE, eval = FALSE}
library(knitr)
knit_hooks$set(rgl = hook_rgl)
```

```{r hyperplane, echo = FALSE}
library(rgl)
plot3d(y = Auto$mpg, x = Auto$horsepower, z = I(Auto$horsepower^2),
          xlab = "Power of engine (hp)", 
          ylab = "Fuel economy (miles/US gallon)", 
          zlab = expression(paste("squared power (", hp^2,")")),
          axis.col = rep("black", 3))
ols <- coef(mod)
ran <- range(Auto$horsepower)
hor_seq <- seq(from = ran[1], to = ran[2], length = 1000)
hor2_seq <- hor_seq^2
mpg_seq <- ols[1] + ols[2]*hor_seq + ols[3]*hor2_seq

points3d(x = hor_seq, z = hor2_seq, y = mpg_seq, col = "red")
planes3d(a = ols[2], c = ols[3], b = -1, d = ols[1], alpha = 0.1)
rglwidget()
```


## (Centered) coefficient of determination

Recall the decomposition of observations into fitted and residual vectors,
$$\boldsymbol{y} = (\boldsymbol{y} - \mX\hbb) + \mX \hbb = \bs{e} + \hat{\bs{y}}$$
where $\bs{e} \equiv \Mmat_{\mX}\bs{y} \perp \hat{\bs{y}} \equiv \Hmat_{\mX}\bs{y}$.


The centered coefficient of determination, $R^2_c$ measures the proportion of variation explained by the centered fitted values relative to the centered observations, i.e.,
$$ R^2_c = \frac{\|\hat{\bs{y}}-\bar{y}\mathbf{1}_n\|^2}{\|\bs{y}-\bar{y}\mathbf{1}_n\|^2}=\frac{\|\hat{\bs{y}}\|^2-\|\bar{y}\mathbf{1}_n\|^2}{\|\bs{y}\|^2-\|\bar{y}\mathbf{1}_n\|^2}.$$
since the vectors $\bar{y}\mathbf{1}_n \perp \hat{\bs{y}}-\bar{y}\mathbf{1}_n$.

Provided that $\mathbf{1}_n \in \Sp(\mX)$, it is obvious that the fitted values $\hat{\bs{y}}$ are invariant to linear transformations of the covariates $\mathbf{X}$ (by which I mean you can transform the design matrix column by column, with $\mathbf{x}_i \mapsto \alpha_i+\mathbf{x}_i\gamma_i$ for $i=1, \ldots, p$). Multiplicative changes in $\bs{y}$ lead to an equivalent change in $\bs{e}$ and $\hat{\bs{y}}$. However, location-changes in $\bs{y}$ are only reflected in $\hat{\bs{y}}$ (they are absorbed by the intercept). This is why $R^2$ is not invariant to location-changes in the response, since the ratio $\|\hat{\bs{y}}\|^2/\|\bs{y}\|^2$ increases to 1 if $\by \mapsto \by + a \mathbf{1}_n$.

This invariance is precisely the reason we dismissed $R^2$. For example, a change of units from Farenheit to Celcius, viz. $T_c = 5 (T_F - 32)/9$, leads to different values of $R^2$:

```{r faraway}
data(aatemp, package = "faraway")
plot(temp ~ year, data = aatemp, ylab = "Temperature (in F)", bty = "l")
#Form design matrix and two response vectors
yF <- aatemp$temp
n <- length(yF)
yC <- 5/9*(aatemp$temp - 32)
X <- cbind(1, aatemp$year)
# Obtain OLS coefficients and fitted values
XtX <- solve(crossprod(X))
beta_hat_F <- XtX %*% crossprod(X, yF)
abline(a = beta_hat_F[1], b = beta_hat_F[2])
beta_hat_C <- XtX %*% crossprod(X, yC)
fitted_F <- X %*% beta_hat_F
fitted_C <- X %*% beta_hat_C
# Compute coefficient of determination
R2_F <- sum(fitted_F^2)/sum(yF^2)
R2_C <-  sum(fitted_C^2)/sum(yC^2)
#Centered R^2
R2c_F <- sum((fitted_F-mean(yF))^2)/sum((yF-mean(yF))^2)
R2c_C <-  sum((fitted_C-mean(yC))^2)/sum((yC-mean(yC))^2)
isTRUE(all.equal(R2c_F, R2c_C))
```

The difference $R^2(F)-R^2(C)=$ `r round(R2_F-R2_C,5)` is small because the $R^2$ value is very high, but the coefficient itself is also meaningless. In this example, $R^2(F)=$ `r round(R2_F,4)`, which seems to indicate excellent fit but in fact only `r round(100*R2c_C, 2)`% of the variability is explained by year and we do an equally good job by simply taking $\hat{y}_i=\bar{y}$. 

$R^2_c$ makes the comparison between the adjusted linear model and the null model with only a constant, which predicts each $y_i (i=1, \ldots, n)$ by the average $\bar{y}$.

If $R^2_c$ gives a very rough overview of how much explanatory power $\mX$ has, it is not a panacea. If we add new covariates in $\mX$, the value of $R^2_c$ necessarily increases. In the most extreme scenario, we could add a set of $n-p$ linearly independent vectors to $\mX$ and form a new design matrix $mX^*$ with those. The fitted values from running a regression with $\mX^*$ will be exactly equal to the observations $\bs{y}$ and thus $R^2_c=1$. However, I hope it is clear that this model will _not_ be useful. Overfitting leads to poor predictive performance; if we get a new set of $\mathbf{x}_*$, we would predict the unobserved $y_*$ using its conditional average $\mathbf{x}_i^*\hbb$ and this estimate will be rubish if we included too many meaningless covariates. 

Other versions of $R^2_c$ exist that include a penalty term for the number of covariates; these are not widely used and can be negative in extreme cases. We will cover better goodness-of-fit diagnostics later in the course.


```{block2, type = "rmdcaution"}
In **R**, the function `lm` returns $R^2_c$ by default (in the `summary` table, under the label `Multiple R-squared`. However, if you remove the intercept, you will get $R^2$ without warning!
Contrast
```

```{r}
mod <- lm(mpg ~ horsepower, data = Auto)
rsqc_lm <- summary(mod)$r.squared
#same model, now X = [1 horsepower] and y = mpg
X <- cbind(1, Auto$horsepower)
y <- Auto$mpg
rsq_lm <- summary(lm(y ~ X - 1))$r.squared 

#Compute quantities manually
rsqc_man <- c(crossprod(fitted(mod) - mean(y)) / crossprod(y - mean(y)))
isTRUE(all.equal(rsqc_man, rsqc_lm))
rsq_man <- c(crossprod(fitted(mod))/crossprod(y))
isTRUE(all.equal(rsq_man, rsq_lm))
```

## Summary of week 2

If $\mathbf{X}$ is an $n \times p$ design matrix containing _covariates_ and $\boldsymbol{Y}$ is our response variable, we can obtain the _ordinary least squares_ (OLS) coefficients for the linear model 
$$\boldsymbol{y} = \mathbf{X}\bbeta + \beps, \qquad \mathrm{E}(\beps)=\boldsymbol{0}_n,$$
by projecting $\boldsymbol{y}$ on to $\mathbf{X}$; it follows that 
$$\mX\hat{\boldsymbol{\beta}}=\mX(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\by$$ and 
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\by.$$


The dual interpretation (which is used for graphical diagnostics), is the row geometry: each row corresponds to an individual and the response is a $1$ dimensional point.
$\hbb$ describes the parameters of the hyperplane that minimizes the sum of squared Euclidean vertical distances between the fitted value $\hat{y}_i$ and the response $y_i$. 
The problem is best written using vector-matrix notation, so

$$ \mathrm{argmin}_{\bbeta} \sum_{i=1}^n (y_i- \mathbf{x}_i\bbeta)^2 \equiv \mathrm{argmin}_{\bbeta} (\bs{y} - \mX\bbeta)^\top(\bs{y}-\mX\bbeta) \equiv \bs{e}^\top\bs{e}.
$$

The solution to the OLS problem has a dual interpretation in the column geometry, in which we treat the vector of stacked observations $(y_1, \ldots, y_n)^\top$ (respectively the vertical distances $(e_1, \ldots, e_n)^\top$) as elements of $\mathbb{R}^n$. There, the response $\bs{y}$ space can be decomposed into _fitted values_ $\hat{\by} \equiv \Hmat_{\mX} = \mX\hbb$ and _residuals_ $\bs{e} = \Mmat_{\mX} = \bs{y} - \mX\hbb$. By construction, $\bs{e} \perp \hat{\by}$.

We therefore get $$\bs{y} = \hat{\bs{y}} + \bs{e}$$
and since these form a right-angled triangle, Pythagoras' theorem can be used to show that 
$\|\bs{y}\|^2 = \|\hat{\bs{y}}\|^2 + \|\bs{e}\|^2.$


## Solutions


The following questions refer to the dataset `prostate` from the package `ElemStatLearn`.

a. Briefly describe the dataset.
b. Look at summaries of `lbph`. What likely value was imputed in places of zeros in `lbph} (before taking the logarithm)?
c. Produce a plot of the pair of variables `lcavol` and `lpsa` on the log and on the original scale. Comment on the relationship between `lcavol` and `lpsa`.
d.  Fit a linear model using the log cancer volume as response variable, including a constant and the log prostate specific antigen as covariates. Obtain numerically the OLS estimates $\hbb$ of the parameters, the fitted values $\hat{\bs{y}}$ and the residuals $\bs{e}$ using the formulas given in class. 
e. Compare the quantities you obtained with the output of the function `lm`.
f. Add the fitted regression line to the scatterplot of `lcavol` against `lpsa`.
g.  Interpret the changes in cancer volume (not the log cancer volume), including any units in your interpretations.
h.  Obtain the orthogonal projection matrix $\mathbf{H}_\mX$ and the OLS coefficients $\hbb$ using a SVD decomposition of $\mX$ (`svd`).
i.  Compute the $R^2_c$ coefficient and compare with the one in `summary` output of the `lm` function. What can you say about the explanatory power of the covariate `lpsa`?


### Exercise 3.5 - Prostate cancer

The following questions refer to the dataset `prostate` from the package `ElemStatLearn`.

a. Briefly describe the data set.

Running `?ElemStatLearn::prostate` gives the help file for the data set. Since we will be coming back to this example, detailed informations are provided below.

This data set was extracted from 

> Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F., Redwine, E.A. and Yang, N. (1989)
Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: II. radical prostatectomy treated 
patients, Journal of Urology 141(5), 1076–1083.

This data set is described in Wakefield (2013), pp. 5-6.                         

> The data were collected on $n=97$ men before radical prostatectomy, a major surgical operation that 
removes the entire prostate gland along with some surrounding tissue.  

> In Stamey et al. (1989), prostate specific antigen 
(PSA) was proposed as a preoperative marker to predict the clinical stage of cancer. As well as modeling the stage of cancer as a 
function of PSA, the authors also examined PSA as a function of age and seven other histological and morphometric covariates. 

> The BPH and capsular penetration variables originally contained zeros, and a small number was substituted before the log transform was taken. It is not clear from the original paper why the log transform was taken though PSA varies over a wide range, and so linearity 
of the mean model may be aided by the log transform. It is also not clear why the variable PGS45 was constructed.

The data set contains the following variables:

- `lcavol`: log of cancer volume, measured in milliliters (cc). The area of cancer was measured from digitized images and 
multiplied by a thickness to produce a volume.
- `lweight`: log of the prostate weight, measured in grams.
- `age`: The age of the patient, in years.
- `lbph`: log of the amount of benign prostatic hyperplasia (BPH), a noncancerous enlargement of the prostate gland, as 
an area in a digitized image and reported in cm${}^2$.
- `svi`: seminal vesicle invasion, a 0/1 indicator of whether prostate cancer cells have invaded the seminal vesicle.
- `lcp`: log of the capsular penetration, which represents the level of extension of cancer into the capsule (the fibrous 
tissue which acts as an outer lining of the prostate gland), measured as the linear extent of penetration, in cm.
- `gleason`: Gleason score, a measure of the degree of aggressiveness of the tumor. The Gleason grading system assigns a 
grade (1–5) to each of the two largest areas of cancer in the tissue samples with 1 being the least aggressive and 5 the most 
aggressive; the two grades are then added together to produce the Gleason score.
- `pgg45`: percentage of Gleason scores that are 4 or 5.
- `lpsa`: log of prostate specific antigen (PSA), a concentration measured in ng/m

To load the data set, use

```{r prostate_question_b}
#Install package if you get an error message
#install.packages("ElemStatLearn")
data(prostate, package = "ElemStatLearn")
?ElemStatLearn::prostate
attach(prostate) 
```
The command `attach` allows you to access column (variables) without using `$` by adding the columns of the data frame to your work environment. **Always** detach the data once you are done with your analysis to avoid overriding or hidding variables.

b. Look at summaries of `lbph`. What likely value was imputed in places of zeros in `lbph` (before taking the logarithm)?

```{r prostate_question_b2}
bph <- exp(lbph) 
head(bph) #look up first elements
min(bph) #return minimum
hist(bph, main = "Histogram", xlab = "benign prostatic hyperplasia")
rug(bph) 
#histogram, with lines below where the observations are
```

It seems likely that in order to take a logarithm, zeros were changed to 0.25. As such, we have to be careful with the interpretation of this coefficient if we include `bph` in the regression.


b. Produce a plot of the pair of variables `lcavol` and `lpsa` on the log and on the original scale. Comment on the relationship between `lcavol` and `lpsa`.

```{r question_b, eval = FALSE}
par(mfrow = c(1, 2)) #graphical parameters: two graphs per window
#Function plot is plot(x = , y = ) or plot(y ~ x)
#this works for vectors! (error message otherwise)
plot(exp(lpsa) ~ exp(lcavol),
xlab = "Cancer volume (milliliters per cc)", #y-axis label
ylab = "prostate specific antigen (ng/ml)", #x-axis label
main = "Prostate cancer dataset", #title
bty = "l", pch = 20) #bty: remove box, only x-y axis
#pch: type of plotting symbol (small filled circle)
plot(x = lcavol, y = lpsa,
xlab = "cancer volume (milliliters per cc), log scale",
ylab = "prostate specific antigen (ng/ml), log scale", 
main = "Prostate cancer dataset",
bty = "l", pch = 20)

hist(exp(lcavol), xlab = "cancer volume (milliliters per cc)", main = "Histogram")
rug(exp(lcavol))
hist(exp(lpsa), xlab = "prostate specific antigen (ng/ml)", main = "Histogram")
rug(exp(lpsa))
```


With `ggplot2`, the same graphs

```{r}
library(ggplot2)

ggplot(data = prostate, aes(x = lcavol, y = lpsa)) + 
  geom_point() +
  labs(y = "prostate specific antigen (ng/ml), log scale",
       x = "cancer volume (milliliters per cc), log scale",
       title = "Prostate cancer dataset")

ggplot(data = prostate, aes(x = exp(lcavol), y = exp(lpsa))) + 
  geom_point() +
  labs(y = "prostate specific antigen (ng/ml)",
       x = "cancer volume (milliliters per cc)",
       title = "Prostate cancer dataset")

ggplot(data = prostate, aes(x = exp(lcavol))) + 
  geom_histogram(bins = 30) + geom_rug() + 
  labs(x = "cancer volume (milliliters per cc)",
       title = "Histogram")

```


We can see that both variables are positive and positively skewed, so a log transform may lead to a more linear relationship, as indicated by the pairs plot. A multiplicative model on the original scale is thus reasonable.

d. Fit a linear model using the log prostate specific antigen as response variable, including a constant and the log cancer volume as covariates. Obtain numerically the OLS estimates $\hat{\boldsymbol{\beta}}$ of the parameters, the fitted values $\hat{\boldsymbol{y}}$ and the residuals $\boldsymbol{e}$ using the formulae given in class.

```{r question_c}
fit <- lm(lpsa ~ lcavol, data = prostate)
summary(fit)

#Create response vector and design matrix
y <- lpsa
X <- cbind(1, lcavol)
#Create function to compute coefs "by hand"
coefs_vals <- function(x, y){
  c(solve(crossprod(x), crossprod(x, y)))
}
# Compute coefficients, fitted values and residuals
beta_hat <- coefs_vals(x = X, y = lpsa)
yhat <- c(X %*% beta_hat)
e <- y - yhat
```

The function `lm` fits a linear model by least squares to a dataset. The function `summary` will return coefficient estimates, standard errors and various other statistics and print them in the console. 

The formula for `lm` must be of the form `y ~ `, and any combination of the variables appearing on the right hand side of the `~` will be added as new columns of the design matrix. By default, the latter includes a column of ones. To remove it, use `+0` or `-1`. If you have two covariates `x1` and `x2`, the model `x1+x2` will have for $i$th row $(1, x_{i1}, x_{i2})$, while the model `x1+x2+x1:x2`$\equiv$`x1*x2` will include an *interaction* term `x1:x2`. The latter just means product, so the $i$th row of the design matrix would be $(1, x_{i1}, x_{i2}, x_{i1}x_{i2})$. **R** will drop any collinear vectors, warn you and report `NA` in the summary output.

e. Compare the quantities you obtained in the last question with the output of the function `lm`.

```{r, echo = FALSE}
beta_hat # equivalent to print(beta_hat)
coef(fit) # coefficients from object of class `lm`
isTRUE(all.equal(beta_hat, coef(fit), check.attributes = FALSE))
isTRUE(all.equal(c(yhat), fitted(fit), check.attributes = FALSE))
isTRUE(all.equal(e, resid(fit), check.attributes = FALSE))
```

f. Add the fitted regression line to the scatterplot of lcavol against lpsa .

```{r question_e, eval = FALSE}
par(mfrow = c(1, 1))
plot(lpsa ~ lcavol, data = prostate,
  xlab = "Cancer volume (milliliters per cc), log scale",
  ylab = "prostate specific antigen (ng/ml), log scale", 
  main = "Prostate cancer dataset",
  bty = "l", pch = 20)
abline(fit, lwd = 2) #simply add regression line, lwd is line width
```

```{r question_eggplot2}
ggplot(data = prostate, aes(x = lcavol, y = lpsa)) + 
  geom_point() +
  labs(y = "prostate specific antigen (ng/ml), log scale",
       x = "cancer volume (milliliters per cc), log scale",
       title = "Prostate cancer dataset") + 
  geom_smooth(method = "lm", se = FALSE)


```

g. Interpret the changes in prostate specific antigen (not the log prostate specific antigen), including any units in your interpretations.

The interpretation is as follows. We fit
\[\log(\texttt{psa}_i) = \beta_0 + \beta_1 \log(\texttt{cavol}_i) + \varepsilon_i.\]

On the original scale, this translates into the multiplicative model $\texttt{psa}_i= \exp^{\beta_0}\texttt{cavol}_i^{\beta_1}\exp(\varepsilon_i)$.
The effect of an increase of  the volume of cancer of prostate cancer by one milliliter per cubic centimeter depends on the size of the latter of $\texttt{cavol}$, $(\texttt{cavol}_1/\texttt{cavol}_2)^{\beta_1}$ for levels $\texttt{cavol}_1$ and $\texttt{cavol}_2$.
For example, an increase of the cancer volume from 2 ml per cc to 3 ml per cc leads to an increase of the concentration of PSA of `r round(3^beta_hat[2]/2^beta_hat[2], 2)` ng/ml.


h. Using the results of Exercise 4.2, obtain the orthogonal projection matrix $\mathbf{H}_{\mathbf{X}}$ and $\hat{\boldsymbol{\beta}}$ using a SVD decomposition (`svd`). Check your output.

```{r, question_h}
#Hat matrix
Hmat <- X %*% solve(crossprod(X)) %*% t(X)
#SVD decomposition of X
svdX <- svd(X)
#OLS coefficients
beta_hat_svd <- svdX$v %*% (t(svdX$u) %*% lpsa / svdX$d)
Hmat_svd <- tcrossprod(svdX$u)
#Check that both quantities are equal
all.equal(Hmat, Hmat_svd, check.attributes = FALSE) 
#use check.attributes = FALSE 
#if you want to compare only the values
#and not e.g. the column names
all.equal(c(beta_hat_svd), beta_hat)
```


i. Compute the $R^2_c$ coefficient and compare with the one in summary output of the `lm` function. What
can you say about the explanatory power of the covariate `lpsa` ?

```{r rsqc}
R2c <- sum((yhat-mean(y))^2)/sum((y-mean(y))^2)
R2c_lm <- summary(fit)$r.squared #this is centered version
all.equal(R2c, R2c_lm)

#Detach prostate from environment
detach(prostate)
```

The value of $R^2_c$ is about `r round(R2c, 2)`, so about half the variability can be explained by the model. There is reasonable explanatory power. Note that presence of cancer causes the prostate specific antigens to increase (not the other way around!). A linear model could nevertheless be sensible here if we wished to obtain a non-invasive detector for predicting presence/absence of cancer, assuming the antigen is present in blood samples, but that detection of cancer would require otherwise a biopsy.

j. Perform an explanatory data analysis of the `prostate` data set. Summarize your findings.

Here are some of the most important features we could detect by looking at the description, plots and summaries of the data set.

- Goal of the study: "PSA was proposed as a preoperative marker to predict the clinical stage of cancer"; 
- Individuals in the data set form a subset of the population of the study; the subset consists of men about to undergo radical prostatectomy. This implies they are in late stages of prostate cancer (see `gleason`);
- No missing values, no obvious outlier;
- Many variables are given on the log scale, potentially to remove the skewness (`lcavol`, `lweight`, `lcp`, `lbph`). This makes sense for volume (why?), less so for other variables;
- The most relevant explanatory variables are cancer volume (`lcavol`), weight (`lweight`) and SVI (`svi`);
- It is not clear why and how `pgg45` was constructed;
- 0.25 was added to benign prostatic hyperplasia and capsular penetration before taking the log-transform (to get `lbph` and `lcp`). It would perhaps be more adequate for interpretability to transform the capsular penetration back to the original scale;
- The weight of the tumor (`lweight`) is correlated with benign prostatic hyperplasia (consider an interaction term);
- Gleason is an ordered categorical, so it makes sense to cast it to a factor, with categories 6, 7 and (8,9). The seminal vesicle invasion (`svi`) is already binary;
- Obs. 37 is the only one with a Gleason score of 8 --- keeping it leads to perfect fit for this data point and will lead to problems with cross-validation if `gleason` is included as a factor in the mean model;


Note that we cannot say anything about the distribution of the response `exp(lpsa)`, because the Gaussian linear model assumes the mean is $\mathbf{X}\boldsymbol{\beta}$ (so the skewness could be removed through the inclusion of covariates). Rather, one could fit both models on `exp(lpsa)` and `lpsa` scale and compare the diagnostics for the residuals.