### 3.1.5 Derive the PC in Practice
---

The common problem in practice is that usually the VCV matrix of the $X_i$'s $\pmb C_X$ is unknown. Hence, in practice it is necessary to estimate this matrix. From equation *(a.1*) it is known that the means of the $X_i$'s can be estimated unbiased and every matrix $\pmb X$ can trasformed into a matrix that has columns coming from a random variable with zero mean. From equation *(a.2)* it is known that the VCV matrix of the $X_i$'s can be estimated by $\frac{1}{n-1} \pmb X' \pmb X$. We can use this estimate to compute the optimization problem in *(1.3)* with the estimated variances and covariances of the $X_i$'s. However, one can also apply the same fashion as in *(a.2)* to directly compute the estimated VCV matrix for the $Z_m$'s by $\frac{1}{n-1} \hat{\pmb Z}' \hat{\pmb Z}$. I use hat notation for the Z matrices, since the variance of $Z_m$ in equation *(2.3)* is estimated, $\phi_m$ is also estimated and therefore a random variable. Therefore $z_m$ is also estimated, by replacing the true value of $\phi_m$ in equation *(2.2)* with the estimated value. Consistently, I denote the estimated vectors by $\hat{\phi}_m$ and $\hat{z}_m$. \
In the next step the values in $\hat{\pmb \phi}$ are computed recursively. First, the vector $\hat{\phi_1}$ is computed in the same fashion as in *(2.3)* but with estimated variance. Hence, the empirical maximization problem is

\begin{align}
\hat{\phi}_1 = \arg \max_{||w|| = 1} \left( \frac{1}{n-1} \hat{z}'_1 \hat{z}_1 \right)= \arg \max_{||w|| = 1} \left( \hat{z}'_1 \hat{z}_1 \right)= \arg \max_{||w|| = 1} \left( (\pmb X w)'(\pmb X w) \right) = \arg \max_{||w|| = 1} \left( w' \pmb X' \pmb X w\right)
\tag{3.11}
\end{align}

From here onward, the solution is the same as for the theoretical case but estimated variables are used. Thus, I have moved a detailed description to the appendix (*A.4*). The estimated phi matrix consists of estimated eigenvectors, which are the eigenvectors of the estimated VCV matrix of the $X_i$'s. 

\begin{align}
\hat{\pmb \phi} = \begin{pmatrix} \hat{\phi_1} & \hat{\phi_2 }& \dots & \hat{\phi_M} \end{pmatrix} = \begin{pmatrix} \hat{v_1} & \hat{v_2 }& \dots & \hat{v_M} \end{pmatrix}
\tag{3.12}
\end{align}

The estimated principal components are computed as in *(2.2)* replacing the true $\pmb \phi$ by the estimate $\hat{\pmb \phi}$, such that $\hat{\pmb Z} = \pmb X \hat{\pmb \phi}$ is a stochastic matrix. Note that by the same reasoning as in subsection *2.1.2* the estimated $\hat{\pmb \phi}$ matrix is again not unique and its class is denoted $[\hat{\phi}]$.

## 3.2 Principal Component Regression (PCR)
---

For now it has been shown how to compute the principal components. But since one is usually not only interested in the principal components themselves, I will show how they can be used for prediction. Note that inference of the $x_i$ is not feasible with PCR, since the original structure of the data is lost.
To illustrate the intuition of the PCR, I add some assumptions. The structure of the data generating process (DGP) is set up in an OLS-like linear way, such that

\begin{align}
Y = \pmb X \gamma + \varepsilon,
\tag{3.13}
\end{align}

whereby $\varepsilon \in \mathbb{M}_{n \times 1}$ is a vector of uncorrelated error terms with mean zero, that follows in general some specific distribution and $\gamma \in \mathbb{R}^p$. In most cases, and the subsequent application, this will be a normal distribution. Since the distrbution of the errors is not of major concern in this paper, I will assume for the purposes of this paper, that any vector of error terms follows $\varepsilon \sim \mathcal{N}\left(\pmb 0, \sigma_x^2 \pmb I_{n \times n} \right)$.  

The idea of PCR is now that $Y$ can be estimated by using the constructed $M$ principal components. Recall that $1 \leq M \leq p$ to ensure the independency of $Z_m$. The advantage against an ordinary OLS model is that often a small number of principal components is sufficient to explain the variability in $X$ and its correlation with $Y$. Hence, it is assumed the directions in which the $x_i$ show the most variation are the directions that correspond to $Y$ (James et al., 2013). These directions are the respective eigenvectors of the (estimated) VCV matrix $\pmb C_X$. How much variation is captured by the principal component is given, as stated above in equation *(2.11)*, by the proportion of the respective eigenvalue. Feasible choices to find a suitbale M are, for example, choosing a cutoff point $c$ such that $\lambda_M$ is the smallest eigenvalue such that $\Phi(\lambda_M) \geq c$ or using K-fold cross validation with the average prediction error as objective function.

The natural question that arises is: Why should that be useful? The line of argumentation in the literature, for example in  Friedmann et al. (2001) and James et al. (2013), is basically the following: Imagine a case where the number of regressors $p$ is close to the number of observations $n$. In such a case the OLS estimate tends to overfit the data and, thus, yields bad predicion results. PCR circumvents this problem by reducing the dimension of the parameters to estimate. This is true if the true VCV $\pmb C_X$ is known. However, as stated in the previous subsection, in practice the true VCV is not known and $\pmb \phi$ is estimated by the eigenvectors of $\hat{\pmb C_X}$. So instead of $p$ parameters to estimate, there are $M$ parameters to estimate in the regression and $p$ vectors to estimate before the regression. The latter adds additional randomness to the $M$ estimated parameters. Similar as in the previous subsection, I will first assume that the true $[\phi]$ class is known and subsequent show what is done in practice, when the VCV must be estimated and therefore the class must be estimated by $[\hat{\phi}]$.

### 3.2.1 PCR in Theory
---

As mentioned above, in this subsection it is assumed that the true VCV $\pmb C_X$ is known and therefore the matrix of eigenvectors $\pmb \phi$ is deterministic after it is chosen from the class $[\phi]$. The equation to be estimated is therefore

\begin{align}
Y = \pmb Z \beta_t + \varepsilon_Z = \pmb X \pmb \phi \beta_t + \varepsilon_Z,
\tag{3.14}
\end{align}

whereby $\varepsilon_Z$ is again a vector of independent homoscedastic error terms defined as in the previous subsection and $\beta_t \in \mathbb{R}^{M}$. The subscript $t$ incdicates that the estimator is from the theoretical case. The model in *(3.15)* is a linear model and can therefore be estimated by minimizing the sum of squarred residuals. The well-known solution to this problem is given by

\begin{align}
\hat{\beta_t} = \arg \min\limits_{b \in \mathbb{R}^M} ||Y- \pmb Z b||^2_2 = \left (\pmb Z' \pmb Z \right)^{-1}\pmb Z' Y = \left (\pmb \phi' \pmb X' \pmb X \pmb \phi\right)^{-1}\pmb \phi' \pmb X' Y.
\tag{3.15}
\end{align}

Since the purpose of the analysis is to compare the variance of $\hat{\beta_t}$ in the non-stochastic $\phi$ case with the variance in the stochastic case, it seems natural to to compute its variance analytically. Hence, I will compute the unconditional variance of $\hat{\beta_t}$. I make use of the well-known fact that the OLS esimate is unbiased conditional on $Z$, thus the conditional expectation is constant (ii), and that the conditional variance is given by $\text{Var}(\hat{\beta}_t|\pmb Z) = \sigma^2_z \left(\pmb Z' \pmb Z \right)^{-1}$ (iii). Furthermore, I apply the variance decomposition (i) derived in the appendix *(a.8)*. If the variance of $\varepsilon_Z$ must be estimated

\begin{align}
\text{Var}(\hat{\beta}_t) \stackrel{\text{i}}{=} \text{E} \left[\text{Var}(\hat{\beta}_t|\pmb Z) \right] + \text{Var} \left[\text{E}(\hat{\beta}_t|\pmb Z) \right] \stackrel{\text{ii}}{=} \text{E} \left[\text{Var}(\hat{\beta}_t|\pmb Z) \right] \stackrel{\text{iii}}{=}   \text{E} \left[\hat{\sigma}^2_z  \left( \pmb Z' \pmb Z \right)^{-1} \right] = \text{E} \left[\hat{\sigma}^2_z \left( \pmb \phi' \pmb X' \pmb X \pmb \phi \right)^{-1} \right]
\tag{3.16}
\end{align}

say and show that its variant to the choice of phi but it does not matter. show that beta changes by having different phis but Y stays the same. also shown in simulation