--- 
Project for the course in Computational Statistics | Summer 2020, M.Sc. Economics, Bonn University | [Manuel Huth](https://github.com/manuhuth)

# Variance Increase in PCR <a class="tocSkip">   
---

The following notebook contains ... (one setence)

#### Downloading and viewing this notebook

* The ensure that every image or format is displayed properly, I recommend to download this notebook from its repository on [GitHub](https://github.com/manuhuth/PCR-Parameter-Variance-Analysis). Other viewing options like _MyBinder_ or _NBViewer_ might have issues to display formulas and formatting.


#### Information about the Set up
* 2-3 bullet points


---
<h1>Table of Contents<span class="tocSkip"></span></h1>

---

In [None]:
#read R-Files

---
# 1. Introduction

- Earnings Introduction
    - why important in economics?
    - how to predict? (many highly correlated regressors) usually not all of them can be used
    - link to highly correlated regressors
    
- PCA/PCR Introduction
    - what is done? uses eigenvectors from VCV to reduce dimesnion, OLS overfitting if number of obs is small and number of parameters is high. however: VCV usually estimated (differences are emphasized in section 2), increase in unconditional variance of parameter. analytically not tracable without making assumptions about the joint distribution of the estimated eigenvectors and the variables. Hence: simulation (-> also increases variance of estimate.) 
    - I check this by running a simulation
    
- structure of paper
    - PCA
    - PCR
    - Structural model and parameters
    - simulation
    - conclusion
    
---

# 2. PCR Methodology
--- 

Let $I = {1, \dots p}$ be an index set, $x_i \in \mathbb{M}_{n \times 1}$ $\forall i \in I$ be a random vector, such that its entries are independent and follow the same distribution $X_i$, with finite first and second. Note that it would be sufficient to assume that the expected values and variances are equal across entries in $x_i$. However, for ease of notation I decided to stick to the case of equal distributions. The variance covariance matrix (VCV) of this random variables is denoted by $\pmb C_X$. The collection of the $x_i$'s vectors is defined by a matrix 

\begin{align}
\pmb X = \begin{pmatrix} x_1 & x_2 & \dots & x_p \end{pmatrix} \in \mathbb{M}_{n \times p}.
\end{align}

It is assumed that the random vector $x_i$ is already demeaned by the mean of the random variable $X_i$. Even though it might seem on the first glimpse, this assumption is not very restrictive. It is shown in the appendix *(A.1)* that every matrix for which each entries in the respective column vectors have the same distributions, can be transformed to a matrix such that the expected value of the column vectors is zero. For ease of notation it is therefore assumed that $\text{E}(X_i) = 0$ $\forall i \in I$. 

## 2.1 Principal Component Analysis (PCA)
---

### 2.1.1 Aim of PCA
---

The aim of the PCA is to build $M$ new variables $z_1, z_2, \dots, z_M$ as orthogonal linear combinations of $x_1, x_2, \dots, x_p$. Note that $M \leq p$ otherwise the $z_m$'s cannot be orthogonal to each other. Denoting the scalars that are used to build $z_m$ by $\phi_m$, one can express $z_m$ as 

\begin{align}
z_m = \pmb X \cdot \phi_m,
\tag{1.1}
\end{align}

whereby $z_m$ is the $m$-th principal component.By defining $\pmb \phi = \begin{pmatrix} \phi_1 & \phi_2 & \dots & \phi_M \end{pmatrix}$ it is possible to shorten the above notation. This will later be useful to compute the values for all $z_m$ in one equation.

\begin{align}
\pmb Z = \begin{pmatrix} z_1 & z_2 & \dots & z_M \end{pmatrix} = \begin{pmatrix} \pmb X \cdot \phi_1 & \pmb X \cdot \phi_2 & \dots & \pmb X \cdot \phi_M \end{pmatrix} = \pmb X \pmb \phi.
\tag{1.2}
\end{align}

First, I follow many textbooks and take $\pmb \phi$ as deterministic. However, in practice it is actually a matrix to estimate and therefore adds additional randomness, turning the $\pmb Z$ matrix stochastic. The empirical case with unknown true distributions and correlations is discussed in 2.1.6.

### 2.1.2 Distribution of the $Z_m$ variables
---

The exact distributions are dependent on the distributions of the $X_i$ variables, their convolution properties and the coordiantes $\phi_m$. Since all $n$ random variables in $z_{m}$ come from the same distribution $Z_m$, they are alle the same linear combinations of random vectors $x_i$ with random variables $X_i$ given by

\begin{align}
 Z_m  = \begin{pmatrix} X_1 & X_2 & \dots & X_p \end{pmatrix} \phi_m \Longleftrightarrow \mathcal{L}\left( Z_m \right) = \mathcal{L} \left( \begin{pmatrix} X_1 & X_2 & \dots & X_p \end{pmatrix} \phi_m \right)
    \tag{1.3}
\end{align}

We take a closer look on the first and second moment in the following. The expected value cof $Z_m$, computed in formula *(1.4)*, is found to be zero.

\begin{align}
\text{E}(Z_m) = \text{E}(\begin{pmatrix} X_1 & X_2 & \dots & X_p \end{pmatrix} \cdot \phi_m) = \text{E} \left(\sum_{j=1}^p X_j \phi_{jm} \right) = \sum_{j=1}^p \underbrace{E(X_j)}_{= 0} \phi_{jm} = 0.
\tag{1.4}
\end{align}

The variance, derived subsequently, heavily depends on the magnitudes of the enrtries in the vector $\phi_m$

\begin{align}
    \text{Var}(Z_m) = \text{E} \left( \left[ Z_m - \underbrace{\text{E}(Z_m)}_{0} \right]^2 \right) =  \text{E} \left( Z_m^2 \right) = \text{E} \left( \begin{pmatrix} X_1 & X_2 & \dots & X_p \end{pmatrix}  \phi_m \cdot \phi'_m   \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{pmatrix} \right) = \text{E} \left(\phi'_m  \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{pmatrix} \begin{pmatrix} X_1 & X_2 & \dots & X_p \end{pmatrix}  \phi_m \right) = \phi'_m \text{E} \left( \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{pmatrix} \begin{pmatrix} X_1 & X_2 & \dots & X_p \end{pmatrix} \right)\phi_m = \phi'_m \pmb C_X  \phi_m
    \tag{1.5}
\end{align}

Note that since $\pmb C_X$ is a positive definite matrix, it is ensured that the variance of $Z_m$ is always positive.

### 2.1.3 Derive the PC theoretically
---

In the next step it is discussed how the $\phi_m$ are chosen in order to ensure uncorrelated $Z_m$'s and such that $\text{Var}(Z_m) \geq \text{Var}(Z_{m+1}) \forall m \in I_{-p}$. The first coordinates $\phi_1$ must be chosen in order to maximize the variance of the random variable $Z_1$. Recall from the previous argument *(1.5)* that the distribution of $Z_m$ is a function of the coordinates $\phi_m$ with expected value zero. Since the variance of $Z_m$ can be arbitrarily high, see formula *(1.5)*, one does restrict the choice of $\phi_m$ to vectors of length one.

\begin{align}
    \phi_1 = \arg \max_{||w|| = 1} \text{Var}(Z_1) = \phi'_1 \pmb C_X  \phi_1
    \tag{1.6}
\end{align}

This problem can be solved using the lagrangian with the constraint $w' w = 1$ 
\begin{align}
    \mathscr{L}(w,\lambda) =& w' \pmb C_X w - \lambda \left( w'w-1 \right) \notag \\
    \frac{\partial \mathscr{L}}{\partial \lambda} =& w'w -1 = 0 \notag \\
    \frac{\partial \mathscr{L}}{\partial w} =& 2 \pmb C_X w - 2\lambda w = 0
    \tag{1.7}
\end{align}

From equation *(1.7)* it follows that $\left(\pmb C_X \right) w = \lambda w$. Hence, the solution vector w is an eigenvector of $\pmb C_X$ with eigenvalue $\lambda$. Since $\pmb C_X$ is a $p \times p$ matrix, it has $p$ eigenvalues $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p > 0$ with p corresponding unit-length eigenvectors $v_1, v_2, \dots, v_p$. All $\lambda_i$ are greather than zero, since $\pmb C_X$ is positive definite. Thus, the question arises which eigenvalue to use to maximize the variance. Since we can express any eigenvector as a function of the corresponding eigenvalue, it is sufficient to maximize over the eigenvalue.

\begin{align}
    \arg \max_{\lambda \in \{\lambda_1, \dots, \lambda_p \}} w' \pmb C_X w \stackrel{*(1.7)*}{=}  \arg \max_{\lambda \in \{\lambda_1, \dots, \lambda_p \}} w' \lambda w = \arg \max_{\lambda \in \{\lambda_1, \dots, \lambda_p \}} \underbrace{w' w}_{1} \lambda = \arg \max_{\lambda \in \{\lambda_1, \dots, \lambda_p \}} \lambda
    \tag{1.8}
\end{align}

Hence, the maximum variance is captured by choosing the largest eigenvalue, which is by definition, $\lambda_1$. The corresponding eigenvector $v_1$ is then chosen to be $\phi_1$. Thus, it is enough to compute the largest eigenvalue of the VCV matrix of all the $X_i$ to compute $\phi_1$ and with that $z_1$. Armed with $\phi_1$, it is possible to compute $\phi_2, \phi_3, \dots, \phi_M$ iteratively by defining new random variables 

\begin{align}
\begin{pmatrix} X_1^{(m)} \\ X_2^{(m)} \\ \vdots \\ X_p^{(m)} \end{pmatrix}= \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{pmatrix} - \sum_{j = 1}^{m-1} Z_j \phi_j
\tag{1.9}
\end{align}

and solving the former problem for $\text{Var}(Z_m)$ using the random variables $\begin{pmatrix} X_1^{(m)} & X_2^{(m)} & \dots & X_p^{(m)} \end{pmatrix}'$ to get $\phi_m$. It turns out that $\pmb \phi$ equals the matrix $\pmb V = \begin{pmatrix} v_1 & v_2 & \dots & v_M \end{pmatrix} $, whereby $v_i$ is the eigenvector with length one of $\pmb C_X$ to the corresponding $i$-th highest eigenvalue $\lambda_i$. Hence, we can compute $\pmb \phi$ by only computing the eigenvectors of $\pmb C_X$ and sort them in decending order by the corresponding eigenvalues. Note that therefore $\pmb \phi$ is an orthonormal matrix. Subsequent, $\pmb Z$ is obtained by matrix multiplication of $\pmb X$ and $\pmb \phi$ as in formula *(1.2)*. In theory, there is thus a non-stochastic true $\pmb Z$ matrix of principal components.

### 2.1.4 Computed Principal Components have covariance zero
---

To show that the PC constructed above indeed have covariances of zero, I compute the VCV matrix $\pmb C_Z$ of the $Z_i$'s with respect to the derived $\pmb \phi$ matrix. Remember from equation *(1.4)* that the expected value is zero for every random Z variable. For 

\begin{align}
\pmb C_Z =& \text{E} \left( \begin{pmatrix} Z_1 \\ Z_2 \\ \vdots \\ Z_M \end{pmatrix} \begin{pmatrix} Z_1 & Z_2 & \dots & Z_M \end{pmatrix} \right) =  \pmb \phi' \text{E} \left( \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{pmatrix} \begin{pmatrix} X_1 & X_2 & \dots & X_p \end{pmatrix} \right) \pmb \phi = \pmb \phi' \pmb C_X  \pmb \phi \notag\\
\stackrel{\text{eigendecomposition}}{=}& 
\pmb \phi'
\begin{pmatrix} 
    v_1 & \dots & v_M 
\end{pmatrix}
\begin{pmatrix} 
    \lambda_1 & \dots & 0\\
    \vdots & \vdots & \vdots \\
    0 & \dots & \lambda_M
\end{pmatrix}
\begin{pmatrix} 
    v'_1 \\ \vdots \\ v'_M 
\end{pmatrix}
\pmb \phi =
\underbrace{\pmb \phi'
\pmb \phi}_{= I_{M \times M}}
\begin{pmatrix} 
    \lambda_1 & \dots & 0\\
    \vdots & \vdots & \vdots \\
    0 & \dots & \lambda_M
\end{pmatrix}
\underbrace{\pmb \phi'
\pmb \phi}_{= I_{M \times M}} = \begin{pmatrix} 
    \lambda_1 & \dots & 0\\
    \vdots & \vdots & \vdots \\
    0 & \dots & \lambda_M
\end{pmatrix}
\tag{1.10}
\end{align} 

### 2.1.5 Portion of cariance captured by the $\lambda_i$'s
---

What we also learn from the derivation of equation *(1.5)* is that $\text{Var}(Z_i) = \lambda_i \forall i \in I$, the $i$-th highest eigenvalue equals the variance of the $i$-th PC. Hence, the proportional variance captured by $Z_m$ is given by

\begin{align}
   \Phi(Z_m) = \frac{\lambda_m}{\sum_{i = 1}^p \lambda_i}
   \tag{1.11}
\end{align}

### 2.1.6 Derive the PC in practice
---

The common problem in practice is that usually the VCV matrix of the $X_i$'s $\pmb C_X$ is unknown. Hence, in practice it is necessary to estimate this matrix. From equation *(a.1*) it is known that the means of the $X_i$'s can be estimated unbiased and every matrix $\pmb X$ can trasformed into a matrix that has columns coming from a random variable with zero mean. From equation *(a.2)* it is known that the VCV matrix of the $X_i$'s can be estimated by $\frac{1}{n-1} \pmb X' \pmb X$. We can use this estimate to compute the optimization problem in *(1.6)* with the estimated variances and covariances of the $X_i$'s. However, one can also apply the same fashion as in *(a.2)* to directly compute the estimated VCV matrix for the $Z_m$'s by $\frac{1}{n-1} \hat{\pmb Z}' \hat{\pmb Z}$. I use hat notation for the Z matrices, since the variance of $Z_m$ in equation *(1.6)* is estimated, $\phi_m$ is also estimated and therefore a random variable. Therefore $z_m$ is also estimated, by replacing the true value of $\phi_m$ in equation *(1.2)* with the estimated value. Consistently, I denote the estimated vectors by $\hat{\phi}_m$ and $\hat{z}_m$. \
In the next step the values in $\hat{\pmb \phi}$ are computed recursively. First, the vector $\hat{\phi_1}$ is computed in the same fashion as in *(1.6)* but with estiamted variance. Hence, the empirical maximization problem is

\begin{align}
\hat{\phi}_1 = \arg \max_{||w|| = 1} \left( \frac{1}{n-1} \hat{z}'_1 \hat{z}_1 \right)= \arg \max_{||w|| = 1} \left( \hat{z}'_1 \hat{z}_1 \right)= \arg \max_{||w|| = 1} \left( (\pmb X w)'(\pmb X w) \right) = \arg \max_{||w|| = 1} \left( w' \pmb X' \pmb X w\right)
\tag{1.12}
\end{align}

From here onward, the solution is the same as for the theoretical case but estimated variables are used. Thus, I have moved a detailed description to the appendix (*A.4*). The estimated phi matrix consists of estimated eigenvectors, which are the eigenvectors of the estimated VCV matrix of the $X_i$'s. 

\begin{align}
\hat{\pmb \phi} = \begin{pmatrix} \hat{\phi_1} & \hat{\phi_2 }& \dots & \hat{\phi_M} \end{pmatrix} = \begin{pmatrix} \hat{v_1} & \hat{v_2 }& \dots & \hat{v_M} \end{pmatrix}
\tag{1.13}
\end{align}

The estimated principal components are computed as in *(1.2)* replacing the true $\pmb \phi$ by the estimate $\hat{\pmb \phi}$, such that $\hat{\pmb Z} = \pmb X \hat{\pmb \phi}$ is a stochastic matrix.

## 2.2 Principal Component Regression (PCR)
---

For now it has been shown how to compute the principal components. But since one is usually not only interested in the principal components themselves, I will show how they can be used for prediction. Note that inference of the $x_i$ is not feasible with PCR, since the original structure of the data is lost.
To illustrate the intuition of the PCR, I add some assumptions in addition to the $\pmb X$. The structure of the data generating process (DGP) is set up in an OLS-like linear way, such that

\begin{align}
Y = \pmb X \beta + \varepsilon,
\tag{2.14}
\end{align}

whereby $\varepsilon \in \mathbb{M}_{n \times 1}$ is a vector of uncorrelated error terms with mean zero, that follows in general some specific distribution. In most cases, and the subsequent application, this will be a normal distribution. Since the distrbution of the errors is not of major concern in this paper, I will assume for the purposes of this paper, that any vector of error terms follows $\varepsilon \sim \mathcal{N}\left(\pmb 0, \sigma_x^2 \pmb I_{n \times n} \right)$.  

The idea of PCR is now that $Y$ can be estimated by using the constructed $M$ principal components. Recall that $1 \leq M \leq p$ to ensure the independency of $Z_m$. The advantage against an ordinary OLS model is that often a small number of principal components is sufficient to explain the variability in $X$ and its correlation with $Y$. Hence, it is assumed the directions in which the $x_i$ show the most variation are the directions that correspond to $Y$ (James et al., 2013). These directions are the respective eigenvectors of the (estimated) VCV matrix $\pmb C_X$. How much variation is captured by the principal component is given, as stated above in equation *(2.11)*, by the proportion of the respective eigenvalue. Feasible choices to find a suitbale M are, for example, choosing a cutoff point $c$ such that $\lambda_M$ is the smallest eigenvalue such that $\Phi(\lambda_M) \geq c$ or using K-fold cross validation with the average prediction error as objective function.

The natural question that arises is: Why should that be useful? The line of argumentation in the literature, for example in  Friedmann et al. (2001) and James et al. (2013), is basically the following: Imagine a case where the number of regressors $p$ is close to the number of observations $n$. In such a case the OLS estimate tends to overfit the data and, thus, yields bad predicion results. PCR circumvents this problem by reducing the dimension of the parameters to estimate. This is true if the true VCV $\pmb C_X$ is known. However, as stated in the previous subsection, in practice the true VCV is not known and $\pmb \phi$ is estimated by the eigenvectors of $\hat{\pmb C_X}$. So instead of $p$ parameters to estimate, there are $M$ parameters to estimate in the regression and $p$ vectors to estimate before the regression. The latter adds additional randomness to the $M$ estimated parameters. Similar as in the previous subsection, I will first assume that the true $\pmb \phi$ matrix is known and subsequent show what is done in practice, when the VCV must be estimated.

### 2.2.1 PCR in Theory
---

As mentioned above, in this subsection it is assumed that the true VCV $\pmb C_X$ is known and therefore the matrix of eigenvectors $\pmb \phi$ is deterministic. The equation to be estimated is therefore

\begin{align}
Y = \pmb Z \beta_Z + \varepsilon_Z = \pmb X \pmb \phi \beta_Z + \varepsilon_Z,
\tag{2.15}
\end{align}

whereby $\varepsilon_Z$ is again a vector of independent homoscedastic error terms defined as in the previous subsection and $\beta_Z \in \mathbb{M}_{M \times 1}$. The model in *(2.15)* is a linear model and can therefore be estimated by minimizing the sum of squarred residuals. The well-known solution to this problem is given by

\begin{align}
\hat{\beta}_Z = \arg \min\limits_{b \in \mathbb{R}^M} ||Y- \pmb Z b||^2_2 = \left (\pmb Z' \pmb Z \right)^{-1}\pmb Z' Y = \left (\pmb \phi' \pmb X' \pmb X \pmb \phi\right)^{-1}\pmb \phi' \pmb X' Y.
\tag{2.16}
\end{align}

Since the purpose of the analysis is to compare the variance of $\hat{\beta}_Z$ in the non-stochastic $\phi$ case with the variance in the stochastic case, it seems natural to try to compute its variance analytically. Hence, I will compute the unconditional variance of $\hat{\beta}_Z$. It will turn out that the variance cannot be computed without making further assumptions. I make use of the well-known fact that the OLS esimate is unbiased conditional on $Z$ (ii) and that the conditional variance is given by $\text{Var}(\hat{\beta}_Z|\pmb Z) = \sigma^2_z \left(\pmb Z' \pmb Z \right)^{-1}$ (iii). Furthermore, I apply the variance decomposition (i) derived in the appendix *(a.8)*. 

\begin{align}
\text{Var}(\hat{\beta}_Z) \stackrel{\text{i}}{=} \text{E} \left[\text{Var}(\hat{\beta}_Z|\pmb Z) \right] + \text{Var} \left[\text{E}(\hat{\beta}_Z|\pmb Z) \right] \stackrel{\text{ii}}{=} \text{E} \left[\text{Var}(\hat{\beta}_Z|\pmb Z) \right] \stackrel{\text{iii}}{=} \sigma^2_z  \text{E} \left[\left(\pmb Z' \pmb Z \right)^{-1} \right]
\tag{2.17}
\end{align}

### 2.2.1 PCR in Practice
---

As derived in subsection $2.1.6$, it is assumed in this subsection that the true $\pmb C_X$ is unknown and therefore the matrix of eigenvectors $\pmb \phi$ is estimated by $\hat{\pmb \phi}$ prior the regression. I will denote $\beta_Z^s$ with the s-index to indicate the stochastic case.

\begin{align}
Y =  \hat{\pmb Z} \beta^s_Z + \varepsilon_Z = \pmb X \hat{\pmb \phi} \beta^s_Z + \varepsilon_Z,
\tag{2.18}
\end{align}

---
# 3. Model
---

Write about it in general. Paper
> Blundell, R., Pistaferri, L. and Saporta-Eksten, I. Consumption Inequality and Family Labor Supply. American Economic Review 2016, 106(2): 387–435


## 3.1 Simple Model
- 'angelehnt' an Two Earners Life-Cycle Model (Wage process)
- paper page 392 equation (1) without transitory shock
- nice to test variance of coefficients in an easy set up

Mincer like equation:
\begin{align}
\ln(Y_{i}) = \alpha_i + \beta_s s_i + \beta_{w_1} w_i + \beta_{w_2} w_i^2 + \pmb X_i + \varepsilon_i
\end{align}
alpha durch test scores approximieren

## 3.2 Extended by transitory shock
- wage process of two earners life-cycle model
- paper page 392 equation (1) with transitory shock
- test in a more realistic set up

---
# 4. Simulation

Leave out as many dummies as possible

---



variables from page 496 of
> Blundell, R., Dearden, L. and Sianesi, B. Evaluating the effect of education on earnings: models, methods and results from the National ChildDevelopment Survey. Royal Statistics Society A 2005, 168(3), 473–512

Covariates for ability = Mathematics, Reading ability at 7,11 years

Qualifications: O-Levels, A-levels, HE
White
Father's years of education, Mother's years of education,
Father's age, mother's age
Parent's social class (professional, intermediate, skilled non-manual, skilled manual, semi-skilled non-manual, semi-skilled manual, unskilled)
parent's interest in education
regional factor?

---
# 5. Conclusion
---

---
## Appendix
---

### A.1 Every matrix in style of $\pmb X$ can be transformed to a matrix with random variables of mean zero

Let $I = {1, \dots p}$ be an index set, $x_i \in \mathbb{M}_{n \times 1}$ $\forall i \in I$ be a random vector, such that its entries are independent and follow the same distribution $X_i$. The collection of the $x_i$'s vectors is defined by a matrix 

\begin{align}
\pmb X = \begin{pmatrix} x_1 & x_2 & \dots & x_p \end{pmatrix} \in \mathbb{M}_{n \times p}.
\end{align}

Since all entries $x_{ji}$ for all $j = 1, \dots, n$ from a random vector $x_i$, are $i.i.d$ with $\text{E}\left(x_{ji} \right) = \mu_i$, 

\begin{align}
    \overline{x}_i = \underbrace{\frac{1}{n} \text{E}\left(\sum_{j = 1}^n x_{ji} \right)}_{\text{unbiased estimator}} = \frac{1}{n} \sum_{j = 1}^n \text{E}\left(x_{ji}\right) = \frac{1}{n} \sum_{j = 1}^n \mu_i = \mu_i
\end{align}

is an unbiased estimator such that
\begin{align}
\text{E}\left(x_{ij} - \overline{x}_i \right) = \text{E}\left(x_{ij} \right) - \text{E}\left(\overline{x}_i \right) = \mu_i -\mu_i = 0 \text{ for all } j = 1, \dots, n \text{ and } i \in I. 
\tag{a.1}
\end{align} 

Therefore, every matrix $\pmb X$ can be transformed into the desired form with zero means, independent of the magnitude of the $\mu_i$'s.

### A.2 The matrix product $\frac{1}{n-1}\pmb X' \pmb X$ is an unbiased estimator of the variance-covariance (VCV) matrix of the X_i's

Since the mean of the random variables $X_i$ is zero, only one matrix multiplication is needed in order to estimate the VCV matrix of the $X_i$'s.

\begin{align}
   \text{E}\left(\frac{1}{n-1}\pmb X' \pmb X \right) = \frac{1}{n-1} \text{E}\left(\begin{pmatrix} x'_1 \\ \vdots \\ x'_p \end{pmatrix} \begin{pmatrix} x_1 & \dots & x_p \end{pmatrix}\right) = 
   \frac{1}{n-1}\text{E}\begin{pmatrix} 
   x'_1 x_1 & \dots & x'_1 x_p \\
   x'_2 x_1 & \dots & x'_2 x_p \\
   \vdots & \vdots & \vdots \\
   x'_p x_1 & \dots & x'_p x_p \\  
   \end{pmatrix} =
   \begin{pmatrix} \text{var}(X_1)  & \dots & \text{cov}(X_1, X_p) \\
        \text{cov}(X_2, X_1)  & \dots & \text{cov}(X_2, X_p) \\
        \vdots &  \vdots & \vdots \\
        \text{cov}(X_p, X_1) & \dots & \text{var}(X_p) 
    \end{pmatrix}
    \tag{a.2}
\end{align}

### A.3 Multiplying a square matrix by a positive scalar does not change its eigenvectors and the order of its eigenvalues

Let $\pmb A \in \mathbb{M}_{p \times p}, \lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p > 0$ its ordered eigenvalues and $v_1, v_2 \dots v_n$ the eigenvectors corresponding to the repsective $\lambda_i$. Let furthermore be $a \in \mathbb{R}_{++}, a \pmb A = \pmb A'$ and $a \lambda_i = \lambda'_i$. From the properties of eigenvectors and eigenvalues it follows for all $i \in I$
\begin{align}
    \pmb A v_i = \lambda_i v_i \Longleftrightarrow a \pmb A v_i = a \lambda_i  \Longleftrightarrow \pmb A' v_i = \lambda'_i v_i
    \tag{a.3}
\end{align}
 
Since $a >0$ the order of the different $\lambda'_i$ is still $\lambda'_1 \geq \lambda'_2 \geq \dots \geq \lambda'_p > 0$. A side note for latter purposes is that the quotient of an eigenvalue with the sum of all the eigenvalues is also invariant to the multiplication with a positive scalar

\begin{align}
    \frac{\lambda'_i}{\sum_{j = 1}^p \lambda'_j} = \frac{a \lambda_i}{\sum_{j = 1}^p a \lambda_j} = \frac{a \lambda_i}{a \sum_{j = 1}^p \lambda_j} = \frac{\lambda_i}{\sum_{j = 1}^p \lambda_j}
    \tag{a.4}
\end{align}

### A.4 Derivation of the empirical PC

From equation *(a.1*) it is known that the means of the $X_i$'s can be estimated unbiased and every matrix $\pmb X$ can trasformed into a matrix that has columns coming from a random variable with zero mean. From equation *(a.2)* it is known that the VCV matrix of the $X_i$'s can be estimated by $\frac{1}{n-1} \pmb X' \pmb X$. We can use this estimate to compute the optimization problem in *(1.6)* with the estimated variances and covariances of the $X_i$'s. However, one can also apply the same fashion as in *(a.2)* to directly compute the estimated VCV matrix for the $Z_m$'s by $\frac{1}{n-1} \hat{\pmb Z}' \hat{\pmb Z}$. I use hat notation for the Z matrices, since the variance of $Z_m$ in equation *(1.6)* is estimated, $\phi_m$ is also estimated and therefore a random variable. Therefore $z_m$ is also estimated, by replacing the true value of $\phi_m$ in equation *(1.2)* with the estimated value. Consistently, I denote the estimated vectors by $\hat{\phi}_m$ and $\hat{z}_m$. \
In the next step the values in $\hat{\pmb \phi}$ are computed recursively. First, the vector $\hat{\phi_1}$ is computed in the same fashion as in *(1.6)* but with estiamted variance. Hence, the maximization problem at hand is

\begin{align}
\hat{\phi}_1 = \arg \max_{||w|| = 1} \left( \frac{1}{n-1} \hat{z}'_1 \hat{z}_1 \right)= \arg \max_{||w|| = 1} \left( \hat{z}'_1 \hat{z}_1 \right)= \arg \max_{||w|| = 1} \left( (\pmb X w)'(\pmb X w) \right) = \arg \max_{||w|| = 1} \left( w' \pmb X' \pmb X w\right)
\tag{a.5}
\end{align}

with the corresponding lagrangian

\begin{align}
    \mathscr{L}(w,\lambda) =& w' \pmb X' \pmb X w - \lambda \left( w'w-1 \right) \notag \\
    \frac{\partial \mathscr{L}}{\partial \lambda} =& w'w -1 = 0 \notag \\
    \frac{\partial \mathscr{L}}{\partial w} =& 2 \pmb X' \pmb X w- 2\lambda w = 0
    \tag{a.6}
\end{align}

From equation *(a.6)* it follows that $\left(\pmb X' \pmb X\right) w = \lambda w$. Hence, the solution vector w is an eigenvector of $\pmb X' \pmb X$ with eigenvalue $\lambda$. Since $\pmb X' X$ is a $p \times p$ matrix, it has $p$ eigenvalues $\hat{\lambda_1} \geq \hat{\lambda_2} \geq \dots \geq \hat{\lambda_p} > 0$ with p corresponding unit-length eigenvectors $\hat{v_1}, \hat{v_2}, \dots, \hat{v_p}$. All $\hat{\lambda_i}$ are greather than zero, since $X' X$ is positive definite. Thus, the question arises which eigenvalue to use to maximize the variance. Since we can express any eigenvector as a function of the corresponding eigenvalue, it is sufficient to maximize over the eigenvalue.

\begin{align}
    \arg \max_{\lambda \in \{\hat{\lambda_1}, \dots, \hat{\lambda_p} \}} w' \pmb X' \pmb X w \stackrel{(a.6)}{=}  \arg \max_{\lambda \in \{\hat{\lambda_1}, \dots, \hat{\lambda_p} \}} w' \lambda w = \arg \max_{\lambda \in \{\hat{\lambda_1}, \dots, \hat{\lambda_p} \}} \underbrace{w' w}_{1} \lambda = \arg \max_{\lambda \in \{\hat{\lambda_1}, \dots, \hat{\lambda_p} \}} \lambda
    \tag{a.7}
\end{align}

Hence, the maximum variance is captured by choosing the largest eigenvalue, which is by definition, $\hat{\lambda_1}$. The corresponding eigenvector $\hat{v_1}$ is then chosen to be $\hat{\phi_1}$. From equation *(a.3)* it follows that $\frac{1}{n-1} \pmb X' \pmb X$ and $\pmb X' \pmb X$ have the same eigenvectors but eigenvalues with the same order multiplied by the scalar. Since it is enough to compute the largest eigenvalue of the estimated VCV matrix of all the $X_i$ to compute $\hat{\phi}_1$ and with that $\hat{z_1}$. Armed with $\hat{\phi}_1$, it is possible to compute $\hat{\phi}_2, \hat{\phi}_3, \dots, \hat{\phi}_M$ iteratively by setting $\pmb X_m = \pmb X - \sum_{j = 1}^{m-1} X \hat{\phi}_j \hat{\phi'}_j$ and solving the former problem for $\pmb X_m$ to get $\hat{\phi}_m$.

It turns out that $\hat{\pmb \phi}$ equals the matrix $\hat{\pmb V} = \begin{pmatrix} \hat{v_1} & \hat{v_2 }& \dots & \hat{v_M} \end{pmatrix} $, whereby $\hat{v_i}$ is the eigenvector with length one of $X'X$ to the corresponding $i$-th highest eigenvalue $\hat{\lambda_i}$. Hence, we can compute $\hat{\pmb \phi}$ by only computing the eigenvectors of $X'X$ and sort them in decending order by the corresponding eigenvalues. Subsequent, $\hat{\pmb Z}$ is obtained by matrix multiplication of $\pmb X$ and $\hat{\pmb \phi}$ as in formula *(1.2)*. 

### A.5 Law of total variance

Let $y \in \mathbb{M}_{p \times 1}$ be a random vector and $x$ be a random matrix. The variance of $y$ can be decomposed by

\begin{align}
\text{Var}(y) =& \text{E}(yy') - \text{E}(y)\text{E}(y') \\
=& \text{E} \left(\text{E}(yy'|x) \right) - \text{E} \left[ \text{E}(y|x) \right] \text{E} \left[ \text{E}(y'|x) \right] \\
=& \text{E} \left(\text{Var}(y|x) + \text{E}(y|x) \text{E}(y'|x) \right) - \text{E} \left( \text{E}(y|x) \right) \text{E} \left( \text{E}(y'|x) \right) \\
=& \text{E} \left[\text{Var}(y|x) \right] + \text{E} \left[ \text{E}(y|x) \text{E}(y'|x) \right] - \text{E} \left[ \text{E}(y|x) \right] \text{E} \left[ \text{E}(y'|x) \right]\\
=& \text{E} \left[\text{Var}(y|x) \right] + \text{Var} \left[\text{E}(y|x) \right],
\tag{a.8}
\end{align}

whereby it is made us of the law of iterated expectations in the second row. The other reshapings are basically applications of the same law given in the first line. 

# References

> Blundell, R., Dearden, L., & Sianesi, B. (2005). Evaluating the effect of education on earnings: models, methods and results from the National Child Development Survey. Journal of the Royal Statistical Society: Series A (Statistics in Society), 168(3), 473-512.

> Blundell, R., Pistaferri, L., & Saporta-Eksten, I. (2016). Consumption inequality and family labor supply. American Economic Review, 106(2), 387-435.

> Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No. 10). New York: Springer series in statistics.

> James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.

---
derivaton of pca
https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf

get number of PCA components
 Kaiser  criterion(Guttman, 1954; Kaiser, 1960)
 acceleration  factor(Cattell,  1966; Raiche,  Roipel,  and  Blais,2006)and parallel analysis(Horn, 1965)

comparison of pca and ridge
https://www.researchgate.net/publication/259265422_A_Monte_Carlo_Comparison_between_Ridge_and_Principal_Components_Regression_Methods

Introduction
- Black  (1997)  shows  that  parents  arewilling to pay a premium to buy a house in a neighborhood with aschool that scores well. (Do  Better  Schools  Matter?  Parental  Valuation  ofElementary  Education) - Havard paper
- https://www.researchgate.net/profile/Duncan_Thomas/publication/5194918_Early_Test_Scores_Socioeconomic_Status_and_Future_Outcomes/links/575812c508ae5c6549074510/Early-Test-Scores-Socioeconomic-Status-and-Future-Outcomes.pdf nice to find literature on test scores

1. structural model with equation
2. relationship $y = X \beta + \epsilon$
    Xs are correlated -> pcr
3. simulate structural model using parameter
4. see how large the variance is of the pcr beta and compare to the ones estimated. (can be computed)