---
# 4. Simulation
---

In my simulation study I want to analyze how the variance of the PCR coefficients differ if the true VCV matrix is not known by applying the model introduced in section three. Since the covariances of some model variables are not really trackable and to have a realistic set-up, I decided to simulate a true population from which I will subsequently draw different samples for my analysis to examine the behavior of the parameter variance. 

## 4.1 Parameterization of the Model

To ease the readability I decided to create a table that includes the full name, mathematical abbreviation of section three, the chosen parameterization and a short reasoning why I chose this specific value. 

**Table X.X - Parameterization**

|Dependency| Name | Abbreviation   | Value | Reasoning | Source
|------|------|------|------|------|------|
|Ability||||||
||Variance of Ability | $\sigma^2_a$ | 1 | Since ability is not measurable and an artificial variable and an increase of the variance has the same effect as increasing $|\gamma_{ability}|$. Hence, I decided to choose the standard normal distribution for my parametrization.| -|
|Age||||||
||Maximum Age | - | 68 | I chose the state pension age as maximum age of the individuals because.| British Government [(2018)](https://www.gov.uk/state-pension-age))|
||Minimum Age | - | 33 | I chose the age that is used for the full population in Blundell et al (2005). This has the advantage that all individuals have finished school.| Blundell et al. (2005)|
|General|||||
|| Population Size | N | 460,000 | In 1991 there were 22,997,199 people in the UK. To reduce computational costs, I decided to set the whole population to around 2\% of UK's true population in 1991. This still allows me in the simulaton study to draw large samples without having the problem that the samples are very likely to be substantially equal.| [Office for National Statistics](https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland/mid2015/ukandregionalpopulationestimates18382015.zip)|
|Number of Siblings||||||
||Decrease in Expected Value of Number of Siblings for each Unit of Parent's Education| $p$ | 0.1 |  In Cygan-Rehm and Maeder (2013) they report this value to be 0.1. The model that is used in the paper is a linear model. However, the quantity has the same interpretation and thus I use the same number. | Cygan-Rehm and Maeder (2013)|
||Mean/Variance of Number of Siblings | $(\mu_n, \sigma_n^2)$ | (1.692, 2.89)  | From the summary Statistics of Blundell et al. (2005) | Blundell et al. (2005) |
|Parent's Education||||||
|| Mean/Variance of Years Education of Parent's | $(\mu_e, \sigma_e^2)$ | (13.342, 21.215)  | In the summary Statistics of Blundell et al. (2005) the mean/standard deviation of father's and mother's education are given. Both are given as the years of education after 6th grade. Hence I did my calculations using the value reported in the paper plus six. Moreover, this transformation does not affect the variance. This is in line with the value reported by the Human Development Report for the United Kingdom in 1991.  Since the moments do not meaningfully differ for fathers and mothers, I decided to only use the mother's education.  | Blundell et al. (2005), [Human Development Report](http://hdr.undp.org/en/indicators/103006) | 
|Test Scores||||||
||Influence of Ability and Parent's Education on the Latent Test Scores| $(\gamma_{ability},\gamma_{parent})$ | (2,2) | I chose $\gamma_{ability}$ and $\gamma_{parent}$ to be equal. The magnitude turned out to yield reasonable results. | - |
||Proportion of Individuals that Scored a (5, 4, 3, 2, 1) in 7th Grade Math | $Gr_{7,m}$ | (0.141, 0.158, 0.185, 0.190, 0.212)| I chose to use the reported quintiles for each test as approximation of the probability mass.| Blundell et al. (2005)|
||Proportion of Individuals that Scored a (5, 4, 3, 2, 1) in 11th Grade Math | $Gr_{11,m}$ | (0.122,0.152, 0.157, 0.179, 0.199)| I chose to use the reported quintiles for each test as approximation of the probability mass.| Blundell et al. (2005)|
||Proportion of Individuals that Scored a (5, 4, 3, 2, 1) in 7th Grade Reading | $Gr_{7,r}$ | (0.166, 0.179, 0.188, 0.187, 0.165)| I chose to use the reported quintiles for each test as approximation of the probability mass.| Blundell et al. (2005)|
||Proportion of Individuals that Scored a (5, 4, 3, 2, 1) in 11th Grade Reading | $Gr_{11,r}$ | (0.132, 0.163, 0.163, 0.176, 0.176)| I chose to use the reported quintiles for each test as approximation of the probability mass.| Blundell et al. (2005)|
|| Variance of Normally Distributed Errors for Test Scores| $\sigma^2$| 1 | Higher variance of the error decreases the correlation of ability and parent's education with the test outcomes. This parameter must be seen relatively to $\gamma_{ability}$ and $\gamma_{parent}$. Hence, I chose standard normal distributed errors to allow for high correlation between the variables without being forced to use very high values of $\gamma_{ability}$ and $\gamma_{parent}$.| - | 
|Wages||||||
||Lower Bounds for the Betas in the Wage Regression (schooling, working, working_sqr/100, number of Siblings, parent's education)| $\beta^{min}$ | (0.03, 0.01, -0.06, -10, 0.01) | The values for schooling, work experience and the squared work experience are chosen such that they are close to the values in Björklund and Kjellström (2002). For the number of Siblings I wanted to set a non-binding bound. To allow changes in this, I found it reasonable to also incorporate the constraint in the code and set it to a very high value. Parent's education should have a positive influence on the wages, hence I chose a small positive number. | Björklund and Kjellström (2002) |
||Mean/Variance of Logarithmic Wage| $(\mu_Y, \sigma^2_Y)$ | (2.040, 1.5) |  Blundell et al. (2005) | Blundell et al. (2005) |
||Upper Bounds for the Betas in the Wage Regression (constant, schooling, working, working_sqr/100, number of Siblings, parent's education) | $\beta^{max}$ | (0.06, 0.06, -0.03, 10, 10) | The values for schooling, work experience and the squared work experience are chosen such that they are close to the values in Björklund and Kjellström (2002). For the number of Siblings and parent's education I wanted to set a non-binding upper bound and thus choose a high value for the coefficients. To allow changes in the latter two quantities, I found it reasonable to also incorporate the constraint in the code.  | Björklund and Kjellström (2002) |
||Weighting of Squarred Deviation to Expected Value and Standard Deviation for the Logarithmic Wage | $\tau$ | 0.5 | I choose equal weights for the Expected Value and the Standard Deviation. A change in $\tau$ did only change the sresults in a very small magnitude.| - |
|Years of Schooling||||||
||Maximum Years of Schooling| m | 29 | I decided to allow that an individual did education since it entered school at 4 until the age 33. | - |
||Mean/Variance of Years of Schooling | $(\mu_s, \sigma_s^2)$ | (13.342, 21.215)  | In the summary Statistics of Blundell et al. (2005) the mean/standard deviation of the years of schooling of an individual are not given. Hence, I decided to use the same moments as for parent's education.  | Blundell et al. (2005)| 
||Minimum Years of Schooling| - | 0 | I decided to use zero. This comes with the cost that some individuals have unrealistic few years of schooling. However, this helps to stay more in line with the given moments. Setting this to a higher value and modify the structure of the years of schooling is a desirable addition for further analysis. | - |
||Persistence of Parent's Education on Children's Education | $q$ | 0.85 | Using the given moments of parent's education and schooling, the coefficient that sets the lower bound for $q$ in equation *3.13* is equal to $\frac{21.215 - 13.342^2/3}{13.342 - 13.342^2/3} \approx 0.8288$. Hence, I decided to use 0.85. The choice of the coefficient does not change the results qualitatively. | -|
|Years of Working experience||||||
||Age at School Enrollment | $age\_schoolenrollment$ | 4 | The enrollment age for school of most children is four in Great Britain. |  [British Government](https://www.gov.uk/schools-admissions/school-starting-age)|
||Probability Mass of Gap Years (0,1,2,3,4) | $F_{gap}$ |  (0.59, 0.11,0.7,0.04, 0.03) | Unfortunately I could not find data for the UK. Holmlund et al. (2008) report swedish data for gap years between schooling and university in 1991. I use this data and assume that everyone that does not go to university does not take gap any gap year.  | Holmlund et al. (2008) |

## 4.2 Description of the whole Population



First I simulate the whole population with the parameterizations presented in section *4.1*. I have wrapped the whole data generating process in a function. The whole function can be found in the Github repository's folder ['R'](https://github.com/manuhuth/PCR-Parameter-Variance-Analysis/tree/master/R) in the file *DGP_function*. Since the data set is very large I decided to simulate it in advance and upload the data to Github. The code I used to simulate is given in the next cell. I have listed the inputs in the same order as they appear in Table X.X. The function additionally allows to tune the optimization to obtain the logarithmic wage. 

In [21]:
#At the end set this on top of the notebook
setwd('C:/Users/Mhuth/Desktop/PCRPVA') #only for local computations
files <- c('PCA_PropVar', 'PCA', 'PCR_cv', 'PCR_predict', 'PCR', 'random_discreteVariables', 'random_VCV', 'DGP_function') #names of files to read
for (i in 1:length(files)) { #loop to read all files
    source(paste('R/', files[i], '.R', sep = '')) 
}

set.seed(123)
N <- 460000
#population <- dgp_model(var_ability = 1, max_age = 68, min_age = 33, n = N, prob_numbSiblings = 0.1, mean_numberSiblings = 1.692, var_numberSiblings = 2.89,
#                      mean_parent_educ = 13.342, variance_parent_educ = 21.215,
#                      gamma_ability = 2,  gamma_parent_educ = 2,
#                      breaks_test7_m = c(0, 0.141, 0.158, 0.185, 0.190, 0.212),  breaks_test11_m = c(0, 0.122,0.152, 0.157, 0.179, 0.199),
#                      breaks_test7_r =  c(0, 0.166, 0.179, 0.188, 0.187, 0.165),  breaks_test11_r = c(0, 0.132, 0.163, 0.163, 0.176, 0.176), test_cat = TRUE, var_err = 1,
#                      beta_min = c(0.03, 0.01, -0.06, -10, 0.01), mean_wage = 2.040, variance_wage = 1.5,  beta_max = c(0.06, 0.06, -0.03, 10, 10), tau = 0.5,  
#                      max_yearsSchooling = 29,  mean_schooling = 13.342, variance_schooling = 21.215, min_yearsSchooling = 0, q = 0.85,
#                      age_school_count = 4, probs_gap = c(0.59, 0.11,0.7,0.04, 0.03), gap_years = c(0,1,2,3,4))

load("SimData/population.RData")
X <- population[c('test7_m', 'test11_m', 'test7_r', 'test11_r', 'parent_educ', 'schooling', 'working')]
X <- cbind(X, X$working^2/100)
colnames(X) <- c(c('test7_m', 'test11_m', 'test7_r', 'test11_r', 'parent_educ', 'schooling','working', 'working_squ'))

**TO-DO** histograms of unstandardized variables to appendix

Since I use standardized variables for PCA and PCR I only report the variance-covariance matrix of the standardized variables. The function *scale* uses $N-1$ in the denominator to compute variances and covariances. Since I set the data as the whole population the denominator should actually be $N$. However, this error cancels out by dividing by $(N-1)$ in the subsequent line yielding the true standardized population variance-covariance matrix.  

In [22]:
population_stand <- scale(X, center = TRUE, scale = TRUE)  
VCV <- t(population_stand)%*%population_stand/(N-1) #since all variables are standardized, this equals the Pearson-Bravais correlations.

**TO-DO** Description of VCV matrix

**Table X.X - Variance-Covariance Matrix of the Standardized Variables for the whole Population**

In [23]:
VCV

Unnamed: 0,test7_m,test11_m,test7_r,test11_r,parent_educ,schooling,working,working_squ
test7_m,1.0,0.9888924,0.8403064,0.8411219,0.6200279,0.7208458,-0.2812663,-0.2720736
test11_m,0.9888924,1.0,0.839665,0.8409315,0.6205272,0.7215375,-0.2816682,-0.2726303
test7_r,0.8403064,0.839665,1.0,0.971708,0.6249727,0.7256524,-0.2829591,-0.2723009
test11_r,0.8411219,0.8409315,0.971708,1.0,0.6242053,0.7250765,-0.2823846,-0.2724868
parent_educ,0.6200279,0.6205272,0.6249727,0.6242053,1.0,0.9085257,-0.3553868,-0.3359949
schooling,0.7208458,0.7215375,0.7256524,0.7250765,0.9085257,1.0,-0.3910614,-0.3708614
working,-0.2812663,-0.2816682,-0.2829591,-0.2823846,-0.3553868,-0.3910614,1.0,0.9822071
working_squ,-0.2720736,-0.2726303,-0.2723009,-0.2724868,-0.3359949,-0.3708614,0.9822071,1.0


From the variance-covariance matrix we can compute the true matrix of eigenvectors $\pmb \phi$ and eigenvalues $\lambda$.

In [24]:
phi <- eigen(VCV)$vectors
lambda <- eigen(VCV)$values

**Table X.X - Eigenvectors of the whole Population Variance-Covariance Matrix of the Standardized Variables**

In [25]:
phi

0,1,2,3,4,5,6,7
-0.3971789,0.179214258,0.2558676,-0.492516983,0.0428211232,-0.011280568,0.01241045,0.7070891398
-0.3972409,0.178752432,0.254699,-0.494033593,0.0392241721,0.007511554,-0.00526989,-0.7068830809
-0.3964945,0.176034246,0.2294243,0.508252053,0.0461715977,-0.705756017,-0.02291929,-0.0099854341
-0.3965419,0.176409041,0.2314812,0.504526626,0.0471567715,0.707509489,0.02536034,0.0081932756
-0.3528383,0.001418523,-0.6908983,-0.020474768,0.6306680198,0.000514802,-6.836487e-05,-0.0009886973
-0.3870243,0.017236135,-0.5054648,-0.007950573,-0.770562383,0.002021462,-0.02406783,0.0027256095
0.2216269,0.658682847,-0.1122628,-0.00284434,-0.0006473233,0.024118708,-0.7097486,0.0090678749
0.2158564,0.663068189,-0.1341751,0.000214979,-0.0276275565,-0.023854355,0.7030843,-0.0091535166


## 4.3 Convergence of Eigenvectors in the 2-Dimensional Case

Since the eigenvectors are 8-Dimensional, they cannot be illustrated in a graph. Hence, I decided to start with a 2-dimensional case to give a first intuition how $\hat{\pmb \phi}$ converges to $\pmb \phi$ if he sample size increases. For this purpose I decided to use the variables *years of schooling* and *years of parent's education*. I chose these two variables since they have a high correlation of 0.9085 and had the same scale in the unstandardized set-up.   