* Here you'll generate your own data to make sure you understand what PCA is doing

* Generate 4 variables W, X, Y, and Z

1. X and Y should not be correlated
  * They are independent

2. W and X should have a mild correlation ( < 0.5)

3. Y and Z should have a mild correlation ( > 0.9)

* Here, it's important to note that:
  1. The correction of two variables A and B is not determined by the linear regression
    * $B = 0.4 \times X$ does not mean that the correlation cor(A,B) = 0.4
  2. Correction is factor of the noise in the linear regressio. 
     * For example, for  $B = 0.4 \times X + \epsilon$, the larger the noise component, the samller will be the correction between A and B
   

4. Generate a variable outcome as a linear combination of W, X, Y, and Z
  * i.e., choose values for the coefficients $\beta_0$, $\beta_1$, etc.. and compute `outcome` 
 
 
 $$outcome = \beta_0 + \beta_1 \times W + \beta_2 \times X + \beta_3 \times Y + \beta_4 \times Z$$

5. Model your outcome using W, X, Y, and Z.
  * Do your results match your model params?

6. Use PCA to reduce the dimensionality of your dataset
  * Can you explain why you don't need to include the outcome?     

7. Use the bi-plot to visualize the contributions of your initial variables

8. How efficient is the new lower-dimensional space representation at predicting the outcome?
  * Do your results match your model params?

In [None]:
#code from R markdown with help from Sam Shedd 

library(tidyverse)
library(dplyr)
library(ggplot2)

W <- rnorm(10, 2, 1)
X <- 2*W + 1
cor(W, X)
error <- rnorm(10, 2, 1)
X <- 0.25*W + error
cor(W, X)

Y <- rnorm(10, 2, 1)
Z <- 2*Y + 1
cor(Y, Z)
error_1 <- rnorm(10, 2, 1)
Z <- 3*Y + error
cor(Y, Z)

df_1 <- data.frame(W, X, Y, Z)
View(df_1)

lm_xy = lm(x~y, df_1)
lm_yz = lm(y~z, df_1)
lm_xw = lm(x~w, df_1)
summary(lm_yz)
summary(lm_xw)

outcome <- 7 + 9 * W + 3 * X + 1 * Y + 2 * Z

ggplot(df_1, aes(x = outcome)) +
  geom_point() + 
  geom_line(aes(y = W), color = "darkred") + 
  geom_line(aes(y = X), color = "blue") +
  geom_line(aes(y = Y), color = "green") +
  geom_line(aes(y = Z), color = "purple")

pca_wxyz = prcomp(df_1, scale=TRUE)
str(pca_wxyz)
pca_wxyz$sdev^2 / sum(pca_wxyz$sdev^2)

fviz_pca_biplot(pca_wxyz)
