* Here you'll generate your own data to make sure you understand what PCA is doing

* Generate 4 variables W, X, Y, and Z

1. X and Y should not be correlated
  * They are independent

2. W and X should have a mild correlation ( < 0.5)

3. Y and Z should have a mild correlation ( > 0.9)

* Here, it's important to note that:
  1. The correction of two variables A and B is not determined by the linear regression
    * $B = 0.4 \times X$ does not mean that the correlation cor(A,B) = 0.4
  2. Correction is factor of the noise in the linear regressio. 
     * For example, for  $B = 0.4 \times X + \epsilon$, the larger the noise component, the samller will be the correction between A and B
   

4. Generate a variable outcome as a linear combination of W, X, Y, and Z
  * i.e., choose values for the coefficients $\beta_0$, $\beta_1$, etc.. and compute `outcome` 
 
 
 $$outcome = \beta_0 + \beta_1 \times W + \beta_2 \times X + \beta_3 \times Y + \beta_4 \times Z$$

In [2]:
install.packages("factoextra")
install.packages("moderndive")
install.packages("readr")
library(tidyverse)
library(geometry)
library(factoextra)
library(moderndive)
library(readr)
library(ggplot2)

package 'factoextra' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\paige\AppData\Local\Temp\RtmpgbB8Nw\downloaded_packages


also installing the dependency 'infer'




  There are binary versions available but the source versions are later:
           binary source needs_compilation
infer       0.5.4  1.0.0             FALSE
moderndive  0.5.1  0.5.2             FALSE



installing the source packages 'infer', 'moderndive'

"installation of package 'moderndive' had non-zero exit status"also installing the dependencies 'rlang', 'vroom'




  There are binary versions available but the source versions are later:
      binary source needs_compilation
rlang 0.4.11 0.4.12              TRUE
vroom  1.4.0  1.5.5              TRUE
readr  1.4.0  2.0.2              TRUE

  Binaries will be installed
package 'rlang' successfully unpacked and MD5 sums checked
package 'vroom' successfully unpacked and MD5 sums checked
package 'readr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\paige\AppData\Local\Temp\RtmpgbB8Nw\downloaded_packages


Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.4.0       v forcats 0.4.0  
"package 'readr' was built under R version 3.6.3"-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
"package 'factoextra' was built under R version 3.6.3"Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa


ERROR: Error in library(moderndive): there is no package called 'moderndive'


In [20]:
x = rnorm(30, 2, 1)
error = rnorm(30, 0, 6)
w = 2*x + error
y = rnorm(30, 3, 3)
z = 3*y + error
cor(w, x)
cor(y, z)

In [21]:
outcome = 1 + 1.5*w  + 2.1*x + 1.8*y + 0.9*z + error
outcome

5. Model your outcome using W, X, Y, and Z.
  * Do your results match your model params?

In [39]:
model = lm(outcome~w + x + y + z + error)
summary(model)
model$coefficients
#The intercept is the same, but the other parameters are not.
#Why is it a perfect fit even though I added noise?

"essentially perfect fit: summary may be unreliable"


Call:
lm(formula = outcome ~ w + x + y + z + error)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.519e-14 -2.455e-15  3.022e-16  3.356e-15  8.640e-15 

Coefficients: (2 not defined because of singularities)
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept)  1.000e+00  3.036e-15  3.294e+14   <2e-16 ***
w            3.400e+00  2.603e-16  1.306e+16   <2e-16 ***
x           -1.700e+00  1.222e-15 -1.391e+15   <2e-16 ***
y            4.500e+00  3.599e-16  1.250e+16   <2e-16 ***
z                   NA         NA         NA       NA    
error               NA         NA         NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.615e-15 on 26 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 1.335e+32 on 3 and 26 DF,  p-value: < 2.2e-16


6. Use PCA to reduce the dimensionality of your dataset
  * Can you explain why you don't need to include the outcome?     

In [38]:
pca = prcomp(outcome, scale=TRUE)
summary(pca)

Importance of components:
                       PC1
Standard deviation       1
Proportion of Variance   1
Cumulative Proportion    1

In [31]:
str(pca)

List of 5
 $ sdev    : num 1
 $ rotation: num [1, 1] 1
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "PC1"
 $ center  : num 23.6
 $ scale   : num 24.6
 $ x       : num [1:30, 1] -0.299 1.378 1.341 -0.89 -1.388 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "PC1"
 - attr(*, "class")= chr "prcomp"


In [34]:
pca$rotation

PC1
1


In [40]:
pca$sdev^2

In [41]:
pca$sdev^2 / sum(pca$sdev^2)

7. Use the bi-plot to visualize the contributions of your initial variables

In [42]:
fviz_pca_biplot(pca)

ERROR: Error in facto_summarize(X, element = "var", result = c("coord", "contrib", : The value of the argument axes is incorrect. The number of axes in the data is: 1. Please try again with axes between 1 - 1


8. How efficient is the new lower-dimensional space representation at predicting the outcome?
  * Do your results match your model params?

In [None]:
#I don't understand why it's a perfect fit.