# Principle Component Analysis

The following is a principal component analysis of the top 10 features in the Ames, Iowa dataset determined by a linear regression. 

Prepare a principal component analysis of these 10 features. 
Visualize the loadings and prepare an hypothesis about what each of the important principal components means that would be understandable to a non-data person

In [1]:
library(caret)
library(ggplot2)
library(dplyr)
set.seed(100)

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [39]:
### Loading in Data Set - Refer to 01. EDA for more details
ames_URL <- 'https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt'
ames <- read.table(url(ames_URL), sep = '\t', header = TRUE)

ames$Order <- NULL
ames$PID <- NULL

to_be_factors <- c("MS.SubClass", "Overall.Qual", "Overall.Cond", "Bsmt.Full.Bath", "Bsmt.Half.Bath", "Full.Bath", 
                   "Half.Bath", "Bedroom.AbvGr", "Kitchen.AbvGr", "TotRms.AbvGrd", "Fireplaces", "Garage.Cars", "Mo.Sold",
                   "Yr.Sold", "Year.Built", "Year.Remod.Add")
ames[to_be_factors] <- lapply(ames[to_be_factors], factor)

ames$Lot.Frontage[is.na(ames$Lot.Frontage)] <- mean(ames$Lot.Frontage, na.rm=TRUE)
ames$Mas.Vnr.Area[is.na(ames$Mas.Vnr.Area)] <- mean(ames$Mas.Vnr.Area, na.rm=TRUE)
ames$Garage.Yr.Blt[is.na(ames$Garage.Yr.Blt)] <- mean(ames$Garage.Yr.Blt, na.rm=TRUE)

empty_means_without <-c("Alley","Bsmt.Qual","Bsmt.Cond","Bsmt.Exposure","BsmtFin.Type.1", "BsmtFin.Type.2", "Fireplace.Qu",
                        "Garage.Type","Garage.Finish", "Garage.Qual","Garage.Cond","Pool.QC","Fence","Misc.Feature")

replace_empty_with_without <- function(feature) {
    levels(feature) <- c(levels(feature), "Without")
    feature[is.na(feature)] <- "Without"
    return(feature)
}

for (feature in empty_means_without) {
    ames[,feature] <- replace_empty_with_without(ames[,feature])
}

ames <- na.omit(ames)

dummy <- dummyVars(" ~ .", data = ames)
ames <- data.frame(predict(dummy, newdata = ames))

In [40]:
top10 <- c('Bedroom.AbvGr','TotRms.AbvGrd','Overall.Qual','X2nd.Flr.SF',
           'Bsmt.Full.Bath','Roof.Style','X1st.Flr.SF','BsmtFin.SF.1','Full.Bath','Overall.Cond')

top10 <- c('Bedroom.AbvGr.3','Bedroom.AbvGr.2','TotRms.AbvGrd.6','TotRms.AbvGrd.7','TotRms.AbvGrd.5','Bedroom.AbvGr.4',
           'TotRms.AbvGrd.8','TotRms.AbvGrd.4','Bedroom.AbvGr.1','TotRms.AbvGrd.9')

In [41]:
head(ames[top10])

Bedroom.AbvGr.3,Bedroom.AbvGr.2,TotRms.AbvGrd.6,TotRms.AbvGrd.7,TotRms.AbvGrd.5,Bedroom.AbvGr.4,TotRms.AbvGrd.8,TotRms.AbvGrd.4,Bedroom.AbvGr.1,TotRms.AbvGrd.9
1,0,0,1,0,0,0,0,0,0
0,1,0,0,1,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0
1,0,1,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0


In [42]:
pca <- princomp(ames[top10], cor = TRUE)

In [44]:
summary(pca)

Importance of components:
                         Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
Standard deviation     1.459887 1.3398605 1.1502210 1.1404091 1.0538219
Proportion of Variance 0.213127 0.1795226 0.1323008 0.1300533 0.1110541
Cumulative Proportion  0.213127 0.3926496 0.5249504 0.6550037 0.7660578
                           Comp.6     Comp.7     Comp.8      Comp.9     Comp.10
Standard deviation     0.99924607 0.86707356 0.69704634 0.269702670 0.174642273
Proportion of Variance 0.09984927 0.07518166 0.04858736 0.007273953 0.003049992
Cumulative Proportion  0.86590704 0.94108869 0.98967605 0.996950008 1.000000000