# Factor Analysis
By Lorenz Menendez & Isaac Kamber

## Introduction
This Notebook will show how we applied Principal Component Analysis (PCA) in R to better understand voting outcomes in the 2016 U.S. Presidential Election. This Notebook begins by importing election and demographic data for outlying counties (i.e. counties who voted significantly differently than their neighbording counties). Then, we show you how to construct a correlation matrix in R, validate the effectiveness of a PCA analysis, and finanlly complete the PCA analysis. Along the way, we touch on some of the statistical reasoning behind our choice of statistical methods as they relate to our interpretation of the election data. A more detailed account on implementing PCA can he found at the [UCLA Institute for Digital Reserach & Education](https://stats.idre.ucla.edu/spss/seminars/introduction-to-factor-analysis/a-practical-introduction-to-factor-analysis/).

## Specific Goals
Running a PCA will help better understand which characteristics define outlying counties by reducing the number of variables (or "dimensions") so we can interpret overall trends. At the end of the analysis, we will have a breakdown of how each demographic variable contributed to more general components which are themselves uncorrelated.

## Libraries
To begin, you're going to need to library the following packages. We will library additional packages as we go along through the tutorial so you can see when each packages' functions are used. 

In [2]:
library(sf)
library(dplyr)

“package ‘sf’ was built under R version 3.4.4”Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
“package ‘dplyr’ was built under R version 3.4.4”
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Importing County-Level Election Data
In a previous notebook, we learned how to identify outlying counties using LISA Maps. Now, we are going to import demogaphic, electoral, and geographic data. We've simplified the data wrangling task for you by compiling the necessary data for you. To load it simply run the following code.

In [3]:
outlierData = st_read("DATA/studyData.geojson") %>% # Importing data from GeoJSON
        filter(LowHigh == 1 | HighLow == 1) %>% # Filter for Outlying Counties
        select(-POLY_ID) # Remove GeoDa's Primary Key

Reading layer `OGRGeoJSON' from data source `/Users/LorenzMenendez/Google Drive/2018-2019 Homework/Spring Quarter/2016Election/Notebooks/DATA/studyData.geojson' using driver `GeoJSON'
Simple feature collection with 3142 features and 59 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -179.1473 ymin: 18.91747 xmax: 179.7785 ymax: 71.35256
epsg (SRID):    4269
proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs


To verify that the data downloaded successfully, run the following code and make sure that R returns 'sf' and 'data.frame'

In [4]:
class(outlierData)

## Creating a Correlation Matrix
Before attempting a PCA, we have to make sure that the measured variables (or "Factors") we chose are actually correlated with the dependent variable we are trying to explain. In this case, we want to make sure that our socioeconomic and demographic factors are actually correlated with the voting outcome. 

For a PCA to be meaningful, we would like to see that many factors are highly correlated or anticorrelated with the dependent variable. Here, we want to see that our county-level variables are correlated with the percent of the vote that went to Hillary Clinton in that county. 