# Factor Analysis
By Lorenz Menendez & Isaac Kamber

## Introduction
This Notebook will show how we applied Principal Component Analysis (PCA) in R to better understand voting outcomes in the 2016 U.S. Presidential Election. This Notebook begins by importing election and demographic data for outlying counties (i.e. counties who voted significantly differently than their neighbording counties). Then, we show you how to construct a correlation matrix in R, validate the effectiveness of a PCA analysis, and finanlly complete the PCA analysis. Along the way, we touch on some of the statistical reasoning behind our choice of statistical methods as they relate to our interpretation of the election data. A more detailed account on implementing PCA can he found at the [UCLA Institute for Digital Reserach & Education](https://stats.idre.ucla.edu/spss/seminars/introduction-to-factor-analysis/a-practical-introduction-to-factor-analysis/).

## Specific Goals
Running a PCA will help better understand which characteristics define outlying counties by reducing the number of variables (or "dimensions") so we can interpret overall trends. At the end of the analysis, we will have a breakdown of how each demographic variable contributed to more general components which are themselves uncorrelated.

## Libraries
To begin, you're going to need to library the following packages. We will library additional packages as we go along through the tutorial so you can see when each packages' functions are used. 

In [2]:
library(sf)
library(dplyr)

“package ‘sf’ was built under R version 3.4.4”Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
“package ‘dplyr’ was built under R version 3.4.4”
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Importing County-Level Election Data
In a previous notebook, we learned how to identify outlying counties using LISA Maps. Now, we are going to import demogaphic, electoral, and geographic data. We've simplified the data wrangling task for you by compiling the necessary data for you. To load it simply run the following code.

In [3]:
outlierData = st_read("https://github.com/isaacnk/2016Election/raw/master/Notebooks/DATA/studyData.geojson") %>% 
        filter(LowHigh == 1 | HighLow == 1) %>% 
        select(-POLY_ID) 

Reading layer `OGRGeoJSON' from data source `https://github.com/isaacnk/2016Election/raw/master/Notebooks/DATA/studyData.geojson' using driver `GeoJSON'
Simple feature collection with 3142 features and 59 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -179.1473 ymin: 18.91747 xmax: 179.7785 ymax: 71.35256
epsg (SRID):    4269
proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs


To verify that the data downloaded successfully, run the following code and make sure that R returns 'sf' and 'data.frame'

In [4]:
class(outlierData)

## Creating a Correlation Matrix
Before attempting a PCA, we have to make sure that the measured variables (or "Factors") we chose are actually correlated with the dependent variable we are trying to explain. In this case, we want to make sure that our socioeconomic and demographic factors are actually correlated with the voting outcome. 

For a PCA to be meaningful, we would like to see that many factors are highly correlated or anticorrelated with the dependent variable. Here, we want to see that our county-level variables are correlated with the percent of the vote that went to Hillary Clinton in that county. To do that, we tell R to create a correlation matrix for our imported dataset.

In [5]:
outlierCor = select(outlierData, -area_name, -fips, -state_abbreviation, -LowHigh, -HighLow, -pct_trm) %>% # Removing Unnecessary Variables
        st_drop_geometry() %>% # Dropping the Geometry Column for speed optimization
        cor() # Create a Correlation Matrix

“the standard deviation is zero”

Now let's view the data to see how our variables are correlated

In [12]:
outlierCor[,"pct_hll"] %>% sort(decreasing = TRUE)

We notice that the variables 'SBO115207', 'RHI325214', and 'HSG096213' are the most positively correlated variables to the percent of the vote going to Clinton. Similarly, 'HSG445213', 'RHI125214', and 'LFE305213' are the most negatively correlated.

To make this output more useable, lets read in the definitions for our data.

In [18]:
read.csv("DATA/county_facts_dictionary.csv")  # Importing dictionary from CSV

column_name,description
PST045214,"Population, 2014 estimate"
PST040210,"Population, 2010 (April 1) estimates base"
PST120214,"Population, percent change - April 1, 2010 to July 1, 2014"
POP010210,"Population, 2010"
AGE135214,"Persons under 5 years, percent, 2014"
AGE295214,"Persons under 18 years, percent, 2014"
AGE775214,"Persons 65 years and over, percent, 2014"
SEX255214,"Female persons, percent, 2014"
RHI125214,"White alone, percent, 2014"
RHI225214,"Black or African American alone, percent, 2014"


Therefore, we notice that Clinton's share of votes is positively corrlelated with the percent of American-Indian owned firms, the percent of American-Indian population, and the percent of housing units in multi-unit structures. 

Clinton's vote is negatively correlated with homeownership rate, percent of population identified as White-Alone, and mean travel time to work. 

As you can tell, simply studying the correlations between each variable one-by-one does not really show anything very useful about the underlying data trends in outlying counties. Therefore, it could be wise to run a PCA.

## Implementing a Principal Component Analysis

### I. Verify Sampling Accuracy
We will use to [Kaiser-Meyer-Olkin (KMO)](https://www.statisticshowto.datasciencecentral.com/kaiser-meyer-olkin/) Test to determine whether our variables were adequately sampled to actually run a PCA. The KMO will return a value between 0 (Low Accuracy) and 1 (High Accuracy). For the PCA, we are looking for KMO greater than 0.5.