-
Notifications
You must be signed in to change notification settings - Fork 1
/
Intro_to_longpca_with_nycflights_data.Rmd
336 lines (219 loc) · 21 KB
/
Intro_to_longpca_with_nycflights_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
---
title: "Intro to longpca with nycflights data"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Intro to longpca with nycflights data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Introduction
If you've already read the readme, you can skip to the section PCA the nycflights
This package introduces a novel formula syntax for PCA. In modern applications (where data is often in "long format"), the formula syntax helps to fluidly imagine PCA without thinking about matrices. In other words, it provides a layer of abstraction above matrices. Given the formula and the (long) data, the code in this package transforms your data into a proper format for fast PCA via sparse linear algebra. The package also provides code to 1) help pick the number of dimensions to compute, 2) diagnose the suitability of PCA (both pre and post PCA), 3) rotate the PCs with varimax, 4) visualize and interpret the dimensions uncovered, and (not yet) 5) make predictions. This package uses "PCA" as a broad term for computing the leading singular vectors of a normalized (sometimes incomplete) matrix. Some might refer to specific instances as factor analysis, correspondence analysis, latent symantic analysis, social network analysis, or low-rank matrix completion, among other possible terms. This is big-tent PCA, all included. `longpca` is in development. So, functions and syntax might change.
The current approach to PCA (principal components analysis) is *matrix first*. This note begins to explore an alternative path, one that is *model first*. The formula syntax provides an alternative way to think about PCA that makes matrices transparent; completely hidden, unless you want to see the code.
I hope this makes PCA legible for folks that have not yet learned linear algebra (just like linear models are legible without solving linear systems of equations).
I am personally inspired by this approach because (despite the fact that I love matrices and linear algebra) I find that this *model first* way of thinking is so much easier and more direct.
This document gives an illustration with a data analysis of the popular `nycflights13` data via PCA. Headline: we find two seasonal effects (annual and weekly) and also the "fly-over-zone" (midwest 4ever. ride or die \<3 much love to my midwest fam). Code details follow this analysis.
(Disclaimer: this is very early in this project. So, the syntax and the code is likely to change a great deal. Input is very welcome about ways to improve it.)
#### Install
The functions for PCA for the People are contained in an R package `longpca`. If you do not already have `devtools` installed, you will first need to install that:
```{r, eval = FALSE}
install.packages("devtools")
devtools::install_github("karlrohe/longpca")
```
Thank you to Alex Hayes for helpful feedback in this process and suggesting the name `longpca`.
### PCA the nycflights.
To illustrate the code, consider the popular data example `nycflights13` which contains a row for every flight departing from the 3 main New York City airports during 2013 (LGA, JFK, and EWR). It includes things like date, destination, and information about delays.
```{r}
library(nycflights13)
flights
```
There are so many matrices "inside" of this data, but you don't think of them when you see this data. Many applications are like this. The data does not look like matrix. Instead, it looks like a spreadsheet or a SQL database or a tidy tibble. That is how the users often think about their data. *And underneath it*, there are so many possible matrices. This code will reveal these possibilities.
```{r, echo=FALSE,include=FALSE}
library(tidyverse)
library(longpca)
```
The first step to using `longpca` is the function `make_interaction_model`. This requires two arguments, the data (tidy data in "long format") and the formula (explained in detail below).
Right now, there are four ways to specify a model with `make_interaction_model`. Hopefully, we can find more. Perhaps you will have a way?
Once you make an `interaction_model`, there are many things you can do. Before running PCA, you can `diagnose` to see if the data is "too sparse", then if so, you can `core` to find the dense core of observed data. You can also `pick_dim` to estimate how large of a model `k` your data can infer.
Here, for the introduction, we will just run PCA right away:
```{r, cache=TRUE}
library(nycflights13)
im = make_interaction_model(flights, ~ (month & day)*(dest))
pcs = pca(im, k = 6)
```
This performs PCA. The key innovation is the "formula":
`~ (month & day)*(dest)`,
which specifies the model that we want. If you are already comfortable with matrices and the SVD, then you can imagine this as the matrix we will apply SVD to (in the future, this will be imagined as fitting a statistical model via least squares).
The basic syntax of the formula is this:
`outcome ~ unit * context`.
The outcome can be left blank (Specification 1 below), otherwise it should be a variable in the data. `unit` and `context` can be variables in the data, or they might be specified by multiple variables in the data. For example, in the formula `~ (month & day)*(dest)`, the units are specified by `(month & day)`.
The next section discusses the various ways of using the formula.
## Four ways to specify an interaction_model
##### Specification 1: Empty left-side
The first and most basic use (perhaps also be the most powerful) is to not specify an outcome:
`~ (month & day)*(dest)`.
*Specification 1* happens when the left-side of the formula is empty. It specifies a matrix where
1) the units (or rows) are indexed by `(month & day)`, two variables of the `flights` data,
2) the context (or columns) are indexed by destinations `dest`, and
3) the outcome (or elements of the matrix) are the number of flights to each `dest` on each `(month & day)`.
So, there 365 rows (one for each day), about 100 columns (one for each destination), and the sum of all elements in the matrix is `336,776`, the number of rows in the flights data.
Said another way, if you leave the left-side empty, it counts co-occurrences of `(month & day)` with `dest` and makes a matrix of those co-occurrences. Some people call this matrix a cross-tab or a contingency table. When we do PCA to this matrix of counts, some folks call that Correspondence Analysis.
##### Specification 2: A variable the left-side that counts something
In this formulation, the outcome (on the left-side of the formula) will have a variable. For example, `nycflights13` has the number of seats available in each plane. We can join this data to the flights to find the number of seats available on each flight.
```{r, cache=TRUE}
dat = flights %>%
select(month, day, dest, tailnum) %>%
left_join(planes %>% select(tailnum, seats))
```
Then, we can use the formula,
`seats ~ (month & day)*dest`
with the new data `dat` in `make_interaction_model`. This is *Specification 2*. It happens when (i) we specify an outcome on the left-hand side of the formula and (ii) we use the default settings of `make_interaction_model`.
```{r, cache=TRUE}
im_seats = make_interaction_model(dat, seats ~ (month & day)*dest)
```
In this formula above, the units and context are identical to the example for Specification 1. The difference is that now the outcome counts the total number of `seats` that flew to each destination, on each day.
In fact, Specification 2 is very similar to Specification 1. Imagine that `flights` had a variable called `1` and every element of that varialbe is a numeric 1. Then, the formula
`1 ~ (month & day)*(dest)`
could be interpreted via Specification 2. And in fact, you can type that formula without having a variable `1` in your data and `make_interaction_model` will understand it the same as `~ (month & day)*(dest)`.
##### Specification 3: The outcome variable (the thing on the left-side of the formula) should be averaged
The flights data contains a column `arr_delay` which gives the arrival delay for the flight. One possible model is
`arr_delay ~ (month & day)*(dest)`.
It does not make sense to simply add up the arrival delays. It makes sense to average them. This is specified in `make_interaction_model` with the argument `duplicates = "average"`:
`make_interaction_model(flights, arr_delay ~ (month & day)*(dest), duplicates = "average")`
This uses the same units and context as the last two examples, but it is helpful to begin imagining some things that correspond to arrival delays and how might they interact. From this, we build our model. For example, in my experience, flights later in the day are more likely to be delayed. Also, weekday flights are worse than weekend flights. I wonder if there is an interaction between them.
```{r}
day_dat = flights |>
mutate(day = wday(time_hour, label = TRUE, abbr = FALSE))
im_delays = day_dat |>
select(hour, day , arr_delay) |>
make_interaction_model(arr_delay ~ hour*day,duplicates = "average")
```
*Specification 3* happens when (i) there is a variable on the left-hand side of the formula, (ii) `duplicates = "average"` in `make_interaction_model`, and (iii) **typically** you then want to use `pca_na` instead of `pca` to compute the principal components:
```{r}
pc_delays = pca_na(im_delays, k = 3)
```
Often, many (or most) of the possible combinations of the (unit, context) pairs do not appear in the data. For example, there were no flights to MSN (Madison, Wisconsin) on January 5th. When most of the possible combinations do not appear, we say that the data is "sparse". (You can diagnose the level of sparsity with `diagnose` and you can look at a "dense subset" of the data with `core`).
If any of the combinations do not appear in the data, then you need to decide if PCA should understand those values to be zero or `NA`. You need to think about your problem to decide which one makes sense. In Specifications 1 and 2, it makes sense to make the missing elements zero because *the sum of zero numbers is zero*. In this case, you use the function `pca`. In Specification 3, it likely makes sense to have the missing values be `NA` because *the average of zero numbers is `NA`*. In this case, call `pca_na` which computes the PCs via `softImpute::softImpute`.
Another way to specify this model (i.e. a different interaction to explore) would be to make the unit `(day & hour)` and the context `dest`:
```{r}
im_dest_delay = day_dat |>
make_interaction_model(arr_delay ~ (day&hour)*dest,duplicates = "average")
```
##### Specification 4: The variable after the \* is a text field (or a sequence) that needs to be parsed
There are no good motivating examples for Specification 4 in the flights data. Instead, please see [the `parse_text` vignette](articles/parse_text.html). You can access that functionality with the argument `parse_text = TRUE` inside `make_interaction_model`.
## What is inside the output of pca
The output of `pca` (and `pca_na`) contains the pc's and their loadings in a tidy format:
```{r, cache=TRUE}
im = make_interaction_model(flights, ~ (month & day)*(dest))
pcs = pca(im, k = 6)
pcs$row_features %>% sample_n(size = 3)
pcs$column_features %>% sample_n(size = 3)
```
Because they are tidy, it makes them pretty easy to ggplot.
First, let's do the units (i.e. dates/rows). To interpret PC's, it is best to plot it in the *native space*. For dates, the native space is a time series or a sequence. Let's plot it there. I give my interpretation after the plots.
```{r, cache=TRUE}
im = make_interaction_model(flights, ~ (month & day)*(dest))
pcs = pca(im, k = 6)
pcs$row_features %>%
mutate(date = make_date(day = day, month=month, year = 2013)) %>%
select(date, contains("pc_")) %>%
pivot_longer(contains("pc_"), names_to = "pc_dimension", values_to = "loadings") %>%
ggplot(aes(x = date, y = loadings)) + geom_line() +
facet_wrap(~pc_dimension, scales= "free") + geom_smooth()
```
I always think of the first PC as the "mean". What we see is that flights are more or less constant throughout the year (see y-axis). I presume that the oscillations are for the weekends. `pc_1` says that, across destinations, there are more flights during the work week and fewer flights on the weekends. The second pc gives a seasonal effect (fewer flights in winter, more in summer); importantly, after `pc_1`, some destinations will have negative values of this (i.e. more in the winter, fewer in the summer). The third pc is positive on weekend destinations (more flights on the weekends and fewer during the weekdays relative to `pc_1`). Again, like `pc_2` some destinations will have a negative value (i.e. more flights on the weekends and fewer during the weekdays relative to the previous two pc's). The last three are harder to interpret. My uninformed guess is that it is some artifact of airline decisions. If you have a guess, I'd love to hear it. Also, later on with `pick_dim`, we have some evidence that they are noise.
Now, let's do the columns (i.e. destinations). The "native space" for destinations is a map. Let's plot it there. Be sure you have `maps` installed.
```{r, cache=TRUE}
airports %>% sample_n(size = 3)
# first, get the lat and lon for the airports:
airport_dat = pcs$column_features %>%
left_join(airports %>% select(dest=faa, lat,lon)) %>%
select(lat, lon, contains("_col")) %>%
pivot_longer(contains("pc_"),
names_to = "pc_dimension", values_to = "loadings") %>%
drop_na()
library(maps)
usa_map <- map_data("state")
p <- ggplot() +
geom_polygon(data = usa_map, aes(x = long, y = lat, group = group),
fill = "white", color = "black") +
coord_fixed(1.3, xlim = c(-125, -65), ylim = c(25, 50))
# i'm only keeping lower 48 states, dropping Anchorage and Honolulu.
p + geom_point(data = airport_dat, aes(x = lon, y = lat,
size = abs(loadings), color = loadings)) +
facet_wrap(~ pc_dimension) +
scale_color_gradient2(low = "red", high = "blue", mid = "white")
```
Here `pc_1` should align with larger and smaller airports (bigger airports \<-\> more flights throughout the year). `pc_2` is negative on Denver and Florida and positive in Maine. Looking back at the time series plots, I interpret this to mean that people go to Denver (skiing) and Florida (beach) in the winter and Maine (coastline) in the summer. `pc_3` picks up the "fly-over zone"... looking back at the time series, folks prefer to travel here during the work week. So, the blue areas are more weekend (vacation) destinations and the red areas are the fly-over. The other pc's are difficult for me to interpret (my guess is that they are weird artifacts of airline things... noise). We do see that the last three are heavily localized on a few airports, looking back at the pairs plots you can see this localization. Given that, my sense is that they are not so interesting, but if I needed to make sense of them, I would print out their most extreme elements and dig into those airports. Making this function is a todo item.
So, using the code is easy. You just need to specify a formula. It's fun to think of other combinations and easy to try them out.
There are three functions that you might like `diagnose`, `pick_dim`, and `plot` that are explained below.
# A deeper look inside the code.
### What is an interaction_model?
To make maximum use of this package, it is helpful to think about *models*, not *matrices*. Each of the key functions in this package is handling a class `interaction_model`:
```{r, cache=TRUE}
formula = ~ (month & day)*dest
im = make_interaction_model(flights, formula)
class(im)
names(im)
```
`im` is a list of four elements. First, `$interaction_tibble` which can be thought of as a sparse matrix in triplet form; `get_Matrix(im)` uses this to construct a sparse matrix. Then, `$row_universe` and `$column_universe` which can be thought of as holding the information corresponding to each row/column. Finally, `$settings` contains various details about the construction.
# Let's do a more careful analysis
### Examining the matrix sparsity
This package contains a few helper functions. First, if lots of rows or columns have very few non-zero elements, this can cause "localization issues". The matrix needs to be "dense enough" for the PCA to find good stuff. So, `diagnose` prints some simple diagnostics and plots the "degree distribution" for the rows and columns. Here, "degree" is the number of non-zero elements in that row or column.
```{r, cache=TRUE, message=FALSE, warning=FALSE}
# inspect "degree distributions" with this funciton:
# recall that im is the interaction_model defined above.
diagnose(im)
```
For example, if either average degree was less than 10, then I might be worried. Another diagnostic in the print out is `percent_le_x` which gives the percent of rows/columns that have row/col sums less than or equal to `x`. If a majority of rows or columns has degree less than or equal to 3, then I would be worried. These are clues that the matrix is very sparse and you might have trouble. Issues with sparsity will likely manifest in localization; something that will be evaluated in functions below.
One possibility is to take the "core":
```{r}
im_cored = core(im,core_threshold = 3)
nrow(im$column_universe)
nrow(im_cored$column_universe)
diagnose(im_cored)
```
This finds the largest subset of rows and columns such that each row and column has at least `core_threshold = 3` data points (and it will also return the largest connected component). In this case, it discards two destination airports, LEX and LGA:
```{r}
im$column_universe |> anti_join(im_cored$column_universe, by = "dest")
```
### Picking k with cross-validated eigenvalues
When doing a PCA, you need to pick the model size $k$. The way that we do this in my lab is with cross-validated eigenvalues. It gives you a Z-score and a p-value. [Here is the arxiv paper](https://arxiv.org/abs/2108.03336). Alex Hayes and Fan Chen made it a proper R package on CRAN [gdim](https://cran.r-project.org/web/packages/gdim/gdim.pdf). For this example, it picks `k=4`.
```{r, cache=TRUE}
cv_eigs = pick_dim(im, dimMax = 10,num_bootstraps = 5)
plot(cv_eigs)
cv_eigs
```
Notice that the top-line of the printout says that the estimated graph dimension is 4. So, we will use `k=6` and see that in this example they become harder to interpret. This is what we would expect if it was just noise... but also, maybe they are not just noise?
### Let's get the people some PCA
For right now, the default of `make_interaction_model` is that if there are multiple rows of the long data that have the same values for `(month & day)` and also for `dest`, then the value inside the matrix is a *sum* of the values on the left hand side of the formula. If there is no variable specified on the left hand side, then it is like imagining there is an additional column in your data, `1`, that has the value 1 in every row. So, `formula = ~ (month & day)*(dest)` is equivalent to `formula = 1 ~ (month & day)*(dest)`. By *summing* over the outcomes, if it is a `1`, then it *counts* how many times that entry appears.
Because it is a count, and the variance stabilizing transformation for a Poisson is square root, the default in the code for `pca` is to square root each count. Then, the code computes the normalized and regularized Laplacian `L` (using the number of non-zero entries as the degree). Then, computes the leading `k` singular vectors. This is all done with sparse linear algebra via the packages `Matrix` and `irlba`.
```{r}
pcs = pca(im, k = 6)
names(pcs)
```
The `row_features` and `column_features` are the PC's and loadings (I don't prefer those old terms). `middle_B` gives the singular values. `settings` contains some details that are handy in later functions.
```{r}
sample_n(pcs$row_features, size = 3)
sample_n(pcs$column_features, size=3)
```
Notice that these features are in a wide and tidy form, making it easy to `lubridate::make_date` (for `row_features`) and left-join with airports (to get latitude and longitude) for `column_features`.
##### Diagnostic plots
You can `plot(pcs)`. It makes these five plots, each described after all plots are displayed.
```{r, cache=TRUE}
plot(pcs)
```
These are the five plots:
1) Screeplot: The top `k` singular values of `L`.\
2) Better screeplot: its singular values `2:k` (because the first one is usually dominant and difficult to see an elbow past it).\
3) A "localization plot" which is very similar (maybe exact?) to [this stuff](https://github.com/karlrohe/LocalizationDiagnostic); for each row (and column) compute its degree and its leverage score. Take the log of both. Fit a linear model `log(leverage)~log(degree)` and plot the residuals against `log(degree)`. If there is localization, I suspect that there will be a big curl on the right side.
4) Pairs plot of `row_features`. This is the plot emphasized in the varimax paper. In these example plots below, we do not see very clear radial streaks.
5) A pairs plot for `column_features`. In both pairs plots, if there are more than 1000 points, then the code samples 1000 points with probability proportional to their leverage scores. It will plot up to `k=10` dimensions. If `k` is larger, then it plots the first 5 and the last 5.