-
Notifications
You must be signed in to change notification settings - Fork 2
/
netlit.Rmd
521 lines (423 loc) · 16.7 KB
/
netlit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
---
title: "netlit Vignette"
author: "Devin Judge-Lord, Adeline Lo & Kyler Hudson"
subtitle: Redistricting Literature
output: rmarkdown::html_vignette
#output: pdf_document
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{netlit Vignette}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
cache = FALSE,
fig.width=10,
fig.height=7,
out.width = "100%",
split = TRUE,
fig.align = 'center',
fig.path='../man/figures/',
fig.retina = 1,
warning=FALSE,
message=FALSE)
library(knitr)
library(kableExtra) # Table formatting and pipe (%>%)
# format kable for document type
kable <- function(...){
if (knitr::is_latex_output()){
head(..., 25) %>%
knitr::kable(booktabs = TRUE, format = 'latex') %>%
kable_styling(latex_options = c("striped", "scale_down", "HOLD_position"))
} else {
knitr::kable(...) %>%
kable_styling() %>%
scroll_box(height = "200px")
}
}
```
Understanding the gaps and connections across existing theories and findings is a perennial challenge in scientific research. Systematically reviewing scholarship is especially challenging for researchers who may lack domain expertise, including junior scholars or those exploring new substantive territory. Conversely, senior scholars may rely on longstanding assumptions and social networks that exclude new research. In both cases, ad hoc literature reviews hinder accumulation of knowledge. Scholars are rarely systematic in selecting relevant prior work or then identifying patterns across their sample. To encourage systematic, replicable, and transparent methods for assessing literature, we propose an accessible network-based framework for reviewing scholarship. In our method, we consider a literature as a network of recurring concepts (nodes) and theorized relationships among them (edges).
Network statistics and visualization allow researchers to see patterns and offer reproducible characterizations of assertions about the major themes in existing literature.
`netlit` provides functions to generate network statistics from a literature review. Specifically, it processes a dataset where each row is a proposed relationship ("edge") between two concepts or variables ("nodes").
The aim is to offer easy tools to begin using the power of network analysis in R for literature reviews. Using `netlit` simply requires researchers to enter relationships they observe in prior studies into a simple spreadsheet.
# Using the `netlit` R Package
The `netlit` package provides functions to generate network statistics from a literature review. Specifically, `netlit` provides a wrapper for `igraph` functions to facilitate using network analysis in literature reviews.
Install this package with
```{r, eval=FALSE}
devtools::install_github("judgelord/netlit")
```
To install `netlit` from CRAN, run the following:
```{r, eval=FALSE}
install.packages("netlit")
```
## Basic Usage
The `review()` function takes in a dataframe, `data`, that includes `from` and `to` columns (a directed graph structure).
In the example below, we use example data from [this project on redistricting](https://github.com/judgelord/redistricting). These data are a set of related concepts (`from` and `to`) in the redistricting literature and citations for these relationships (`cites` and `cites_empirical`). See `vignette("netlit")` for more details on this example.
```{r literature}
library(netlit)
data("literature")
literature %>% kable()
```
---
`netlit` offers four main functions: `make_edgelist()`, `make_nodelist()`, `augment_nodelist()`, and `review()`.
`review()` is the primary function. The others are helper functions that perform the individual steps that `review()` does all at once. `review()` takes in a dataframe with at least two columns representing linked concepts (e.g., a cause and an effect) and returns data augmented with network statistics. Users must either specify "from" nodes and "to" nodes with the `from` and `to` arguments or include columns named `from` and `to` in the supplied `data` object.
`review()` returns a list of three objects:
1. an augmented `edgelist` (a list of relationships with `edge_betweenness` calculated),
2. an augmented `nodelist` (a list of concepts with `degree` and `betweenness` calculated), and
3. a `graph` object suitable for use in other `igraph` functions or other network visualization packages.
Users may wish to include edge attributes (e.g., information about the relationship between the two concepts) or node attributes (information about each concept). We show how to do so below. But first, consider the basic use of `review()`:
```{r}
lit <- review(literature, from = "from", to = "to")
lit
edges <- lit$edgelist
edges %>% kable()
nodes <- lit$nodelist
nodes %>% kable()
```
Edge and node attributes can be added using the `edge_attributes` and `node_attributes` arguments. `edge_attributes` is a vector that identifies columns in the supplied data frame that the user would like to retain. `node_attributes` is a separate dataframe that contains attributes for each node in the primary data set. The example `node_attributes` data include one column `type` indicating a type for each each node/variable/concept.
```{r}
data("node_attributes")
node_attributes %>% kable()
lit <- review(literature,
edge_attributes = c("cites", "cites_empirical"),
node_attributes = node_attributes)
lit
```
Tip: to retain all variables from `literature`, use `edge_attributes = names(literature)`.
## More Advanced Uses: larger networks, visualizing your network, network descriptives
<!-- Additional columns in the redistricting literature data include discriptions of the `edge` (the relationship between the `to` and `from` concepts), the theorized `mechanism`, and `cite_weight`---the number of studies in the literature that cite that that causal relationship.
### A Larger Edgelist and Nodelist -->
```{r libraries, message=FALSE, warning=FALSE}
library(tidyverse)
library(magrittr)
library(ggraph)
```
```{r include=FALSE}
clean <- . %>%
str_replace_all("([a-z| |-]{8}) ","\\1\n") %>%
str_replace_all(" ([a-z| |-]{9})", "\n\\1") %>% str_to_title() %>%
str_replace("\nOf\n", "\nOf ") %>%
str_replace("\nFellow ", " Fellow\n") %>%
str_replace("\nState\n", " State\n") %>%
str_replace("\nDistrict\n", " District\n") %>%
str_replace("\nWith\n", " With\n")
literature$from %<>% clean()
literature$to %<>% clean()
node_attributes$node %<>% clean()
```
We separated multiple cites to a theorized relationship with semicolons.
Let's count the total number of citations and the number of citations to empirical work by splitting out each cite and measuring the length of that vector.
```{r}
# count cites
literature %<>%
group_by(to, from) %>%
mutate(cite_weight = str_split(cites, ";")[[1]] %>% length(),
cite_weight_empirical = str_split(cites_empirical, ";",)[[1]] %>% length(),
cite_weight_empirical = ifelse(is.na(cites_empirical), 0, cite_weight_empirical)) %>%
ungroup()
# subsets
literature %<>% mutate(communities_node = str_c(to, from) %>% str_detect("Commun"),
confound = case_when(
from == "Preserve\nCommunities\nOf Interest" & to == "Rolloff" ~ T,
from == "Voter\nInformation\nAbout Their\nDistrict" & to == "Rolloff" ~ T,
from == "Preserve\nCommunities\nOf Interest"
& to == "Voter\nInformation\nAbout Their\nDistrict" ~ T,
T ~ F),
empirical = ifelse(!is.na(cites_empirical),
"Empirical work",
"No empirical work"))
```
Now we use `review()` on this expanded edgelist, including all variables in the `literature` data with `edge_attributes = names(literature)`.
```{r}
# now with all node and edge attributes
lit <- review(literature,
edge_attributes = names(literature),
node_attributes = node_attributes
)
edges <- lit$edgelist
edges %>% kable()
nodes <- lit$nodelist
nodes %>% kable()
```
### The `igraph` object
```{r}
# define igraph object as g
g <- lit$graph
g
```
What does it mean?
- `D` means directed
- `N` means named graph
- `W` means weighted graph
- `name (v/c)` means _name_ is a node attribute and it's a character
- `cite_weight (e/n)` means _cite_weight_ is an edge attribute and it's numeric
---
## With `ggraph`
We can also plot using the package `ggraph` package to plot the `igraph` object.
This package allows us to plot self-ties, but it is slightly more difficult to use ggplot features (e.g. colors and legend labels) compared to `ggnetwork`.
```{r ggraph, cache= FALSE}
set.seed(5)
p <- ggraph(g, layout = 'fr') +
geom_node_point(
aes(color = degree_total %>% as.factor() ),
size = 6,
alpha = .7
) +
geom_edge_arc2(
start_cap = circle(3,'mm'),
end_cap = circle(6, 'mm'),
aes(
color = cite_weight ,
linetype = empirical
),
curvature = 0,
arrow = arrow(length = unit(2, 'mm'),
type = "open")
) +
geom_edge_loop(
start_cap = circle(5, 'mm'),
end_cap = circle(2, 'mm'),
aes( color = cite_weight ,
linetype = empirical
),
n = 300,
strength = .6,
arrow = arrow(length = unit(2, 'mm'),
type = "open")
) +
geom_node_text( aes(label = name), size = 2.3) +
ggplot2::theme_void() +
theme(legend.position="bottom") +
labs(edge_color = "Number of\nPublications",
color = "Total Degree\nCentrality",
edge_linetype = "") +
scale_edge_colour_viridis(
discrete = FALSE,
option = "plasma",
begin = 0,
end = .9,
direction = -1,
guide = "legend",
aesthetics = "edge_colour") +
scale_color_viridis_d(option = "mako",
begin = 1,
end = .5)
p
```
---
#### Subgraphs
```{r ggraph-subset,fig.width=20, cache=FALSE, fig.retina=8}
p + facet_wrap("communities_node")
p + facet_wrap("confound")
```
### Betweenness
Edge Betweenness
```{r ggraph-edge-betweenness}
ggraph(g, layout = 'fr') +
geom_node_point(size = 10,
alpha = .1) +
theme_void() +
theme(legend.position="bottom"
) +
scale_color_viridis_c(begin = .5,
end = 1,
direction = -1,
option = "cividis") +
scale_edge_color_viridis(begin = 0.2,
end = .9,
direction = -1,
guide = "legend",
option = "cividis") +
geom_edge_arc2(
start_cap = circle(3, 'mm'),
end_cap = circle(5, 'mm'),
aes(
color = edge_betweenness,
linetype = empirical
),
curvature = .1,
arrow = arrow(length = unit(2, 'mm'),
type = "closed")) +
geom_edge_loop(aes(color = edge_betweenness)) +
geom_node_text(aes(label = name),
size = 2.3) +
labs(edge_color = "Edge Betweenness",
color = "Node Betweenness",
edge_linetype = "")
```
Node Betweenness
```{r ggraph-betweenness}
p <- ggraph(g, layout = 'fr') +
geom_node_point(
aes(color = betweenness),
size = 6,
alpha = .7
) +
geom_edge_arc2(
start_cap = circle(3, 'mm'),
end_cap = circle(6, 'mm'),
aes(
color = cite_weight,
linetype = empirical
),
curvature = 0,
arrow = arrow(length = unit(2, 'mm'),
type = "open")
) +
geom_edge_loop(
start_cap = circle(5, 'mm'),
end_cap = circle(2, 'mm'),
aes( color = cite_weight,
linetype = empirical
),
n = 300,
strength = .6,
arrow = arrow(length = unit(2, 'mm'),
type = "open")
) +
geom_node_text(aes(label = name),
size = 2.3) +
theme_void() +
theme(legend.position="bottom") +
labs(edge_color = "Number of\nPublications",
color = "Betweeneness",
edge_linetype = "") +
scale_edge_color_viridis(option = "plasma",
begin = 0,
end = .9,
direction = -1,
guide = "legend") +
scale_color_gradient2()
p
ggraph(g, layout = 'fr') +
geom_node_point(aes(color = betweenness),
size = 10,
alpha = 1) +
theme_void() +
theme(legend.position="bottom") +
scale_color_viridis_c(begin = .5,
end = 1,
direction = -1,
option = "cividis") +
scale_edge_color_viridis(begin = 0.2,
end = .9,
direction = -1,
option = "cividis",
guide = "legend") +
geom_edge_arc2(
start_cap = circle(3, 'mm'),
end_cap = circle(5, 'mm'),
aes(
color = edge_betweenness,
linetype = empirical
),
curvature = .1,
arrow = arrow(length = unit(2, 'mm'),
type = "closed")) +
geom_edge_loop(aes(color = edge_betweenness)) +
labs(edge_color = "Edge Betweenness",
color = "Node Betweenness",
edge_linetype = "") +
geom_node_text(aes(label = name),
size = 2.3)
```
<!--### Coreness -->
```{r ggraph-coreness, eval=FALSE, include=FALSE}
#TODO
p <- ggraph(g, layout = 'fr') +
geom_node_point(
aes(color = coreness),
size = 6,
alpha = .7
) +
geom_edge_arc2(
start_cap = circle(3, 'mm'),
end_cap = circle(6, 'mm'),
aes(
color = cite_weight %>% as_factor(),
linetype = empirical
),
curvature = 0,
arrow = arrow(length = unit(2, 'mm'),
type = "open")
) +
geom_edge_loop(
start_cap = circle(5, 'mm'),
end_cap = circle(2, 'mm'),
aes(
color = cite_weight %>% as_factor(),
linetype = empirical
),
n = 300,
strength = .6,
arrow = arrow(length = unit(2, 'mm'),
type = "open")
) +
geom_node_text(aes(label = name),
size = 2.3) +
theme_void() +
theme(legend.position="bottom") +
labs(edge_color = "Number of\nPublications",
color = "Coreness",
edge_linetype = "") +
scale_edge_color_viridis(discrete = TRUE,
option = "plasma",
begin = 0,
end = .9,
direction = -1) +
scale_color_gradient2()
p
```
### Degree
```{r ggraph-degree-total, fig.retina=8}
ggraph(g, layout = 'fr') +
geom_node_point(aes(color = degree_total),
size = 10,
alpha = 1) +
theme_void() +
theme(legend.position="bottom" ) +
scale_color_gradient2() +
scale_edge_color_viridis(begin = 0.2,
end = .9,
direction = -1,
option = "cividis",
guide = "legend") +
geom_edge_arc2(
start_cap = circle(3, 'mm'),
end_cap = circle(5, 'mm'),
aes(
color = edge_betweenness,
linetype = empirical
),
curvature = .1,
arrow = arrow(length = unit(2, 'mm'),
type = "closed")) +
geom_edge_loop(aes(color = edge_betweenness)) +
labs(edge_color = "Edge Betweenness",
color = "Total Degree",
edge_linetype = "") +
geom_node_text(aes(label = name),
size = 2.3)
```
---
# About the example data
Articles were chosen according to specific selection criteria. We first identified articles published since 2010 that either 1) were published in one of eight high-ranking journals or 2) gained at least 50 citations according to Google Scholar. We then chose articles that contained four possible key terms in the title or abstract.
```{r}
# Journal articles in example data
data("literature_metadata")
literature_metadata %>% kable()
# count publications per journal
pub_table <- literature_metadata %>%
filter(str_detect(paste(literature$cites, collapse = "|"), Author)) %>%
count(Publication, name = "Articles") %>%
mutate(Publication = case_when(
Publication == "AJPS" ~ "American Journal of Political Science",
Publication == "APSR" ~ "American Political Science Review",
Publication == "BJPS" ~ "British Journal of Political Science",
Publication == "JOP" ~ "The Journal of Politics",
Publication == "NCL Review" ~ "North Carolina Law Review",
Publication == "QJPS" ~ "Quarterly Journal of Political Science",
TRUE ~ Publication
))
pub_table %>% kable()
```