-
Notifications
You must be signed in to change notification settings - Fork 6
/
b1_assigning_articles.Rmd
363 lines (277 loc) · 11.3 KB
/
b1_assigning_articles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
title: "Original Round of Assigning Articles to Evaluators"
author: "Adam H Sparks"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{assigning_articles_first_round}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## Creating a sample of articles
Twenty one journals in the discipline of plant pathology were selected by the authors
as being the primary choice for most plant pathologists when publishing manuscripts.
Using the assumption that journals publish an average 10 articles per issue, for
a population of ~1000 articles, using a confidence level of 95% and confidence
interval of 10% we need 88 samples. In
[Issue #3](https://github.com/openplantpathology/Reproducibility_in_Plant_Pathology/issues/3),
we decided to select 200 articles to have a large sample of the population.
There are three distinct steps involved:
1. Set up the journals, years, and page numbers and evaluators and assign
randomly. Create a tibble that holds these data along with empty fields
for the evaluation of the articles. Create a Google Sheets object to serve as a
template for the second step.
2. Look up the articles and check that the selected journal, year, page number
actually are appropriate and enter the DOI and other comments regarding
article selection in Google Sheets. Which is used in the third step.
3. Import article notes and DOIs from Google Sheets back to R, retrieve as many
bib entries as possible using the DOI. Merge the bib entries with the notes
and DOI values. Create a new tab in the Google Sheets object for the evaluators
to carry out evaluations.
## Step 1
### R setup
```{r library, message=FALSE}
library(dplyr)
library(readr)
library(bib2df)
library(googlesheets)
library(stringr)
library(Reproducibility.in.Plant.Pathology)
set.seed(1)
# For printing tibble in total
options(tibble.print_max = 21, tibble.print_min = 21)
```
### Create list of journals
We hand-picked a list of 21 journals that we felt represented plant pathology
research. In this step, we will create a `tibble` in R of these journals,
assigning them a number so that we can randomise them.
```{r journal_list}
journal_list <- tibble(
seq(1:21),
c("Australasian Plant Pathology",
"Canadian Journal of Plant Pathology",
"Crop Protection",
"European Journal of Plant Pathology",
"Forest Pathology",
"Journal of General Plant Pathology",
"Journal of Phytopathology",
"Journal of Plant Pathology",
"Virology Journal (Plant Viruses Section)",
"Molecular Plant-Microbe Interactions",
"Molecular Plant Pathology",
"Nematology",
"Physiological and Molecular Plant Pathology",
"Phytoparasitica",
"Phytopathologia Mediterranea",
"Phytopathology",
"Plant Disease",
"Plant Health Progress",
"Plant Pathology",
"Revista Mexicana de Fitopatología",
"Tropical Plant Pathology"))
names(journal_list) <- c("number", "journal")
```
### Create list of evaluators
#### First round of reviews
Initially, when we started this work we had four authors working together on it.
For the four original authors of the paper we created a list for us
to use that will randomly assign articles for us to evaluate for this
manuscript.
Later we decided to add more authors and more articles. That work is detailed
in this [second document](b2_assigning_articles).
```{r assignees}
assignees <- rep(c("Adam", "Emerson", "Zach", "Nik"), 50)
```
### Create randomised list
Now we will create a randomised list of journal articles to assign to each of
the four authors for this paper.
#### Create a randomised list of the journals
```{r journal_sample}
journals <- tibble(sample(1:21, 200, replace = TRUE))
names(journals) <- "number"
journals <- left_join(journals, journal_list, "number")
```
#### Randomly select articles
Generate a random list of years between 2012 and 2016 and a random list of start
pages between 1 and 150 since some journals start numbering at 1 with every
issue. Then bind the columns of the randomised list of journals with the
randomised years and page start numbers. This then assumes that there is no
temporal effect, _i.e._, the time of year an article is published does not
affect whether or not it is reproducible.
```{r years_and_articles}
year <- sample(2012:2016, 200, replace = TRUE)
contains_page <- sample.int(150, 200, replace = TRUE)
journals <- cbind(journals[, -1], year, contains_page, assignees)
journals <- arrange(.data = journals, assignees, journal, year, contains_page)
```
#### Check the number of articles per journal
```{r, count}
journals %>%
group_by(journal) %>%
tally(sort = TRUE)
```
Once this is done, the articles are manually examined for suitability. Reference
articles or off-topic articles are not included. Notes are provided regarding
these cases in `assigned_article_notes`. If the selected page number/article was
not suitable, the next sequential article was selected manually.
### Add empty columns for evaluations
A variety of information will be collected with each article to be used in the
analysis later. It is easiest to enter this using a spreadsheet application, so
we will add on the columns for what information we want to collect and save the
table as a csv for manual editing.
```{r}
to_record <- c(
"doi",
"IF_5year",
"country",
"open",
"repro_inst",
"iss_per_year",
"art_class",
"supl_mats",
"comp_mthds_avail",
"software_avail",
"software_cite",
"analysis_auto",
"data_avail",
"data_annot",
"data_tidy",
"reproducibility_score"
)
journals[to_record] <- ""
journals <-
journals %>%
group_by(assignees) %>%
as_tibble()
```
### Write to Google Sheets
We decided to use Google Sheets so that we could concurrently edit the file
more easily. Once we're done filling in our evaluations, we will import the
data back to R. Jenny Bryan has created a handy package, _googlesheets_ that we
use here. _Note that this must be run interactively for `gs_new()` to work.
Knitting this .Rmd file will not generate any gsheets, it must be done in an
interactive R session._
Create a Google Sheets workbook to hold worksheets for this project. This first
sheet will serve as a template and is thus named "template".
```{r setup_gs, eval=FALSE}
# Give googlesheets permission to access spreadsheets and Google Drive
gs_auth()
# create Google Sheet for concurrent edits, first sheet: article notes template
gs_new(
"article_notes",
ws_title = "template",
input = journals,
trim = TRUE,
verbose = FALSE
)
```
## Step 2
### Add article DOI's
This step is completed in gsheets, Adam looked up articles and added notes and
DOIs to a gsheet, "article_notes" which will be imported to R in the next step.
## Step 3
### Generate a bibliography file to use with the paper
Using the DOIs in this file, we can retrieve the BibTex citations for those
articles that provide a DOI. Note that not all of the selected articles have a
DOI, so we'll have to manually generate those entries, or fetch them from the
journals' website later.
```{r join_files, warning=FALSE, message=FALSE, results="hide", eval=FALSE}
article_notes <- gs_title("article_notes")
notes <-
article_notes %>%
gs_read(ws = "article_notes")
# format DOIs to all be uniform, lower-case
notes$doi <- tolower(notes$doi)
# convert encodings for text with accents
notes$journal <- iconv(notes$journal, "latin1", "UTF-8")
# get bib references from the DOI (that exist)
bib <- unlist(lapply(notes$doi, doi2bib))
```
The _Journal of Plant Pathology_ does not supply valid bib entries via the DOI.
So we need to clean them up.
```{r first_bib_cleaning, warning=FALSE, message=FALSE, results="hide", eval=FALSE}
# locations of Journal of "Plant Pathology"" entries in bib object
clean_pp <- grep("journal=\\{Journal of Plant Pathology\\}", bib)
# create a dataset of just those bib entries
bib_clean_pp <- bib[clean_pp]
# function to clean the entries
clean_articles_pp <- function(x) {
title_loc <- unlist(gregexpr(pattern = "title", x))
y <- substr(x, start = title_loc[1], stop = nchar(x))
y <- paste0("@article{jpp,\n\t" , y)
y <- gsub("}, ", "},\n\t", y)
y <- gsub("=", " = ", y)
y <- gsub("}}\n", "}\n}", y)
}
# apply function and clean the bib entries
cleaned_pp <- unlist(lapply(bib_clean_pp, clean_articles_pp))
# remove the incorrect entries for Journal of Plant Pathology
bib <- bib[-clean_pp]
```
_Phytopathologia Mediterranea_ also provides some malformed bib entries. The
cite entries are malformed and `<em> </em>` is used rather than the proper
`\textit{}` for a BibTeX file.
```{r second_bib_cleaning, warning=FALSE, message=FALSE, results="hide", eval=FALSE}
clean_pm <- grep("journal=\\{Phytopathologia Mediterranea\\}", bib)
bib_clean_pm <- bib[clean_pm]
clean_articles_pm <- function(x) {
x <- gsub(" @article\\{[[:print:]]+, title=\\{", "@article\\{PM, title = \\{", x)
x <- gsub("<em>", "\\\textit\\{", x)
x <- gsub("</em>", "}", x)
}
# apply function and clean the bib entries
cleaned_pm <- unlist(lapply(bib_clean_pm, clean_articles_pm))
bib <- bib[-clean_pm]
```
The larger list of bib objects has some items that are supposedly invalid DOIs.
In this step we remove them and clean up the list. They'll be added back to the
database when we merge the notes at the end.
Also there are duplicates in the list somehow. Use `unique()` to clean the list
up.
```{r, clean_bib, eval=FALSE}
bib <- bib[bib != "Invalid DOI"]
bib <- unique(bib)
```
Next export the items as files to disk as .bib files to be cleaned using
JabRef. This will automatically assign unique cite keys and fix some issues with
characters in the bib file. Open the file in JabRef and click "yes" when
prompted to regarding the cite keys.
```{r export_import_bib, eval=FALSE}
# clean up any existing files since we're appending
unlink("~/tmp/bib_article_notes.bib")
# write files to a local directory and clean up with JabRef
lapply(unlist(bib), write, "~/tmp/bib_article_notes.bib", append = TRUE)
lapply(unlist(cleaned_pp), write, "~/tmp/bib_article_notes.bib", append = TRUE)
lapply(unlist(cleaned_pm), write, "~/tmp/bib_article_notes.bib", append = TRUE)
```
*Important* Open resulting files in `~/tmp` directory in JabRef, create new
keys and resave. JabRef will generate unique keys and fix any inconsistencies
that may exist from the DOI search.
### Export Google Sheets
In this last bit, import the BibTeX file back into R and export to a tab in the
Google Sheets for collaborative edits.
```{r google, warning=FALSE, message=FALSE, results="hide", eval=FALSE}
# import both the Journal of Plant Pathology (cleaned_bib) and article_notes.bib
bib <- bib2df("~/tmp/bib_article_notes.bib")
names(bib) <- tolower(names(bib))
# join notes with bibliographic data
bib_df <- left_join(notes, bib)
bib_df <- arrange(bib_df, assignees)
# create Google Sheet for concurrent edits
article_notes <-
article_notes %>%
gs_ws_new(
ws_title = "article_evaluations",
input = bib_df,
trim = TRUE,
verbose = FALSE
)
# clean up files to build R package properly with no NOTES
unlink(".httr-oauth")
```
### Last steps
The last few steps are completed outside of R in Google Docs, manually adding
the missing information to the spreadsheet and the assigneess adding their
evaluations of their respective articles.
The Google Sheets can be found here:
https://docs.google.com/spreadsheets/d/19gXobV4oPZeWZiQJAPNIrmqpfGQtpapXWcSxaXRw1-M/edit#gid=1699540381.