-
Notifications
You must be signed in to change notification settings - Fork 0
/
basic_use.Rmd
182 lines (135 loc) · 6.63 KB
/
basic_use.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
title: "Basic use"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Basic use}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
library(pubmedRecords)
library(dplyr)
library(ggthemes)
library(stringr)
library(tidyr)
library(tidytext)
library(wordcloud2)
```
This package provides tools to download records from the NCBI PubMed database based on user-specified search criteria, and to add CrossRef citation data for the returned records. The output is in tidy data format, facilitating downstream analysis using tools from the 'tidyverse'.
This vignette illustrates the use of the package to download data for an author from PubMed and the building a word-cloud from the titles of their publications.
I will be using Prof Rolf-Detlef Treede, a renowned scientist in the field of pain research, in this example.
### Step 1
Load the packages required for this vignette.
```{r setup_demo, eval = FALSE}
# install.packages("devtools")
# devtools::install_github("kamermanpr/pubmedRecords")
library(pubmedRecords)
library(dplyr)
library(stringr)
library(tidyr)
library(tidytext)
library(wordcloud2)
```
### Step 2
Enter search parameters and perform a search using the `get_records` function.
The function parameters are:
- **search_terms:** A character string of terms that define the scope of the PubMed database query. Boolean operators (AND, OR, NOT) and search field tags may be used to create more complex search criteria. Commonly used search fields tags include:
- \[TI]: Word in title
- \[TIAB]: Word in title or abstract
- \[MH]: Medical Subject Heading (MeSH)
- \[AU]: Author name (e.g., Doe J)
- \[AD]: Author institutional affiliation
- \[TA]: Journal title (e.g., J Pain)
For a full set of search fields tags: [PubMed search field tags](https://www.ncbi.nlm.nih.gov/books/NBK3827/#_pubmedhelp_Search_Field_Descriptions_and_).
Note that the article publication type, date type, and date range are modified using the _pub\_type_, _date\_type_, _min\_date_ and _max\_date_ arguments below.
- **pub_type:** A character string specifying the type of publication the search must return. The default value is _'journal article'_. For more information: [PubMed article types](https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.publication_types/?report=objectonly).
- **min_date:** A character string in the format _'YYYY/MM/DD'_, _'YYYY/MM'_ or _'YYYY'_ specifying the starting date of the search. The default value is _1966/01/01'_.
- **max_date:** A character string in the format _'YYYY/MM/DD'_, _'YYYY/MM'_ or _'YYYY'_ specifying the end date of the search. The default value is `Sys.Date()`.
- **date_type:** A character string specifying the publication date type that is being specified in the search. Available values are:
- PDAT: Date the article was published (default)
- MDAT: Date the PubMed entry was modified.
- EDAT: Date the entry was added to PubMed.
- **has_abstract:** Logical specifying whether the returned records should be limited to those records that have an abstract. The default value is _TRUE_.
- **api_key:** An API character string obtained from the users NBCI account. The key is not essential, but it specifying a key gives substantially faster record query rates.
**Returning the records can take a while if there are a lot of records, so I suggest that you use `count_records` before `get_records` (they use the same parameters) to check how many record queries will be made before executing a request**.
```{r search_1_demo, eval = FALSE}
# Search for journal articles by RD Treede in the journal "PAIN" and
# which were published between 1 January 2000 and 31 December 2018
df <- get_records(search_terms = "Treede RD[AU] AND Pain[TA]",
min_date = '2000/01/01',
max_date = '2018/12/31',
api_key = NULL, # Add only if you have one (see documentation)
pub_type = 'journal article',
date_type = 'PDAT')
```
```{r search_1, echo = FALSE}
df <- get_records(search_terms = "Treede RD[AU]",
min_date = '2000/01/01',
max_date = '2018/12/31',
pub_type = 'journal article',
date_type = 'PDAT')
```
Have a quick look at the output.
```{r glimpse}
# Print first 10 lines
print(df)
# View structure
glimpse(df)
```
Each author of a paper is found on a separate row, with the rest of the information duplicated down the authors of a given article. Making each row a unique co-author record helps keep the data in a tidy format, and makes filtering records by co-authors easier. The downside is that the returned dataframe can be quite large because of all the duplicated information.
----
Although not essential for this example, you can added CrossRef citation counts to the records using the `citation_metrics` function. This function requires you to pass to it the output from `get_records`.
**The addition of citations also can take a while if there are a lot of records**.
```{r, citation}
# Add a column called "crossref_citations" to the first 6 observations
df_citations <- get_citations(head(df))
# View structure
glimpse(df_citations)
```
----
### Step 3
Now that we have the data we can generate the wordcloud from article titles.
#### First, select the title column
```{r title}
words <- df %>%
# Select the title column
select(title) %>%
# extract unique entries only
unique(.)
```
#### Second, extract 2-ngrams
```{r extract_ngram}
tidy_words <- words %>%
unnest_tokens(word, title, token = "ngrams", n = 2) %>%
# Remove stopwords
separate(word, into = c('word1', 'word2'), sep = ' ') %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
# Convert terms containing numerals to NA
mutate(word1 = ifelse(str_detect(word1, '[0-9]'),
yes = NA,
no = paste(word1))) %>%
mutate(word2 = ifelse(str_detect(word2, '[0-9]'),
yes = NA,
no = paste(word2))) %>%
# Remove NA
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
# Join word columns them back together to form 2-ngrams
unite(word, word1, word2, sep = ' ')
```
#### Third, count the number of occurances of each 2-ngram
```{r ngram_count}
ngram_count <- tidy_words %>%
count(word) %>%
arrange(desc(n))
```
#### Fourth, strip out the top 100 2-ngrams and plot
```{r plot, fig.width = 9, fig.height = 7}
word_cloud <- ngram_count[1:100, ] %>%
rename(freq = n)
wordcloud2(data = word_cloud,
fontFamily = 'arial',
size = 0.4,
color = tableau_color_pal(palette = 'Color Blind')(10))
```