-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
211 lines (136 loc) · 10.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
output: github_document
---
# handcodeR <img src="man/figures/logo.png" align="right" height="139" />
[![codecov](https://codecov.io/gh/liserman/handcodeR/branch/master/graph/badge.svg?token=GVL875HZ14)](https://app.codecov.io/gh/liserman/handcodeR)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/handcodeR)](https://cran.r-project.org/package=handcodeR)
[![CRAN_latest_release_date](https://www.r-pkg.org/badges/last-release/handcodeR)](https://cran.r-project.org/package=handcodeR)
[![Downloads](https://cranlogs.r-pkg.org/badges/handcodeR)](https://cran.r-project.org/package=handcodeR)
[![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/handcodeR?color=yellow)](https://cran.r-project.org/package=handcodeR)
[![DOI](https://zenodo.org/badge/608736610.svg)](https://zenodo.org/badge/latestdoi/608736610)
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
message = FALSE,
comment = "#>",
fig.path = "man/figures/",
out.width = "100%"
)
```
# handcodeR
R-Package to facilitate the annotation of text data by hand in R.
The goal of the handcodeR package is to provide an easy to use app to annotate text data by hand. Often times when we work with text data, we rely on hand coded annotations of texts either as unit of analysis in itself, or as training and test samples for supervised machine learning tools to classify text data. handcodeR offers a Shiny-App that can be run within R to annotate individual texts one by one in up to six different variables. To do so, the package uses the function `handcode()`:
- `handcode()` opens a Shiny-App which allows for hand-coding strings of text into pre-defined categories. You can code between one and six variables at a time. It returns a data frame with your coded annotations.
I present a short step-by-step guide as well as the functions in more detail below.
## How to cite this package
To cite the handcodeR package, you can use:
> Isermann, Lukas. (2023). handcodeR: Text annotation app.
> R package version 0.1.2.
> <http://doi.org/10.5281/zenodo.8075100>.
You can also access the preferred citation as well as the bibtex entry for the handcodeR Package via R:
```{r}
citation("handcodeR")
```
## Installation
A stable version of `handcodeR` can be directly accessed on CRAN:
```{R, message = FALSE, warning = FALSE, results = "hide", eval = FALSE}
install.packages("handcodeR", force = TRUE)
```
To install the latest development version of `handcodeR` directly from [GitHub](https://github.com/liserman/handcodeR) use:
```{R, message=FALSE, warning=FALSE, results = "hide", eval=FALSE}
library(devtools) # Tools to Make Developing R Packages Easier
devtools::install_github("liserman/handcodeR", force = TRUE)
```
## How to use this package
First, load the package
```{R, message=FALSE, warning=FALSE}
library(handcodeR) # classify texts by hand in R
```
In the following, we are going to exemplify the workflow of the package using a minimal working example.
The workflow of the package follows a simple rule:
1. If you start the coding process, initialize the coding with `handcode()` by providing a text vector of texts you wish to annotate as `data` input, and up to three named character vectors of categories you want to code. Hand code as much data as you would like and return the output data frame via the `save and exit`-button.
2. If you want to resume coding that you have already been working on, continue the coding with `handcode()` by providing the data frame you received as output from your last call of `handcode()` as `data` input.
### handcode
The main function of the handcodeR package is `handcode()`. `handcode()` takes either a vector of texts and up to 6 named character vectors with classification categories, or a data frame already initialized by `handcode()` as input. The function allows users to annotate texts using the pre-defined categories in an interactive Shiny-App and returns a data frame of the texts with their annotations.
In order to demonstrate the functionality of `handcode()`, we first use the R-package [`archiveRetriever`](https://github.com/liserman/archiveRetriever) to download a New York Times article on the presidential debate between Joe Biden and Donald Trump in the 2020 American presidential campaign. We split the article in individual sentences which we can then annotate with `handcode()`.
```{r download data}
# Install pacman if not already installed
if(!require(pacman)) install.packages("pacman")
# Use pacman to install and load archiveRetriever and stringr
pacman::p_load(archiveRetriever,
stringr)
# Use the archiveRetriever to download article
nytimes_article <- scrape_urls(Urls = "http://web.archive.org/web/20201001004918/https://www.nytimes.com/2020/09/30/opinion/biden-trump-2020-debate.html",
Paths = c(title = "//h1[@itemprop='headline']",
author = "//span[@itemprop='name']",
date = "//time//text()",
article = "//section[@itemprop='articleBody']//p"))
# Split up the article in different sentences
sentences <- unlist(str_split(nytimes_article$article, pattern = "(?<=(?<!Mr)[\\.!?])\\s"))
head(sentences)
```
We can now use these sentences as input in `handcode()` to annotate the individual sentences of the New York Times article. We will annotate two variables measuring the candidate a sentence talks about and the sentiment of the statement.
```{r, eval = FALSE}
annotated <- handcode(data = sentences,
candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"))
```
```{r, echo=FALSE, out.width="350px"}
knitr::include_graphics("man/figures/App_1.PNG")
```
If we want to see not only the sentence we are currently coding, but also the surrounding sentences, we can use the option `context = TRUE`. This gives us our current sentence alongside its previous and following sentence. To not generate any confusion about which sentence is currently evaluated, the surrounding sentences are shown in gray.
```{r, eval = FALSE}
annotated <- handcode(data = sentences,
candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"),
context = TRUE)
```
```{r, echo=FALSE, out.width="350px"}
knitr::include_graphics("man/figures/App_2.PNG")
```
If our text vector does not form a continuous text, but you nonetheless want to provide a previous and next sentence as context, you can also specify a vector with all previous and a vector with all next sentences as `pre` and `post` inputs.
```{r, eval = FALSE}
# Vectors of all previous and all subsequent sentences
previous <- c("", sentences[2:length(sentences)])
subsequent <- c(sentences[2:length(sentences)-1])
annotated <- handcode(data = sentences,
candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"),
context = TRUE,
pre = previous,
post = subsequent)
```
We can stop the annotation process at any point by clicking on the button `save and exit`. Once we click this button, the app will close and the function returns a data frame with our texts and annotations.
```{r, echo=FALSE}
annotated <- tidyr::tibble(texts = sentences, candidate = factor("", levels = c("", "Not applicable", "Joe Biden", "Donald Trump")), sentiment = factor("", levels = c("", "Not applicable", "positive", "negative")))
annotated$candidate[1:2] <- c("Joe Biden", "Not applicable")
annotated$sentiment[1:2] <- c("negative", "negative")
```
```{r}
annotated
```
```{r, echo=FALSE}
```
We can resume the annotation process at any point by using the returned data frame from our last execution of `handcode()` as input to a new `handcode()` command. By default, the function will resume the annotation at the first text that has not been annotated yet.
```{r, eval = FALSE}
annotated <- handcode(data = annotated,
context = TRUE)
```
```{r, echo=FALSE, out.width="350px"}
knitr::include_graphics("man/figures/App_3.PNG")
```
To facilitate the classification process, `handcode()` takes the keyboard shortcuts `space` for `previous` and `enter` for `next`. If you go back to already coded lines of your data, the app automatically displays your previous coding, if you go to new lines of your data, the default values for your variables always are "". If the last row of your data is reached, `next` automatically leads to the saving of the data and exit from the Shiny-App.
#### Beyond the basics
By default, `handcode` uses the first line in the input data that has not been annotated yet as start value. However, the option `start` allows users to specify with which observation they want to start their coding process. If we have not yet annotated lines of data that lie between already coded lines of data, you can also specify `start = "all_empty"` to annotate all lines that have not been coded yet in the order in which they appear.
Sometimes, we explicitly want to display texts in a random order to rule out that the context of a text within the larger body of texts influences our annotations. If we want to randomize the order in which texts are displayed, we can set the option `randomize = TRUE`. This will, however, not influence the order in which texts are sorted in the resulting output.
As a default, `handcode` will display one missing category "Not applicable". If you want a different, or more than one missing category, you can provide a character vector of missing categories you would like to have displayed as `missing`. Missing categories will automatically be displayed in gray. In the output these values will be returned with a leading and trailing `_`.
```{r, eval = FALSE}
annotated <- handcode(data = sentences,
candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"),
missing = c("Not applicable", "Undecided"))
```
```{r, echo=FALSE, out.width="350px"}
knitr::include_graphics("man/figures/App_4.PNG")
```