-
Notifications
You must be signed in to change notification settings - Fork 9
/
03a-text-intro.Rmd
279 lines (200 loc) · 9.26 KB
/
03a-text-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
---
title: "Introduction to automated text analysis"
author: Pablo Barbera
date: June 28, 2017
output: html_document
---
### String manipulation with R
We will start with basic string manipulation with R.
Our running example will be a random sample of 10,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We'll save the text of these tweets as a vector called `text'
```{r}
tweets <- read.csv("data/EP-elections-tweets.csv", stringsAsFactors=F)
head(tweets)
text <- tweets$text
```
R stores the basic string in a character vector. `length` gets the number of items in the vector, while `nchar` is the number of characters in the vector.
```{r}
length(text)
text[1]
nchar(text[1])
```
Note that we can work with multiple strings at once.
```{r}
nchar(text[1:10])
sum(nchar(text[1:10]))
max(nchar(text[1:10]))
```
We can merge different strings into one using `paste`:
```{r}
paste(text[1], text[2], sep='--')
```
Charcter vectors can be compared using the `==` and `%in%` operators:
```{r}
tweets$screen_name[1]=="martinwedge"
"DavidCoburnUKip" %in% tweets$screen_name
```
As we will see later, it is often convenient to convert all words to lowercase or uppercase.
```{r}
tolower(text[1])
toupper(text[1])
```
We can grab substrings with `substr`. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.
```{r}
substr(text[1], 1, 2)
substr(text[1], 1, 10)
```
This is useful when working with date strings as well:
```{r}
dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
substr(dates, 6, 7) # months
```
We can split up strings by a separator using `strsplit`. If we choose space as the separator, this is in most cases equivalent to splitting into words.
```{r}
strsplit(text[1], " ")
```
Let's dit into the data a little bit more. Given the construction of the dataset, we can expect that there will be many tweets mentioning the names of the candidates, such as @Nigel_Farage, We can use the `grep` command to identify these. `grep` returns the index where the word occurs.
```{r}
grep('@Nigel_Farage', text[1:10])
```
`grepl` returns `TRUE` or `FALSE`, indicating whether each element of the character vector contains that particular pattern.
```{r}
grepl('@Nigel_Farage', text[1:10])
```
Going back to the full dataset, we can use the results of `grep` to get particular rows. First, check how many tweets mention the handle "@Nigel_Farage".
```{r}
nrow(tweets)
grep('@Nigel_Farage', tweets$text[1:10])
length(grep('@Nigel_Farage', tweets$text))
```
It is important to note that matching is case-sensitive. You can use the `ignore.case` argument to match to a lowercase version.
```{r}
nrow(tweets)
length(grep('@Nigel_Farage', tweets$text))
length(grep('@Nigel_Farage', tweets$text, ignore.case = TRUE))
```
### Regular expressions
Another useful tool to work with text data is called "regular expression". You can learn more about regular expressions [here](http://www.zytrax.com/tech/web/regex.htm). Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.
For example, we could look at tweets that mention more than one handle using the operator "|" (equivalent to "OR")
```{r}
nrow(tweets)
length(grep('@Nigel_Farage|@UKIP', tweets$text, ignore.case=TRUE))
```
We can also use question marks to indicate optional characters.
```{r}
nrow(tweets)
length(grep('MEP?', tweets$text, ignore.case=TRUE))
```
This will match MEP, MEPs, etc.
Other common expression patterns are:
- `.` matches any character, `^` and `$` match the beginning and end of a string.
- Any character followed by `{3}`, `*`, `+` is matched exactly 3 times, 0 or more times, 1 or more times.
- `[0-9]`, `[a-zA-Z]`, `[:alnum:]` match any digit, any letter, or any digit and letter.
- Special characters such as `.`, `\`, `(` or `)` must be preceded by a backslash.
- See `?regex` for more details.
For example, how many tweets are direct replies to @Nigel_Farage? How many tweets are retweets? How many tweets mention any username?
```{r}
length(grep('^@Nigel_Farage', tweets$text, ignore.case=TRUE))
length(grep('^RT @', tweets$text, ignore.case=TRUE))
length(grep('@[A-Za-z0-9]+ ', tweets$text, ignore.case=TRUE))
```
Another function that we will use is `gsub`, which replaces a pattern (or a regular expression) with another string:
```{r}
gsub('@[0-9_A-Za-z]+', 'USERNAME', text[1])
```
To extract a pattern, and not just replace, use parentheses and choose the option `repl="\\1"`:
```{r}
gsub('.*@([0-9_A-Za-z]+) .*', text[1], repl="\\1")
```
You can make this a bit more complex using `gregexpr`, which will extract the location of the matches, and then `regmatches`
```{r}
handles <- gregexpr('@([0-9_A-Za-z]+)', text)
handles <- regmatches(text, handles)
handles <- unlist(handles)
head(sort(table(handles), decreasing=TRUE), n=25)
# now with hashtags...
hashtags <- regmatches(text, gregexpr("#(\\d|\\w)+",text))
hashtags <- unlist(hashtags)
head(sort(table(hashtags), decreasing=TRUE), n=25)
```
Now let's try to identify what tweets are related to UKIP and try to extract them. How would we do it? First, let's create a new column to the data frame that has value `TRUE` for tweets that mention this keyword and `FALSE` otherwise. Then, we can keep the rows with value `TRUE`.
```{r}
tweets$ukip <- grepl('ukip|farage', tweets$text, ignore.case=TRUE)
table(tweets$ukip)
ukip.tweets <- tweets[tweets$ukip==TRUE, ]
```
### Preprocessing text with quanteda
As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several "preprocessing" steps before it can be passed to a statistical model. We'll use the `quanteda` package [quanteda](https://github.com/kbenoit/quanteda) here.
The basic unit of work for the `quanteda` package is called a `corpus`, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use `summary` to get some information about your corpus.
```{r}
library(quanteda)
twcorpus <- corpus(tweets$text)
summary(twcorpus)
```
A useful feature of corpus objects is _keywords in context_, which returns all the appearances of a word (or combination of words) in its immediate context.
```{r}
kwic(twcorpus, "brexit", window=10)
kwic(twcorpus, "miliband", window=10)
kwic(twcorpus, "eu referendum", window=10)
```
We can then convert a corpus into a document-feature matrix using the `dfm` function.
```{r}
twdfm <- dfm(twcorpus, verbose=TRUE)
twdfm
```
`dfm` has many useful options. Let's actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features...
```{r}
?dfm
twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, ngrams=1:3, verbose=TRUE)
twdfm
```
Note that here we use ngrams -- this will extract all combinations of one, two, and three words (e.g. it will consider both "human", "rights", and "human rights" as tokens in the matrix).
Stemming relies on the `SnowballC` package's implementation of the Porter stemmer:
```{r}
tokenize(tweets$text[1])
tokens_wordstem(tokenize(tweets$text[1]))
```
In a large corpus like this, many features often only appear in one or two documents. In some case it's a good idea to remove those features, to speed up the analysis or because they're not relevant. We can `trim` the dfm:
```{r}
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
```
It's often a good idea to take a look at a wordcloud of the most frequent features to see if there's anything weird.
```{r}
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)
```
What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. "a", "the", "is"). We can also see the list using `topFeatures`
```{r}
topfeatures(twdfm, 25)
```
We can remove the stopwords when we create the `dfm` object:
```{r}
twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)
```
One nice feature of quanteda is that we can easily add metadata to the corpus object.
```{r}
docvars(twcorpus) <- data.frame(screen_name=tweets$screen_name, polite=tweets$polite)
summary(twcorpus)
```
We can then use this metadata to subset the dataset:
```{r}
polite.tweets <- corpus_subset(twcorpus, polite=="impolite")
```
And then extract the text:
```{r}
mytexts <- texts(polite.tweets)
```
We'll come back later to this dataset.
### Importing text with quanteda
There are different ways to read text into `R` and create a `corpus` object with `quanteda`. We have already seen the most common way, importing the text from a csv file and then adding the metadata, but `quanteda` has a built-in function to help with this:
```{r}
library(readtext)
tweets <- readtext(file='data/EP-elections-tweets.csv')
twcorpus <- corpus(tweets)
```
This function will also work with text in multiple files. To do this, we use the textfile command, and use the 'glob' operator '*' to indicate that we want to load multiple files:
```{r}
myCorpus <- readtext(file='data/inaugural/*.txt')
inaugCorpus <- corpus(myCorpus)
```