-
Notifications
You must be signed in to change notification settings - Fork 9
/
03b-dictionary-methods.Rmd
102 lines (80 loc) · 3.09 KB
/
03b-dictionary-methods.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Dictionary methods"
author: Pablo Barbera
date: June 28, 2017
output: html_document
---
#### Dictionary methods
A different type of keyword analysis consists on the application of dictionary methods, or lexicon-based approaches to the measurement of tone or the prediction of diferent categories related to the content of the text.
The most common application is sentiment analysis: using a dictionary of positive and negative words, we compute a sentiment score for each individual document.
Let's apply this technique to tweets by the four leading candidates in the 2016 Presidential primaries.
```{r}
library(quanteda)
tweets <- read.csv('data/candidate-tweets.csv', stringsAsFactors=F)
```
```{r}
# loading lexicon of positive and negative words (from Neal Caren)
lexicon <- read.csv("data/lexicon.csv", stringsAsFactors=F)
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]
# a look at a random sample of positive and negative words
sample(pos.words, 10)
sample(neg.words, 10)
```
As earlier today, we will convert our text to a corpus object.
```{r}
twcorpus <- corpus(tweets$text)
```
Now we're ready to run the sentiment analysis!
```{r}
# first we construct a dictionary object
mydict <- dictionary(list(negative = neg.words,
positive = pos.words))
# apply it to our corpus
sent <- dfm(twcorpus, dictionary = mydict)
# and add it as a new variable
tweets$score <- as.numeric(sent[,2]) - as.numeric(sent[,1])
```
```{r}
# what is the average sentiment score?
mean(tweets$score)
# what is the most positive and most negative tweet?
tweets[which.max(tweets$score),]
tweets[which.min(tweets$score),]
# what is the proportion of positive, neutral, and negative tweets?
tweets$sentiment <- "neutral"
tweets$sentiment[tweets$score<0] <- "negative"
tweets$sentiment[tweets$score>0] <- "positive"
table(tweets$sentiment)
```
We can also disaggregate by groups of tweets, for example according to the party they mention.
```{r}
# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
```
One important note: dictionary methods can be very sensitive to specific words that appear very often. Let's see one example...
```{r}
# remove word "great" from dictionary
lexicon <- lexicon[-which(lexicon$word=="great"),]
pos.words <- lexicon$word[lexicon$polarity=="positive"]
neg.words <- lexicon$word[lexicon$polarity=="negative"]
# construct dictionary object again
mydict <- dictionary(list(negative = neg.words,
positive = pos.words))
# apply it to our corpus
sent <- dfm(twcorpus, dictionary = mydict)
# and add it as a new variable
tweets$score <- as.numeric(sent[,2]) - as.numeric(sent[,1])
# loop over candidates
candidates <- c("realDonaldTrump", "HillaryClinton", "tedcruz", "BernieSanders")
for (cand in candidates){
message(cand, " -- average sentiment: ",
round(mean(tweets$score[tweets$screen_name==cand]), 4)
)
}
```