-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path2017-10-07-tidyrick.Rmd
297 lines (220 loc) · 15.2 KB
/
2017-10-07-tidyrick.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "A tidy text analysis of Rick and Morty"
author: "Tamas Szilagyi"
date: 2017-10-07T23:15:14-05:00
categories: ["R"]
tags: ["tidytext", "R", "tidyverse"]
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
```

```{css, echo = FALSE}
pre code, pre, code {
white-space: pre !important;
overflow-x: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
}
```
## Adventures in the multiverse
For those unfamiliar with the show, Rick and Morty is an animated series about the interuniversal exploits of a half-drunk mad scientist Rick, and his daft grandson Morty. Living under one roof with his daughter, Rick constantly drags his grandson Morty along for adventures into unusual worlds inhabited by surreal creatures. At first hesitant to accompany his eccentric granddad, Morty slowly grows into an indispensable sidekick. Using Rick’s portal gun, they leave the rest of their dysfunctional family at home, and travel through space and time.
Most episodes draw inspiration from or make fun of cult movies such as Back to the Future, A Nightmare on Elm Street, Inception and many other classics by the likes of John Carpenter or David Cronenberg. Besides the ruthless humor and over-the-top visual scenery, the show brilliantly builds independent sci-fi realms, going about their day-to-day according to their wacky rules.
## One man's weekend project, another man's treasure
After reading the book [Tidy Text Mining](http://tidytextmining.com/) online, I have been wanting to try out some of the concepts outlined in the book, and the functions of the [accompanying package](https://github.com/juliasilge/tidytext), on an interesting dataset. So I was pretty stoked to find [Francois Keck's **subtools package** on GitHub](https://github.com/fkeck/subtools), that allows for reading *.srt* files (the usual format for subtitles) straight into R. With season 3 of Rick and Morty coming to an end last week, the stars have finally aligned to roll up my sleeves and have some fun with text mining.
It is very easy to find English subtitles for pretty much anything on the Internet. With subtools, an entire series can be read with one command from the containing folder, `read.subtitles.serie()`. We convert the resulting MultiSubtitles object to a data.frame with a second command `subDataFrame()`.
```{r, eval = F}
library(subtools)
a <- read.subtitles.serie(dir = "/series/rick and morty/")
df <- subDataFrame(a)
str(df)
```
```{r, echo = F}
library(subtools)
a <- read.subtitles.serie(dir = "/Users/tamas/Dropbox/my_site/data1/rick and morty/")
df <- subDataFrame(a)
str(df)
```
The `$Text` column contains the subtitle text, surrounded by additional variables for line id, timestamp, season and episode number. This is the structure preferred by the tidytext package, as it is by the rest of tidyverse.
## *"Morty, you got to...come on."*
Let's start with the bread and butter of text mining, *term frequencies*. We split the text by word, exclude stop words,
```{r, echo = FALSE}
library(tidytext)
library(dplyr)
data(stop_words)
tidy_df <- df %>%
unnest_tokens(word, Text) %>%
anti_join(stop_words %>%
rbind(data.frame(word = c("xhmikosr", "www.addic7ed.com",
"sync", "corrections"),
lexicon = "custom")))
```
```{r, eval = F}
data(stop_words)
tidy_df <- df %>%
unnest_tokens(word, Text) %>%
anti_join(stop_words)
```
and aggregate and plot the top 10 words per season.
```{r, eval = F}
library(dplyr)
library(ggplot2)
tidy_df %>% group_by(season) %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
ggplot(aes(reorder(word,n), n, fill = season)) +
geom_col() +
coord_flip() +
facet_wrap(~season, scales = "free_y") +
labs(x = NULL) +
guides(fill = FALSE) +
scale_fill_brewer(palette = "Set1")
```

Both seasons are dominated by, well, Rick and Morty. The main characters are tirelessly addressing each other, talking one another either into or out of the mess they find themselves in. What stands out most is the absence of Rick's daughter, Beth from the top 10 in all seasons. She's perhaps the only sane person of the family, but then again, sanity doesn't get too much airtime on this show.
## Network analysis on bi-grams
We can similarly get the number of times each *two words* appear, called *bi-grams*. Besides calculating summary statistics on bi-grams, we can now construct a network of words according to co-occurrence using [igraph](https://cran.r-project.org/web/packages/igraph/index.html), the go-to package for network analysis in R.
```{r, echo = FALSE}
options(max.print=5)
stop_words <- stop_words %>%
rbind(data.frame(word = c("xhmikosr", "www.addic7ed.com",
"sync", "corrections"),
lexicon = "custom"))
```
```{r, warning = FALSE}
library(tidyr)
library(igraph)
bigram_graph <- df %>%
unnest_tokens(bigram, Text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
group_by(season) %>%
count(word1, word2, sort = TRUE) %>%
select(word1, word2, season, n) %>%
filter(n > 2) %>%
graph_from_data_frame()
print(bigram_graph)
```
This igraph object contains a directed network, where the vertices are the words and an edge exists between each that appear after one another more than twice. Representing the text as a graph, we can calculate things such as [degree centrality](https://en.wikipedia.org/wiki/Centrality#Degree_centrality), and plot the results.
```{r, echo = F, eval = F}
library(ggraph)
# Decompose graph to get largest connected network,
largest_graph <- decompose.graph(bigram_graph)[[1]]
# Calculate degree centrality
# and add as vertex attribute
vertex_attr(largest_graph, "dgc")<-centr_degree(largest_graph)$res
ggraph(largest_graph, layout = "kk") +
geom_edge_link(alpha = .5) +
geom_node_point(aes(color = dgc, size = dgc * 5)) +
geom_node_text(aes(label = name, size = dgc*5), vjust = 1, hjust = 1, check_overlap = T)+
scale_colour_gradient(low="yellow",high="red") +
guides(size=FALSE) +
ggtitle("Degree Centrality of the Largest Bi-gram Network") +
theme_graph() +
theme(legend.position = "bottom")
```

Looking at the largest connected network, we arrive at the same conclusion as with term frequencies. Rick and Morty are the most important words. They are at the center of the network and so have the highest degree centrality scores.
Besides visualising the importance of words in our network, we can similarly differentiate between words that precede either Rick or Morty. These are all the 1st degree connections (words) that have an edge pointing towards the main characters, but aren't shared among the them.
```{r, echo = FALSE, message = F, eval = F}
# first (in)degree connections to Rick
r1 <- names(unlist(neighborhood(largest_graph, 1, nodes="rick", mode="in")))
r1 <- r1[r1!="morty"]
# first (in)degree connections to Morty
m1 <- names(unlist(neighborhood(largest_graph, 1, nodes="morty", mode="in")))
m1 <- m1[m1!="rick"]
# shared connections
s <- setdiff(m1[m1%in%r1], c("rick","morty"))
# create vertex attribute
vertex_attr(largest_graph, "precedes")<- ifelse(V(largest_graph)$name %in% r1, "rick",
ifelse(V(largest_graph)$name %in% m1, "morty","out of range"))
vertex_attr(largest_graph, "precedes") <- ifelse(V(largest_graph)$name %in% s, "shared",
V(largest_graph)$precedes)
ggraph(largest_graph, layout = "kk") +
geom_edge_link(alpha = .5) +
geom_node_point(aes(color = precedes, size = 1)) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
scale_colour_brewer() +
guides(size=FALSE) +
scale_colour_brewer(palette = "Set1") +
ggtitle("The words preceding Rick and Morty") +
theme_graph() +
theme(legend.position = "bottom")
```

Looking at the red nodes, we recognize many of the things Rick throws at Morty: *"Relax Morty!...It's science Morty!...Run Morty!"*. There is also a handful of words that precede both characters like *"Geez", "Boy"* or *"God"*. All other words that are more than one degree away, are colored blue as out of range.
## tf-idf
Thus far we have looked at all words across seasons. But where do the seasons differ from each other? And can we summarise each season using a handful of topics? To answer the first question, text mining's most notorious statistic [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) comes to the rescue. It stands for **term frequency - inverse document frequency**. We take the word counts per season and multiply it by the *scaled inverse fraction of seasons that contain the word*. Simply put, we penalize words that are common across all seasons, and reward ones that are not. This way, we bring forth the words most typical of each season. Again the tidytext implementation is super easy.
```{r}
tf_idf_df <- tidy_df %>%
count(season, word, sort = TRUE) %>%
bind_tf_idf(word, season, n)
```
```{r, echo = F, warning=F, eval = F}
tf_idf_df %>% arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
top_n(20) %>%
ggplot(aes(word, tf_idf, fill = season)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
coord_flip() +
ggtitle("Words with the Highest TF-IDF Scores") +
scale_colour_brewer(palette = "Set1") +
theme(legend.position = "bottom") +
scale_fill_brewer(palette = "Set1")
```

What we get back are the most important elements, characters, motives or places across episodes. I'm somewhat surprised that Mr. Meeseeks didn't come in first though. I was sure as hell annoyed out of my mind after hearing it uttered for the 100th time during the episode [Meeseeks and Destroy](https://en.wikipedia.org/wiki/Meeseeks_and_Destroy). But then again, Mr Meeseeks does make a cameo in two other seasons, so that kind of torpedoes his chances for the first spot.
## Topic models
Having seen the most unique words of the script by seasons, we will take our analysis one last step further and try to capture the gist of a the show using topic modeling. Broadly speaking, it's an unsupervised classification method that tries to represent a document as a collection of topics. Here, I will take the classic [Latent Dirichlet Allocation or shortly LDA ](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) algorithm for a spin. The basic idea is that
> "...a topic is defined as a mixture over words where each word has a probability
of belonging to a topic. And a document is a mixture over topics, meaning that a single
document can be composed of multiple topics.""
We could for example take season two, and tell `LDA()` that we want to compress 10 episodes into just 6 topics. To compensate for the omnipresence of the top words across episodes, I will exclude them for the purpose of clearer separation of topics.
```{r, warning = FALSE}
library(topicmodels)
popular_words <- c("rick","morty", "yeah","hey",
"summer", "jerry", "uh", "gonna")
episodes_dtm <- tidy_df %>% filter(season_num == 2 & !word %in% popular_words) %>%
group_by(episode_num) %>%
count(word, sort = TRUE) %>%
cast_dtm(episode_num, word, n)
episodes_lda <- LDA(episodes_dtm, k = 6, control = list(seed = 1234))
```
After `tidy()`ing the results, we can plot the top 10 words that contribute (*beta*) to most to each topic.
```{r, echo = FALSE, eval=F}
tidy(episodes_lda, matrix = "beta") %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_fill_brewer(palette = "Set1") +
ggtitle("Word / Topic Probabilities")
```

There's definitely a few topics that contain multiple elements of a particular episode. Take for example **topic 1**. It includes "*Roy*", the name of the videogame Morty plays in the same episode "*Fart*" appears, a gaseous creature kept under locks by aliens. Or **topic 5**, which probably relates to the episode where Rick visits his old lover "*Unity*". It further contains words as "*remember*" and "*memories*". The episode ends with Unity repeating "I want it *real*".
Not only can we examine the **word per topic probabilities**, we can also plot **the topic per document probabilities**, or *gamma* values. This lets us see what topic belongs to what episode.
```{r, echo = FALSE}
t<- c("A Rickle in Time", "Mortynight Run", "Auto Erotic Assimilation", "Total Rickall",
"Get Schwifty", "The Ricks Must Be Crazy", "Big Trouble in Little Sanchez", "Interdimensional Cable 2", "Look Who's Purging Now", "The Wedding Squanchers")
titles <- data.frame(document = as.character(1:10) ,title = factor(t, levels = t))
```
```{r, warning = F, message = F, eval = F}
tidy(episodes_lda, matrix = "gamma") %>%
inner_join(titles) %>%
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ title) +
ggtitle("Dominant Topics per Episode")
```

Our previous assumptions are confirmed, the first topic does belong to the episode *Mortynight Run* as does the fifth topic to *Auto-Erotic Assimilation*. It is important to note that the results strongly depend on the number of topics supplied to `LDA()`, so inevitably, some experimentation is required to arrive at meaningful results.
## Final thoughts
I ran through some very interesting concepts fairly quickly in this post. I owe much of it to the tidytext package. With very little coding, we can mine a tremendous amount of insights from textual data. And I have just scrachted the surface of what's possible. The seamless integration with the tidyverse, as with igraph and topicmodels does make a huge difference.
Nonetheless, text mining is a complex topic and when arriving at more advanced material, [further reading](https://github.com/trinker/topicmodels_learning) on the inner workings of these algorithms might come in handy for effective use. The full [data](https://github.com/mtoto/mtoto.github.io/tree/master/data/2017-10-07-tidyrick/rick%20and%20morty) and [code](https://github.com/mtoto/mtoto.github.io/blob/master/blog/2017/2017-10-07-tidyrick.Rmd) for this post is available as usual on my Github.