Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
mtoto.github.io/blog/2017/2017-10-07-tidyrick.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
297 lines (220 sloc)
15.2 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "A tidy text analysis of Rick and Morty" | |
author: "Tamas Szilagyi" | |
date: 2017-10-07T23:15:14-05:00 | |
categories: ["R"] | |
tags: ["tidytext", "R", "tidyverse"] | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE, message = FALSE) | |
``` | |
 | |
```{css, echo = FALSE} | |
pre code, pre, code { | |
white-space: pre !important; | |
overflow-x: scroll !important; | |
word-break: keep-all !important; | |
word-wrap: initial !important; | |
} | |
``` | |
## Adventures in the multiverse | |
For those unfamiliar with the show, Rick and Morty is an animated series about the interuniversal exploits of a half-drunk mad scientist Rick, and his daft grandson Morty. Living under one roof with his daughter, Rick constantly drags his grandson Morty along for adventures into unusual worlds inhabited by surreal creatures. At first hesitant to accompany his eccentric granddad, Morty slowly grows into an indispensable sidekick. Using Rick’s portal gun, they leave the rest of their dysfunctional family at home, and travel through space and time. | |
Most episodes draw inspiration from or make fun of cult movies such as Back to the Future, A Nightmare on Elm Street, Inception and many other classics by the likes of John Carpenter or David Cronenberg. Besides the ruthless humor and over-the-top visual scenery, the show brilliantly builds independent sci-fi realms, going about their day-to-day according to their wacky rules. | |
## One man's weekend project, another man's treasure | |
After reading the book [Tidy Text Mining](http://tidytextmining.com/) online, I have been wanting to try out some of the concepts outlined in the book, and the functions of the [accompanying package](https://github.com/juliasilge/tidytext), on an interesting dataset. So I was pretty stoked to find [Francois Keck's **subtools package** on GitHub](https://github.com/fkeck/subtools), that allows for reading *.srt* files (the usual format for subtitles) straight into R. With season 3 of Rick and Morty coming to an end last week, the stars have finally aligned to roll up my sleeves and have some fun with text mining. | |
It is very easy to find English subtitles for pretty much anything on the Internet. With subtools, an entire series can be read with one command from the containing folder, `read.subtitles.serie()`. We convert the resulting MultiSubtitles object to a data.frame with a second command `subDataFrame()`. | |
```{r, eval = F} | |
library(subtools) | |
a <- read.subtitles.serie(dir = "/series/rick and morty/") | |
df <- subDataFrame(a) | |
str(df) | |
``` | |
```{r, echo = F} | |
library(subtools) | |
a <- read.subtitles.serie(dir = "/Users/tamas/Dropbox/my_site/data1/rick and morty/") | |
df <- subDataFrame(a) | |
str(df) | |
``` | |
The `$Text` column contains the subtitle text, surrounded by additional variables for line id, timestamp, season and episode number. This is the structure preferred by the tidytext package, as it is by the rest of tidyverse. | |
## *"Morty, you got to...come on."* | |
Let's start with the bread and butter of text mining, *term frequencies*. We split the text by word, exclude stop words, | |
```{r, echo = FALSE} | |
library(tidytext) | |
library(dplyr) | |
data(stop_words) | |
tidy_df <- df %>% | |
unnest_tokens(word, Text) %>% | |
anti_join(stop_words %>% | |
rbind(data.frame(word = c("xhmikosr", "www.addic7ed.com", | |
"sync", "corrections"), | |
lexicon = "custom"))) | |
``` | |
```{r, eval = F} | |
data(stop_words) | |
tidy_df <- df %>% | |
unnest_tokens(word, Text) %>% | |
anti_join(stop_words) | |
``` | |
and aggregate and plot the top 10 words per season. | |
```{r, eval = F} | |
library(dplyr) | |
library(ggplot2) | |
tidy_df %>% group_by(season) %>% | |
count(word, sort = TRUE) %>% | |
top_n(10) %>% | |
ggplot(aes(reorder(word,n), n, fill = season)) + | |
geom_col() + | |
coord_flip() + | |
facet_wrap(~season, scales = "free_y") + | |
labs(x = NULL) + | |
guides(fill = FALSE) + | |
scale_fill_brewer(palette = "Set1") | |
``` | |
 | |
Both seasons are dominated by, well, Rick and Morty. The main characters are tirelessly addressing each other, talking one another either into or out of the mess they find themselves in. What stands out most is the absence of Rick's daughter, Beth from the top 10 in all seasons. She's perhaps the only sane person of the family, but then again, sanity doesn't get too much airtime on this show. | |
## Network analysis on bi-grams | |
We can similarly get the number of times each *two words* appear, called *bi-grams*. Besides calculating summary statistics on bi-grams, we can now construct a network of words according to co-occurrence using [igraph](https://cran.r-project.org/web/packages/igraph/index.html), the go-to package for network analysis in R. | |
```{r, echo = FALSE} | |
options(max.print=5) | |
stop_words <- stop_words %>% | |
rbind(data.frame(word = c("xhmikosr", "www.addic7ed.com", | |
"sync", "corrections"), | |
lexicon = "custom")) | |
``` | |
```{r, warning = FALSE} | |
library(tidyr) | |
library(igraph) | |
bigram_graph <- df %>% | |
unnest_tokens(bigram, Text, token = "ngrams", n = 2) %>% | |
separate(bigram, c("word1", "word2"), sep = " ") %>% | |
filter(!word1 %in% stop_words$word) %>% | |
filter(!word2 %in% stop_words$word) %>% | |
group_by(season) %>% | |
count(word1, word2, sort = TRUE) %>% | |
select(word1, word2, season, n) %>% | |
filter(n > 2) %>% | |
graph_from_data_frame() | |
print(bigram_graph) | |
``` | |
This igraph object contains a directed network, where the vertices are the words and an edge exists between each that appear after one another more than twice. Representing the text as a graph, we can calculate things such as [degree centrality](https://en.wikipedia.org/wiki/Centrality#Degree_centrality), and plot the results. | |
```{r, echo = F, eval = F} | |
library(ggraph) | |
# Decompose graph to get largest connected network, | |
largest_graph <- decompose.graph(bigram_graph)[[1]] | |
# Calculate degree centrality | |
# and add as vertex attribute | |
vertex_attr(largest_graph, "dgc")<-centr_degree(largest_graph)$res | |
ggraph(largest_graph, layout = "kk") + | |
geom_edge_link(alpha = .5) + | |
geom_node_point(aes(color = dgc, size = dgc * 5)) + | |
geom_node_text(aes(label = name, size = dgc*5), vjust = 1, hjust = 1, check_overlap = T)+ | |
scale_colour_gradient(low="yellow",high="red") + | |
guides(size=FALSE) + | |
ggtitle("Degree Centrality of the Largest Bi-gram Network") + | |
theme_graph() + | |
theme(legend.position = "bottom") | |
``` | |
 | |
Looking at the largest connected network, we arrive at the same conclusion as with term frequencies. Rick and Morty are the most important words. They are at the center of the network and so have the highest degree centrality scores. | |
Besides visualising the importance of words in our network, we can similarly differentiate between words that precede either Rick or Morty. These are all the 1st degree connections (words) that have an edge pointing towards the main characters, but aren't shared among the them. | |
```{r, echo = FALSE, message = F, eval = F} | |
# first (in)degree connections to Rick | |
r1 <- names(unlist(neighborhood(largest_graph, 1, nodes="rick", mode="in"))) | |
r1 <- r1[r1!="morty"] | |
# first (in)degree connections to Morty | |
m1 <- names(unlist(neighborhood(largest_graph, 1, nodes="morty", mode="in"))) | |
m1 <- m1[m1!="rick"] | |
# shared connections | |
s <- setdiff(m1[m1%in%r1], c("rick","morty")) | |
# create vertex attribute | |
vertex_attr(largest_graph, "precedes")<- ifelse(V(largest_graph)$name %in% r1, "rick", | |
ifelse(V(largest_graph)$name %in% m1, "morty","out of range")) | |
vertex_attr(largest_graph, "precedes") <- ifelse(V(largest_graph)$name %in% s, "shared", | |
V(largest_graph)$precedes) | |
ggraph(largest_graph, layout = "kk") + | |
geom_edge_link(alpha = .5) + | |
geom_node_point(aes(color = precedes, size = 1)) + | |
geom_node_text(aes(label = name), vjust = 1, hjust = 1)+ | |
scale_colour_brewer() + | |
guides(size=FALSE) + | |
scale_colour_brewer(palette = "Set1") + | |
ggtitle("The words preceding Rick and Morty") + | |
theme_graph() + | |
theme(legend.position = "bottom") | |
``` | |
 | |
Looking at the red nodes, we recognize many of the things Rick throws at Morty: *"Relax Morty!...It's science Morty!...Run Morty!"*. There is also a handful of words that precede both characters like *"Geez", "Boy"* or *"God"*. All other words that are more than one degree away, are colored blue as out of range. | |
## tf-idf | |
Thus far we have looked at all words across seasons. But where do the seasons differ from each other? And can we summarise each season using a handful of topics? To answer the first question, text mining's most notorious statistic [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) comes to the rescue. It stands for **term frequency - inverse document frequency**. We take the word counts per season and multiply it by the *scaled inverse fraction of seasons that contain the word*. Simply put, we penalize words that are common across all seasons, and reward ones that are not. This way, we bring forth the words most typical of each season. Again the tidytext implementation is super easy. | |
```{r} | |
tf_idf_df <- tidy_df %>% | |
count(season, word, sort = TRUE) %>% | |
bind_tf_idf(word, season, n) | |
``` | |
```{r, echo = F, warning=F, eval = F} | |
tf_idf_df %>% arrange(desc(tf_idf)) %>% | |
mutate(word = factor(word, levels = rev(unique(word)))) %>% | |
top_n(20) %>% | |
ggplot(aes(word, tf_idf, fill = season)) + | |
geom_col() + | |
labs(x = NULL, y = "tf-idf") + | |
coord_flip() + | |
ggtitle("Words with the Highest TF-IDF Scores") + | |
scale_colour_brewer(palette = "Set1") + | |
theme(legend.position = "bottom") + | |
scale_fill_brewer(palette = "Set1") | |
``` | |
 | |
What we get back are the most important elements, characters, motives or places across episodes. I'm somewhat surprised that Mr. Meeseeks didn't come in first though. I was sure as hell annoyed out of my mind after hearing it uttered for the 100th time during the episode [Meeseeks and Destroy](https://en.wikipedia.org/wiki/Meeseeks_and_Destroy). But then again, Mr Meeseeks does make a cameo in two other seasons, so that kind of torpedoes his chances for the first spot. | |
## Topic models | |
Having seen the most unique words of the script by seasons, we will take our analysis one last step further and try to capture the gist of a the show using topic modeling. Broadly speaking, it's an unsupervised classification method that tries to represent a document as a collection of topics. Here, I will take the classic [Latent Dirichlet Allocation or shortly LDA ](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) algorithm for a spin. The basic idea is that | |
> "...a topic is defined as a mixture over words where each word has a probability | |
of belonging to a topic. And a document is a mixture over topics, meaning that a single | |
document can be composed of multiple topics."" | |
We could for example take season two, and tell `LDA()` that we want to compress 10 episodes into just 6 topics. To compensate for the omnipresence of the top words across episodes, I will exclude them for the purpose of clearer separation of topics. | |
```{r, warning = FALSE} | |
library(topicmodels) | |
popular_words <- c("rick","morty", "yeah","hey", | |
"summer", "jerry", "uh", "gonna") | |
episodes_dtm <- tidy_df %>% filter(season_num == 2 & !word %in% popular_words) %>% | |
group_by(episode_num) %>% | |
count(word, sort = TRUE) %>% | |
cast_dtm(episode_num, word, n) | |
episodes_lda <- LDA(episodes_dtm, k = 6, control = list(seed = 1234)) | |
``` | |
After `tidy()`ing the results, we can plot the top 10 words that contribute (*beta*) to most to each topic. | |
```{r, echo = FALSE, eval=F} | |
tidy(episodes_lda, matrix = "beta") %>% | |
group_by(topic) %>% | |
top_n(10, beta) %>% | |
ungroup() %>% | |
mutate(term = reorder(term, beta)) %>% | |
ggplot(aes(term, beta, fill = factor(topic))) + | |
geom_col(show.legend = FALSE) + | |
facet_wrap(~ topic, scales = "free") + | |
coord_flip() + | |
scale_fill_brewer(palette = "Set1") + | |
ggtitle("Word / Topic Probabilities") | |
``` | |
 | |
There's definitely a few topics that contain multiple elements of a particular episode. Take for example **topic 1**. It includes "*Roy*", the name of the videogame Morty plays in the same episode "*Fart*" appears, a gaseous creature kept under locks by aliens. Or **topic 5**, which probably relates to the episode where Rick visits his old lover "*Unity*". It further contains words as "*remember*" and "*memories*". The episode ends with Unity repeating "I want it *real*". | |
Not only can we examine the **word per topic probabilities**, we can also plot **the topic per document probabilities**, or *gamma* values. This lets us see what topic belongs to what episode. | |
```{r, echo = FALSE} | |
t<- c("A Rickle in Time", "Mortynight Run", "Auto Erotic Assimilation", "Total Rickall", | |
"Get Schwifty", "The Ricks Must Be Crazy", "Big Trouble in Little Sanchez", "Interdimensional Cable 2", "Look Who's Purging Now", "The Wedding Squanchers") | |
titles <- data.frame(document = as.character(1:10) ,title = factor(t, levels = t)) | |
``` | |
```{r, warning = F, message = F, eval = F} | |
tidy(episodes_lda, matrix = "gamma") %>% | |
inner_join(titles) %>% | |
ggplot(aes(factor(topic), gamma)) + | |
geom_boxplot() + | |
facet_wrap(~ title) + | |
ggtitle("Dominant Topics per Episode") | |
``` | |
 | |
Our previous assumptions are confirmed, the first topic does belong to the episode *Mortynight Run* as does the fifth topic to *Auto-Erotic Assimilation*. It is important to note that the results strongly depend on the number of topics supplied to `LDA()`, so inevitably, some experimentation is required to arrive at meaningful results. | |
## Final thoughts | |
I ran through some very interesting concepts fairly quickly in this post. I owe much of it to the tidytext package. With very little coding, we can mine a tremendous amount of insights from textual data. And I have just scrachted the surface of what's possible. The seamless integration with the tidyverse, as with igraph and topicmodels does make a huge difference. | |
Nonetheless, text mining is a complex topic and when arriving at more advanced material, [further reading](https://github.com/trinker/topicmodels_learning) on the inner workings of these algorithms might come in handy for effective use. The full [data](https://github.com/mtoto/mtoto.github.io/tree/master/data/2017-10-07-tidyrick/rick%20and%20morty) and [code](https://github.com/mtoto/mtoto.github.io/blob/master/blog/2017/2017-10-07-tidyrick.Rmd) for this post is available as usual on my Github. |