/
ex-episode-analysis.Rmd
158 lines (120 loc) · 6.94 KB
/
ex-episode-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Episode analysis"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Episode analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
set.seed(1)
knitr::opts_chunk$set(
collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, error = FALSE,
fig.width = 12, fig.height = 7
)
library(rtrek)
```
## Importing transcript data
A curated data frame of metadata and text variables derived from episode and movie transcripts can be downloaded with `st_transcripts`. The format is one episode per row. There are metadata columns and a list column of nested text variables.
```{r ep1}
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggrepel)
(scriptData <- st_transcripts())
scriptData$text[[81]]
```
## Basic TNG summary
Consider TNG episodes. A rough estimate of the relative amount of speaking parts can be obtained by counting up the lines for each character. A better measure would be an estimate of word count taken from each spoken line. Calculate these statistics by season and episode as well as character.
Also remove unneeded columns. As is common in text analysis, even a clean dataset may need further preparation for a specific task. In this case it is important to strip references to things such as voice over `(V.o.)` from the `character` column.
```{r ep2}
pat <- "('s\\s|\\s\\(|\\sV\\.).*"
x <- filter(scriptData, format == "episode" & series == "TNG") %>%
unnest(text) %>%
select(season, title, character, line) %>%
mutate(character = gsub(pat, "", character)) %>%
group_by(season, title, character) %>%
summarize(lines = n(), words = length(unlist(strsplit(line, " "))))
x
```
Another limitation of using lines is that they may not always mimic the natural breaks in spoken lines in the episodes. While word count here is simply estimated by breaking text on spaces, it is likely more representative than the line count.
### Total spoken words
Next, focus on the top eight characters.
```{r ep3}
totals <- group_by(x, character) %>%
summarize(lines = sum(lines), words = sum(words)) %>%
arrange(desc(words)) %>% top_n(8)
totals
```
By the way, look at the total estimated lines and words spoken by each character. These are of course rough estimates. Nevertheless, it is interesting to see that Picard has nearly twice as much to say as the next most talkative character on the show, the Android, Data. It must be all those impromptu diplomatic speeches.
```{r ep4}
id <- totals$character
chr <- factor(totals$character, levels = id)
uniform_colors <- c("#5B1414", "#AD722C", "#1A6384")
ulev <- c("Command", "Operations", "Science")
uniform <- factor(ulev[c(1, 2, 1, 2, 3, 3, 2, 3)], levels = ulev)
totals <- mutate(totals, character = chr, uniform = uniform)
ggplot(totals, aes(character, words, fill = uniform)) +
geom_col(color = NA, show.legend = FALSE) +
scale_fill_manual(values = uniform_colors) +
geom_text(aes(label = paste0(round(words / 1000), "K")),
size = 10, color = "white", vjust = 1.5) +
scale_x_discrete(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_minimal(18)
```
### Prominent roles
What is each character's biggest episode per season in terms of estimated spoken words?
```{r ep5}
biggest <- filter(x, character %in% id) %>%
mutate(character = factor(character, levels = id)) %>%
group_by(season, character) %>%
summarize(title = title[which.max(words)], words = max(words)) %>%
arrange(character)
biggest
biggest <- mutate(biggest, winner = character[which.max(words)],
ymn = min(words), ymx = max(words)) %>% ungroup() %>%
mutate(uniform = factor(uniform[match(biggest$character, id)], levels = ulev))
ggplot(biggest, aes(season, words)) +
geom_linerange(aes(ymin = ymn, ymax = ymx)) +
geom_point(aes(fill = uniform), shape = 21, size = 2, show.legend = FALSE) +
scale_fill_manual(values = uniform_colors) +
geom_text_repel(aes(label = paste0(title, " (", character, ")")),
size = 2.3, hjust = -0.1, direction = "y", min.segment.length = 0.65) +
scale_x_continuous(breaks = 1:7, labels = 1:7, expand = expand_scale(0, c(0.1, 0.8))) +
theme_minimal(18) + theme(panel.grid.minor = element_blank())
```
## Transferring from TNG to DS9
Finally, look at both Worf and O'Brien, making a comparison between TNG and DS9 in terms of their prominence.
```{r ep6}
x <- filter(scriptData, format == "episode" & series %in% c("TNG", "DS9")) %>%
unnest(text) %>%
select(series, season, title, character, line) %>%
mutate(character = gsub(pat, "", character)) %>%
filter(character %in% c("Worf", "O'brien")) %>%
group_by(series, title, character) %>%
summarize(lines = n(), words = length(unlist(strsplit(line, " "))))
avg <- group_by(x, series, character) %>%
summarize(lines = mean(lines), words = mean(words), n_episodes = n()) %>%
arrange(character, series) %>% ungroup()
avg$character <- gsub("O'brien", "O'Brien", avg$character)
avg
id <- rev(unique(avg$character))
chr <- factor(avg$character, levels = id)
avg <- mutate(avg, series = factor(series, levels = c("TNG", "DS9")), character = chr)
ggplot(avg, aes(character, words, fill = series)) +
geom_col(color = NA, show.legend = FALSE, position = position_dodge()) +
scale_fill_manual(values = c("dodgerblue", "orange")) +
geom_text(aes(label = paste0(series, ": ", round(words))),
size = 10, color = "white", vjust = 1.5,
position = position_dodge(width = 0.9)) +
labs(y = "Average words per episode") +
scale_x_discrete(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_minimal(18)
```
This is a simple exploration of the data, but the results are interesting. Starfleet promotions in rank aside, Worf did not even quite receive a 10% pay bump in terms of average spoken words per episode. On the other hand, O'Brien had an increase of 211%, or more than triple the average number of spoken words per episode when moving from the Enterprise to Deep Space Nine. All things considered, take the transfer. But on spoken words alone, O'Brien was clearly favored. The squeaky wheel gets the grease.
### Considerations
This calculation accounts for the number of episodes each character has speaking lines in. For example, TNG episodes missing O'Brien after DS9 began and DS9 episodes missing Worf before he joined the show are not counted against them.
It does not account for episodes where a character may appear in an episode, but without any speaking lines, which should drop their averages. However, this is rare and, if anything, I would expect it to exacerbate the difference seen here rather than diminish it. I'm just guessing there may have been some early TNG episodes where O'Brien was shown but never spoke.
A more rigorous approach that I may show in a subsequent example would be to join this transcript data with episode casting data from STAPI so that there is no need to rely on speaking lines in transcripts to guess at whether someone was featured in an episode.