/
2018-01-07-comparing-33c3-and-34c3-tweets.Rmd
243 lines (171 loc) · 10.6 KB
/
2018-01-07-comparing-33c3-and-34c3-tweets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: Comparing 33C3 and 34C3 tweets
author: Knut Behrends
date: '2018-01-07'
slug: tweets-33C3-vs-33C4
categories: ['personal-blog']
tags: [twitter, fun, rstats]
banner: img/post/thumb/tweets--33c3--34cc-histo-per-day--80x.png
draft: false
---
### Why this analysis?
There are two reasons:
- learn to process Twitter data at scale (~100MB-1GB/file - that's large enough for me)
- get more info from/about the inspiring tweets from the CCC community
(This blogpost ist still a DRAFT - I want to publish quickly, even when an analysis is incomplete, or the text is a bit rough in places)
### Background
Occasionally, over the last two years, I have gathered data from the Twitter Streaming API. However I have never really analysed these datasets.
Often my tweet-gathering happens during specific events, each with limited duration. When the period of data harvesting is limited to a few days or hours, data files do not get too big.
For instance, I have collected tweets during the Euro 2016 Championship Final, during the US elections 2016, from the AGU/EGU geosciences conferences, and a few more. Until now these files just sat on my hard drive, untouched. They are typically less than 1 GB in size. For this study, I've used files 200 MB and 600 MB in size, containing tweets about the 33C3 and the 34C3 conferences, respectively.
I really like the [Chaos Communication Congress](https://de.wikipedia.org/wiki/Chaos_Communication_Congress#Kongress-Mottos_und_Veranstaltungsorte_1984_bis_heute)), nicknamed XXC3. I remember well how I first happened to know about it, in 2003 or 2004. Waiting at a Berlin train station I saw that a high-rise building had become a [giant light-installation](https://de.wikipedia.org/wiki/Projekt_Blinkenlights) ([Projekt Blinkenlights](https://en.wikipedia.org/wiki/Projekt_Blinkenlights)). Baffled, I went to the conference center, purchased a ticket, and went inside. It was a great experience, so informative! (Back in those days, conference tickets were really easy to come by. Today ticket batches sell out within minutes, and are given preferably to CCC members and friends). Bitcoin, hardware hacking, security nightmares... I heard first about all these things, and many others, at the CCC congresses. During the last years, I didn't attend any CCC conferences (maybe I'm too old for this...?)
Well, I never was a CCC member, and I'm not a really a hardware hacker or electronics guy. However, I think continued computer-education is important. Therefore I try to watch the live streams from the annual CCC conferences. Many of the streamed talks contain very high quality content, for free.
For a while now, I've also noticed that the tweets created during these conferences are highly inspiring. So let's analyse the tweets a bit.
### Preprocessing and simple timeseries plots
Load the necessary R packages, and the datafiles. (For details see [github page](https://github.com/knbknb/kbehrends.netlify.com/blob/master/content/post/2018-01-07-comparing-33c3-and-34c3-tweets.Rmd)).
```{r libs, warning=FALSE, message=FALSE, echo=FALSE}
options(knitr.table.format = "html")
options(fig.width = 9)
library(here) # filesystem
library(stringr) #
library(lubridate) # dateformats
library(threejs) # interactive plot
library(tidyverse) #
library(igraph) # network
library(jsonlite) # JSON
library(kableExtra) # styled output
theme_set(theme_bw())
datadir <- "data_private/twitter"
# (extracts from the full tweets data, only id, creation_date, tweet_text)
infile0 <- here::here(datadir, "33c3.created_at.json")
infile1 <- here::here(datadir, "34c3.created_at.json")
infile3 <- "33c3_strlength.txt"
infile4 <- "34c3_strlength.txt"
# read in simple JSON files, simplifyVector converts to data frames
tweets33c3 <- read_json(infile0, simplifyVector = TRUE) %>%
mutate(event = "33c3")
tweets34c3 <- read_json(infile1, simplifyVector = TRUE) %>%
mutate(event = "34c3")
# read in tweet lengths, discussed below
tweetlengths_33c3 <- read_csv(here::here(datadir, infile3), col_names = FALSE)
tweetlengths_34c3 <- read_csv(here::here(datadir, infile4), col_names = FALSE)
tweetlengths_33c3 <- tweetlengths_33c3 %>% mutate(event = "33c3")
tweetlengths_34c3 <- tweetlengths_34c3 %>% mutate(event = "34c3")
tweetlengths <- bind_rows(tweetlengths_33c3, tweetlengths_34c3)
```
#### Distribution of tweet lengths
Usually each tweet about any subject consists *at least* of 2000 characters. This might surprise you, because a tweet-text used to be 140 characters in length, until Twitter increased it to 280 characters in 2017. But a tweet contains a lot of metadata, and this makes a tweet exceed 2000 characters in length when serialized to a compact JSON string (*not* pretty-printed).
Line lengths of serialized tweet data-structures can be determined with:
`awk '{print length}' stream_of_tweets.json > 34c3_strlength.txt` .
The following histograms show that there were many more tweet data-structures for the 34C3 congress than for the 33C3 congress. There are also many junk tweet-fragments, from 0-2500 characters in length, that are not present in the 33C3 histogram (red rectangle).
I think the multi-modal peak structure of the histograms represents tweets that have been retweeted once or twice, or even more often. Actually I haven't investigated that.
```{r len, fig.width=9, echo=FALSE, warning=FALSE}
myColors <- c("dodgerblue", blues9[6])
ggplot(tweetlengths, aes(X1, fill=factor(event))) +
geom_histogram( binwidth=100) +
geom_rect(xmin=0, xmax= 2000, ymin=0, ymax=300, color="tomato", alpha=0) +
facet_wrap(~ event, ncol=2, nrow=1) +
xlim(c(0,20000)) +
xlab("Tweet length [characters]") +
ylab("Count of Tweets") +
scale_fill_manual(values=myColors) +
theme(legend.position = "none") +
ggtitle("Histograms of lengths of 33C3/34C3 tweet texts",
subtitle = "Scraping often failed to separate 34C3 tweets properly, creating garbage fragments.")
```
For 34C3, the harvesting script was not able to extract all tweets as syntactically valid JSON files. Visual inspection revealed that for some unknown reason they consist of many corrupted and duplicated tweets. I have not figured out exactly what was wrong. It seems that many tweets were nested inside each other randomly, but the nesting occured almost always on a single line. Still, some tweet fragments from the 34C3 scrape *did* span multiple lines; see the red rectangle in the figure above. However, there were so many garbage tweets in the 600MB .json-file that it was impossible to remove them all manually.
#### Bad JSON
So, what did the JSON look like? *Not* like this:
```
tweet1
tweet2
tweet3
tweet4
```
...but like this:
```
tweet1_tweet2fragment1
tweet2fragment2
tweet3
tweet4
```
Using a series of shell commands and pipelines, I tried my best to recover the first tweet from each line, discarding the fragmented JSON objects tweets misplaced and appended to each line. This recovery step might produce duplicated tweets for some reason.
Intermediate result:
```
tweet1
tweet1
fragment2
tweet3
tweet4
```
#### Valid tweets
Valid tweets always start with '"{created_at:"', at the beginning of the line.
- keep these
- remove duplicated tweets
Thus I had to further preprocess the tweets with a command similar to this pipe:
```
grep "^{\"created_at" stream__34c3.id_date_text.json |
uniq > 34c3.created_at.json`
```
Convert dates to strings, group them into day-of-month, and hours-of-day.
```{r dates, echo=FALSE, results='hide'}
tweetdates <- bind_rows(tweets33c3, tweets34c3)
# rstats_tweettimes.txt: "Sun Jul 10 08:14:23 +0000 2016"
tweetdates <- tweetdates %>%
mutate(dt = str_replace(dt, "\\+0000 ", ""),
dt = str_replace(dt, "^\\w+ ", ""),
dt = parse_date_time(dt, orders = c("b d H:M:S Y")),
y_m_d = format(dt, "%Y-%m-%d"),
day = day(dt),
hour = hour(dt),
year= year(dt))
```
Tweets gathered during the two events:
```{r tab1, echo=FALSE}
tweetsumm <- tweetdates %>%
group_by(event) %>%
summarize(from=min(format( dt, "%Y-%m-%d")),
to = max(format( dt, "%Y-%m-%d")),
n_tweets = n(),
median_text_length = median(str_length(txt)))
summ_html <- knitr::kable(tweetsumm,
align = "lllcc", col.names = c("Event", "From", "To", "Num Tweets", "Median Length of text"))
kable_styling(summ_html, "striped", position = "left")
```
There exist still more tweets about the 34C3 conference than about the 33C3, but the difference in tweet count is no longer as big as the first histogram above suggested.
#### Plot: User activity over time
This plot compares user activity on twitter during the 2 conferences, one year apart:
```{r plot1, fig.width=9, echo=FALSE}
tweetdates_27 <- tweetdates %>%
filter((event == "34c3" & (day >= 27 | day == 1))|(event == "33c3"))
tweetdates_27 %>%
ggplot(aes(hour)) +
geom_bar(fill="dodgerblue") +
facet_wrap(event ~ y_m_d, ncol=6) +
xlab("Hour of Day")+ ylab("Tweets per Hour") +
ggtitle("#33C3 vs #34C3 Tweets (from Dec 2016 and 2017)",
subtitle = "Harvested from the Twitter Streaming API. Local Time Zone: Central European Time.")
```
For the 33C3 I've started gathering tweets later (on Dec 27, 2016) and finished earlier, after 6 days. For the 34C3 I've let the script run unattended for 13 days (see also 'Extra plots', at end of post).
**Observations (TBC)**
The 34C3 congress had 15000 visitors, much more than 33C3 (12000), but absolute count of tweets per hour seems to be lower in 2017, during the first 2 days, whereas during the last 2 days users seem to have tweeted a bit more, though.
- Maybe Twitter has become less popular among the visitors?
- Maybe my preprocessing script was too aggressive in cleaning up things.
- Maybe the Twitter streaming API works differently?
(To be continued:)
- Wordcloud?
- Network Diagrams from mentions and retweets?
### Extra plots
This plot shows that there was some user activity on Twitter even before the 34c3 started, and after it had ended.
```{r plotextra1, fig.width=9, fig.height=7, echo=FALSE}
tweetdates %>%
filter(event == "34c3") %>%
ggplot(aes(hour)) +
geom_bar(fill="dodgerblue") +
facet_wrap(event ~ y_m_d, ncol=6, nrow=3) +
expand_limits(y=1400) +
#scale_color_manual(c("dodgerblue", blues9[6])) +
xlab("Hour of Day")+ ylab("Tweets per Hour") +
ggtitle("#34C3 Tweets (27.-30. Dec 2017) - Pre- and Post-Conference Tweets",
subtitle = "Harvested from the Twitter Streaming API. Local Time Zone: Central European Time.")
```
(TBC)