Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
273 lines (230 sloc) 9.94 KB
title date tags slug output
Your and my 2019 R goals
2019-01-01
Twitter
rtweet
r-goals
md_document
variant preserve_yaml
markdown_github
true

Here we go again, using a Twitter trend as blog fodder! Colin Fay launched an inspiring movement by sharing his R goals of 2019.

{{< tweet 1078953442385772544 >}}

It’s been quite interesting reading the objectives of other tweeps: what they want to learn, make, how they want to get involved in the community, etc. As Mike Kearney, rtweet’s maintainer, underlined, it is excellent reading material!

{{< tweet 1079865464241774593 >}}

… but also blogging material! Let me fetch and tokenize these tweets to summarize them!

Disclaimer: I later saw that Jason Baik got the same idea and was faster than I, find the analysis here.

Collect Twitter data

If you’re using rtweet for the first time, check out its website for information about use and setup and also refer to Twitter API docs themselves for knowing more about rate limitation and e.g. learning that the search endpoint won’t let you get tweets older than 6-9 days old.

tweets <- rtweet::search_tweets("Rstats goals 2019",
                                include_rts = FALSE)

I obtained 87 tweets from 85 unique users. Definitely not big data, but not bad!

Tokenize tweets

I then set out to tokenize the tweets into words using the specific tokenizers::tokenize_tweets() tokenizer via the tidytext package. If you’re new to tidytext I’d recommend reading the book written by its authors. A token in natural language processing can be a word, line, etc. which is a totally different concept from a token for rtweet functions (your API credentials).

The tweet tokenization is a “tokenization by word that preserves usernames, hashtags, and URLS”. So awesome, and today is the first time I find an occasion to use it! I also removed stopwords.

library("magrittr")

stopwords <- rcorpora::corpora("words/stopwords/en")$stopWords

tokens <- tweets %>%
  dplyr::select(text) %>%
  tidytext::unnest_tokens(token, text,
                        token = "tweets",
                        drop = FALSE) %>%
  dplyr::filter(!token %in% stopwords) 

Analyze tweets

Most mentioned topics

I first was able to draw a figure similar to Jason Baik’s one, with the most common tokens. I too removed digits.

library("ggalt")
tokens %>%
  dplyr::mutate(token = stringr::str_remove_all(token, "[^\x01-\x7F]")) %>%
  dplyr::mutate(token = stringr::str_remove_all(token, "[[:digit:]]")) %>%
  dplyr::filter(! token %in% c("", "#rstats", "goals")) %>%
  dplyr::count(token, sort = TRUE) %>%
  dplyr::mutate(token = reorder(token, n)) %>%
  head(n = 18) %>%
  ggplot()  +
  geom_lollipop(aes(token, n),
                size = 1.5, col = "salmon") +
  hrbrthemes::theme_ipsum(base_size = 12,
                          axis_title_size = 12) +
  coord_flip()

Most common tokens in 2019 R goals tweets

What actions?

In this figure I identify verbs like learn, finish, write, build and contribute. Let me look at a sample of lines for each of them. This is a sample of lines for a small sample of verbs.

lines <- tweets %>%
  dplyr::select(text) %>%
  tidytext::unnest_tokens(line, text,
                        token = "lines")

sample_verb <- function(verb, lines){
  set.seed(42)
  dplyr::filter(lines, stringr::str_detect(line, paste0(verb, " "))) %>%
    dplyr::sample_n(3)
}

samples <- purrr::map_df(c("learn", "finish", "write", "build", "contribute"), sample_verb, lines)

knitr::kable(samples)
line
3️⃣ learn how to make r packages and write my code so it could be made into an r package more easily
1. learn how to do spatial analysis in r
2️⃣ learn better way to automate feature engineering (neural nets) for text
1️⃣ finally finish all the courses and certifications i started last year on #coursera and #datacamp
2️⃣ finish my track and field r package
3). finish that text mining project i started in october
4️⃣ write an advanced shiny book with bookdown 🎈
3️⃣ learn how to make r packages and write my code so it could be made into an r package more easily
1️⃣ write the htmlwidgets book
- build my first #rstats package (aiming for 2 but 1 would be great :d)
2⃣ build a shiny web app to explore tx staar data
- use f(x) regularly & build own package. cease patching.
5⃣ contribute to foss https://t.co/oh7mwcq50r
2 contribute more to #rstats community through #scicomm, #stackoverflow, etc
3) contribute to #swdchallenge (with r, duh)

These actions are quite varied, e.g. writing is applied to software as well as reading material. My goal was to summarize tweets, but I keep thinking reading all of them is interesting!

Packages?

I wondered how many of the tokens correspond to a package name. I limited myself to CRAN packages, by using the available.packages() function, but one could have a look at the source code of the available package to get an idea of how to find names of packages from Bioconductor and GitHub.

cran_pkgs <- as.character(
  available.packages(contrib.url('https://cran.r-project.org', 'source'))[,"Package"])
pkg_tokens <- dplyr::mutate(tokens,
                            token = gsub("#", "", token)) %>%
  dplyr::filter(token %in% cran_pkgs)

Using the data I’ll look at tweets with the most packages, and most frequent packages.

pkg_tokens %>%
  dplyr::group_by(text) %>%
  dplyr::mutate(pkg_text = paste(toString(token), text)) %>%
  dplyr::count(pkg_text, sort = TRUE) %>%
  head(n = 3) %>%
  dplyr::pull(pkg_text)
## [1] "portfolio, blogdown, rmarkdown, knitr, shiny, maps I really like seeing all these #rstats 2019 goals. My own, in order of urgency:\n1) Finish my personal website and online portfolio using blogdown\n2) Get rolling with project workflows, rmarkdown, and knitr \n3) Create shiny apps for custom interactive maps"                                           
## [2] "inference, projects, import, rvest, httr, xml2 My #rstats 2019 goals:\n1. Improve my statistical modeling and inference skills\n2. Develop business literacy and apply it in data analysis projects\n3. Continue to post on my blog (1 post every 2 months)\n4. Learn to import data using DBI, rvest, httr, and xml2"                                           
## [3] "shiny, templates, shiny, shiny, bookdown, shiny My #RStats goals for 2019: \n\n1<U+FE0F><U+20E3> Improve shinydashboardPlus, bs4Dash and argonDash ..<U+0001F973> \n\n2<U+FE0F><U+20E3> Release new shiny templates              \n3<U+FE0F><U+20E3> Open a consulting service for https://t.co/k3PAbxyVMa about shiny \n4<U+FE0F><U+20E3> Write an advanced shiny book with bookdown <U+0001F388>\n#rstats #shiny #consulting https://t.co/Fyc7MhaeW8"

There are false positives, e.g. projects was here meant as a word, not a package name. What about the most popular packages among the tweets?

dplyr::count(pkg_tokens, token, sort = TRUE)
## # A tibble: 66 x 2
##    token         n
##    <chr>     <int>
##  1 shiny        16
##  2 blogdown      9
##  3 projects      9
##  4 rmarkdown     9
##  5 tidyverse     6
##  6 bookdown      5
##  7 purrr         4
##  8 track         4
##  9 caret         3
## 10 markdown      3
## # ... with 56 more rows

In this table, we get a glimpse at current popular packages, apart from “projects”, “track” and “markdown”. If I’m reading the list correctly they’re all developed at RStudio!

Conclusion

In this post I followed an approach similar to Jason Baik’s to summarize tweets about 2019 R goals announced on Twitter: I collected tweets with rtweet and then used tidytext and the tidyverse to summarize them. Goals often included learning about stuff, building packages (find my list of resources and don’t miss this offer by Steph de Silva), and mentions of RStudio packages.

What about my own R goals, that I haven’t tweeted? I have not made any list, but have exciting projects at work, and hope to keep semi-consistently posting on this blog. In January I’ll also get to start 2019 by giving two R talks, one at R-Ladies Paris and a remote one at ConectaR 2019! Happy 2019, I hope you can meet your own R goals!