Mining CRAN DESCRIPTION Files
Text analysis and more
library(knitr) knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, cache.lazy = FALSE, dpi = 180) options(width=120, dplyr.width = 150) library(ggplot2) library(silgelib) theme_set(theme_roboto())
A couple of weeks ago, I saw on Dirk Eddelbuettel's blog that R 3.4.0 was going to include a function for obtaining information about packages currently on CRAN, including basically everything in DESCRIPTION files. When R 3.4.0 was released, this was one of the things I was most immediately excited about exploring, because although I recently dabbled in scraping CRAN to try to get this kind of information, it was rather onerous.
library(tidyverse) cran <- tools::CRAN_package_db() # the returned data frame has two columns with the same name??? cran <- cran[,-65] # make it a tibble cran <- tbl_df(cran) cran
There you go, all the packages currently on CRAN!
Practices of CRAN maintainers
Some of the fields in the DESCRIPTION file of an R package tell us a bit about how a CRAN maintainer works, and in aggregate we can see how R package developers are operating.
How many packages have a URL, a place to go like GitHub to see the code and check out what is going on?
cran %>% summarise(URL = mean(!is.na(URL)))
What about a URL for bug reports?
cran %>% summarise(BugReports = mean(!is.na(BugReports)))
How many packages have a package designated as a
cran %>% count(VignetteBuilder, sort = TRUE)
Are there packages that have vignettes but also have
VignetteBuilder? Yes, those would be packages that use Sweave, the built-in vignette engine that comes with R. This must be biased toward older packages and it can't be a large proportion of the total, given when CRAN has been growing the fastest. I know there are still packages with Sweave vignettes, but these days, having something in
VignetteBuilder is at least somewhat indicative of whether a package has a vignette. There isn't anything else in the DESCRIPTION file, to my knowledge, that indicates whether a package has a vignette or not.
How many packages use testthat or RUnit for unit tests?
library(stringr) cran %>% mutate(tests = ifelse(str_detect(Suggests, "testthat|RUnit"), TRUE, FALSE), tests = ifelse(is.na(tests), FALSE, tests)) %>% summarise(tests = mean(tests))
(Another handful of packages have these testing suites in Imports or Depends, but not enough to change that proportion much.)
Is it the same ~20% of packages that are embracing the practices of unit tests, building vignettes, and providing a URL for bug reports?
cran %>% mutate(tests = ifelse(str_detect(Suggests, "testthat|RUnit"), TRUE, FALSE), tests = ifelse(is.na(tests), FALSE, tests), bug_report = ifelse(is.na(BugReports), FALSE, TRUE), vignette = ifelse(is.na(VignetteBuilder), FALSE, TRUE)) %>% count(tests, bug_report, vignette)
Huh, so no, actually. I would have guessed that there would have been more packages in the
TRUE/TRUE/TRUE bin in this data frame and fewer in the bins that are mixes of
FALSE. What does that distribution look like?
library(tidyr) cran %>% mutate(tests = ifelse(str_detect(Suggests, "testthat|RUnit"), "Tests", "No tests"), tests = ifelse(is.na(tests), "No tests", tests), bug_report = ifelse(is.na(BugReports), "No bug report", "Bug report"), vignette = ifelse(is.na(VignetteBuilder), "No vignette builder", "Vignette builder")) %>% count(tests, bug_report, vignette) %>% mutate(percent = n / sum(n)) %>% arrange(desc(percent)) %>% unite(practices, tests, bug_report, vignette, sep = "\n") %>% mutate(practices = reorder(practices, -percent)) %>% ggplot(aes(practices, percent, fill = practices)) + geom_col(alpha = 0.7, show.legend = FALSE) + scale_y_continuous(labels = scales::percent_format()) + labs(x = NULL, y = "% of CRAN pacakges", title = "How many packages on CRAN have units tests, a URL for bug reports, or a vignette builder?", subtitle = "About 6% of packages currently on CRAN have all three")
Maybe I should not be surprised, since a package that I myself maintain has unit tests and a URL for bug reports but no vignette. And remember that a few of the "No vignette builder" packages are maintainers choosing to produce vignettes via Sweave, OLD SCHOOL.
Yo dawg I heard you like Descriptions in your DESCRIPTION
One of the fields in the DESCRIPTION file for an R package is the
Description for the package.
cran %>% filter(Package == "tidytext") %>% select(Description)
library(tidytext) tidy_cran <- cran %>% unnest_tokens(word, Description) word_totals <- tidy_cran %>% anti_join(stop_words) %>% count(word, sort = TRUE)
word_totals %>% top_n(20) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill = "cyan4", alpha = 0.8) + coord_flip() + scale_y_continuous(expand = c(0,0)) + labs(x = NULL, y = "Number of uses in CRAN descriptions", title = "What are the most commonly used words in CRAN package descriptions?", subtitle = "After removing stop words")
Now let's see what the relationships between all these description words are. Let's look at how words are correlated together within description fields and make a word network.
library(igraph) library(ggraph) library(widyr) word_cors <- tidy_cran %>% anti_join(stop_words) %>% group_by(word) %>% filter(n() > 150) %>% # filter for words used at least 150 times ungroup %>% pairwise_cor(word, Package, sort = TRUE) filtered_cors <- word_cors %>% filter(correlation > 0.2, item1 %in% word_totals$word, item2 %in% word_totals$word) vertices <- word_totals %>% filter(word %in% filtered_cors$item1) set.seed(1234) filtered_cors %>% graph_from_data_frame(vertices = vertices) %>% ggraph(layout = "fr") + geom_edge_link(aes(edge_alpha = correlation), width = 2) + geom_node_point(aes(size = n), color = "cyan4") + geom_node_text(aes(label = name), repel = TRUE, point.padding = unit(0.2, "lines"), family = "RobotoCondensed-Regular") + theme_graph(base_family = "RobotoCondensed-Regular") + theme(plot.title=element_text(family="Roboto-Bold")) + scale_size_continuous(range = c(1, 15)) + labs(size = "Number of uses", edge_alpha = "Correlation", title = "Word correlations in R package descriptions", subtitle = "Which words are more likely to occur together than with other words?")
If you are interested in this approach to text analysis in R, check out the book Dave and I are publishing with O'Reilly, to be released this summer, available online as well. I found it really interesting to get a glimpse into this ecosystem that is such an important part of my professional and open-source life, both to see the overlap with the areas that I work in and the vast areas that I do not! The R Markdown file used to make this blog post is available here. I am very happy to hear feedback and questions!