Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
200 lines (150 sloc) 8.38 KB
layout title date output share categories excerpt tags
Text analysis and more
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, 
                      cache.lazy = FALSE, dpi = 180)
options(width=120, dplyr.width = 150)

A couple of weeks ago, I saw on Dirk Eddelbuettel's blog that R 3.4.0 was going to include a function for obtaining information about packages currently on CRAN, including basically everything in DESCRIPTION files. When R 3.4.0 was released, this was one of the things I was most immediately excited about exploring, because although I recently dabbled in scraping CRAN to try to get this kind of information, it was rather onerous.


cran <- tools::CRAN_package_db()

# the returned data frame has two columns with the same name???
cran <- cran[,-65]

# make it a tibble
cran <- tbl_df(cran)


There you go, all the packages currently on CRAN!

Practices of CRAN maintainers

Some of the fields in the DESCRIPTION file of an R package tell us a bit about how a CRAN maintainer works, and in aggregate we can see how R package developers are operating.

How many packages have a URL, a place to go like GitHub to see the code and check out what is going on?

cran %>% 
    summarise(URL = mean(!

What about a URL for bug reports?

cran %>% 
    summarise(BugReports = mean(!

How many packages have a package designated as a VignetteBuilder?

cran %>% 
    count(VignetteBuilder, sort = TRUE)

Are there packages that have vignettes but also have NA for VignetteBuilder? Yes, those would be packages that use Sweave, the built-in vignette engine that comes with R. This must be biased toward older packages and it can't be a large proportion of the total, given when CRAN has been growing the fastest. I know there are still packages with Sweave vignettes, but these days, having something in VignetteBuilder is at least somewhat indicative of whether a package has a vignette. There isn't anything else in the DESCRIPTION file, to my knowledge, that indicates whether a package has a vignette or not.

How many packages use testthat or RUnit for unit tests?


cran %>% 
    mutate(tests = ifelse(str_detect(Suggests, "testthat|RUnit"), TRUE, FALSE),
           tests = ifelse(, FALSE, tests)) %>%
    summarise(tests = mean(tests))

(Another handful of packages have these testing suites in Imports or Depends, but not enough to change that proportion much.)

Is it the same ~20% of packages that are embracing the practices of unit tests, building vignettes, and providing a URL for bug reports?

cran %>%
    mutate(tests = ifelse(str_detect(Suggests, "testthat|RUnit"), TRUE, FALSE),
           tests = ifelse(, FALSE, tests),
           bug_report = ifelse(, FALSE, TRUE),
           vignette = ifelse(, FALSE, TRUE)) %>%
    count(tests, bug_report, vignette)

Huh, so no, actually. I would have guessed that there would have been more packages in the TRUE/TRUE/TRUE bin in this data frame and fewer in the bins that are mixes of TRUE and FALSE. What does that distribution look like?


cran %>%
    mutate(tests = ifelse(str_detect(Suggests, "testthat|RUnit"), "Tests", "No tests"),
           tests = ifelse(, "No tests", tests),
           bug_report = ifelse(, "No bug report", "Bug report"),
           vignette = ifelse(, "No vignette builder", "Vignette builder")) %>%
    count(tests, bug_report, vignette) %>%
    mutate(percent = n / sum(n)) %>%
    arrange(desc(percent)) %>%
    unite(practices, tests, bug_report, vignette, sep = "\n") %>%
    mutate(practices = reorder(practices, -percent)) %>%
    ggplot(aes(practices, percent, fill = practices)) +
    geom_col(alpha = 0.7, show.legend = FALSE) +
    scale_y_continuous(labels = scales::percent_format()) +
    labs(x = NULL, y = "% of CRAN pacakges",
         title = "How many packages on CRAN have units tests, a URL for bug reports, or a vignette builder?",
         subtitle = "About 6% of packages currently on CRAN have all three")

Maybe I should not be surprised, since a package that I myself maintain has unit tests and a URL for bug reports but no vignette. And remember that a few of the "No vignette builder" packages are maintainers choosing to produce vignettes via Sweave, OLD SCHOOL.

Yo dawg I heard you like Descriptions in your DESCRIPTION

One of the fields in the DESCRIPTION file for an R package is the Description for the package.

cran %>%
    filter(Package == "tidytext") %>%

Let's use the tidytext package that I have developed with David Robinson to take a look at the words maintainers use to describe their packages. What words do they use the most often?


tidy_cran <- cran %>%
    unnest_tokens(word, Description)

word_totals <- tidy_cran %>%
    anti_join(stop_words) %>%
    count(word, sort = TRUE)

word_totals %>%
    top_n(20) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n)) +
    geom_col(fill = "cyan4", alpha = 0.8) +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = NULL, y = "Number of uses in CRAN descriptions",
         title = "What are the most commonly used words in CRAN package descriptions?",
         subtitle = "After removing stop words")

Now let's see what the relationships between all these description words are. Let's look at how words are correlated together within description fields and make a word network.


word_cors <- tidy_cran %>%
    anti_join(stop_words) %>%
    group_by(word) %>%
    filter(n() > 150) %>% # filter for words used at least 150 times
    ungroup %>%
    pairwise_cor(word, Package, sort = TRUE)

filtered_cors <- word_cors %>%
  filter(correlation > 0.2,
         item1 %in% word_totals$word,
         item2 %in% word_totals$word)

vertices <- word_totals %>%
    filter(word %in% filtered_cors$item1)

filtered_cors %>%
    graph_from_data_frame(vertices = vertices) %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(edge_alpha = correlation), width = 2) +
    geom_node_point(aes(size = n), color = "cyan4") +
    geom_node_text(aes(label = name), repel = TRUE, point.padding = unit(0.2, "lines"),
                   family = "RobotoCondensed-Regular") +
    theme_graph(base_family = "RobotoCondensed-Regular") +
    theme(plot.title=element_text(family="Roboto-Bold")) +
    scale_size_continuous(range = c(1, 15)) +
    labs(size = "Number of uses",
         edge_alpha = "Correlation",
         title = "Word correlations in R package descriptions",
         subtitle = "Which words are more likely to occur together than with other words?")

The End

If you are interested in this approach to text analysis in R, check out the book Dave and I are publishing with O'Reilly, to be released this summer, available online as well. I found it really interesting to get a glimpse into this ecosystem that is such an important part of my professional and open-source life, both to see the overlap with the areas that I work in and the vast areas that I do not! The R Markdown file used to make this blog post is available here. I am very happy to hear feedback and questions!