R package for computation and visualization of summary trees
summarytrees is an R package to help you summarize and visualize a potentially large data set that can be represented by a rooted, node-weighted tree.
The input data looks something like the table below, containing the node ID, the ID of the node's parent, the node's (non-negative) weight, and its label. The example table pictured below contains a snapshot of the DMOZ topic hierarchy circa 2015, with over 635,000 nodes (unique topics) in the tree, where each node weight has been set to the number of URLs belonging to the given topic in the hierarchy. There are roughly 3.7 million total URLs (this is the sum of the weights). (The level of each node is also shown in the table, but is not required as input to compute summary trees.) The question that
summarytrees helps to answer is "What is the distribution of weights in this large tree?"
summarytrees package implements a dynamic programming algorithm that aggregates the nodes of the input tree in an optimal way, maximizing the entropy of the distribution of the node weights of the aggregated tree among all possible aggregations of a given size, subject to certain constraints. The resulting set of summary trees can be visualized using d3.js to allow for exploratory data analysis of the strucure and weight distribution of the input tree. Below is a snapshot of the 18-node maximum entropy summary tree of the DMOZ data. Click on the picture to see the interactive version of the visualization.
To see some demos, visit one of the links in the first column of the table below. The running times reported for each data set (in minutes and seconds) are from a single run (not averaged over multiple runs) computing maximum entropy summary trees for k = 1, 2, .., K = 100 on a single machine with a 2.20 GHz CPU and over 100 GB of RAM. The rightmost three columns report running times for the 3 different versions of the algorithm that are available. See Section 7 of our paper for more details.
|Data||# Nodes||Total Weight||Max Depth||Optimal||Approx (epsilon = 0.1)||Approx (epsilon = 0.5)||Greedy|
|Tree of Life||94080||54121||123||11:53||20:44||2:34||0:09|
summarytrees package is still under development, and is currently available here on Github, but not yet on CRAN. To install it, type
library(devtools) install_github("kshirley/summarytrees", build_vignettes = TRUE)
It has been developed on a Mac OS X 10.9.5, using R version 3.1.2, Chrome Version 44.0.2403.130 (64-bit), and Firefox 39.0. More testing is planned for the near future to make sure it all works in multiple environments.
There are two vignettes included with the package.
1. The "Gauss" vignette contains an analysis of the Carl Gauss subtree of the Math Genealogy Project (MGP), where the hierarchy of the tree is defined by the advisor-student relationships among mathematicians (starting with Carl Gauss as an advisor) and the node weight for each mathematician is set to 1 by default. To load the vignette, type:
vignette("Gauss", package = "summarytrees")
Note: This sample of data has been shared with the permission and cooperation of the Math Genealogy Project; please do not re-distribute it. See
help("Gauss", package = "summarytrees") for more.
2. The "DMOZ" vignette contains an analysis of a subset of URLs from the DMOZ (aka Open Directory Project) directory of websites. This data contains all the URLs which are classified into the "Top/Sports" branch of the tree, which consists of about 76,000 URLs (about 2% of the total data set) spread out over about 14,000 unique topics. The node weights are set to the number of URLs belonging to each topic in the hierarchy.
vignette("DMOZ", package = "summarytrees")
The original paper describing the algorithm and resulting visualizations is:
(2013). Howard Karloff and Kenneth E. Shirley. "Maximum Entropy Summary Trees", Computer Graphics Forum (Proc. EuroVis), Volume 32, Issue 3, Part 1, pp. 71-80.
A copy of the paper is available here.
A website describing the work is here.