There already exists a great package for information theory measures (Cover and Thomas 2001), called "infotheo" (Meyer 2014). tidyinftheo wraps around a few of the functions in the "infotheo" package. 'Tidy-style' data manipulation in R. Some key differences is that this package:
- just calculates Shannon Entropy, Conditional Shannon Entropy, Mutual Information, and Normalized Mutual Information.
- just calculates the "emperical" versions of these measures, as opposed to estimates.
- prefers "bits" (base-2 logs) vs "nats" (natural logs).
- includes a function for aggregating the pairwise comparison of mutual information across more than two variables, yielding a triangular matrix analogous to a correlation matrix for continuous variables.
- is fairly flexible about the type of the input. Factors, integers, strings, should all work. Doubles won't work.
shannon_entropy(.data, ..., na.rm=FALSE)
shannon_cond_entropy(.data, ..., na.rm=FALSE)
mutual_info(.data, ..., normalized=FALSE, na.rm=FALSE)
mutual_info_matrix(.data, ..., normalized=FALSE, na.rm=FALSE)
mutual_info_heatmap(mi_matrix, title=NULL, font_sizes=c(12,12))
You can install, the traditional way, from CRAN with:
install.packages("tidyinftheo")
or if you have the devtools
package installed, install tidyinftheo
from GitHub with:
devtools::install_github("pohlio/tidyinftheo")
then load:
library(tidyinftheo)
The GitHub version should still be stable, and may in fact be the same version as CRAN. Unstable, in-development versions of the package with experimental features will be on other branches of this repository, most notably the dev
branch. Using branches other than master
is not recommended unless perhaps you're interested in contributing new features or fixes to the package.
Calculate (in bits) the Shannon Entropy of the eye color variable in the starwars
dataset:
starwars %>% shannon_entropy(eye_color)
#> [1] 3.117176
With the classic mtcars
dataset, choose some columns to make a matrix of mutual information pairwise comparisons. In particular, the cyl, vs, am, gear, and carb columns are all whole numbers indicating they belong to a category. The other columns are continuous and are better suited to correlation comparisons, unless they're discretized. Here are the first few rows of mtcars:
mtcars %>% select(cyl, vs, am, gear, carb) %>% head()
cyl | vs | am | gear | carb | |
---|---|---|---|---|---|
Mazda RX4 | 6 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 6 | 0 | 1 | 4 | 4 |
Datsun 710 | 4 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 6 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 8 | 0 | 0 | 3 | 2 |
Valiant | 6 | 1 | 0 | 3 | 1 |
And here is our comparison table. There should be 5-choose-2 = 10 different combinations. NMI stands for Normalized Mutual Information, so the mutual information, normally given in bits, is scaled between 0 and 1:
mi_matr <- as_tibble(mtcars) %>%
mutate_if(is_double, as.character) %>%
mutual_info_matrix(cyl, vs, am, gear, carb, normalized=TRUE)
mi_matr
V1 | V2 | MI |
---|---|---|
cyl | vs | 0.4937932 |
cyl | am | 0.1672528 |
cyl | gear | 0.3504372 |
cyl | carb | 0.3983338 |
vs | am | 0.0208314 |
vs | gear | 0.2397666 |
vs | carb | 0.2861119 |
am | gear | 0.5173527 |
am | carb | 0.1149038 |
gear | carb | 0.1905054 |
The matrix is already in a convenient format to plot:
p <- mutual_info_heatmap(mi_matr)
print(p)
NOTE: The above SVG may or may not render 100% correctly. See the issue for more discussion.
Cover, Thomas M., and Joy A. Thomas. 2001. Elements of Information Theory. 2nd ed. 10th Ser. New York, NY: John Wiley & Sons, Inc.
Meyer, Patrick E. 2014. Infotheo: Information-Theoretic Measures. https://CRAN.R-project.org/package=infotheo.