You can see the pkgdown site here.
You can install the current version of ICIKendallTau via GitHub:
remotes::install_github("MoseleyBioinformaticsLab/ICIKendallTau")
You can also install Windows or Mac binaries using our r-universe:
options(repos = c(
moseleybioinformaticslab = 'https://moseleybioinformaticslab.r-universe.dev',
CRAN = "https://cloud.r-project.org"))
install.packages("ICIKendallTau")
- How to handle missing data (i.e.
NA
’s) in calculating a correlation between two variables. - Current calculations of correlation are based on having all pairs of
observations for two variables.
- However, whether an observation is present or missing is semi-quantitative information for many analytical measurements with sensitivity limits.
- i.e. in many cases, missing observations are not “missing-at-random”, but “missing-not-at-random” due to falling below the detection limit.
- In these cases, NA is informative.
- Therefore, in most analytical measurements (gene expression, proteomics, metabolomics), missing measurements should be included, and contribute to the correlation.
If you want to read more on how we solve this problem, see the package vignette.
The functions that implement this include:
ici_kt
: the C++ workhorse, actually calculating a correlation between an X and Y.- The option
perspective
will control how theNA
values influence ties. - When comparing samples, you likely want to use
perspective = "global"
.
- The option
ici_kendallt
: Handles comparisons for a large matrix.- Rows are features, columns are samples.
- Implicitly parallel, but have to call:
library(furrr)
plan(multiprocess)
- Otherwise will only use a single core.
We’ve also included a function for testing if the missingness in your
data comes from left-censorship, test_left_censorship
. We walk through
creating example data and testing it in the vignette Testing for Left
Censorship.
In addition to testing, you can also visualize the missing data pattern
by feature rank using the rank_order_data
function, and use
visdat::vis_miss()
on the original and reordered missing data.
The most common case is a large matrix of independent samples (columns) and measured features in each of the samples (i.e. gene expression).
Here we will make some artificial data to show how the correlation changes as we introduce missingness.
set.seed(1234)
library(ICIKendallTau)
s1 = sort(rnorm(1000, mean = 100, sd = 10))
s2 = s1 + 10
matrix_1 = cbind(s1, s2)
r_1 = ici_kendalltau(matrix_1)
r_1$cor
#> s1 s2
#> s1 1 1
#> s2 1 1
Now we introduce some missing values at the low end of each one. We will
just do the simplest thing and introduce NA
values in the bottom set.
s3 = s1
s3[sample(100, 50)] = NA
s4 = s2
s4[sample(100, 50)] = NA
matrix_2 = cbind(s3, s4)
r_2 = ici_kendalltau(matrix_2)
r_2$cor
#> s3 s4
#> s3 1.0000000 0.9944616
#> s4 0.9944616 1.0000000
The C++ code implementation (thanks {Rcpp}!) is based on the SciPy implementation, which uses two merge sorts of the ranks of each vector, and then looks for differences between them. This is the fastest method we know of, and has a complexity of O(nlogn). The naive way of computing it, which explicitly examines all of the pairs, has a complexity of n^2. Our implementation was compared to the {pcaPP::cov.fk} function, and the use of {Rcpp} and our inefficient copying of vectors makes ours 3X slower than that one. Which honestly isn’t too bad.
library(microbenchmark)
x = rnorm(1000)
y = rnorm(1000)
x2 = rnorm(40000)
y2 = rnorm(40000)
library(microbenchmark)
microbenchmark(
cor(x, y, method = "kendall"),
ici_kt(x, y, "global"),
ici_kt(x2, y2, "global"),
times = 5
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> cor(x, y, method = "kendall") 11666.371 11671.671 12405.0886 12084.801 13276.95 13325.649 5
#> ici_kt(x, y, "global") 253.826 255.717 430.9002 277.985 317.24 1049.733 5
#> ici_kt(x2, y2, "global") 13405.302 13731.770 15208.7108 14693.928 15415.66 18796.894 5
In the case of 40,000 features, the average time on a modern CPU is 14 milliseconds.
Of course, if you want to use it to calculate Kendall-tau-b without incorporating missingness, it can do that just fine as well.
k_tau = ici_kt(x, y, "global")
all.equal(k_tau[[1]] ,cor(x, y, method = "kendall"))
#> [1] TRUE
We also provide the kt_fast
function, if you want something that
treats NA
values similarly to stats::cor
.
k_tau_fast = kt_fast(x, y)
k_tau_fast
#> tau pvalue
#> -0.003411411 0.871672260
If you have {future} and the {furrr} packages installed, then it is also possible to split up the a set of matrix comparisons across compute resources for any multiprocessing engine registered with {future}.
library(furrr)
future::plan(multicore, workers = 4)
r_3 = ici_kendalltau(matrix_2)
In the case of hundreds of thousands of comparisons to be done, the
result matrices can become very, very large, and require lots of memory
for storage. They are also inefficient, as both the lower and upper
triangular components are stored. An alternative storage format is as a
data.frame
, where there is a single row for each comparison performed.
This is actually how the results are stored internally, and then they
are converted to a matrix form if requested (the default). To keep the
data.frame
output, add the argument return_matrix=FALSE
to the call
of ici_kendalltau
.
r_4 = ici_kendalltau(matrix_2, return_matrix = FALSE)
r_4
#> $cor
#> s1 s2 core raw pvalue taumax cor
#> 1 s3 s4 1 0.9924359 0 0.997963 0.9944616
#> 2 s3 s3 0 1.0000000 0 1.000000 1.0000000
#> 3 s4 s4 0 1.0000000 0 1.000000 1.0000000
#>
#> $run_time
#> [1] 0.01783729
Please note that the ICIKendallTau project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.