-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve addDivergence #635
Conversation
Note: in many divergence functions it is not necessary to compute the entire dissimilarity matrix if only specific pairwise dissimilarities are needed (e.g. consecutive time points, or time points vs. baseline etc.). In those cases calculating the entire dissimilarity matrix can be too much to calculate. Good to check this one? |
Yes, that is good to check. However, it still might be easier and faster to calculate the whole dissimilarity matrix instead of calculating one by one especially since vegdist is calculated with C. One option is to calculate dissimilarities in batches, for each unique reference sample. However, in case of time divergence, this causes a bottle neck (there are lots of reference samples) |
In general it is faster to not calculate the entire dissimilarity matrix but it may depend a bit on feature and sample set sizes. This is relatively straightfwd to quantify experimentally. This problem is particularly remarkable with larger sample sizes as the number of comparisons grows N^2 |
Yes, well that is true. Can you check if that is causing problems @Daenarys8 ? In miaTime we have a dataset with over 1,000 samples. If there is a single reference sample for all samples, we could calculate dissimilarity only between samples and this reference sample. If there are multiple reference samples, we could calculate the whole dissimilarity matrix. Anyway, we do not have to change the function structure. The only difference would be in the lines in .calc_divergence() function. |
If there are multiple reference samples, I think it would still be faster to apply/sweep/loop over those references, and calculate reference-vs-othersamples for each reference. This will be R x N comparisons, whereas all-vs-all will always be N x N comparisons. |
That is something to check. The initial idea for this came because I noticed that this line is not working https://github.com/microbiome/miaTime/blob/2a78096888432796c95fe488af09c0ba0888e993/R/getStepwiseDivergence.R#L230 It loops through all reference samples. As loops are slow in R, it took forever to complete. When I calculated the whole dissimilarity matrix, it took only couple of seconds since vegdist is coded in C. |
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
You are correct @antagomir. If we calculate all dissimilarities, the time is exponential and if we loop through sample-pairs the time grows linearly. I think the code in miaTime has some other issues which is why it takes so long (for around 1,000 samples it does not complete in sensible amount of time). I thought that it was something to do with calling and initializing C from R. I updated the code. @Daenarys8 can you still check that the documentation is clearly stating that this function utilizes getDissimilarity()? Also, can you then update miaTime's divergence functions based on this (check the branch above)? They should have only functionality to add reference sample information to colData and then call getDivergence(). The branch that I initialized, have some idea, however, you can remove functionality from there as only the reference sample mapping should be included. |
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #635 +/- ##
========================================
Coverage ? 68.12%
========================================
Files ? 44
Lines ? 5411
Branches ? 0
========================================
Hits ? 3686
Misses ? 1725
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
This PR improves and streamlines the addDivergence() function. Basically, now dissimilarity matrix is calculated first for all sample pairs. From that matrix, the values are assigned to correct sample-pairs. This makes the calculation a lot faster since the dissimilarities are calculated at once for all. The computational cost was an issue especially in miaTime's divergence functions. Moreover, the function now uses getDissimilarity() function which makes the maintenance easier.
This new approach also allows users to specify different reference samples. Previously it was only possible to specify a single reference sample for all samples.
To make the usage simpler, user is advised to store reference sample information to colData (if reference is not "mean" or "median"). However, the function still works with single sample name, vector of samples or vector of numeric values denoting a single sample.
The aim of this all is that now we can utilize this directly from divergence functions from miaTime. I have initialized there a branch, however, I think it is already outdated but might give some idea.
In miaTime functions, we only need to specify the reference samples from previous time points. After this is done, we can call mia::getDivergence().
@Daenarys8 Would you be able to first fix the documentation in this mia::addDivergence() (old description style), check that everything works and then incorporate this function to miaTime divergence functions. You could update miaTime functions before this branch is merged to devel (install mia from this branch). This is to ensure that everything works.