Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve addDivergence #635

Merged
merged 17 commits into from
Sep 22, 2024
Merged

Improve addDivergence #635

merged 17 commits into from
Sep 22, 2024

Conversation

TuomasBorman
Copy link
Contributor

This PR improves and streamlines the addDivergence() function. Basically, now dissimilarity matrix is calculated first for all sample pairs. From that matrix, the values are assigned to correct sample-pairs. This makes the calculation a lot faster since the dissimilarities are calculated at once for all. The computational cost was an issue especially in miaTime's divergence functions. Moreover, the function now uses getDissimilarity() function which makes the maintenance easier.

This new approach also allows users to specify different reference samples. Previously it was only possible to specify a single reference sample for all samples.

To make the usage simpler, user is advised to store reference sample information to colData (if reference is not "mean" or "median"). However, the function still works with single sample name, vector of samples or vector of numeric values denoting a single sample.

The aim of this all is that now we can utilize this directly from divergence functions from miaTime. I have initialized there a branch, however, I think it is already outdated but might give some idea.

In miaTime functions, we only need to specify the reference samples from previous time points. After this is done, we can call mia::getDivergence().

@Daenarys8 Would you be able to first fix the documentation in this mia::addDivergence() (old description style), check that everything works and then incorporate this function to miaTime divergence functions. You could update miaTime functions before this branch is merged to devel (install mia from this branch). This is to ensure that everything works.

@antagomir
Copy link
Member

Note: in many divergence functions it is not necessary to compute the entire dissimilarity matrix if only specific pairwise dissimilarities are needed (e.g. consecutive time points, or time points vs. baseline etc.). In those cases calculating the entire dissimilarity matrix can be too much to calculate. Good to check this one?

@TuomasBorman
Copy link
Contributor Author

TuomasBorman commented Aug 29, 2024

Yes, that is good to check. However, it still might be easier and faster to calculate the whole dissimilarity matrix instead of calculating one by one especially since vegdist is calculated with C.

One option is to calculate dissimilarities in batches, for each unique reference sample. However, in case of time divergence, this causes a bottle neck (there are lots of reference samples)

@antagomir
Copy link
Member

In general it is faster to not calculate the entire dissimilarity matrix but it may depend a bit on feature and sample set sizes. This is relatively straightfwd to quantify experimentally. This problem is particularly remarkable with larger sample sizes as the number of comparisons grows N^2

@TuomasBorman
Copy link
Contributor Author

Yes, well that is true. Can you check if that is causing problems @Daenarys8 ? In miaTime we have a dataset with over 1,000 samples.

If there is a single reference sample for all samples, we could calculate dissimilarity only between samples and this reference sample. If there are multiple reference samples, we could calculate the whole dissimilarity matrix.

Anyway, we do not have to change the function structure. The only difference would be in the lines in .calc_divergence() function.

@antagomir
Copy link
Member

If there are multiple reference samples, I think it would still be faster to apply/sweep/loop over those references, and calculate reference-vs-othersamples for each reference. This will be R x N comparisons, whereas all-vs-all will always be N x N comparisons.

@TuomasBorman
Copy link
Contributor Author

That is something to check. The initial idea for this came because I noticed that this line is not working https://github.com/microbiome/miaTime/blob/2a78096888432796c95fe488af09c0ba0888e993/R/getStepwiseDivergence.R#L230

It loops through all reference samples. As loops are slow in R, it took forever to complete. When I calculated the whole dissimilarity matrix, it took only couple of seconds since vegdist is coded in C.

@Daenarys8 Daenarys8 self-assigned this Aug 30, 2024
Daenarys8 and others added 6 commits August 30, 2024 11:30
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
@TuomasBorman
Copy link
Contributor Author

You are correct @antagomir. If we calculate all dissimilarities, the time is exponential and if we loop through sample-pairs the time grows linearly.

I think the code in miaTime has some other issues which is why it takes so long (for around 1,000 samples it does not complete in sensible amount of time). I thought that it was something to do with calling and initializing C from R.

image

I updated the code. @Daenarys8 can you still check that the documentation is clearly stating that this function utilizes getDissimilarity()? Also, can you then update miaTime's divergence functions based on this (check the branch above)? They should have only functionality to add reference sample information to colData and then call getDivergence(). The branch that I initialized, have some idea, however, you can remove functionality from there as only the reference sample mapping should be included.

antagomir and others added 6 commits September 9, 2024 20:51
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Signed-off-by: Daena Rys <rysdaena8@gmail.com>
Copy link

codecov bot commented Sep 21, 2024

Codecov Report

Attention: Patch coverage is 95.60440% with 4 lines in your changes missing coverage. Please review.

Please upload report for BASE (devel@04d0e04). Learn more about missing BASE report.

Files with missing lines Patch % Lines
R/addDissimilarity.R 85.71% 3 Missing ⚠️
R/estimateDivergence.R 98.57% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             devel     #635   +/-   ##
========================================
  Coverage         ?   68.12%           
========================================
  Files            ?       44           
  Lines            ?     5411           
  Branches         ?        0           
========================================
  Hits             ?     3686           
  Misses           ?     1725           
  Partials         ?        0           
Flag Coverage Δ
68.12% <95.60%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TuomasBorman TuomasBorman merged commit bb33807 into devel Sep 22, 2024
3 checks passed
@TuomasBorman TuomasBorman deleted the update_divergence branch September 22, 2024 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants