Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTU Table Read Abundance vs. Present/Absent Data #21

Open
timz0605 opened this issue Feb 19, 2024 · 6 comments
Open

OTU Table Read Abundance vs. Present/Absent Data #21

timz0605 opened this issue Feb 19, 2024 · 6 comments

Comments

@timz0605
Copy link

Hello,

First of all thanks for the code and package! It is something I've been thinking of and trying to do, and love to see there have been work done in the past.

For the input OTU table, I was wondering if it only considers read counts data? We all know that many potential biases could be introduced during the PCR process and bioinformatics pipeline. Therefore, for many metazoan metabarcoding studies, people convert the read counts data to present/absent data (1 vs. 0) for downstream analyses. So, I am curious about what approaches this code takes.

@lentendu
Copy link
Owner

Hi,

there is no special implementation in the code to handle 1/0 data.
If you use presence/absence data, you probably would like to skip the normalization of read counts by using the option: -n no
The rest is based on Spearman's rank correlation and randomized matrix, so you still need to chose the null model that suits your data.
I have not tested to analyze 1/0 data, in microbiology we also have the depth bias but we consider that the relative abundance is still a valuable information. Log or square-root transformations of relative abundance is then recommended to reduce the importance of hyper-abundant taxa, sometimes due to PCR amplification bias (i.e. using option -n ratio_log or -n ratio_sqrt).
So, you might want to run NetworkNullHPC on a test dataset for which you are sure about the counts to investigate the potential impact of 1/0 transformation on the co-occurrence and co-exclusion results.

@timz0605
Copy link
Author

Hello @lentendu

Thank you for the quick response!

I am relatively new to Linux system and running program that uses a combination of different languages. I was wondering if you could help me with the process? I am trying to run this locally on my computer, and I am using WSL. I have installed R in WSL along with all the required packages

@lentendu
Copy link
Owner

lentendu commented Feb 21, 2024

As mentioned in the readme, this tool is only for Linux server with a SLURM job scheduler.

The individual r scripts are available in the rscripts directory if you want to re-implement it in a single script, but I cannot invest time in it.

Alternatives are the original code of Connor, Barberàn and Clauset (2017) in Matlab, or a different way to produce networks, e.g. using RMThreshold R package to detect the correct Spearman's rank corrlation threshold, see for example Bunick et al. (2021)

@timz0605
Copy link
Author

timz0605 commented Mar 5, 2024

Hello @lentendu,

I have had some preliminary success running the whole program (after some debugging and editing the script to fit the HPC I use), and I guess the next step for me will be playing around with adjusting the parameters to see how they affect my results.

Meanwhile, I want to double-check if I have the format for the OTU table correctly. You mentioned in readme that rows will be samples and columns will be OTUs, correct? Since usually, the OTU table output from the bioinformatics pipeline (say vsearch) will have OTUs as rows and samples/locations as columns.

@timz0605
Copy link
Author

timz0605 commented Mar 5, 2024

Besides, I am also curious about how you visualize the network after you obtain the edge list as the final output. In the paper, you plotted the network where each node represents one OTU and an edge between two nodes represents significant co-occur. I was wondering if you ever had other thoughts or intuitions while exploring the data?

Right now, using all default options, I am only able to obtain approx. 10 pairs of OTU which have significant co-occur patterns (not ideal for visualizing using network methods). However, the median Spearman's rank correlation value for those pairs are all above 0.9. I was wondering if it's possible to select/filter/adjust for the threshold? E.g., all pairs with correlation value above 0.5 or 0.8 will be retained.

@lentendu
Copy link
Owner

lentendu commented Mar 7, 2024

Hi @timz0605 ,
here are my replies to your last questions:

  • the OTU table format follows standard in the R vegan package, that is site as rows and OTU/ASV/species as columns. You can easily transpose your matrix in R with function t() if needed.
  • for visualization, you can use igraph and ggnetwork packages in R, or other softwares like cytoscape or gephi
  • the heart of this co-occurrence network computation approach is to learn the appropriate Spearman's rank correlation threshold from your data, that is correlation not originating from random co-occurrence. The threshold can vary a lot depending on the size (number of sites and species) of your matrix. With small matrices or when using presence/absence data, the threshold will be relatively high. You should really avoid setting hard threshold. I do not know your data, but it might just be that only 10 pairs of OTU have non-random co-occurrences across your samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants