-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
taxonomy ranks are too restictive #219
Comments
One aspect in using TAXONOMY_RANKS is to enforce certain standards. However, there are situations where some other categories are needed for a similar purpose. We are now working on that in a |
The current set of values in TAXONOMY_RANKS is very 16S centric. Some 18S reference databases use different/more ranks. And with the ongoing development of full rRNA seq methods, amplicon data may very well be at Strain level in the near future. I get that standardization is important, but hard coding is almost never a good idea. There are always situations where one needs to work with modified setups. |
This is good point. It would be great to see a comment from @FelixErnst on this but I know that it may take some time before he can respond. Should we allow users to define the rank names as they wish? We already had some discussion on this. |
PS: I personally would leave the default TAXONOMY_RANKS as is, but move it to options(), so I can be (globally) modified by the user as needed. And if it is in options() in can also be used by other packages. |
I would phrase it the other way around like this: taxonomyRanks breaks one's taxonomy ranks. This is the intention. It is the equivalent of "Mind the gap". From my point of view standardization is a very good idea. Sometimes shortcuts and not following standards leads to a lot of work afterwards. However, shortcuts are sometimes good, but only if they are used like this and not as the new standards. Since you mentioned 18S rRNA references: Can you specify which ranks are available in those reference databases, which are not part of TAXONOMY_RANKS? Can you provide a reproducable example including the data visualising the issue? What reference other than "it is in that reference database" can you cite? As far as I know ICN, ICZN, ICNP, ICNCP, ICPN and ICVCN all use this type of taxonomic ranks or a subset of those. Sometimes ranks contain two words, but from a technical point of view this is not a problem. On personal experience: I cannot remember a single dataset, for which I didn't have to adjust some aspects of the taxonomic data due to errors in reference databases. Is this case different? |
How would that solve the problem, when you have two datasets with different assumptions in the same session? |
Here is a real world example: https://benjjneb.github.io/dada2/training.html: |
Currently, I still don't see that this is a problem to be addressed in the
Thanks edit: Phylum is sometimes called division and supergroup might just be a different name for domain... |
Ups, yeah, that is my custom wrapper function. But it just calls
Sure, I could rename everything. However, the consensus for the DADA2 PR2 reference database is to use "Supergroup" and "Division", so when I deliver results to our collaborators, I would have to rename everything back.
Here is an example based on my first comment using Strain information: library(mia)
data("GlobalPatterns")
# let's simulate that we have Strain information included in this data
rowData(GlobalPatterns)$Strain <- stringr::str_c(rowData(GlobalPatterns)$Genus, " ", rowData(GlobalPatterns)$Species, " ", stringi::stri_rand_strings(nrow(rowData(GlobalPatterns)), 1))
agglomerateByRank(GlobalPatterns, "Species") # works
agglomerateByRank(GlobalPatterns, "Strain") # does not work Again, I think it makes more sense to enforce standardization by good default values, but also allow people to customize. Like dada2 does for example: https://github.com/benjjneb/dada2/blob/2e8360d08912b429533d1495198a4ac02e00ba32/R/taxonomy.R#L66 There are always unforeseen situations where customization is needed. |
What would be an example of a case where customization option leads to problems, when we assume that users in general know what they are doing in such cases? I tend to think that it is positive to allow users the freedom to do this also for other than some default groupings that are not fundamentally universal in the end. The |
Well you spelled it out there. By experience, I have to assume that nobody reads the man pages and therefore they don't know what the are doing.
I think they are very different functions and that they are named so closely sets my teeth on edge. We are talking about a functionality called I will prepare a PR, but @antagomir or @TuomasBorman has to follow-up. I probably don't have the time to complete it. |
If the other split function |
It is not. Have a look at the code. |
Right, it was only in the discussion part so far. It is better to not mix those indeed, and I agree that |
The issue #185 had discussion on including the full abundance table, optionally, as one of the outputs in |
I don't what you mean by "full abundance table".
No, the
No it has not. See the PR. The scope was as muddy as it could get
which let to the whole confusion. I commented in #185
That exactly what has happened here. An implementation for a problem, which didn't exist and missing the implementation @and3k and @antagomir both have opened the same issue for... aka. lapply(as.list(rowData(x)[,c("Kingdom","Phylum")]), mergeRows, x = x) # replace Kingdom and Phylum with any rank or column data of you choice |
@and3k: just to make it clear the solution to your problem is: mergeRows(x = x, f = rowData(tse)[,"Strain"]) and if you want to agglomerate on more than one rank, it is the example given above: lapply(as.list(rowData(x)[,c("Kingdom","Phylum")]), mergeRows, x = x) Works with any data in the rowData, any additional taxonomic information you may encounter in any reference dataset |
I would be fine with just having this as an extra example in |
Sure go ahead. |
Encountering this issue again. Some rowData has superkingdom, and since it is not included in TAXONOMY_RANKS "superkingdom" taxonomy rank is not parsed. Not sure what is the most optimal solution, but we could have "expanded" taxonomy ranks with unofficial ranks and aliases --> we could test if that works If we need expanded ranks, we can just overwrite TAXONOMY_RANKS with TAXONOMY_RANKS_EXP |
It is possible although I have concerns that those who bump into situations where it is needed may have hard time coming up with the fact that such solution exists. The problem is, I guess, that the TAXONOMY_RANKS are treated differently from other rowData fields in aggregation and related tasks, therefore we cannot simply provide an alternative where any fields could be freely defined to be in the ranks. How about just expanding TAXONOMY_RANKS but then additionally having a list of "official" ranks either in the function internally, or in the function argument (so that users can modify although it gets complicated too..), and whenever the data contains some of the unofficial TAXONOMY_RANKS, then a warning could be thrown, with a suggestion on how the user could redefine TAXONOMY_RANKS or the list of official ranks? |
@ake123 Can you have a look? Use the data from Slack --> add those taxonomy ranks that are provided in one of rowData's columns --> modify TAXONOMY_RANKS to support these ranks --> check what breaks |
I made TAXONOMY_RANKS as follows TAXONOMY_RANKS <- c("taxonomy1","taxonomy2","taxonomy3",
"taxonomy4","taxonomy5","taxonomy6","taxonomy7",
"taxonomy8","taxonomy9","taxonomy10","taxonomy11",
"taxonomy12","taxonomy13") I tested almost all the functions and they all seem to be working addTaxonomyTree(tse)
Error: rownames of 'x' mismatch with node labels of the tree
Try 'changeTree' with 'rowNodeLab' provided.
In addition: Warning messages:
1: In toTree(td) : 10 duplicated rows are removed
2: In toTree(td) : The root is added with label 'ALL' For x21 <- mergeFeaturesByRank(tse, rank = "taxonomy11",agglomerateTree = TRUE)
Error: rownames of 'x' mismatch with node labels of the tree
Try 'changeTree' with 'rowNodeLab' provided.
In addition: Warning message:
In toTree(td) : The root is added with label 'ALL' |
How about if you drop "rank_lineage" column (it does not belong to these ranks) Check what is causing this issue. Seems that it's a problem that can be easily fixed. |
I actually made a correction in TAXONOMY_RANKS <- c("taxonomy1","taxonomy2","taxonomy3","taxonomy4","taxonomy5",
"taxonomy6","taxonomy7","taxonomy8","taxonomy9","taxonomy10",
"taxonomy11","taxonomy12","taxonomy13","taxonomy14","taxonomy15",
"taxonomy16","taxonomy17") I checked the following > taxonomyTree(tse)
Phylogenetic tree with 26 tips and 218 internal nodes.
Tip labels:
taxonomy17:_32_1_1_1_1_1, taxonomy17:Saccharomyces cerevisiae, taxonomy17:_36, taxonomy17:_34_3, taxonomy17:_32_1_1_3, taxonomy17:_34_1_1_1_1_3, ...
Node labels:
root:ALL, taxonomy1:, taxonomy2:_1, taxonomy3:_1, taxonomy4:_6_1, taxonomy5:_5_1, ...
Rooted; includes branch lengths.
Warning messages:
1: In toTree(td) : 17 duplicated rows are removed
2: In toTree(td) : The root is added with label 'ALL' >any( !rownames(tse) %in% c( tree$node.labels, tree$tip.labels) )
[1] TRUE |
Hi @ake123 Sorry for late reply. Your solution as such did not work (at least as I quickly tried it), but this seems to work
Proposial
Things to consider (my free thoughts): We can have also ranks for columns. Currently, our methods do not support too much this column ranks / colTree, but we have decided to support this more. --> Should we have |
Hi,
in case one's taxonomy ranks fall outside of what is hard-coded in
mia::TAXONOMY_RANKS
, it breaks functions liketaxonomyRanks
.For example:
One solution would be to be able to customize TAXONOMY_RANKS. Or include all possible ranks in TAXONOMY_RANKS. Especially eukaryotes have a bunch of more ranks.
Thanks!
Bela
The text was updated successfully, but these errors were encountered: