taxonomy ranks are too restictive #219

nr0cinu · 2022-02-07T08:58:29Z

Hi,

in case one's taxonomy ranks fall outside of what is hard-coded in mia::TAXONOMY_RANKS, it breaks functions like taxonomyRanks.

For example:

> agglomerateByRank(my_tse, "Strain")
 Error: 'rank' must be a value from 'taxonomyRanks()'

One solution would be to be able to customize TAXONOMY_RANKS. Or include all possible ranks in TAXONOMY_RANKS. Especially eukaryotes have a bunch of more ranks.

Thanks!
Bela

The text was updated successfully, but these errors were encountered:

antagomir · 2022-02-07T09:37:57Z

One aspect in using TAXONOMY_RANKS is to enforce certain standards.

However, there are situations where some other categories are needed for a similar purpose. We are now working on that in a splitBy issue, which is in preparation #216

nr0cinu · 2022-02-07T10:30:39Z

One aspect in using TAXONOMY_RANKS is to enforce certain standards.

The current set of values in TAXONOMY_RANKS is very 16S centric. Some 18S reference databases use different/more ranks. And with the ongoing development of full rRNA seq methods, amplicon data may very well be at Strain level in the near future.

I get that standardization is important, but hard coding is almost never a good idea. There are always situations where one needs to work with modified setups.

antagomir · 2022-02-07T10:33:36Z

This is good point. It would be great to see a comment from @FelixErnst on this but I know that it may take some time before he can respond. Should we allow users to define the rank names as they wish? We already had some discussion on this.

nr0cinu · 2022-02-07T10:38:40Z

PS: I personally would leave the default TAXONOMY_RANKS as is, but move it to options(), so I can be (globally) modified by the user as needed. And if it is in options() in can also be used by other packages.

FelixErnst · 2022-02-08T17:40:57Z

in case one's taxonomy ranks fall outside of what is hard-coded in mia::TAXONOMY_RANKS, it breaks functions like taxonomyRanks.

I would phrase it the other way around like this: taxonomyRanks breaks one's taxonomy ranks. This is the intention. It is the equivalent of "Mind the gap".

From my point of view standardization is a very good idea. Sometimes shortcuts and not following standards leads to a lot of work afterwards. However, shortcuts are sometimes good, but only if they are used like this and not as the new standards.

Since you mentioned 18S rRNA references: Can you specify which ranks are available in those reference databases, which are not part of TAXONOMY_RANKS? Can you provide a reproducable example including the data visualising the issue? What reference other than "it is in that reference database" can you cite?

As far as I know ICN, ICZN, ICNP, ICNCP, ICPN and ICVCN all use this type of taxonomic ranks or a subset of those. Sometimes ranks contain two words, but from a technical point of view this is not a problem.

On personal experience: I cannot remember a single dataset, for which I didn't have to adjust some aspects of the taxonomic data due to errors in reference databases. Is this case different?

FelixErnst · 2022-02-08T17:43:01Z

PS: I personally would leave the default TAXONOMY_RANKS as is, but move it to options(), so I can be (globally) modified by the user as needed. And if it is in options() in can also be used by other packages.

How would that solve the problem, when you have two datasets with different assumptions in the same session?

nr0cinu · 2022-02-09T07:36:02Z

Since you mentioned 18S rRNA references: Can you specify which ranks are available in those reference databases, which are not part of TAXONOMY_RANKS?

Here is a real world example: https://benjjneb.github.io/dada2/training.html: Note: PR2 has different taxLevels than the dada2 default. When assigning taxonomy against PR2, use the following: assignTaxonomy(..., taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species"))

FelixErnst · 2022-02-09T17:12:54Z

Currently, I still don't see that this is a problem to be addressed in the mia package.

Your example includes a function named agglomerate_by_rank, which is not available from mia. (In mia there is a function called agglomerateByRank. Do you mean that function? Did you use it while encountering the problem?)
The link to the dada2 website doesn't contain a reproducable example. It mentions that a reference dataset does not include Phylum information, but includes supergroup and division. Is this just a naming issue? Can those columns by renamed to work with agglomerateByRank? (One would need to know, what values supergroup and division contain)
Without the reproducable example it is hard to judge, what the actual problem is. I can make guesses, but before changes to the code can be thought of, we have get to point of knowing. Can you provide a minimal script and minimal dataset demonstrating the problem?

Thanks

edit: Phylum is sometimes called division and supergroup might just be a different name for domain...

nr0cinu · 2022-02-10T11:35:14Z

Your example includes a function named agglomerate_by_rank, which is not available from mia. (In mia there is a function called agglomerateByRank. Do you mean that function? Did you use it while encountering the problem?)

Ups, yeah, that is my custom wrapper function. But it just calls agglomerateByRank which is the cause of the error. I fixed it in the first example.

The link to the dada2 website doesn't contain a reproducable example. It mentions that a reference dataset does not include Phylum information, but includes supergroup and division. Is this just a naming issue? Can those columns by renamed to work with agglomerateByRank? (One would need to know, what values supergroup and division contain)

Sure, I could rename everything. However, the consensus for the DADA2 PR2 reference database is to use "Supergroup" and "Division", so when I deliver results to our collaborators, I would have to rename everything back.

Without the reproducable example it is hard to judge, what the actual problem is. I can make guesses, but before changes to the code can be thought of, we have get to point of knowing. Can you provide a minimal script and minimal dataset demonstrating the problem?

Here is an example based on my first comment using Strain information:

library(mia)
data("GlobalPatterns")
# let's simulate that we have Strain information included in this data
rowData(GlobalPatterns)$Strain <- stringr::str_c(rowData(GlobalPatterns)$Genus, " ", rowData(GlobalPatterns)$Species, " ", stringi::stri_rand_strings(nrow(rowData(GlobalPatterns)), 1))
agglomerateByRank(GlobalPatterns, "Species") # works
agglomerateByRank(GlobalPatterns, "Strain") # does not work

Again, I think it makes more sense to enforce standardization by good default values, but also allow people to customize. Like dada2 does for example:

https://github.com/benjjneb/dada2/blob/2e8360d08912b429533d1495198a4ac02e00ba32/R/taxonomy.R#L66

There are always unforeseen situations where customization is needed.

antagomir · 2022-02-10T15:41:58Z

What would be an example of a case where customization option leads to problems, when we assume that users in general know what they are doing in such cases? I tend to think that it is positive to allow users the freedom to do this also for other than some default groupings that are not fundamentally universal in the end. The splitBy(in the making) could solve this but then the added value of splitByRanks may become questionable.

FelixErnst · 2022-02-10T20:11:11Z

when we assume that users in general know what they are doing in such cases?

Well you spelled it out there. By experience, I have to assume that nobody reads the man pages and therefore they don't know what the are doing.

The splitBy(in the making) could solve this but then the added value of splitByRanks may become questionable.

I think they are very different functions and that they are named so closely sets my teeth on edge. splitByRanks is a fire-and-forget function, which works like it is named. splitBy is a crutch of some sort, because sometimes you want to split by something else. This would actually not be needed, if relevant functions would be able to use a grouping factor, the f in splitBy, on-the-fly.

We are talking about a functionality called splitBySomeOtherRanks, which disable sanity checks required for correct taxonomic labeling and such.

I will prepare a PR, but @antagomir or @TuomasBorman has to follow-up. I probably don't have the time to complete it.

antagomir · 2022-02-10T20:19:38Z

If the other split function splitBySomeOtherRanks (with some other name) will be made, then splitBy could be potentially reserved exclusively for the case using the grouping factor f. It is now also providing functionality which is similar to splitBySomeOtherRanks.

FelixErnst · 2022-02-10T20:20:49Z

It is not. Have a look at the code.

antagomir · 2022-02-10T20:37:09Z

Right, it was only in the discussion part so far. It is better to not mix those indeed, and I agree that splitBy is disturbingly similarly named with splitByRanks bc of the differences. The splitBySomeOtherRanks would be useful for us too.

antagomir · 2022-02-10T20:44:50Z

The issue #185 had discussion on including the full abundance table, optionally, as one of the outputs in splitByRanks, or a similar function. As well as about implementing something like splitBySomeOtherRanks. The conclusion in that issue seemed to be that splitBy will solve that need. Now it does not, which is good because it has another clearly defined scope. But splitBySomeOtherRanks could benefit from having the option to include the full abundance table in the output list, for the reasons that were discussed in #185

FelixErnst · 2022-02-10T21:26:16Z

The issue #185 had discussion on including the full abundance table, optionally, as one of the outputs in splitByRanks, or a similar function. As well as about implementing something like splitBySomeOtherRanks.

I don't what you mean by "full abundance table".

The conclusion in that issue seemed to be that splitBy will solve that need.

No, the splitBy implementation confused the issue

Now it does not, which is good because it has another clearly defined scope.

No it has not. See the PR. The scope was as muddy as it could get

But splitBySomeOtherRanks could benefit from having the option to include the full abundance table in the output list, for the reasons that were discussed in #185

which let to the whole confusion.

I commented in #185

If you really really really think, that saving 2 lines of code is worth writting a function, than write a wrapper and call it splitByAddRanks or something., IMO writting 100 lines for man page, examples, the generic and the actual function to save two lines of code is really not worth it.

That exactly what has happened here. An implementation for a problem, which didn't exist and missing the implementation @and3k and @antagomir both have opened the same issue for... aka. splitByAlternativeRank which adds a lot of line codes to wrap this oneliner

lapply(as.list(rowData(x)[,c("Kingdom","Phylum")]), mergeRows, x = x) # replace Kingdom and Phylum with any rank or column data of you choice

FelixErnst · 2022-02-10T21:37:32Z

@and3k: just to make it clear the solution to your problem is:

mergeRows(x = x, f = rowData(tse)[,"Strain"])

and if you want to agglomerate on more than one rank, it is the example given above:

lapply(as.list(rowData(x)[,c("Kingdom","Phylum")]), mergeRows, x = x)

Works with any data in the rowData, any additional taxonomic information you may encounter in any reference dataset

antagomir · 2022-02-10T22:07:02Z

I would be fine with just having this as an extra example in splitByRanks (logical place to find), and/or in the relevant section OMA so that it can be found. Should we do that instead, to avoid creating many functions and go with that until there is more evidence that a new function is needed?

FelixErnst · 2022-02-10T22:14:19Z

Sure go ahead.

TuomasBorman · 2023-08-31T09:54:11Z

Encountering this issue again. Some rowData has superkingdom, and since it is not included in TAXONOMY_RANKS "superkingdom" taxonomy rank is not parsed. Not sure what is the most optimal solution, but we could have "expanded" taxonomy ranks with unofficial ranks and aliases --> we could test if that works

If we need expanded ranks, we can just overwrite TAXONOMY_RANKS with TAXONOMY_RANKS_EXP

antagomir · 2023-09-03T07:33:57Z

It is possible although I have concerns that those who bump into situations where it is needed may have hard time coming up with the fact that such solution exists.

The problem is, I guess, that the TAXONOMY_RANKS are treated differently from other rowData fields in aggregation and related tasks, therefore we cannot simply provide an alternative where any fields could be freely defined to be in the ranks.

How about just expanding TAXONOMY_RANKS but then additionally having a list of "official" ranks either in the function internally, or in the function argument (so that users can modify although it gets complicated too..), and whenever the data contains some of the unofficial TAXONOMY_RANKS, then a warning could be thrown, with a suggestion on how the user could redefine TAXONOMY_RANKS or the list of official ranks?

TuomasBorman · 2023-09-14T08:29:06Z

@ake123 Can you have a look? Use the data from Slack --> add those taxonomy ranks that are provided in one of rowData's columns --> modify TAXONOMY_RANKS to support these ranks --> check what breaks

ake123 · 2023-09-14T11:17:52Z

I made TAXONOMY_RANKS as follows

TAXONOMY_RANKS <- c("taxonomy1","taxonomy2","taxonomy3",
                    "taxonomy4","taxonomy5","taxonomy6","taxonomy7",
                    "taxonomy8","taxonomy9","taxonomy10","taxonomy11",
                    "taxonomy12","taxonomy13")

I tested almost all the functions and they all seem to be working
Here are the errors I came across

addTaxonomyTree(tse)
Error: rownames of 'x' mismatch with node labels of the tree 
 Try 'changeTree' with 'rowNodeLab' provided.
In addition: Warning messages:
1: In toTree(td) : 10 duplicated rows are removed
2: In toTree(td) : The root is added with label 'ALL'

For mergeFeaturesByRank when agglomerateTree=FALSE no error appears. But the following error appears when agglomerateTree=TRUE

 x21 <- mergeFeaturesByRank(tse, rank = "taxonomy11",agglomerateTree = TRUE)
Error: rownames of 'x' mismatch with node labels of the tree 
 Try 'changeTree' with 'rowNodeLab' provided.
In addition: Warning message:
In toTree(td) : The root is added with label 'ALL'

TuomasBorman · 2023-09-14T13:06:53Z

How about if you drop "rank_lineage" column (it does not belong to these ranks)

Check what is causing this issue. Seems that it's a problem that can be easily fixed. any( !rownames(x) %in% c( tree$node.labels, tree$tip.labels) ) might give a hint --> why some rownames are missing from the tree or why these nodes are named differently than the rownames? --> the reason might be that rowData includes only taxonomy table and not the last rank which is rownames... --> how about if you add rownames as taxonomy14?

ake123 · 2023-09-25T10:43:02Z

I actually made a correction in TAXONOMY_RANKS

TAXONOMY_RANKS <- c("taxonomy1","taxonomy2","taxonomy3","taxonomy4","taxonomy5",
                    "taxonomy6","taxonomy7","taxonomy8","taxonomy9","taxonomy10",
                    "taxonomy11","taxonomy12","taxonomy13","taxonomy14","taxonomy15",
                    "taxonomy16","taxonomy17")

I checked the following

> taxonomyTree(tse)

Phylogenetic tree with 26 tips and 218 internal nodes.

Tip labels:
  taxonomy17:_32_1_1_1_1_1, taxonomy17:Saccharomyces cerevisiae, taxonomy17:_36, taxonomy17:_34_3, taxonomy17:_32_1_1_3, taxonomy17:_34_1_1_1_1_3, ...
Node labels:
  root:ALL, taxonomy1:, taxonomy2:_1, taxonomy3:_1, taxonomy4:_6_1, taxonomy5:_5_1, ...

Rooted; includes branch lengths.
Warning messages:
1: In toTree(td) : 17 duplicated rows are removed
2: In toTree(td) : The root is added with label 'ALL'

>any( !rownames(tse) %in% c( tree$node.labels, tree$tip.labels) )
[1] TRUE

TuomasBorman · 2023-11-06T07:36:01Z

Hi @ake123

Sorry for late reply. Your solution as such did not work (at least as I quickly tried it), but this seems to work

library("mia")
data("GlobalPatterns")

tse <- GlobalPatterns

colnames(rowData(tse)) <- c("Kingdom", "Phylum", "Class", "Taxa1", "Family", "Genus", "Species")

rowData(tse)[["extra_column"]] <- rep(c("asd", "qwe"))

ranks <- c("kingdom", "phylum", "class", "taxa1", "family", "genus", "species", "extra_column")

assignInNamespace("TAXONOMY_RANKS", ranks, ns = asNamespace("mia"))

splitByRanks(tse)

addTaxonomyTree(tse)

rowData(agglomerateByRank(tse, "Genus"))

rowData(agglomerateByRank(tse, "Taxa1"))

Proposial

Add super kingdom to TAXONOMY_RANKS because it seems to be a common. Also check importing functions like loadFromBiom because it detects ranks based on prefix (such as "k__" --> Kingdom) --> add sk__ --> superkingdom.
Add setTaxonomyRanks and getTaxonomyRanks functions to modify TAXONOMY_RANKS --> user can define taxonomy ranks themselves.
Add examples on this.

Things to consider (my free thoughts): We can have also ranks for columns. Currently, our methods do not support too much this column ranks / colTree, but we have decided to support this more. --> Should we have colRanks variable? --> is this one-rank-system enough (only TAXONOMY_RANKS) --> I think it is

@antagomir

FelixErnst linked a pull request Feb 10, 2022 that will close this issue

added splitByAlternativeRanks #221

Closed

FelixErnst closed this as completed Feb 10, 2022

antagomir reopened this Feb 10, 2022

antagomir mentioned this issue Oct 6, 2023

plotAbundance problem microbiome/miaViz#100

Open

antagomir assigned TuomasBorman Jul 18, 2023

TuomasBorman assigned ake123 Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

taxonomy ranks are too restictive #219

taxonomy ranks are too restictive #219

nr0cinu commented Feb 7, 2022 •

edited

antagomir commented Feb 7, 2022

nr0cinu commented Feb 7, 2022

antagomir commented Feb 7, 2022 •

edited

nr0cinu commented Feb 7, 2022

FelixErnst commented Feb 8, 2022 •

edited

FelixErnst commented Feb 8, 2022

nr0cinu commented Feb 9, 2022

FelixErnst commented Feb 9, 2022 •

edited

nr0cinu commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

antagomir commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

TuomasBorman commented Aug 31, 2023

antagomir commented Sep 3, 2023

TuomasBorman commented Sep 14, 2023

ake123 commented Sep 14, 2023 •

edited

TuomasBorman commented Sep 14, 2023

ake123 commented Sep 25, 2023 •

edited

TuomasBorman commented Nov 6, 2023 •

edited

taxonomy ranks are too restictive #219

taxonomy ranks are too restictive #219

Comments

nr0cinu commented Feb 7, 2022 • edited

antagomir commented Feb 7, 2022

nr0cinu commented Feb 7, 2022

antagomir commented Feb 7, 2022 • edited

nr0cinu commented Feb 7, 2022

FelixErnst commented Feb 8, 2022 • edited

FelixErnst commented Feb 8, 2022

nr0cinu commented Feb 9, 2022

FelixErnst commented Feb 9, 2022 • edited

nr0cinu commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

antagomir commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

antagomir commented Feb 10, 2022

FelixErnst commented Feb 10, 2022

TuomasBorman commented Aug 31, 2023

antagomir commented Sep 3, 2023

TuomasBorman commented Sep 14, 2023

ake123 commented Sep 14, 2023 • edited

TuomasBorman commented Sep 14, 2023

ake123 commented Sep 25, 2023 • edited

TuomasBorman commented Nov 6, 2023 • edited

nr0cinu commented Feb 7, 2022 •

edited

antagomir commented Feb 7, 2022 •

edited

FelixErnst commented Feb 8, 2022 •

edited

FelixErnst commented Feb 9, 2022 •

edited

ake123 commented Sep 14, 2023 •

edited

ake123 commented Sep 25, 2023 •

edited

TuomasBorman commented Nov 6, 2023 •

edited