Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taxonomy ranks are too restictive #219

Open
nr0cinu opened this issue Feb 7, 2022 · 26 comments
Open

taxonomy ranks are too restictive #219

nr0cinu opened this issue Feb 7, 2022 · 26 comments
Assignees

Comments

@nr0cinu
Copy link

nr0cinu commented Feb 7, 2022

Hi,

in case one's taxonomy ranks fall outside of what is hard-coded in mia::TAXONOMY_RANKS, it breaks functions like taxonomyRanks.

For example:

> agglomerateByRank(my_tse, "Strain")
 Error: 'rank' must be a value from 'taxonomyRanks()' 

One solution would be to be able to customize TAXONOMY_RANKS. Or include all possible ranks in TAXONOMY_RANKS. Especially eukaryotes have a bunch of more ranks.

Thanks!
Bela

@antagomir
Copy link
Member

One aspect in using TAXONOMY_RANKS is to enforce certain standards.

However, there are situations where some other categories are needed for a similar purpose. We are now working on that in a splitBy issue, which is in preparation #216

@nr0cinu
Copy link
Author

nr0cinu commented Feb 7, 2022

One aspect in using TAXONOMY_RANKS is to enforce certain standards.

The current set of values in TAXONOMY_RANKS is very 16S centric. Some 18S reference databases use different/more ranks. And with the ongoing development of full rRNA seq methods, amplicon data may very well be at Strain level in the near future.

I get that standardization is important, but hard coding is almost never a good idea. There are always situations where one needs to work with modified setups.

@antagomir
Copy link
Member

antagomir commented Feb 7, 2022

This is good point. It would be great to see a comment from @FelixErnst on this but I know that it may take some time before he can respond. Should we allow users to define the rank names as they wish? We already had some discussion on this.

@nr0cinu
Copy link
Author

nr0cinu commented Feb 7, 2022

PS: I personally would leave the default TAXONOMY_RANKS as is, but move it to options(), so I can be (globally) modified by the user as needed. And if it is in options() in can also be used by other packages.

@FelixErnst
Copy link
Contributor

FelixErnst commented Feb 8, 2022

in case one's taxonomy ranks fall outside of what is hard-coded in mia::TAXONOMY_RANKS, it breaks functions like taxonomyRanks.

I would phrase it the other way around like this: taxonomyRanks breaks one's taxonomy ranks. This is the intention. It is the equivalent of "Mind the gap".

From my point of view standardization is a very good idea. Sometimes shortcuts and not following standards leads to a lot of work afterwards. However, shortcuts are sometimes good, but only if they are used like this and not as the new standards.

Since you mentioned 18S rRNA references: Can you specify which ranks are available in those reference databases, which are not part of TAXONOMY_RANKS? Can you provide a reproducable example including the data visualising the issue? What reference other than "it is in that reference database" can you cite?

As far as I know ICN, ICZN, ICNP, ICNCP, ICPN and ICVCN all use this type of taxonomic ranks or a subset of those. Sometimes ranks contain two words, but from a technical point of view this is not a problem.

On personal experience: I cannot remember a single dataset, for which I didn't have to adjust some aspects of the taxonomic data due to errors in reference databases. Is this case different?

@FelixErnst
Copy link
Contributor

PS: I personally would leave the default TAXONOMY_RANKS as is, but move it to options(), so I can be (globally) modified by the user as needed. And if it is in options() in can also be used by other packages.

How would that solve the problem, when you have two datasets with different assumptions in the same session?

@nr0cinu
Copy link
Author

nr0cinu commented Feb 9, 2022

Since you mentioned 18S rRNA references: Can you specify which ranks are available in those reference databases, which are not part of TAXONOMY_RANKS?

Here is a real world example: https://benjjneb.github.io/dada2/training.html: Note: PR2 has different taxLevels than the dada2 default. When assigning taxonomy against PR2, use the following: assignTaxonomy(..., taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species"))

@FelixErnst
Copy link
Contributor

FelixErnst commented Feb 9, 2022

Currently, I still don't see that this is a problem to be addressed in the mia package.

  1. Your example includes a function named agglomerate_by_rank, which is not available from mia. (In mia there is a function called agglomerateByRank. Do you mean that function? Did you use it while encountering the problem?)
  2. The link to the dada2 website doesn't contain a reproducable example. It mentions that a reference dataset does not include Phylum information, but includes supergroup and division. Is this just a naming issue? Can those columns by renamed to work with agglomerateByRank? (One would need to know, what values supergroup and division contain)
  3. Without the reproducable example it is hard to judge, what the actual problem is. I can make guesses, but before changes to the code can be thought of, we have get to point of knowing. Can you provide a minimal script and minimal dataset demonstrating the problem?

Thanks

edit: Phylum is sometimes called division and supergroup might just be a different name for domain...

@nr0cinu
Copy link
Author

nr0cinu commented Feb 10, 2022

Your example includes a function named agglomerate_by_rank, which is not available from mia. (In mia there is a function called agglomerateByRank. Do you mean that function? Did you use it while encountering the problem?)

Ups, yeah, that is my custom wrapper function. But it just calls agglomerateByRank which is the cause of the error. I fixed it in the first example.

The link to the dada2 website doesn't contain a reproducable example. It mentions that a reference dataset does not include Phylum information, but includes supergroup and division. Is this just a naming issue? Can those columns by renamed to work with agglomerateByRank? (One would need to know, what values supergroup and division contain)

Sure, I could rename everything. However, the consensus for the DADA2 PR2 reference database is to use "Supergroup" and "Division", so when I deliver results to our collaborators, I would have to rename everything back.

Without the reproducable example it is hard to judge, what the actual problem is. I can make guesses, but before changes to the code can be thought of, we have get to point of knowing. Can you provide a minimal script and minimal dataset demonstrating the problem?

Here is an example based on my first comment using Strain information:

library(mia)
data("GlobalPatterns")
# let's simulate that we have Strain information included in this data
rowData(GlobalPatterns)$Strain <- stringr::str_c(rowData(GlobalPatterns)$Genus, " ", rowData(GlobalPatterns)$Species, " ", stringi::stri_rand_strings(nrow(rowData(GlobalPatterns)), 1))
agglomerateByRank(GlobalPatterns, "Species") # works
agglomerateByRank(GlobalPatterns, "Strain") # does not work

Again, I think it makes more sense to enforce standardization by good default values, but also allow people to customize. Like dada2 does for example:

https://github.com/benjjneb/dada2/blob/2e8360d08912b429533d1495198a4ac02e00ba32/R/taxonomy.R#L66

There are always unforeseen situations where customization is needed.

@antagomir
Copy link
Member

What would be an example of a case where customization option leads to problems, when we assume that users in general know what they are doing in such cases? I tend to think that it is positive to allow users the freedom to do this also for other than some default groupings that are not fundamentally universal in the end. The splitBy(in the making) could solve this but then the added value of splitByRanks may become questionable.

@FelixErnst
Copy link
Contributor

when we assume that users in general know what they are doing in such cases?

Well you spelled it out there. By experience, I have to assume that nobody reads the man pages and therefore they don't know what the are doing.

The splitBy(in the making) could solve this but then the added value of splitByRanks may become questionable.

I think they are very different functions and that they are named so closely sets my teeth on edge. splitByRanks is a fire-and-forget function, which works like it is named. splitBy is a crutch of some sort, because sometimes you want to split by something else. This would actually not be needed, if relevant functions would be able to use a grouping factor, the f in splitBy, on-the-fly.

We are talking about a functionality called splitBySomeOtherRanks, which disable sanity checks required for correct taxonomic labeling and such.

I will prepare a PR, but @antagomir or @TuomasBorman has to follow-up. I probably don't have the time to complete it.

@antagomir
Copy link
Member

If the other split function splitBySomeOtherRanks (with some other name) will be made, then splitBy could be potentially reserved exclusively for the case using the grouping factor f. It is now also providing functionality which is similar to splitBySomeOtherRanks.

@FelixErnst
Copy link
Contributor

It is not. Have a look at the code.

@antagomir
Copy link
Member

Right, it was only in the discussion part so far. It is better to not mix those indeed, and I agree that splitBy is disturbingly similarly named with splitByRanks bc of the differences. The splitBySomeOtherRanks would be useful for us too.

@antagomir
Copy link
Member

The issue #185 had discussion on including the full abundance table, optionally, as one of the outputs in splitByRanks, or a similar function. As well as about implementing something like splitBySomeOtherRanks. The conclusion in that issue seemed to be that splitBy will solve that need. Now it does not, which is good because it has another clearly defined scope. But splitBySomeOtherRanks could benefit from having the option to include the full abundance table in the output list, for the reasons that were discussed in #185

@FelixErnst FelixErnst linked a pull request Feb 10, 2022 that will close this issue
@FelixErnst
Copy link
Contributor

The issue #185 had discussion on including the full abundance table, optionally, as one of the outputs in splitByRanks, or a similar function. As well as about implementing something like splitBySomeOtherRanks.

I don't what you mean by "full abundance table".

The conclusion in that issue seemed to be that splitBy will solve that need.

No, the splitBy implementation confused the issue

Now it does not, which is good because it has another clearly defined scope.

No it has not. See the PR. The scope was as muddy as it could get

But splitBySomeOtherRanks could benefit from having the option to include the full abundance table in the output list, for the reasons that were discussed in #185

which let to the whole confusion.

I commented in #185

If you really really really think, that saving 2 lines of code is worth writting a function, than write a wrapper and call it splitByAddRanks or something., IMO writting 100 lines for man page, examples, the generic and the actual function to save two lines of code is really not worth it.

That exactly what has happened here. An implementation for a problem, which didn't exist and missing the implementation @and3k and @antagomir both have opened the same issue for... aka. splitByAlternativeRank which adds a lot of line codes to wrap this oneliner

lapply(as.list(rowData(x)[,c("Kingdom","Phylum")]), mergeRows, x = x) # replace Kingdom and Phylum with any rank or column data of you choice

@FelixErnst
Copy link
Contributor

@and3k: just to make it clear the solution to your problem is:

mergeRows(x = x, f = rowData(tse)[,"Strain"])

and if you want to agglomerate on more than one rank, it is the example given above:

lapply(as.list(rowData(x)[,c("Kingdom","Phylum")]), mergeRows, x = x)

Works with any data in the rowData, any additional taxonomic information you may encounter in any reference dataset

@antagomir
Copy link
Member

I would be fine with just having this as an extra example in splitByRanks (logical place to find), and/or in the relevant section OMA so that it can be found. Should we do that instead, to avoid creating many functions and go with that until there is more evidence that a new function is needed?

@antagomir antagomir reopened this Feb 10, 2022
@FelixErnst
Copy link
Contributor

Sure go ahead.

@TuomasBorman
Copy link
Contributor

Encountering this issue again. Some rowData has superkingdom, and since it is not included in TAXONOMY_RANKS "superkingdom" taxonomy rank is not parsed. Not sure what is the most optimal solution, but we could have "expanded" taxonomy ranks with unofficial ranks and aliases --> we could test if that works

If we need expanded ranks, we can just overwrite TAXONOMY_RANKS with TAXONOMY_RANKS_EXP

@antagomir
Copy link
Member

It is possible although I have concerns that those who bump into situations where it is needed may have hard time coming up with the fact that such solution exists.

The problem is, I guess, that the TAXONOMY_RANKS are treated differently from other rowData fields in aggregation and related tasks, therefore we cannot simply provide an alternative where any fields could be freely defined to be in the ranks.

How about just expanding TAXONOMY_RANKS but then additionally having a list of "official" ranks either in the function internally, or in the function argument (so that users can modify although it gets complicated too..), and whenever the data contains some of the unofficial TAXONOMY_RANKS, then a warning could be thrown, with a suggestion on how the user could redefine TAXONOMY_RANKS or the list of official ranks?

@TuomasBorman
Copy link
Contributor

@ake123 Can you have a look? Use the data from Slack --> add those taxonomy ranks that are provided in one of rowData's columns --> modify TAXONOMY_RANKS to support these ranks --> check what breaks

@ake123
Copy link
Collaborator

ake123 commented Sep 14, 2023

I made TAXONOMY_RANKS as follows

TAXONOMY_RANKS <- c("taxonomy1","taxonomy2","taxonomy3",
                    "taxonomy4","taxonomy5","taxonomy6","taxonomy7",
                    "taxonomy8","taxonomy9","taxonomy10","taxonomy11",
                    "taxonomy12","taxonomy13")

I tested almost all the functions and they all seem to be working
Here are the errors I came across

addTaxonomyTree(tse)
Error: rownames of 'x' mismatch with node labels of the tree 
 Try 'changeTree' with 'rowNodeLab' provided.
In addition: Warning messages:
1: In toTree(td) : 10 duplicated rows are removed
2: In toTree(td) : The root is added with label 'ALL'

For mergeFeaturesByRank when agglomerateTree=FALSE no error appears. But the following error appears when agglomerateTree=TRUE

 x21 <- mergeFeaturesByRank(tse, rank = "taxonomy11",agglomerateTree = TRUE)
Error: rownames of 'x' mismatch with node labels of the tree 
 Try 'changeTree' with 'rowNodeLab' provided.
In addition: Warning message:
In toTree(td) : The root is added with label 'ALL'

@TuomasBorman
Copy link
Contributor

How about if you drop "rank_lineage" column (it does not belong to these ranks)

Check what is causing this issue. Seems that it's a problem that can be easily fixed. any( !rownames(x) %in% c( tree$node.labels, tree$tip.labels) ) might give a hint --> why some rownames are missing from the tree or why these nodes are named differently than the rownames? --> the reason might be that rowData includes only taxonomy table and not the last rank which is rownames... --> how about if you add rownames as taxonomy14?

@ake123
Copy link
Collaborator

ake123 commented Sep 25, 2023

I actually made a correction in TAXONOMY_RANKS

TAXONOMY_RANKS <- c("taxonomy1","taxonomy2","taxonomy3","taxonomy4","taxonomy5",
                    "taxonomy6","taxonomy7","taxonomy8","taxonomy9","taxonomy10",
                    "taxonomy11","taxonomy12","taxonomy13","taxonomy14","taxonomy15",
                    "taxonomy16","taxonomy17")

I checked the following

> taxonomyTree(tse)

Phylogenetic tree with 26 tips and 218 internal nodes.

Tip labels:
  taxonomy17:_32_1_1_1_1_1, taxonomy17:Saccharomyces cerevisiae, taxonomy17:_36, taxonomy17:_34_3, taxonomy17:_32_1_1_3, taxonomy17:_34_1_1_1_1_3, ...
Node labels:
  root:ALL, taxonomy1:, taxonomy2:_1, taxonomy3:_1, taxonomy4:_6_1, taxonomy5:_5_1, ...

Rooted; includes branch lengths.
Warning messages:
1: In toTree(td) : 17 duplicated rows are removed
2: In toTree(td) : The root is added with label 'ALL'
>any( !rownames(tse) %in% c( tree$node.labels, tree$tip.labels) )
[1] TRUE

@TuomasBorman
Copy link
Contributor

TuomasBorman commented Nov 6, 2023

Hi @ake123

Sorry for late reply. Your solution as such did not work (at least as I quickly tried it), but this seems to work

library("mia")
data("GlobalPatterns")

tse <- GlobalPatterns

colnames(rowData(tse)) <- c("Kingdom", "Phylum", "Class", "Taxa1", "Family", "Genus", "Species")

rowData(tse)[["extra_column"]] <- rep(c("asd", "qwe"))

ranks <- c("kingdom", "phylum", "class", "taxa1", "family", "genus", "species", "extra_column")

assignInNamespace("TAXONOMY_RANKS", ranks, ns = asNamespace("mia"))

splitByRanks(tse)

addTaxonomyTree(tse)

rowData(agglomerateByRank(tse, "Genus"))

rowData(agglomerateByRank(tse, "Taxa1"))

Proposial

  1. Add super kingdom to TAXONOMY_RANKS because it seems to be a common. Also check importing functions like loadFromBiom because it detects ranks based on prefix (such as "k__" --> Kingdom) --> add sk__ --> superkingdom.
  2. Add setTaxonomyRanks and getTaxonomyRanks functions to modify TAXONOMY_RANKS --> user can define taxonomy ranks themselves.
  3. Add examples on this.

Things to consider (my free thoughts): We can have also ranks for columns. Currently, our methods do not support too much this column ranks / colTree, but we have decided to support this more. --> Should we have colRanks variable? --> is this one-rank-system enough (only TAXONOMY_RANKS) --> I think it is

@antagomir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

5 participants