Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genus names at Family rank #1

Closed
microsud opened this issue Mar 9, 2021 · 4 comments
Closed

Genus names at Family rank #1

microsud opened this issue Mar 9, 2021 · 4 comments

Comments

@microsud
Copy link

microsud commented Mar 9, 2021

Hi Mike,
Thanks for creating this useful resource. I downloaded the following files from Zenodo on 8th February 2021. Michael R. McLaren. (2020). Silva SSU taxonomic training data formatted for DADA2 (Silva version 138) (Version 2)

  • silva_nr99_v138_train_set.fa.gz
  • silva_species_assignment_v138.fa

I noticed that the genera below were placed at "Family" rank and had an NA at genus level when classifying ASVs with dada2::assignTaxonomy followed by dada2::addSpecies. Not sure if the original SILVA Db has them placed like this?

| Kingdom | Phylum | Class | Order | Family | Genus | Species 
 | Bacteria | Firmicutes | Clostridia | Peptostreptococcales-Tissierellales | Anaerococcus |  NA  | NA   

Others include the following
"Peptoniphilus", "Finegoldia", "Fenollaria", "Parvimonas", "Ezakiella", 
"Fastidiosipila", "Murdochiella","Gallicola","[Eubacterium] coprostanoligenes group", 
"Helcococcus", "Ruminiclostridium", "Tissierella", "Lutispora"

So when using phyloseq::tax_glom these are dropped (NArm=TRUE).
While I find these genera from my analysis, there could be a possibility for some other discrepancies in the Db.

Cheers,
Sudarshan

@mikemc
Copy link
Owner

mikemc commented Mar 9, 2021

Hi @microsud, thanks for bringing this to my attention. Apparently there were some problems in the taxonomy in Silva 138 which led them to release a new minor version, Silva 138.1, but I was unable to find a description of what the problems were, and it looks like you may have discovered some of them. I actually just created the DADA2-formatted 138.1 database this weekend, which you can find at https://zenodo.org/record/4587955, and it looks like these taxa are correct in Silva 138.1 and this new DADA2-formatted database. So I recommend redoing your assignment with these new Silva 138.1 files and seeing if that fixes your problem.

@mikemc
Copy link
Owner

mikemc commented Mar 10, 2021

@microsud can you let me know how this works if you try it? I think the problem will be fixed for your Peptostreptococcales-Tissierellales taxa but not everything (e.g. I suspect "Lutispora" will still end up as the family instead of genus)

@microsud
Copy link
Author

I did not try the updated database.
I did it like this.
Assuming ps.clean is phyloseq object and Family names end with "ceae".
I was interested in the genus level.

library(phyloseq)
library(dplyr)
library(tibble)
# get all unique family names
fam.names <- get_taxa_unique(ps.clean, "Family")
# find which that don't end with 'ceae'
fam.names[which(!grepl("ceae$",fam.names))]

I found the following Peptoniphilus, Finegoldia, Anaerococcus, Fenollaria, Parvimonas, Ezakiella, Fastidiosipila, Murdochiella,Gallicola, [Eubacterium] coprostanoligenes group, Helcococcus, Ruminiclostridium, Tissierella, Lutispora
I also investigated the file manually in excel.
Then copied names from Family column which are genus names to the Genus column.

change.genus <- c("Peptoniphilus", "Finegoldia", "Anaerococcus", "Fenollaria", "Parvimonas", "Ezakiella", "Fastidiosipila", "Murdochiella","Gallicola","[Eubacterium] coprostanoligenes group", "Helcococcus", "Ruminiclostridium", "Tissierella", "Lutispora")

tax_tib <- tax_table(x) %>% 
  as("matrix") %>% 
  as.data.frame(stringsAsFactors = FALSE) %>% 
  rownames_to_column("ASVID") %>% 
  as_tibble()

tax_tib <- tax_tib %>% 
  mutate(Genus = ifelse(Family %in% change.genus, Family, Genus))
#unique(tax_tib$Genus)

tax_tib <- as.data.frame(tax_tib) %>% column_to_rownames("ASVID") %>% 
  as.matrix()

# replace with corrected genus names
tax_table(ps.clean) <- tax_table(tax_tib)

This code may not be ideal but almost always, I inspect the taxonomy files for discrepancies because there is always something that is missed by automated database generation and changes that are continuously made to taxonomy.

Let me know your thoughts.

@mikemc
Copy link
Owner

mikemc commented Mar 10, 2021

Hi @microsud you approach basically makes sense to me and I think will mostly fix the genus names. I created a list of all the taxa that seem to have the problem here. These are taxa for which Silva did not assign every rank between domain and genus, which caused the DADA2 formatting function to promote the lower ranks up to fill the missing rank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants