Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why aren't OTUs with the same consensus taxonomy combined? #209

Closed
mothur-westcott opened this issue Feb 8, 2016 · 8 comments
Closed

Why aren't OTUs with the same consensus taxonomy combined? #209

mothur-westcott opened this issue Feb 8, 2016 · 8 comments

Comments

@mothur-westcott
Copy link
Contributor

Very good question! And one we get a lot, :). There is a difference between taxonomic OTU assigment and distance based OTU assigment. The cluster commands group sequences into OTUs based on the distances between sequences. The phylotype command groups sequences into OTUs based on their classifications. Although sequences from 2 OTUs may have consensus taxonomies that are the same, it does not indicate the same accuracy level as the distance based clustering models.

@Vieira34
Copy link

Vieira34 commented Apr 6, 2017

Hi Sarah,

So, I was searching for similiar question here and did not find. I do not know if is the right place to post this question but... I'm using a dataset with 4 samples (small and easy to check) and after classify OTUs and opened cons.taxonomy file, there is a total of 3019 OTUs, ok, cool... I executed the get.rep command to see how many sequences are in each OTU. Looks like there have many OTUs with the same taxonomic consensus. From 3019 OTUs only 511 is unique... at the end of the day can I attribute this OTUs (3019) as a different genus for example or should I count only the unique (511).

Thank for your attention and time!

best regards,
Fabricio.

@mothur-westcott
Copy link
Contributor Author

Often there are duplicate consensus taxonomies for different OTUs. There are several things that contribute to this, one reason is as you said, the consensus taxonomic definition does not extend as far as we would like at times. For a taxon to be included it must be the classification of at least 51% of the sequences at that level in the given OTU. You should included all the OTUs.

@Vieira34
Copy link

Vieira34 commented Apr 7, 2017

First, thank you for the explanation. So, in that case, I should use all OTUs (3019) to describe the diversity in my sample, right? Other question... after normalization (sample size) I run the venn command and I assume that in venn figure are there only the unique OTUs because the number OTUs is much less than 3019, something around 311 OTUs (total richness for 4 way venn, 4 samples). The most abundant OTU (in that case) have duplicate taxonomy consensus (around 300 OTUs at the same classification level, acinetobacter), which I have an interest in look into. Can I pick this 300 OTUs and make a tree based on sequence similarity? On the other hand, among those 300 duplicate OTUs, one of this holds 99% of sequences attributed for this OTU... and all most there are singletons. Sorry to boring you with many questions. I'm using mothur very often and I took the Mothur workshop a year and a half ago but some questions are still open for me.

Thank for your patience.

best,
Fabricio.

@mothur-westcott
Copy link
Contributor Author

So, in that case, I should use all OTUs (3019) to describe the diversity in my sample, right? Yes

Other question... after normalization (sample size) I run the venn command and I assume that in venn figure are there only the unique OTUs because the number OTUs is much less than 3019, something around 311 OTUs (total richness for 4 way venn, 4 samples).

Do you have more than 4 samples? The 311 OTUs are the OTUs associated with those four samples. It looks like there are OTUs that contain no sequences from any of the 4 samples, so they would not be included in the venn diagram.

The most abundant OTU (in that case) have duplicate taxonomy consensus (around 300 OTUs at the same classification level, acinetobacter), which I have an interest in look into. Can I pick this 300 OTUs and make a tree based on sequence similarity?

You can select OTUs for processing in several ways. The get.lineage command, https://mothur.org/wiki/Get.lineage#Running_with_a_constaxonomy_file, allows you to select OTUs based on their classification. Here's how:

get.lineage(constaxonomy=yourConstaxonomy, list=yourListFile, taxon=yourTaxons, count=yourCountFile)

On the other hand, among those 300 duplicate OTUs, one of this holds 99% of sequences attributed for this OTU... and all most there are singletons.

It's normal to have duplicate consensus taxonomies. Singleton OTUs are reported as having the classification of the sequence they contain.

@Vieira34
Copy link

Sarah,

Thank you for answering those question. I figured out what happened in Venn diagram, was a wrong file (phylotype file) now it's ok.

I'll be following your suggestion (select OTUs) to see what I can get.

Yes, singletons OTUs are reported as having the classification of the sequence they contain and in my case, I'm not sure.... could be a rare or low-frequency OTU in my sample or some error during the processing sample as PCR, library preparation or sequence...

Thanks again.

best,
Fabricio.

@rtrpaine
Copy link

Hi Sarah,

I wanted to follow up with this question. I'll premise this by saying I am using mothur (v 1.39.5) to process sequences amplified from a hypervariable region of 12S rRNA of freshwater fishes.

I was having the same issue as Fabricio. I have changed the clustering and classify.otu variables several times, all of which result in a range of a few hundred OTUs (~800) to several thousand (~2400), but no matter what there are many OTUs that are taxonomically ranked as the same species.

However, unlike microbial communities, I actually have a good estimate as to the number of species in my system (~150). So would it not be artificially inflating the biodiversity to consider all OTUs as contributing to the community diversity (like how you told Fabricio to consider all 3000 OTUs). In my case, I know there aren't 2400 species, or even 800 species.

From what I gather from your initial answer to the question (Taxonomic OTU assignment vs. distance based OTU assignment) I could potentially just clump those OTUs that have the same taxonomic rank. This would as Fabricio inferred create the unique OTUs, which would provide a more accurate representation of the true communities in my system? To elaborate on this idea, basically a taxonomic assignment could represent several OTUs. While the OTUs themselves may vary at a distance-based perspective, they all still have the same taxonomic ranking. The differences seen at a distance-based perspective may potentially just represent differences within the population itself.

Cheers,

Robert

@pschloss
Copy link
Contributor

For @Vieira34 I suspect he has 511 genera and not genera. For 16S when we bin sequences into OTUs we commonly see many OTUs with the same genus because an OTU is likely to be at a taxonomic resolution that is finer than the genus (I'm trying not to say species 😄). This is one reason why we prefer OTUs over classification data.

In your case, I suspect you might really have 150 true fish species. So why might you have so many more OTUs or taxa than 150? To be perfectly frank, I know nothing about 12S diversity. My initial thoughts would be...

  1. You have high sequencing error, which is inflating the number of unique sequences, OTUs, and taxa. We see this when the paired reads do not fully overlap with each other.
  2. There could be many copies of the 12S rRNA gene in the fish genomes and there is variation amongst the copies (there is for 16S, but it's small and there's generally not as many as I understand there to be for eukaryotes). In this case, different 12S copies from the same genome could end up in different OTUs. This might argue for using a larger distance threshold for doing your clustering.

As to whether to lump OTUs together that have the same taxonomy, that would be acceptable. I would probably skip the OTU steps and skip to the phylotype command without the headaches of the distance-based clustering.

@rtrpaine
Copy link

Hi Dr. Schloss,

So being the impatient person that I am, I actually used the phylotype command while waiting for a response. This seems to have solved my problem. The 2400 OTUs in my data set were lumped together in to 48 OTUs with taxonomic ranking of genus and/or species. I will mention that 3 or 4 of the OTUs are ranked at the Class or Order level because we don't have all the species in our database yet. We're aiming to complete our database this summer, which will potentially break up these "super-clusters" into more species specific clusters. At this point in the analysis, the data seems to at least fit what we know about the community (maximum of ~150 species). We have data obtained from local government agencies that have sampled the site to tell what fish species they have collected there over the past 20 years.

One possibility I considered with my dataset is that there does indeed to seem to be some variation. I collected sequences from multiple individual of the same species and compared those sequences to each other. There does seem to be some within-species variation in 12S. While some areas in the 12S gene are highly conserved (no bp diffs) there are other areas that are highly variable (~8-10 bp diffs within a 200bp length). I did this for about three species in three different families, just to get a sense of what was happening. Additionally, I looked at this within my specific area of interest which is on average 172 bp. There seems to be about a 2-4 bp differences. So long story short, there may be some intra-population differences.

With this information in mind, I think the phylotype command makes sense in my case because while there may be population differences within a species, as long as its robustly classified, I can group them together into one cluster, which would represent a species. And at this point, what I'm interested in is presence/or absence.

Thank you (and Sarah) for your help and hard work with mothur! It's greatly appreciated.

Cheers,

Robert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants