Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Alternative order of samples/taxa #230

Closed
ConnieHa opened this Issue · 8 comments

3 participants

ConnieHa Paul J. McMurdie Mikeyj
ConnieHa

Hi Joey,

This is my first time using R so I have lots of questions on how to generate heatmaps. After following your extensive tutorial I managed to plot a heatmap using my own data (see attached), great colours but I need help to make it better.

hfdgenuslog10

1) My aim was to show the effect of diet on microbiota at the genus level. Is there a way to group all the samples by diet? For instance, all control samples on the left and all HFD on the right?

2) I also want to group genera by their phylogenetic relationship but at this stage they seem to be all over the place. For this heatmap I used otu table from qiime, a tree file of all the otus and a mapping file as the input. Do I need additional info to group the taxa phylogenetically or just additional commands on top of the ones you showed in the demo?

3) Can you provide more information on how the program compute abundance from the otu table? The number of reads varies between samples (range from 2000-10000) so I was wondering does the plot_heatmap command normalise the data before generating a heatmap or do I have to do that manually (like provide a provide of relative abundance rather than absolute counts)?

It would be great if you can give me a few pointers!

Cheers,
Connie

Paul J. McMurdie joey711 was assigned
Paul J. McMurdie
Owner

@ConnieHa

First of all, how did you include your output figure so nicely in a GitHub issue post? I'm fairly fluent in markdown, but I noticed your figure even has a GitHub URL. I'd like to be able to do the same in user responses from time to time.

Second of all, I would normally request that you provide reproducible code to illustrate your issue, but in this case the figure probably does enough (and I know how you generated it, already).

Answers to questions:
(1) Unfortunately this is a circular argument in the context of a heatmap. I have answered this question in many forms, many times, and I need to figure out how to better emphasize this widely misunderstood aspect of heatmaps:
Clustering is everything.
The choices you made for your ordination determined how the OTUs (in this case, genera) and samples (labeled by diet) are clustered relative to one another. Even for data with very clear structure, it is very easy to create misleading or obfuscating heatmaps if the axis ordering is done improperly/maliciously. In the figure you've provided, the samples are already clustered by diet, save for one outlying control sample. The edges of the heatmap don't matter, because the index order is based on the radial position of the ordination results, so could start anywhere. What I should probably do is include a user-definable option for the "start" sample and/or OTU, in case this helps better center the key clusters in the heatmap.

(2) Here is the rub. Your Question (1) answered itself because the samples naturally clustered by experimental design in the ordination, which I imagine is what you were hoping to see. I still recommend trying a few different ordination approaches to see how robust this is. In the case of OTUs, this is the same story. I have not yet implemented a phylogenetic tree ordering, opting instead to order the OTUs according to their position in the user-specified ordination, or, if none is provided, I believe it takes the original order in the abundance matrix -- or possibly the order of the OTU names, neither of which is desirable... (except that if you have tree provided in your data object, the OTU table is automatically ordered the same as that tree).

Why is it like this? For the same reason as for the samples. Heatmaps are actually a way of displaying clustering, and they are semi-quantitative at best. They can be awesomely compact representations of the data, but they are still massively sensitive to the order that you choose. This is bad. We want display methods to show us robust patterns that are not mirages of our accidental choices or desired outcomes. Maybe that sounds preachy and not-helpful (I hope it's at least a little helpful), but here's the more-helpful part: There are clear clusters or OTUs in your figure, they just do not reflect something obvious in the taxonomic label you chose. For example, one of the more obvious patterns is the small cluster of Fervidicola, Syntrophococcus, and Howardella, all of which are zero or near-zero in your test samples. That seems like something you might want to further explore.

(3) plot_heatmap does not "calculate" abundances. It does relate the abundances you provide to a continuous color scale, usually with a logarithmic transformation, and this is user-definable, but the original abundance data is not manipulated, and should be considered as accurately represented within the limitations of the chosen scale transformation.

Normalization of the abundance values to account for different library sizes is essential for comparing colors between samples. It is also essential for some ordination methods, while others already account for sample sizes intrinsically (like correspondence analysis). While I'm aware of some issues/limitations of "rarefying", it is one particularly common approach to achieve this, and will work for plot_heatmap. The relevant function is rarefy_even_depth.

Also, it might make help to review the introduction to the plot_heatmap tutorial, in particular the summary of the reasoning behind organizing a heatmap using ordination rather than trees (whether hclust or phylogenetic), which is described nicely by Rajaram and Oono, 2010.

Hope that helps. Thank you for the feedback. I will think about whether any additional options should be added to plot_heatmap. If so, this issue will become a feature request.

Cheers

joey

Paul J. McMurdie
Owner

@ConnieHa

I was hoping for some additional discussion, as this is a topic that I think is interesting to many users and I was also wondering if some of the trends evident in the plot that you included in your original question turned out to be useful.

In the absence of further feedback, however, I will close this issue. We can always re-open it if you have additional comments, requests, problems, or other feedback; and you are free to continue to comment on a "closed" issue, anyway.

Thanks again for the feedback. It is helpful and I will attempt to implement some of the issues related to this in the documentation and interface of plot_heatmap.

Paul J. McMurdie joey711 closed this
ConnieHa

Hi Paul,

Apologies for the late reply, I was caught up with other experiments. I honestly don't know how to include figures nicely into the comments I merely followed the instructions in Github's write box. I understand what you're trying to say about clustering and grouping otus by ordination rather than phylogenetic distances but at the end of the day it's not the most useful way to present my data. Obviously various projects have different needs in terms of data presentation but below are the reasons why I don't want plot_heatmap to cluster my samples and OTUs. I think it would be great to have the option of purely overlaying colours on normalised abundance data where samples are strictly categorised by diet or timepoints and OTUs grouped by phylogeny, and take clustering out of the equation.

The reason why I want to generate a heatmap is to convey multiple messages/microbial trends to audience outside the microbiome field. In other words I want to condense a lot of findings into one figure and other people can get the message straight away based on the colour patterns of normalised data.

1) For instance, I want to group genera phylogenetically so it's easier to tell the audience that there is a difference in the ratio of Firmicutes and Bacteroidetes due to treatments (by comparing the intensity of colours between categories on X axis).

2) But at a finer scale not all genera within the same phylum responds to treatment in the same same way. This is where lumping by OTU relatedness comes in handy so I can easily point out that Genus A belongs to Firmicutes, colour is hot in samples exposed by HFD and hence the relative abudance is greater in HFD mice while Genus B also belongs to Firmicutes but colour intensity is the same between control and HFD so it wasn't enriched by HFD.

3) I want to strictly group samples by categories because even if the dataset has one or two outliers in the treatment group it's still easy to see the global trends of a particular treatment. Plus we can easily determine how the outlier deviates from the rest of the samples in that category. If the aberrant sample is all the way on the other end of the heatmap it is very difficult to have a visual comparison by colours.

This is my POV on heatmap figures, I'm happy to have further discussion.

Cheers,
Connie

Mikeyj

I would have to support Connie's argument here. It's vital to be able to cluster by a variable associated with the data in addition to clustering by ordination or hclust. I don't know whether this is something peculiar to human/clinical research, but whenever I present to clinicians they expect to see cases and controls grouped together and then visually compare across those rows. I would highlight whichever OTUs turn out to be significantly different on the heatmap in a contrasting colour or by boxing them. It is also appropriate when there is no significant clustering to default to clustering by a variable.

The phylogenetic tree sorting on the left hand side is a natural way to order the OTUs, particularly as phylogenetically related organisms might be expected to have a similar response, so any response is clearer on the heatmap. Alphabeticising or sorting by taxonomy doesn't put related OTUs together and neither does taking the input order (which pretty much randomises everything).

Sorting by ordination is a really nice feature and I do prefer it to hclust, not least because other figures in the paper are more likely to be an ordination and there's additional consistency there. It is a feature unique to PhyloSeq, as far as I'm aware.

Also illustrated here is another reason for mixed level taxa labels - as per issue 213 #213 in heatmaps of the most significantly different organisms (which is probably all you'll have room to display in a publication or presentation) you can read every label and it's important that these are informative. Here it looks like Connie has glommed them by genus, but if I did this I would have multiple genera that were NA, and I am sure this would be common in other datasets too. Will continue that discussion in the other thread though and am working slowly on a way to implement it myself :)

Paul J. McMurdie
Owner

You have both convinced me this would be a useful feature, and I will open this as a feature request.

To re-word what you're describing, this is the difference between

(1) using plot_heatmap for exploratory analysis (the way it is now) ...

(2) as opposed to using plot_heatmap to confirm the presence/absence of a pre-conceived relationship between samples (experimental design) or taxa (evolutionary tree) -- (what you are asking for).

I can see why the latter is a useful option to have, and not only in human/clinical research.

I will appropriate some time to add these features, which will be in addition to Issue 237.

Please don't hesitate to make further suggestions about the interface for this.

Thank you both for the useful feedback. I'm guessing this will be helpful for many other users as well.

Paul J. McMurdie joey711 reopened this
ConnieHa

It would be great if there is an option for users to order the samples on x axis manually, in addition to grouping samples by a variable. I know this is quite easy to do in plot_bar using levels (Issue#240), can you add this feature into plot_heatmap?

Paul J. McMurdie
Owner

I'm looking into this. Some of these features are already supported because the output is a valid ggplot2 plot object ("ggplot" class). At the very least I'll add documentation about this.

Paul J. McMurdie
Owner

I trial version of this is now available in:
https://github.com/joey711/phyloseq/tree/heatmap-options

I'm running a few tests on it before I roll it into the master. Should be soon.

Paul J. McMurdie joey711 referenced this issue from a commit
Paul J. McMurdie 1.7.6 new features for plot_heatmap() and improved import_mothur()
CHANGES IN VERSION 1.7.5
-------------------------

NEW FEATURES

	- User-specified axis ordering to plot_heatmap()

	- User-specified axis edges to plot_heatmap()

	- This addresses:
	  [Issue 237](#237)
	  [Issue 230](#230)

USER-VISIBLE CHANGES

	- New arguments to plot_heatmap():
	  `taxa.order`, `sample.order`, `first.sample`, `first.taxa`

CHANGES IN VERSION 1.7.4
-------------------------

NEW FEATURES

	- import_mothur now handles more formats

	- Added documentation to discourage .group/.list formats
b751904
Paul J. McMurdie joey711 referenced this issue from a commit
Paul J. McMurdie 1.7.5 User-specified axis ordering and edges to plot_heatmap()
NEW FEATURES

	- User-specified axis ordering to plot_heatmap()

	- User-specified axis edges to plot_heatmap()

	- This addresses:
	  [Issue 237](#237)
	  [Issue 230](#230)
ec43b76
Paul J. McMurdie joey711 closed this
Paul J. McMurdie joey711 referenced this issue
Closed

plot_heatmap #362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.