Clustered heatmap #5646

olgabot · 2013-12-05T06:30:52Z

Hello there,
I'm working on a clustered heatmap figure for pandas, like R's heatmap.2 which I used I ton when I was still in R-land.

I'm having some trouble wrapping my mind around pandas.tools.plotting but I have a draft version of this heatmap function with some initial parameters in my fork: olgabot@4759d31 But I'm unsure how to proceed with plumbing this function into the current pandas plotting tools.

Here's a notebook of a simple and complex example: http://nbviewer.ipython.org/gist/olgabot/7801024

Also, unlike other many plotting functions, heatmap creates a whole figure. I currently return the fig instance along with the dendrogram objects for both the rows and columns in case the user wants to do some stats on them. It seems like for this function, accepting an ax argument (or even a fig) doesn't make any sense because there's at least 4 (up to 6 if you label columns or rows with colors) ax instances to create.

FYI besides adding documentation and tests and such I'm still working on:

reasonable dendrogram sizing for when the dimensions are really large. Right now the dendrogram axis size is a proportion of the figure size, but this should change for even medium (100+) size datasets because then the dendrogram takes over the figure.
Allowing for multiple colors labels for each row/column
Adding optimal leaf ordering (important for time-dependent data)
vmin/vmax for pcolormesh
maybe add some padding between the dendrogram, colorbars, and heatmap?
Other suggestions welcome!

I'm also still working through all the developer FAQs and such so I'm probably making tons of n00b mistakes so please point them out to me, especially if there are some pandas/python conventions I'm totally not following.

olgabot · 2013-12-05T06:42:10Z

Another issue that people may run into is that the clustering a large matrix will run into recursion limit issues, so there should be a way to smartly change sys.setrecursionlimit(get_recursionlimit(df)) based on the size of the dataframe.

TomAugspurger · 2013-12-05T14:13:50Z

I'll take a closer look this weekend. These are just some quick comments.

There's an open issue about returning figures vs. axes in issue API: decide what to return from plotting functions #4020; This may be a good time to sort that out definitively.
Have a look at the numpy docstring format
Just a couple PEP8 errors. Your text editor probably has a plugin to check these automatically

olgabot · 2013-12-05T16:46:19Z

Re figs vs axes: you can access all axes of a fig with fig.axes() which returns an array of axes. Returning just the fig could be the default, but maybe we could also add the row and column dendrograms if the user specifies, because they might want to do something else with them later.

olgabot · 2013-12-05T16:49:35Z

Also, here are the R doc for heatmap.2: http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gplots/html/heatmap.2.html (this is the main feature set I'm trying to create)

And the Matlab docs for clustergram: http://www.mathworks.com/help/bioinfo/ref/clustergram.html I haven't used this much but I know other people in my lab do. This defaults to optimal leaf ordering on, but I'm still getting through the paper describing the efficient optimal leaf ordering algorithm: http://www.cimms.ou.edu/~zhang/pdfs/BIDM04-JiantingZhang.pdf

…t" better

olgabot · 2013-12-05T17:18:57Z

What's the correct way to test if an object is None, if the other option is that it is a pd.DataFrame (or other iterable)?

According to PEP8, what I'm doing here is wrong:

if type(plot_df) is type(None):
    plot_df = df

but when I do this as suggested by PEP8:

if isinstance(plot_df, None):
    plot_df = df

I get:

TypeError: isinstance() arg 2 must be a class, type, or tuple of classes and types

… (For Travis-CI)

jreback · 2013-12-05T17:22:09Z

just do

if x is None:
    ....

…ngs and not necessarily indexable by a list of indices

olgabot · 2013-12-05T17:35:39Z

That worked, thanks @jreback! I think I was getting the if variable: issues with iterables (that stated that the truth value of an array is ambiguous), confused with if variable is None:

jreback · 2013-12-05T17:38:45Z

if variable:
   ....

implicity calls __nonzero__, which doesn't make sense on a pandas object (though works on None), which is why it raises (this is similar to what numpy does)

…specifications

…er fiddling

…e other colorbar locations not implemented)

…=log

jreback · 2013-12-05T23:23:15Z

this should close #3497, or if not, pls create a separate issue for this? (just so we can put tags on it and reference it).

TomAugspurger · 2013-12-07T13:08:03Z

I'm thinking a bit more about the API. It can be a bit intimidating to see a function signature with 15-20 keyword arguments. Things like title, title_fontsize, figsize, xlabel_fontsize, ylabel_fontsize, linewidth, edgecolor and maybe some others could all be moved into a **kwargs argument. Basically anything that is getting passed to the matplotlib plot command (pcolormesh I guess). This is a bit trickier since you'll have the multiple parts with the heat map and dendogram. But it will also give the benefit of being able to use all the matplotlib formatting commands, and not just the ones you've included.

And I guess this adds a new optional dependency with the fastcluster library. One of the maintainers will have to weigh in on that.

TomAugspurger · 2013-12-07T13:16:11Z

pandas/tools/plotting.py

+    if plot_df is None:
+        plot_df = df
+
+    if any(plot_df.index != df.index):


This is negligible, but you may want to use if (plot_df.index != df.index).any(): here; That uses the numpy any instead of the python builtin. It's faster for big arrays.

olgabot · 2013-12-08T02:32:31Z

It is quite overwhelming, but there a lot of aspects to control at a time. R's heatmap.2 strategy is to have all non-image (pcolormesh in our case) arguments to be in the main args, then **kwargs for pcolormesh separately. Matlab's clustergram is to have everything in the main args, which is overwhelming. So I suggest adopting the same strategy, which will move linewidth, edgecolor, cmap, norm (not currently an argument), vmin, vmax (vmin and vmax will have to be moved out of where they are now) into the kwargs. That's not very many, but it's something.

Another feature that both R and Matlab have is the ability to z-score-ify the matrix, i.e. normalize across either rows or columns such that 0 is the middle value. Should this be an option as well?

One note is that the title is not actually placed as the title of pcolormesh, but rather as the title of the column dendrogram, since that will appear at the top of the figure, rather than in between the dendrogram and the false color image.

lbeltrame · 2013-12-13T06:50:16Z

I can only say I'm extremely interested in this. Given that heatmap.2 has such a horrid syntax, I'm all welcoming this!

As a suggestion, I wonder if it would be possible to pass scipy dendrograms to the function, in case they are calculated from different algorithms than the base ones (e.g., when doing bootstrapped clustering).

olgabot · 2013-12-13T20:16:02Z

@lbeltrame that is a great idea! The way heatmap.2 does it is that Colv and Rowv can be NULL or FALSE , both of which result in no clustering, or TRUE, which will cluster, or a dendrogram, and then it will use this instead of computing the dendrogram. This syntax seems confusing to me, but I'm not sure if having cluster_cols=True and col_dendrogram=None separate makes much sense either. What do you suggest?

lbeltrame · 2013-12-13T20:19:48Z

col_dendrogram=None separate makes much sense either. What do you
suggest?

Type checking is evil, but IMO less evil than having more confusing options. I
would use an instance of a dendrogram to use it as is, False to not cluster
and None to cluster.

I know it sounds ugly, that's why I'm open to suggestions. ;)

lbeltrame · 2013-12-16T06:43:31Z

Likewise I'd change AssertionError (which is more like "this should never happen") throughout the code to ValueError ("incorrect input", in this case).

I also second @TomAugspurger's comments, the matplotlib stuff should go into **kwargs.

lbeltrame · 2013-12-16T06:48:09Z

pandas/tools/plotting.py

+
+    # heatmap with row names
+
+    def get_width_ratios(half_width, side_colors,


Would be nice to have a docstring to explain what this does, even if it's internal.

olgabot · 2013-12-16T17:04:02Z

How should tests be written for figures? For prettyplotlib I used matplotlib's @image_comparison (from matplotlib.testing.decorators import image_comparison: http://matplotlib.org/devel/testing.html#writing-an-image-comparison-test) But I haven't seen that in the pandas code.

olgabot · 2013-12-16T17:16:36Z

@lbeltrame so both True and None would cluster? That sounds fishy.. I think None and False should both not cluster.

jreback · 2013-12-16T17:22:15Z

@olgabot why don't you have a cluster= kw that takes a string and make False/None just not cluster.

jtratner · 2013-12-16T17:27:45Z

@olgabot I'd be quite happy if we put in tests with image_comparison - seems like a good idea. Make sure you mark your tests with slow though (might be time to add a plotting decorator too...)

olgabot · 2013-12-16T17:30:34Z

It would need to be separate for columns and rows, so cluster_cols and cluster_rows need to separately address:

whether or not to cluster this dimension (True/False/None?)
If the user has already clustered, accept a dendrogram (a dict with keys icoord, dcoord and leaves)
This could also be confused for the metric or linkage method, though I've separated those to metric, col_linkage_method, and row_linkage_method respectively so hopefully that won't be the case

So to address all of the above issues, I propose these acceptable arguments to cluster_cols and cluster_rows:

False : do not cluster
True : cluster using the *linkage_method and metric provided
None: ?? do not cluster I think
dict: check for icoord, dcoord, and leaves in the keys, and we're good to go

jtratner · 2013-12-16T17:32:41Z

@olgabot - would you mind listing out the different outcome decisions
(maybe a table?) [or, alternatively, a set of example calls] - might help
to come to a simpler API for this.

olgabot · 2013-12-16T17:40:45Z

The main outcomes are

cluster (using other params in the program)
don't cluster
use this other dendrogram I made to arrange the heatmap

Does that help?

jtratner · 2013-12-16T17:48:46Z

I agree with what your suggestion then, just tweaked such that anything
Falsey (None/False/empty dict, whatever) should not cluster. Is it invalid
to get a dendogram and also get a linkage_method?

Also, if you end up with linkage_method_rows, linkage_method_cols, etc,
you might consider passing two dicts: col_clustering = dict(linkage_method=...), row_clustering = dict(dendogram=...). Will
probably make it cleaner.

On Mon, Dec 16, 2013 at 12:40 PM, Olga Botvinnik
notifications@github.comwrote:

The main outcomes are

cluster (using other params in the program)

don't cluster

use this other dendrogram I made to arrange the heatmap

Does that help?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5646#issuecomment-30681905
.

lbeltrame · 2013-12-16T22:47:20Z

@olgabot +1 on your idea with API.

With regards to linkage and clustering methods, IMO it makes no sense to use two different methods for rows and columns (at least in my common usage), so I would rather avoid that.

lbeltrame · 2013-12-20T10:44:57Z

Can you update either the PR text or add here a TODO of what's left? Might be useful if someone else wants to contribute.

BTW, how do you handle row text when rows are hundreds or even thousands? Even R can't get this quite right (I end up omitting row names in such cases).

ghost · 2013-12-22T18:21:15Z

@olgabot looks like this PR found a more suitable home in seaborn, as discussed. Close?

olgabot · 2013-12-22T19:37:01Z

Yes sure

Sent from my mobile device.
On Dec 22, 2013 10:21 AM, "y-p" notifications@github.com wrote:

@olgabot https://github.com/olgabot looks like this PR found a more
suitable home in seaborn as discussed. Close?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5646#issuecomment-31092646
.

olgabot added 3 commits December 4, 2013 21:12

Initial commit of clustered heatmap

4759d31

Fixed bugs that arose while making an example. Funny how that happens!

d8eb771

Added citation of Chris DeBoever

7e2b64d

olgabot added 4 commits December 4, 2013 22:48

Added some docstrings

94a92c5

Added some docstrings

b6ce06c

Changed exceptions to AssertionErrors

049a36e

Fixed relabeling of columns

a190062

Fixed some PEP8 and other formatting stuff

e955409

Changed default sequential colormap to YlGnBu and explained "divergen…

c0f03d8

…t" better

olgabot added 2 commits December 5, 2013 09:20

Changed dictionary comprehension to explicit dict() call on generated…

f82ebb4

… (For Travis-CI)

fixed incomplete typo undoing

7634d20

olgabot added 4 commits December 5, 2013 09:26

Fixed yticklabel indexing so yticklabels could be just a list of stri…

5ee83c0

…ngs and not necessarily indexable by a list of indices

Made AssertionError of label_cols and label_rows more descriptive

542f10d

Removed extraneous print statements

de47e83

Removed more extraneous print statements

f4c2f27

olgabot added 4 commits December 5, 2013 10:18

Changed docstrings to reStructuredText and added linewidth+edgecolor …

8f03b1e

…specifications

Fixed log scale vmin/vmax stuff and added some more parameters for us…

2517e71

…er fiddling

Added width ratio finder for various colorbar locations (but otherwis…

8286c88

…e other colorbar locations not implemented)

Changed clustering to operate on log10 of the data if the color_scale…

fd195b1

…=log

olgabot added 2 commits December 5, 2013 22:23

Added acknowledgement to Michael Lovci

ae86982

Added fastcluster as optional import for really large matrices

ff63dd9

olgabot added 4 commits December 6, 2013 01:40

added metric option for calculating distances

7caf10b

added metric option for calculating distances

87e989d

removed extraneous spaces

b7b2340

set default linkage method to average

4c31c6c

TomAugspurger reviewed Dec 7, 2013
View reviewed changes

changed any(df) to df.any()

a2a1313

lbeltrame reviewed Dec 16, 2013
View reviewed changes

Changed AssertionError's to ValueError's

0707d0f

Added docstring explaining get_width_ratios

b2e4552

olgabot mentioned this pull request Dec 16, 2013

Heatmaps? mwaskom/seaborn#41

Closed

Unified row and column linkage methods

d6785a2

ghost closed this Dec 22, 2013

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustered heatmap #5646

Clustered heatmap #5646

olgabot commented Dec 5, 2013

olgabot commented Dec 5, 2013

TomAugspurger commented Dec 5, 2013

olgabot commented Dec 5, 2013

olgabot commented Dec 5, 2013

olgabot commented Dec 5, 2013

jreback commented Dec 5, 2013

olgabot commented Dec 5, 2013

jreback commented Dec 5, 2013

jreback commented Dec 5, 2013

TomAugspurger commented Dec 7, 2013

TomAugspurger Dec 7, 2013

olgabot commented Dec 8, 2013

lbeltrame commented Dec 13, 2013

olgabot commented Dec 13, 2013

lbeltrame commented Dec 13, 2013

lbeltrame commented Dec 16, 2013

lbeltrame Dec 16, 2013

olgabot commented Dec 16, 2013

olgabot commented Dec 16, 2013

jreback commented Dec 16, 2013

jtratner commented Dec 16, 2013

olgabot commented Dec 16, 2013

jtratner commented Dec 16, 2013

olgabot commented Dec 16, 2013

jtratner commented Dec 16, 2013

lbeltrame commented Dec 16, 2013

lbeltrame commented Dec 20, 2013

ghost commented Dec 22, 2013

olgabot commented Dec 22, 2013


		# heatmap with row names

		def get_width_ratios(half_width, side_colors,

Clustered heatmap #5646

Clustered heatmap #5646

Conversation

olgabot commented Dec 5, 2013

olgabot commented Dec 5, 2013

TomAugspurger commented Dec 5, 2013

olgabot commented Dec 5, 2013

olgabot commented Dec 5, 2013

olgabot commented Dec 5, 2013

jreback commented Dec 5, 2013

olgabot commented Dec 5, 2013

jreback commented Dec 5, 2013

jreback commented Dec 5, 2013

TomAugspurger commented Dec 7, 2013

TomAugspurger Dec 7, 2013

Choose a reason for hiding this comment

olgabot commented Dec 8, 2013

lbeltrame commented Dec 13, 2013

olgabot commented Dec 13, 2013

lbeltrame commented Dec 13, 2013

lbeltrame commented Dec 16, 2013

lbeltrame Dec 16, 2013

Choose a reason for hiding this comment

olgabot commented Dec 16, 2013

olgabot commented Dec 16, 2013

jreback commented Dec 16, 2013

jtratner commented Dec 16, 2013

olgabot commented Dec 16, 2013

jtratner commented Dec 16, 2013

olgabot commented Dec 16, 2013

jtratner commented Dec 16, 2013

lbeltrame commented Dec 16, 2013

lbeltrame commented Dec 20, 2013

ghost commented Dec 22, 2013

olgabot commented Dec 22, 2013