Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustered heatmap #5646

Closed
wants to merge 31 commits into from
Closed

Clustered heatmap #5646

wants to merge 31 commits into from

Conversation

olgabot
Copy link

@olgabot olgabot commented Dec 5, 2013

Hello there,
I'm working on a clustered heatmap figure for pandas, like R's heatmap.2 which I used I ton when I was still in R-land.

I'm having some trouble wrapping my mind around pandas.tools.plotting but I have a draft version of this heatmap function with some initial parameters in my fork: olgabot@4759d31 But I'm unsure how to proceed with plumbing this function into the current pandas plotting tools.

Here's a notebook of a simple and complex example: http://nbviewer.ipython.org/gist/olgabot/7801024

Also, unlike other many plotting functions, heatmap creates a whole figure. I currently return the fig instance along with the dendrogram objects for both the rows and columns in case the user wants to do some stats on them. It seems like for this function, accepting an ax argument (or even a fig) doesn't make any sense because there's at least 4 (up to 6 if you label columns or rows with colors) ax instances to create.

FYI besides adding documentation and tests and such I'm still working on:

  • reasonable dendrogram sizing for when the dimensions are really large. Right now the dendrogram axis size is a proportion of the figure size, but this should change for even medium (100+) size datasets because then the dendrogram takes over the figure.
  • Allowing for multiple colors labels for each row/column
  • Adding optimal leaf ordering (important for time-dependent data)
  • vmin/vmax for pcolormesh
  • maybe add some padding between the dendrogram, colorbars, and heatmap?
  • Other suggestions welcome!

I'm also still working through all the developer FAQs and such so I'm probably making tons of n00b mistakes so please point them out to me, especially if there are some pandas/python conventions I'm totally not following.

@olgabot
Copy link
Author

olgabot commented Dec 5, 2013

Another issue that people may run into is that the clustering a large matrix will run into recursion limit issues, so there should be a way to smartly change sys.setrecursionlimit(get_recursionlimit(df)) based on the size of the dataframe.

@TomAugspurger
Copy link
Contributor

I'll take a closer look this weekend. These are just some quick comments.

@olgabot
Copy link
Author

olgabot commented Dec 5, 2013

Re figs vs axes: you can access all axes of a fig with fig.axes() which returns an array of axes. Returning just the fig could be the default, but maybe we could also add the row and column dendrograms if the user specifies, because they might want to do something else with them later.

@olgabot
Copy link
Author

olgabot commented Dec 5, 2013

Also, here are the R doc for heatmap.2: http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gplots/html/heatmap.2.html (this is the main feature set I'm trying to create)

And the Matlab docs for clustergram: http://www.mathworks.com/help/bioinfo/ref/clustergram.html I haven't used this much but I know other people in my lab do. This defaults to optimal leaf ordering on, but I'm still getting through the paper describing the efficient optimal leaf ordering algorithm: http://www.cimms.ou.edu/~zhang/pdfs/BIDM04-JiantingZhang.pdf

@olgabot
Copy link
Author

olgabot commented Dec 5, 2013

What's the correct way to test if an object is None, if the other option is that it is a pd.DataFrame (or other iterable)?

According to PEP8, what I'm doing here is wrong:

if type(plot_df) is type(None):
    plot_df = df

but when I do this as suggested by PEP8:

if isinstance(plot_df, None):
    plot_df = df

I get:

TypeError: isinstance() arg 2 must be a class, type, or tuple of classes and types

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

just do

if x is None:
    ....

@olgabot
Copy link
Author

olgabot commented Dec 5, 2013

That worked, thanks @jreback! I think I was getting the if variable: issues with iterables (that stated that the truth value of an array is ambiguous), confused with if variable is None:

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

if variable:
   ....

implicity calls __nonzero__, which doesn't make sense on a pandas object (though works on None), which is why it raises (this is similar to what numpy does)

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

this should close #3497, or if not, pls create a separate issue for this? (just so we can put tags on it and reference it).

@TomAugspurger
Copy link
Contributor

I'm thinking a bit more about the API. It can be a bit intimidating to see a function signature with 15-20 keyword arguments. Things like title, title_fontsize, figsize, xlabel_fontsize, ylabel_fontsize, linewidth, edgecolor and maybe some others could all be moved into a **kwargs argument. Basically anything that is getting passed to the matplotlib plot command (pcolormesh I guess). This is a bit trickier since you'll have the multiple parts with the heat map and dendogram. But it will also give the benefit of being able to use all the matplotlib formatting commands, and not just the ones you've included.

And I guess this adds a new optional dependency with the fastcluster library. One of the maintainers will have to weigh in on that.

if plot_df is None:
plot_df = df

if any(plot_df.index != df.index):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is negligible, but you may want to use if (plot_df.index != df.index).any(): here; That uses the numpy any instead of the python builtin. It's faster for big arrays.

@olgabot
Copy link
Author

olgabot commented Dec 8, 2013

It is quite overwhelming, but there a lot of aspects to control at a time. R's heatmap.2 strategy is to have all non-image (pcolormesh in our case) arguments to be in the main args, then **kwargs for pcolormesh separately. Matlab's clustergram is to have everything in the main args, which is overwhelming. So I suggest adopting the same strategy, which will move linewidth, edgecolor, cmap, norm (not currently an argument), vmin, vmax (vmin and vmax will have to be moved out of where they are now) into the kwargs. That's not very many, but it's something.

Another feature that both R and Matlab have is the ability to z-score-ify the matrix, i.e. normalize across either rows or columns such that 0 is the middle value. Should this be an option as well?

One note is that the title is not actually placed as the title of pcolormesh, but rather as the title of the column dendrogram, since that will appear at the top of the figure, rather than in between the dendrogram and the false color image.

@lbeltrame
Copy link
Contributor

I can only say I'm extremely interested in this. Given that heatmap.2 has such a horrid syntax, I'm all welcoming this!

As a suggestion, I wonder if it would be possible to pass scipy dendrograms to the function, in case they are calculated from different algorithms than the base ones (e.g., when doing bootstrapped clustering).

@olgabot
Copy link
Author

olgabot commented Dec 13, 2013

@lbeltrame that is a great idea! The way heatmap.2 does it is that Colv and Rowv can be NULL or FALSE , both of which result in no clustering, or TRUE, which will cluster, or a dendrogram, and then it will use this instead of computing the dendrogram. This syntax seems confusing to me, but I'm not sure if having cluster_cols=True and col_dendrogram=None separate makes much sense either. What do you suggest?

@lbeltrame
Copy link
Contributor

col_dendrogram=None separate makes much sense either. What do you
suggest?

Type checking is evil, but IMO less evil than having more confusing options. I
would use an instance of a dendrogram to use it as is, False to not cluster
and None to cluster.

I know it sounds ugly, that's why I'm open to suggestions. ;)

@lbeltrame
Copy link
Contributor

Likewise I'd change AssertionError (which is more like "this should never happen") throughout the code to ValueError ("incorrect input", in this case).

I also second @TomAugspurger's comments, the matplotlib stuff should go into **kwargs.


# heatmap with row names

def get_width_ratios(half_width, side_colors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a docstring to explain what this does, even if it's internal.

@olgabot
Copy link
Author

olgabot commented Dec 16, 2013

How should tests be written for figures? For prettyplotlib I used matplotlib's @image_comparison (from matplotlib.testing.decorators import image_comparison: http://matplotlib.org/devel/testing.html#writing-an-image-comparison-test) But I haven't seen that in the pandas code.

@olgabot
Copy link
Author

olgabot commented Dec 16, 2013

@lbeltrame so both True and None would cluster? That sounds fishy.. I think None and False should both not cluster.

@jreback
Copy link
Contributor

jreback commented Dec 16, 2013

@olgabot why don't you have a cluster= kw that takes a string and make False/None just not cluster.

@jtratner
Copy link
Contributor

@olgabot I'd be quite happy if we put in tests with image_comparison - seems like a good idea. Make sure you mark your tests with slow though (might be time to add a plotting decorator too...)

@olgabot
Copy link
Author

olgabot commented Dec 16, 2013

It would need to be separate for columns and rows, so cluster_cols and cluster_rows need to separately address:

  • whether or not to cluster this dimension (True/False/None?)
  • If the user has already clustered, accept a dendrogram (a dict with keys icoord, dcoord and leaves)
  • This could also be confused for the metric or linkage method, though I've separated those to metric, col_linkage_method, and row_linkage_method respectively so hopefully that won't be the case

So to address all of the above issues, I propose these acceptable arguments to cluster_cols and cluster_rows:

  • False : do not cluster
  • True : cluster using the *linkage_method and metric provided
  • None: ?? do not cluster I think
  • dict: check for icoord, dcoord, and leaves in the keys, and we're good to go

@jtratner
Copy link
Contributor

@olgabot - would you mind listing out the different outcome decisions
(maybe a table?) [or, alternatively, a set of example calls] - might help
to come to a simpler API for this.

@olgabot
Copy link
Author

olgabot commented Dec 16, 2013

The main outcomes are

  • cluster (using other params in the program)
  • don't cluster
  • use this other dendrogram I made to arrange the heatmap

Does that help?

@jtratner
Copy link
Contributor

I agree with what your suggestion then, just tweaked such that anything
Falsey (None/False/empty dict, whatever) should not cluster. Is it invalid
to get a dendogram and also get a linkage_method?

Also, if you end up with linkage_method_rows, linkage_method_cols, etc,
you might consider passing two dicts: col_clustering = dict(linkage_method=...), row_clustering = dict(dendogram=...). Will
probably make it cleaner.

On Mon, Dec 16, 2013 at 12:40 PM, Olga Botvinnik
notifications@github.comwrote:

The main outcomes are

  • cluster (using other params in the program)
  • don't cluster
  • use this other dendrogram I made to arrange the heatmap

Does that help?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5646#issuecomment-30681905
.

@olgabot olgabot mentioned this pull request Dec 16, 2013
@lbeltrame
Copy link
Contributor

@olgabot +1 on your idea with API.

With regards to linkage and clustering methods, IMO it makes no sense to use two different methods for rows and columns (at least in my common usage), so I would rather avoid that.

@lbeltrame
Copy link
Contributor

Can you update either the PR text or add here a TODO of what's left? Might be useful if someone else wants to contribute.

BTW, how do you handle row text when rows are hundreds or even thousands? Even R can't get this quite right (I end up omitting row names in such cases).

@ghost
Copy link

ghost commented Dec 22, 2013

@olgabot looks like this PR found a more suitable home in seaborn, as discussed. Close?

@olgabot
Copy link
Author

olgabot commented Dec 22, 2013

Yes sure

Sent from my mobile device.
On Dec 22, 2013 10:21 AM, "y-p" notifications@github.com wrote:

@olgabot https://github.com/olgabot looks like this PR found a more
suitable home in seaborn as discussed. Close?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5646#issuecomment-31092646
.

@ghost ghost closed this Dec 22, 2013
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants