Reboot of clustering #230

Closed
wants to merge 4,502 commits into
from

Projects

None yet
@olgabot
Contributor
olgabot commented Jul 2, 2014

This is pulled off of master this time instead of dev, fixing weird merging issues occurring in #73

Trailing Todos:

  • Reformat params with heatmap_kws, row_kws and col_kws or something to clean up the arguments
  • Is there a way to set_axis_style('nogrid') within a single plot? I couldn't find a way to detect the current axis_style to switch, and then return the rcParams to what they were.
  • Allow for several levels of labeling groups in the heatmap (multiple levels of colors) [probably not in this release]
  • Allow for user to specify their own linkage matrices
  • Constrain colorbar scale to have at most 3 labels. Currently this works only for divergent, linear maps. [probably not in this release]

Here's a notebook of the current abilities: http://nbviewer.ipython.org/gist/anonymous/10da01312dc2486888f5

@olgabot olgabot referenced this pull request Jul 2, 2014
Closed

Clustered heatmap #73

3 of 5 tasks complete
@mwaskom
Owner
mwaskom commented Jul 2, 2014

Awesome!

For tutorials, the current/future approach is to use a hybrid Sphinx/IPyNB approach. The best thing to look at for that is the linear models notebook (source, raw IPyNB, online docs).

The way this works is that I'm composing the code in an IPyNB, but writing the narrative parts as Sphinx .rst in raw text (not markdown) cells. Then when you build the docs, this gets executed, converted to .rst (using nbconvert), and built as part of the Sphinx site. Add a line to this Makefile so that will happen. But the key is that the executed notebooks are never stored in version control, so the repository stays manageable.

It's best to try to use the full power of Sphinx .rst so cross-linking, page structure, etc, will work. You can check out the docs or just try to copy what's done in the other notebooks.

If we can come up with a simple example dataset or two to use for the docs, that would be good. Ideally something that makes some sense to a range of people so they don't need to use too many cognitive resources figuring out what the plots in the docs are showing, and can focus on how they show it. That also makes it easier to do quick examples/testing without regenerating a whole simulated dataset every time. I host those in a separate repository for space reasons, and load the data over the internet with seaborn.load_dataset).

@olgabot olgabot referenced this pull request in YeoLab/flotilla Jul 3, 2014
Closed

Documentation and examples on readthedocs.org #57

@olgabot
Contributor
olgabot commented Jul 10, 2014

For the data, how about this basketball data? (link to data in question) Heatmaps are good for finding categorical clusters, so something like this could be good. I'll try to whip up an example.

@olgabot
Contributor
olgabot commented Jul 11, 2014

Something like this? I'm thinking of adding color labels for teams and stat type.

@mwaskom
Owner
mwaskom commented Jul 11, 2014

That looks pretty cool!

On Thu, Jul 10, 2014 at 9:09 PM, Olga Botvinnik notifications@github.com
wrote:

Something like this
http://nbviewer.ipython.org/gist/olgabot/448be6d27a1f0ed89ac9? I'm
thinking of adding color labels for teams and stat type.


Reply to this email directly or view it on GitHub
#230 (comment).

@olgabot
Contributor
olgabot commented Jul 11, 2014

Started writing up a notebook for the examples: http://nbviewer.ipython.org/gist/olgabot/bfe1e3638af3eea52fb1

@olgabot
Contributor
olgabot commented Jul 15, 2014

Do you have any comments on the example notebook? I'd love to get this function into people's hands and get it stress-tested.

@mwaskom
Owner
mwaskom commented Jul 15, 2014

I have some thoughts but need a little time to compose and type them up (I'm wrapping up a vacation on the east coast right now). I'll try to get to it soon, feel free to keep bugging me about it.

@mwaskom
Owner
mwaskom commented Jul 15, 2014

Overall I would say that it's looking awesome, though!

@olgabot
Contributor
olgabot commented Jul 15, 2014

Enjoy your vacation! For now, we'll make this fork a requirement for our
flotilla package, because clustered heat maps have been our most requested
feature.

Sent from my mobile device

I have some thoughts but need a little time to compose and type them up
(I'm wrapping up a vacation on the east coast right now). I'll try to get
to it soon, feel free to keep bugging me about it.


Reply to this email directly or view it on GitHub
#230 (comment).

@mwaskom
Owner
mwaskom commented Jul 18, 2014

Hey would you mind rebasing off master? There's a conflict in __init__.py with the silly xkcd stuff I added recently. Also can you get rid of the .gitignore~ file? Thanks!

@mwaskom
Owner
mwaskom commented Jul 18, 2014

Looks like there's a bug in positioning the dendrograms with fewer than cells on a side. Here's an example with real data but I can reproduce with random dataframes of size < 6 on either the rows or columns:

screenshot 2014-07-18 09 31 55

@mwaskom
Owner
mwaskom commented Jul 18, 2014

I'm not really a fan of how the colorbar is positioned, nor of how it changes size with the dimensions of the plot. I get that it's stuck in the corner to maximize the use of space, which I think is a good idea, but it feels really awkward. I think I'd prefer putting it on one side of the figure, the way colorbars are conventionally positioned when you call plt.colorbar(). It might be easiest to do it tall and thin on the left side, since the variable size of the labels might make the left or bottom tricky?

Also is there a kwarg I'm missing to control the name associated with the colorbar? It's "values" in all your examples, which isn't really adding anything. I'd think it could be removed.

@mwaskom
Owner
mwaskom commented Jul 18, 2014

When using a "tidy" formatted dataframe, it seems like the pivot_kws assume there is one observation for each (row_value, col_value) tuple. But it's also possible there are multiple values and one would want to aggregate them. This can be done using pd.pivot_table. If we switch to using that, the default aggregation func is mean which, when you only have one value in each cell, will give the same answer as df.pivot. But it makes the function more flexible. Does that make sense?

@mwaskom
Owner
mwaskom commented Jul 18, 2014

I got an error when running the tests:

======================================================================
ERROR: seaborn.tests.test_clustering.TestClusteredHeatmapPlotter.test_get_linkage_function_large_data
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mwaskom/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/mwaskom/Code/seaborn/seaborn/tests/test_clustering.py", line 260, in test_get_linkage_function_large_data
    shape=(100, 100), use_fastcluster=False)
  File "/Users/mwaskom/Code/seaborn/seaborn/clustering.py", line 410, in get_linkage_function
    RuntimeWarning)
TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType
@mwaskom
Owner
mwaskom commented Jul 18, 2014

Also you may have noticed this, but you can run make coverage to get a test coverage report. Currently the clustering module is at 92%, which is really good, but if you see any low hanging fruit in what's left out it would be good to add. Here's the text report:

Name                    Stmts   Miss  Cover   Missing
-----------------------------------------------------
seaborn.clustering        282     23    92%   29, 135, 209, 212, 215, 239, 251, 307, 401, 457, 551, 564-568, 641-644, 661, 673, 814-829

When you run make coverage, it will also generate an html report in cover/ within the source directory that you can browse.

@mwaskom
Owner
mwaskom commented Jul 18, 2014

So I am not sure how I feel about the return value from this function. All other seaborn functions do one of two things, they return either the Axes instance they have plotted into, or they return a seaborn *Grid object that organizes the parts of a complex figure. The latter is the case for lmplot, factorplot, jointplot, etc. In those cases, the object returned is a public-facing class that is also useful to the user (i.e. you can do a lot using FacetGrid directly). It's not clear to me that there's such benefit from having a separate class to structure the clustered heatmap figure, because I'm not sure what else people might use it for.

But, there would be some advantages to having the function return an object that provides easy access to more than just the dendrogram dictionaries. For example, if you want to tweak the heatmap Axes, you need to do something like plt.gcf().axes and then figure out which object in that list is the one you need. That's not obvious. It would be better to do

grid = sns.clusteredheatmap(...)
grid.heatmap_ax

One thing that makes me think this could be really useful is the strong admonition about saving figures, which is confusing and will be missed by a lot of people. In general, I really try to avoid that kind of complication, and if clusteredheatmap returned an object, that would be easy to do because you could add a savefig() method that turns around and calls fig.savefig() with bbox_inches="tight":

grid = sns.clusteredheatmap(...)
grid.savefig("my_heatmap.png")

Moving all of the figure structure to a separate object would make the code cleaner and better separate the statistical and drawing parts from the figure architecture. Because you've already written the code this is less useful, but still nice code is good to strive for (and will make maintaining it easier). Plus, to the extent that there are named methods on the object it might make it easier for people to do post-processing tweaks.

Finally, that reorganization would make clusteredheatmap match the rest of seaborn better. What do you think? I know you've already done a lot of coding here. It's also possibly something that could be done post-merge, so people can start using the general interface.

@mwaskom
Owner
mwaskom commented Jul 18, 2014

The default calculation of the figure size leads to VERY large figures, that only look reasonable in the notebook because IPython is scaling them. This might be tricky to figure out correctly, but we should aim for a calculation that ends up with most figures in a reasonable (i.e. close to other matplotlib figures) size.

@mwaskom
Owner
mwaskom commented Jul 18, 2014

Because normalizing the input data is something that will very often be useful, is it something that could be added to the function itself and then controlled with a normalize_data or similar kwarg?

@mwaskom
Owner
mwaskom commented Jul 18, 2014

I'm not sure why there need to be separate data and data_na_ok inputs. Why not have a parameter called fillna (it can default to 0 I guess) and then call data.fillna(fillna) inside the function?

@mwaskom
Owner
mwaskom commented Jul 18, 2014

It also seems like we could make it easier to add the row/column color labels. Often that kind of information is going to be in the dataframe, and I would expect the user to often not care about the particularly colors used but just to want to have distinct colors. So that suggests that the parameter should accept some combination of the distinct values (possibly in a way that infers them from the dataframe, i.e. a row/column label) and then the name of a palette. As with other functions, the palette argument can also take a dictionary if the user wants to be precise about label > color mappings.

@olgabot
Contributor
olgabot commented Jul 18, 2014

Thanks for taking such a thorough look at this! I'll make a checklist so it's easier for me to keep track.

  • Rebasing: in progress
  • Bug with small dataframes
  • Failing tests
  • Test coverage
  • Re: colorbar: Currently the axes are positioned this way:

image

Which is similar to how R's heatmap.2 does it:

R heatmap.2

It's possible to put it over on the left side, and then there'll be a bunch of whitespace in the upper left. I've seen heatmaps with the colorbar on the right side, but you're right, it would interfere with the labels.

The keyword argument to change from "values" is colorbar_kws=dict(label="not_values")

  • Re: figure size: Currently it's scaled because I usually have huge dataframes and want to see all the labels, but I can see how that's not always viable. Maybe it could just be fixed at figsize=(10, 10) and the user can change it if it's too small?
  • Re: Pivot. pivot_table and pivot have different arguments, like "rows" instead of "index" - will this be a problem?
  • Re: returning the dendrogram/figure saving - I will often use the dendrogram ordering after the fact, so it would still be important to have around, however not all users will want that. And this particular way of saving figures would definitely be best abstracted, because it's so odd and unintuitive.
  • re: normalizing data - I thought about that, and there's so many different ways that a user would want to normalize data that I didn't want to have to provide for ALL of them. There are quite a few that are often used, e.g. the z-scores I used for the basketball data, so it's possible to provide at least that one. We would just need to specify which axis is the features and which is the samples, since usually you normalize across features.
  • re: data_na_ok - data.fillna() will work if you are replacing the nas along columns but not across rows, which I thought would be a problem.
  • re: Colors - Good idea about the colors! Maybe a groupby or something? I haven't thought about this too in-depth.
  • re: matching the rest of seaborn - I'm happy to do it, because working on just this tiny part of the project has made me a better programmer so it'd be great to pick up more good habits. Could you point me to some paradigms of what you're trying to follow or specific examples of what you mean? I was using JointPlot as an example, but it's probably been updated since I looked at it.

I'm about to go out of town so sorry if I missed anything, but I'll be able to address the rest and actually code next week.

@mwaskom
Owner
mwaskom commented Jul 18, 2014

It's possible to put it over on the left side, and then there'll be a bunch of whitespace in the upper left. I've seen heatmaps with the colorbar on the right side, but you're right, it would interfere with the labels.

Yeah, I think maybe my issue is that there's some assumption that things in the top left are the most important (since that's how reading English works) and it seemed weird to put the colorbar in such a place of prominence. So it might be fine in a corner, but if the plot were restructured a bit maybe so the whitespace were in the lower right? I also think standardizing the size of the colorbar (at least in figure coordinates) would be good.

@mwaskom
Owner
mwaskom commented Jul 18, 2014

Re: returning the dendrogram/figure saving - I will often use the dendrogram ordering after the fact, so it would still be important to have around, however not all users will want that.

Yeah these should definitely be attributes on the object that gets returned (which could be called a DendrogramGrid I guess, through that is a long name so alternatives might be good too).

@mwaskom
Owner
mwaskom commented Jul 18, 2014

Normalizing data - I thought about that, and there's so many different ways that a user would want to normalize data that I didn't want to have to provide for ALL of them. There are quite a few that are often used, e.g. the z-scores I used for the basketball data.

Yeah this kind of abstraction is hard, I don't have a good sense for what might be most common. But if we can think of something that covers the modal usecase, or a good abstraction, it seems nice to have.

@mwaskom
Owner
mwaskom commented Jul 18, 2014

re: data_na_ok - data.fillna() will work if you are replacing the nas along columns but not across rows, which I thought would be a problem.

Yeah, hm, I guess the column order matters for some of the methods. I had in mind filling nulls with a constant, but maybe that doesn't cover enough usecases.

I wonder if a better interface is for the second parameter to take a boolean array that is used to mask the heatmap. Then it would be useful for when you have NAs, but maybe for other things too (outlier values, etc.)

@olgabot
Contributor
olgabot commented Jul 21, 2014

Just tried to rebase... did it work?

@mwaskom
Owner
mwaskom commented Jul 21, 2014

Still showing a merge conflict, I bet the push got rejected, you'll need to force push.

@olgabot
Contributor
olgabot commented Jul 22, 2014

Looks like there's a bug in positioning the dendrograms with fewer than cells on a side. Here's an example with real data but I can reproduce with random dataframes of size < 6 on either the rows or columns

Can you try again with the latest version? I wasn't able to reproduce it:

image

The default calculation of the figure size leads to VERY large figures, that only look reasonable in the notebook because IPython is scaling them. This might be tricky to figure out correctly, but we should aim for a calculation that ends up with most figures in a reasonable (i.e. close to other matplotlib figures) size.

Changed the default figure size to (10, 10) - what do you think?

@olgabot
Contributor
olgabot commented Jul 22, 2014

Yeah, I think maybe my issue is that there's some assumption that things in the top left are the most important (since that's how reading English works) and it seemed weird to put the colorbar in such a place of prominence. So it might be fine in a corner, but if the plot were restructured a bit maybe so the whitespace were in the lower right? I also think standardizing the size of the colorbar (at least in figure coordinates) would be good.

Good point, what about in the lower right like this? Size is flexible.

image

@olgabot
Contributor
olgabot commented Jul 22, 2014

Yeah these should definitely be attributes on the object that gets returned (which could be called a DendrogramGrid I guess, through that is a long name so alternatives might be good too).

Do you prefer DendrogramGrid or some other separate object name over the current _ClusteredHeatmapPlotter?

@olgabot
Contributor
olgabot commented Jul 22, 2014

It also seems like we could make it easier to add the row/column color labels. Often that kind of information is going to be in the dataframe, and I would expect the user to often not care about the particularly colors used but just to want to have distinct colors. So that suggests that the parameter should accept some combination of the distinct values (possibly in a way that infers them from the dataframe, i.e. a row/column label) and then the name of a palette. As with other functions, the palette argument can also take a dictionary if the user wants to be precise about label > color mappings.

Are you thinking a color_groupby or something?

@olgabot
Contributor
olgabot commented Jul 22, 2014

Yeah this kind of abstraction is hard, I don't have a good sense for what might be most common. But if we can think of something that covers the modal usecase, or a good abstraction, it seems nice to have.

Just added z_score and standard_scale, though I'm not yet quite happy with how they're implemented. Specifically, I think the vmin argument of standard_scale is misleading, but it's basically just saying whether the data should range from 0 to 1 or from

data_range = data.max()  - data.min()

# Minimum of the data
data.min()/(data_range)

# Maximum of the data
data.max()/data_range

E.g. from the docstring,

>>> import numpy as np
>>> d = np.arange(5, 8, 0.5)
>>> standard_scale(d)
[ 0.   0.2  0.4  0.6  0.8  1. ]
>>> standard_scale(d, vmin=None)
[ 2.   2.2  2.4  2.6  2.8  3. ]
@FedericoV

Hey olga,

awesome work. I just pulled your branch to play with a dataset that I was dealing with, and it works very well.

I did, however, get this error:

row_dendrogram, col_dendrogram = clusteredheatmap(signed_reduced_ks_df, figsize=(20, 20))
TypeError: '_ClusteredHeatmapPlotter' object is not iterable

The plot came out fine, so I'm not sure what the nature of the error is. I'm dealing with a 43, 43 dataframe, with identical column and index labels.

@mwaskom
Owner
mwaskom commented Jul 23, 2014

Can you try again with the latest version? I wasn't able to reproduce it:

It seems you have fixed that problem, but introduced something new when trying to use side colors:

screenshot 2014-07-23 07 50 44

@mwaskom
Owner
mwaskom commented Jul 23, 2014

Good point, what about in the lower right like this? Size is flexible.

I guess I had in mind that the dendrograms would now be on the bottom and right side of the figure. But that might violate expectations too much for people who are used to R (and maybe is hard to do with the scipy functions?)

I think the colorbar in the top left is probably good, it's the best use of space. But I think it should stay a fixed, relatively small size, so that it doesn't draw the attention too much. I also think the label for the colorbar should probably default to "" instead of "values", what do you think?

@mwaskom
Owner
mwaskom commented Jul 23, 2014

Just added z_score and standard_scale, though I'm not yet quite happy with how they're implemented.

These look good, I'm not sure None is the most obvious way to specify that but I think it's useful functionality. Maybe it would be clearer as new_vmin or something like that.

@mwaskom
Owner
mwaskom commented Jul 23, 2014

Do you prefer DendrogramGrid or some other separate object name over the current _ClusteredHeatmapPlotter?

I'll try to write up what I have in mind in some more detail later today, stay tuned.

@mwaskom
Owner
mwaskom commented Jul 23, 2014

Another thing that would be nice (but possibly not worth the effort, since I think it would be hard) is to detect whether the column labels will fit without rotating and draw them horizontally if that is true.

@mwaskom
Owner
mwaskom commented Jul 23, 2014

Also it looks like the rebase ended up doubling the commits? Not sure what happened there.

@olgabot
Contributor
olgabot commented Jul 25, 2014

@FedericoV - Thanks for checking this out! As for the error, try doing your plotting without the assignment. I took out returning of the dendrograms directly because not everyone needs them.

e.g.

clusteredheatmap(signed_reduced_ks_df, figsize=(20, 20));
@olgabot
Contributor
olgabot commented Jul 25, 2014

I guess I had in mind that the dendrograms would now be on the bottom and right side of the figure. But that might violate expectations too much for people who are used to R (and maybe is hard to do with the scipy functions?)

Oh, I see now. I think it's okay to stick to R/Matlab style for now, and add functionality to move the dendrograms and labels around later.

One issue that I've run into, is that xticklabels and yticklabels must be of the same font size - do you know a way around this?

@mwaskom
Owner
mwaskom commented Jul 25, 2014

One issue that I've run into, is that xticklabels and yticklabels must be of the same font size - do you know a way around this?

Not sure what you mean?

screenshot 2014-07-25 15 53 34

@olgabot
Contributor
olgabot commented Jul 25, 2014

One issue that I've run into, is that xticklabels and yticklabels must be of the same font size - do you know a way around this?
Not sure what you mean?

Odd, when I tried to set one to 8pt, then the next as 12pt, only the last value would work. I'll look more into this.

@olgabot
Contributor
olgabot commented Jul 25, 2014

I'm having a lot of trouble coming up with sane ways to specify the z-score/standard scaling that will make both casual and power users happy, that covers most bases and is explicit. So far, these are the combinations that make sense to me:

  • z_score=None, standard_scale=None: Do nothing
  • z_score="rows", standard_scale=None: Do z-scores only (same for "columns")
  • z_score="rows", standard_scale="columns": Illegal. Can only standardize one dimension at a time.
  • z_score="rows", standard_scale="rows": Do z-scores and standard scale from -1 to 1 (same for if both were "columns"). This makes sense to me because you're essentially forcing the data in to a standard normal. I was considering whether someone would ever want to do z-scores but then have their data from 0 to 1, and while that's possible, I don't think it makes statistical sense to move the mean when it was already conveniently at 0.
  • z_score=None, `standard_scale="rows": Scale the values in each row from 0 to 1.

What do you think?

@mwaskom
Owner
mwaskom commented Jul 27, 2014

I think by "standard_scaling" you mean scaling the variable so the smallest value is 0 and the largest value is 1?

z_score="rows", standard_scale="rows": Do z-scores and standard scale from -1 to 1 (same for if both were "columns"). This makes sense to me because you're essentially forcing the data in to a standard normal. I was considering whether someone would ever want to do z-scores but then have their data from 0 to 1, and while that's possible, I don't think it makes statistical sense to move the mean when it was already conveniently at 0.

I don't follow this. I don't see how you could ever have both z-scores and standard scaled data.

@mwaskom
Owner
mwaskom commented Jul 27, 2014

Sorry I've been unresponsive, I'm in paper-writing mode and this is a big context switch.

Do you prefer DendrogramGrid or some other separate object name over the current _ClusteredHeatmapPlotter?

So the basic idea is that these would be separate objects. Think about how jointplot works. It combines relatively high-level plotting functions with a class object, JointGrid. The plotting functions (like distplot and regplot) have a fair amount of complexity in what they do, but they all do it within a specific Axes and don't know anything about how the figure they participate in is laid out. In contrast, the JointGrid knows about how to set up a collection of Axes (for instance, it knows how to translate the kwargs size=6, ratio=3 into a GridSpec, because it's more convenient for the user to specify the former). And it knows some things about how to draw on those axes but in general is agnostic to what the plotting function will actually do, like how the plot_marginals method just requires that it gets a function that takes a vertical= kwarg and applies it to the marginal Axes with the right data. At the end of the day, jointplot returns the JointGrid object, which stores references to the figure and the component axes at sensibly (I hope) named attributes. Here is an example of it in action.

So the basic idea is 1) to separate the code that draws plots from the code that structures figures, 2) to expose a higher-level interface to manipulations on the figure in a way that's both internally cleaner to seaborn and externally useful for the user, and 3) keep references to compoenents of the figure the user might want to access.

What I think that would mean more specifically would be to carve out all of the figure-structure code from _ClusteredHeatmapPlotter into a DendrogramGrid (though I am still ambivalent about the name). Then internally clusterheatmap will initialize a particular instance of a DendrogramGrid and plot stuff onto it using _ClusteredHeamapPlotter, where all the heavy algorithmic and drawing code will still live. Then store references to the axes, figure, dendrogram dictionaries, etc. on the DendrogramGrid and return it.

Does that make more sense? Like I said above, this is a large amount of work and I think the function would overall benefit from getting some wider testing. So I'm open to merging this soon with a clear notice that the API is still somewhat unstable and then think about what makes sense in terms of refactoring.

There are actually a lot of similarities between JointGrid and what we need for here, so it may be a good idea one day (not now) to refactor the codebase so they inherit from a common object and save a lot of duplicated code.

@mwaskom
Owner
mwaskom commented Jul 27, 2014

Also another thing that occurs to me is that seaborn would benefit from a heatmap function that draws the pcolormesh without all of the clustering stuff (and then can be reused here), but that does all of the nice colormap, normalization, and labeling business. That also doesn't need to be part of this PR, but it would be good as you polish stuff before the first merge to keep that in mind so that it's easier to split off the heatmap code.

@mwaskom
Owner
mwaskom commented Jul 27, 2014

Also sorry that Travis is not providing useful feedback right now. There is some issue with installing Microsoft fonts so that the notebook tests pass, and I have not had time to dig into it since debugging Travis issues is a PITA (and I know very little about Ubuntu issues).

@shoyer
Contributor
shoyer commented Jul 29, 2014

Indeed, a heatmap function would be awesome

@olgabot
Contributor
olgabot commented Jul 30, 2014

Indeed, a heatmap function would be awesome

Kind of like this? :D https://github.com/olgabot/prettyplotlib/wiki/Examples-with-code#pcolormesh-heatmaps

I can do that.

@olgabot
Contributor
olgabot commented Jul 30, 2014

Sorry I've been unresponsive, I'm in paper-writing mode and this is a big context switch.

No problem, I understand the paper writing crunch.

What I think that would mean more specifically would be to carve out all of the figure-structure code from _ClusteredHeatmapPlotter into a DendrogramGrid (though I am still ambivalent about the name). Then internally clusterheatmap will initialize a particular instance of a DendrogramGrid and plot stuff onto it using _ClusteredHeamapPlotter, where all the heavy algorithmic and drawing code will still live. Then store references to the axes, figure, dendrogram dictionaries, etc. on the DendrogramGrid and return it.

This helps, thank you!

@mwaskom
Owner
mwaskom commented Aug 2, 2014

Hey @olgabot just a heads up that I fixed the Travis build issue in master, so if you rebase again (sorry) you should get useful feedback.

@mwaskom
Owner
mwaskom commented Aug 2, 2014

Also just wanted to tag #160 as being related to the future existence of an sns.heatmap function.

@mwaskom
Owner
mwaskom commented Aug 11, 2014

I haven't read it closely, but does this mean we might not need the external dependency on fastcluster for new versions of scipy?

@olgabot
Contributor
olgabot commented Aug 11, 2014

Not sure, it looks like it only speeds single-linkage and not the remaining
algorithms (complete, average, WARD, etc)


Olga Botvinnik
PhD Program in Bioinformatics and Systems Biology
Gene Yeo Laboratory http://yeolab.ucsd.edu/yeolab/Home.html | Sanford
Consortium for Regenerative Medicine
University of California, San Diego
www http://olgabotvinnik.com | blog http://blog.olgabotvinnik.com/ |
github http://github.com/olgabot | twitter http://twitter.com/olgabot |
linkedin http://www.linkedin.com/in/olgabotvinnik

2014-08-11 13:13 GMT-07:00 Michael Waskom notifications@github.com:

I haven't read it closely, but does this
http://richardtsai.info/posts/gsoc14-recent-progress mean we might not
need the external dependency on fastcluster for new versions of scipy?


Reply to this email directly or view it on GitHub
#230 (comment).

This was referenced Sep 10, 2014
@olgabot
Contributor
olgabot commented Oct 2, 2014

Hello again! I just wanted to let you know that I have returned from the dead and will work on this again tomorrow.

@olgabot
Contributor
olgabot commented Oct 2, 2014

Okay I tried rebasing twice... and still there's merge conflicts :(

I think I'm just going to start over and merge from current master. this is ridiculous.

and others added some commits Feb 18, 2014
@mwaskom @olgabot Make FacetGrid legend labels match hue_order ab45a61
@mwaskom @olgabot Make FacetGrid legend more robust 546998b
@mwaskom @olgabot Make other plotting functions map-friendly by avoiding kwargs.pop be2f0e3
@mwaskom @olgabot Fix distplot component kwargs 2a68316
@mwaskom @olgabot Fix Py3K tests 65af209
@mwaskom @olgabot Another stab at making distplot delegate kws properly 619ff8f
@mwaskom @olgabot Cleaning up more kwarg issues fa2d7cd
@mwaskom @olgabot Update example code (not prose) so tests are useful 06a448a
@mwaskom @olgabot Add nonfunctional kwargs to interactplot for use with FacetGrid 5d43a08
@mwaskom @olgabot Initial FacetGrid docs aeedacd
@mwaskom @olgabot Add new notebook-based docs on axis_grids and quant lm plots 44457ec
@mwaskom @olgabot Partial update of doc machinery to new notebook-based system 85506b0
@mwaskom @olgabot Move location of axis_grid docs 7755944
@mwaskom @olgabot Add margin_titles to factorplot 394c332
@mwaskom @olgabot Add nbstripout tool for cleaning notebooks eb855a9
@mwaskom @olgabot Add categorical function tutorials and api docs 0431028
@mwaskom @olgabot Tweaks to other tutorial notebooks 23f835b
@mwaskom @olgabot Updates to supporting doc tools 39d0a38
@mwaskom @olgabot Try to fix troublesome cell in timeseries notebook a393293
@mwaskom @olgabot Just remove bad tutorial cells for the time being 23ec6ab
@mwaskom @olgabot Fix warnings about API docs for class objects on doc build 883f7dc
@mwaskom @olgabot Fix typo in distplot docstring 4972e12
@mwaskom @olgabot Add docstrings for new linearmodel functions. e539a2a
@mwaskom @olgabot Remove broken cell from linear_models notebook cd74cf0
@mwaskom @olgabot Update .travis.yml 5ac4195
@mwaskom @olgabot Revert travis config changes to fix on separate branch cc0b909
@mwaskom @olgabot Change where conda gets updated 41a970f
@mwaskom @olgabot Make sure to use conda in non-interactive mode ee20a81
@mwaskom @olgabot Try building docs on Travis 3178437
@mwaskom @olgabot Update categorical lm doc source afd41e8
@mwaskom @olgabot Only buil docs on Python 2 621d2ce
@mwaskom @olgabot Add release notes for v0.3 4f1f598
@mwaskom @olgabot Update gitignore for coverage output e69b96e
@phobson @olgabot phobson ENH: added offset_spines function to utils and trim kwarg to despine b8a4f24
@phobson @olgabot phobson DOC: demo offset_spines and despine(trim=True) in an existing examples
Run hexstrip on aesthetics notebook
8fa5064
@mwaskom @olgabot Make notebook image diff test more lenient since random seeds no long…
…er work
f8ce05c
and others added some commits Apr 25, 2014
@mwaskom @olgabot Add 0.4 release notes to index 185ea12
@mwaskom @olgabot Fix cubehelix narrative docs 99c5cf7
@mwaskom @olgabot Update docs and examples with new spine offset functionality e169c31
@mwaskom @olgabot Add the option to plotting_context 187606e
@mwaskom @olgabot Make setup behave more politely about dependencies (GH #169) 74bd5a6
@olgabot olgabot Finish side_color tests 3648c00
@mwaskom @olgabot Test on minimal deps, with edits to make tests pass e300e0f
@mwaskom @olgabot Test on Python 3.4 3d86b9f
@alimanfoo @olgabot alimanfoo #185 make it possible to override gridsize with hex jointplot c0b35b8
@alimanfoo @olgabot alimanfoo pep8 line length f2cdc9f
@alimanfoo @olgabot alimanfoo fix violinplot doco 8299ff0
@alimanfoo @olgabot alimanfoo typos 72ee2d7
@mwaskom @olgabot Be safer about trying to operate on possible strings b050d9f
@cpcloud @olgabot cpcloud BUG: col and row names should be checked for None and length 2b5ea68
@swederik @olgabot swederik Added pdf.fonttype=42 to rcmod for Illustration-editable text a154bf8
@swederik @olgabot swederik Added pdf.fonttype to style_keys. Added details to release document 10e16a3
@luispedro @olgabot luispedro ENH Draw horizontal violin plots
The interface is analogous to the one for boxplots from matplotlib,
using an option called ``vert`` to distinguish vertical (default) and
horizontal plots.
28ec92c
@luispedro @olgabot luispedro PEP8 Fix Pep8 issues 44df900
@thoppe @olgabot thoppe Fixed a typo fix in the comments. 87c9d86
@mwaskom @olgabot Revert change to functions that reset RC params 801e497
@mwaskom @olgabot Skip some rcParam-related tests on older matplotlibs d764029
@mwaskom @olgabot Don't test deprecated svg parameter due to odd failures 602859e
@tyarkoni @olgabot tyarkoni fixed alpha setting bug in _RegressionPlotter scatter 4266c34
@tyarkoni @olgabot tyarkoni added regplot scatter_kws alpha test fb0cdc7
@mwaskom @olgabot PEP8 0c49e2b
@mwaskom @olgabot Freeze pyopenxl version to avoid warning in notebooks 121bfac
@mwaskom @olgabot Add xkcd color dictionary and palette function 4ea275b
@mwaskom @olgabot Add logx option to regplot dded58c
@mwaskom @olgabot Format travis setup better c5a9436
@mwaskom @olgabot Remove easy_install from install docs 5346c4d
@olgabot olgabot Add tests for labeling, plotting the actual heatmap, and setting the …
…title
b491ce1
@olgabot olgabot Set defaults to plot as None for everything e618706
@olgabot olgabot finished tests? 2e38c2b
@olgabot olgabot added check in case 'side_colors' is None 3927f57
@olgabot olgabot fixed test errors 10993be
@olgabot olgabot add pycharm .idea to gitiginore bd0a213
@olgabot olgabot fix testing incompatibility with numpy 1.6 527c548
@olgabot olgabot fix range error 1e9e544
@olgabot olgabot about to try multiple subplotspec approach... 8e2ce86
@olgabot olgabot tight_layout is evil 0d1041b
@olgabot olgabot remove option to label on the dendrogram. too complicated to keep con…
…sistent with tight_layout if the labels are long
29ce867
@olgabot olgabot fix colorbar positioning and spacing 970faaa
@olgabot olgabot tests pass! faa688e
@olgabot olgabot Drafted ipynb of how to use clusteredheatmaps d63dc0f
@olgabot olgabot fix pep8 errors b1446f9
@diego0020 @olgabot diego0020 documentation typo 1dc6bca
@olgabot olgabot prettify arguments list d004d87
@olgabot olgabot Add note on saving figures and docstring example 4170e76
@olgabot olgabot specify this version as "clustering" 14ed70d
@olgabot olgabot fix bug with fastcluster, spacing issues with side colors 1e09601
@olgabot olgabot reboot of clustering work 4ca1f4d
@olgabot olgabot moving norm and cmap to attributes of _ClusteredHeatmapPlotter 04abcf9
@olgabot olgabot mega refactor abstracting out plotting functions 88d4508
@olgabot olgabot examples work with refactor.. onto testing!! 727f60d
@olgabot olgabot one test fully works! :) d432d67
@olgabot olgabot An instance of mpl.colors.LogNorm should be tested as assert_is_isnta…
…nce and not assert_equal
abe0ddb
@olgabot olgabot Add linkage tests and remove unnecessary width_ratios for gridspec 12dbf68
@olgabot olgabot Add getting gridspec ratios and side_colors matrix and cmap tests 6d03e09
@olgabot olgabot Remove extra "variable =" which was there for no reason 4640cad
@olgabot olgabot refactor dendrogram calculation out of plot_dendrogram 3e7bf60
@olgabot olgabot move establish_axes to plot() 15b5f88
@olgabot olgabot test plot_dendrogram for both row and col 1852f63
@olgabot olgabot Added sidecolors tests and fixed erroneous sidecolor making (imagine …
…that, writing tests helps you fix your code!)
d4d0b08
@olgabot olgabot Added asserts to check that the provided side_colors are the correct …
…length
53e2c1f
@olgabot olgabot Remove cmap/vmin/vmax from pcolormesh_kws, test establish_axes in a b…
…unch of ways
6dfe8be
@olgabot olgabot Finish side_color tests c51bd89
@olgabot olgabot Add tests for labeling, plotting the actual heatmap, and setting the …
…title
b4c8a76
@olgabot olgabot Set defaults to plot as None for everything 3c8cfc0
@olgabot olgabot finished tests? 1d5d832
@olgabot olgabot added check in case 'side_colors' is None 9829c31
@olgabot olgabot fixed test errors bc4ac9d
@olgabot olgabot fix pep8 f67f667
@olgabot olgabot add pycharm .idea to gitiginore 46509ed
@olgabot olgabot fix testing incompatibility with numpy 1.6 34bfcb0
@olgabot olgabot fix range error 7dd8f74
@olgabot olgabot about to try multiple subplotspec approach... 0e6e9c7
@olgabot olgabot tight_layout is evil d766141
@olgabot olgabot remove option to label on the dendrogram. too complicated to keep con…
…sistent with tight_layout if the labels are long
ff18666
@olgabot olgabot fix colorbar positioning and spacing 359df97
@olgabot olgabot tests pass! f40ac6d
@olgabot olgabot Drafted ipynb of how to use clusteredheatmaps 5fecdba
@olgabot olgabot fix pep8 errors a0de7f4
@olgabot olgabot prettify arguments list 5eb9971
@olgabot olgabot Add note on saving figures and docstring example 4e12b4f
@olgabot olgabot fix bug with fastcluster, spacing issues with side colors edc64f5
@olgabot olgabot fix tests 12a7206
@olgabot olgabot Set figsize to (10,10) 9e7a7ad
@olgabot olgabot colorbar on lower right 986eb88
@olgabot olgabot Add savefig to `ClusteredHeatmapPlotter` ea4e33d
@olgabot olgabot Added z_score and standard_scale as functions, don't do anything yet 430450c
@olgabot olgabot Move colorbar to upper left
Change default colorbar label to empty string

Fix colorbar width and height

removed image output from ipython notebook

initial z-score/standard scaling
fef91b4
@olgabot olgabot Merge branch 'clustering2' of github.com:olgabot/seaborn into cluster…
…ing2

* 'clustering2' of github.com:olgabot/seaborn: (2223 commits)
  Move colorbar to upper left
  Added z_score and standard_scale as functions, don't do anything yet
  Add savefig to `ClusteredHeatmapPlotter`
  colorbar on lower right
  Set figsize to (10,10)
  fix tests
  fix bug with fastcluster, spacing issues with side colors
  Add note on saving figures and docstring example
  prettify arguments list
  fix pep8 errors
  Drafted ipynb of how to use clusteredheatmaps
  tests pass!
  fix colorbar positioning and spacing
  remove option to label on the dendrogram. too complicated to keep consistent with tight_layout if the labels are long
  tight_layout is evil
  about to try multiple subplotspec approach...
  fix range error
  fix testing incompatibility with numpy 1.6
  add pycharm .idea to gitiginore
  fix pep8
  ...
3d73e88
@olgabot olgabot Merge branch 'master' of github.com:mwaskom/seaborn into clustering2
* 'master' of github.com:mwaskom/seaborn: (24 commits)
  Add local seed to linear model CI test for stability
  Add note on FacetGrid maps to release notes
  set random seed to make test deterministic
  Add example for FacetGrid with custom projection
  Document heatmap in new tutorial
  Add correlation matrix heatmap example
  Pass read_csv kwargs in load_dataset
  Add ability to set equal aspect in heatmap call
  Add gallery example for heatmap
  Be more thorough in the annotation test
  Use matplotlibrc file to force Agg backend in tests
  Add tests for heatmap
  Plot the matrix with rows running from top to bottom
  Ensure that the pcolormesh covers the full axis limits
  Fix ticklabels test on Python 3
  Refactor heatmap function and add features
  Add utility functions to check for ticklabel overlaps
  Rename source file with heatmap
  add edgecolor, linewidth, kwargs, and fix pep8 lint
  add colorbar params
  ...
6c5fe4b
@olgabot
Contributor
olgabot commented Oct 13, 2014

I'm very confused why there are still conflicts. I could do a git pull upstream master from my repo, where upstream=mwaskom/seaborn and yet there are still conflicts.

@mwaskom
Owner
mwaskom commented Oct 14, 2014

Well rebasing won't automatically fix conflicts, there will still be need for manual intervention if the same part of the codebase was touched on the two branches. It looks like the problem here is that at some point about 6 months of history got pulled into this branch, which likely spawned a lot of conflicts outside of the parts you've touched for the new cluster plots.

@mwaskom
Owner
mwaskom commented Oct 14, 2014

I think in general you don't want to mix merging with rebasing, although I'm not sure exactly what's happening with the commit history here.

@mwaskom
Owner
mwaskom commented Oct 17, 2014

Closing in favor of #332

@mwaskom mwaskom closed this Oct 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment