Skip to content

Conversation

@rickymagner
Copy link

@rickymagner rickymagner commented May 12, 2023

This PR is heavily inspired by this one, building on this forum post, which gave a minimal implementation of UpSet plots using the figure_factory. UpSet plots are a more natural way to generalize data represented via Venn diagrams, as they are more scalable and it's easier to see differences in bar sizes rather than circles. This PR builds on code introduced in the other PR, but vastly extends functionality, adds the characteristic "marginal" side plot, and includes "full" documentation.

I know in the previous PR, it was stated resources are limited for figure_factory PRs, so I hope by trying to make this as "complete" as possible, it'll be much easier to get merged. I'd be happy to discuss the code with any reviewers, and try to provide more details on the code below. As this includes both new features and corresponding documentation, I have both checklists here. I greatly appreciate any feedback on trying to get this PR compliant and improving the code.

Documentation PR

  • I've seen the doc/README.md file
  • This change runs in the current version of Plotly on PyPI and targets the doc-prod branch OR it targets the master branch
  • If this PR modifies the first example in a page or adds a new one, it is a px example if at all possible
  • Every new/modified example has a descriptive title and motivating sentence or paragraph
  • Every new/modified example is independently runnable
  • Every new/modified example is optimized for short line count and focuses on the Plotly/visualization-related aspects of the example rather than the computation required to produce the data being visualized
  • Meaningful/relatable datasets are used for all new examples instead of randomly-generated data where possible
  • The random seed is set if using randomly-generated data in new/modified examples
  • New/modified remote datasets are loaded from https://plotly.github.io/datasets and added to https://github.com/plotly/datasets
  • Large computations are avoided in the new/modified examples in favour of loading remote datasets that represent the output of such computations
  • Imports are plotly.graph_objects as go / plotly.express as px / plotly.io as pio
  • Data frames are always called df
  • fig = <something> call is high up in each new/modified example (either px.<something> or make_subplots or go.Figure)
  • Liberal use is made of fig.add_* and fig.update_* rather than go.Figure(data=..., layout=...) in every new/modified example
  • Specific adders and updaters like fig.add_shape and fig.update_xaxes are used instead of big fig.update_layout calls in every new/modified example
  • fig.show() is at the end of each new/modified example
  • plotly.plot() and plotly.iplot() are not used in any new/modified example
  • Hex codes for colors are not used in any new/modified example in favour of these nice ones

Code PR

  • I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
  • I have added tests (if submitting a new feature or correcting a bug) or
    modified existing tests.
  • For a new feature, I have added documentation examples in an existing or
    new tutorial notebook (please see the doc checklist as well).
  • I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
  • For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

Notes on the Code

To make it easier to review, I'll provide a brief description of the code layout. The code is somewhat similar to the create_quiver method in the feature_factory. The main create_upset method creates an instance of the _Upset class using the user inputs. Aside from a few utilities for doing some preprocessing, most of the plot generating methods are contained in this class. This structure was used to make it a little easier for the conceptual major steps in generating the plot to freely share data using the class attributes.

The _Upset class performs the following steps:

  1. Validate some user inputs belong to an explicit set of allowable entries (e.g. sort_by is either Counts or Intersections).
  2. Perform some data manipulation to collect the appropriate subset/intersection counts. This includes inferring whether the data was provided in wide or condensed format, and handling the logic for splitting across color and x.
  3. Create the "primary plot" which sits at the top of the final output, typically a bar chart (though the user can modify this choice when considering a distribution of counts over x values).
  4. Add a "switchboard plot" (i.e. a carefully crafted scatter plot) below the primary plot, representing which intersection corresponds to the figure above it.
  5. Add a "marginal plot" on the left, representing the counts of each of the subset/tag categories.

Any feedback is greatly appreciated!

A Preview

As motivation, here's a nice example plot generated in one line with a well-formatted DataFrame:

ff.create_upset(df, color='color', title='My UpSet Plot')

newplot (14)

@rickymagner
Copy link
Author

It seems some tests are failing on older versions of Python because of a (presumably) older version of pandas which doesn't have value_counts for dataframes yet. Is there a way to have tests require certain version of pandas or otherwise ignore those older versions?

Also, if anyone has some insight in getting the notebook test to pass, that'd be great. It seems to be currently failing because it doesn't like the permalink attribute. I just copied and modified the notebook attributes from another example.

@alexcjohnson
Copy link
Collaborator

Thanks for the PR @rickymagner !

re: Pandas: looks like our Python 3.6 and 3.7 "optional" jobs still run against Pandas 0.24:


We do want to be flexible in the pandas versions we support, though 0.24 is pretty ancient. Looks like 1.1.5 is the last version that keeps Python 3.6 support and that's 2.5 years old so I'd be comfortable bumping the version in the above two files to 1.1.5 at this point. Is that new enough to support value_counts? If not, we'll need to include fallback code to mimic value_counts using older methods.

@alexcjohnson
Copy link
Collaborator

@rickymagner sorry for the contradictory notes but... thinking about this a bit more, I'd like us not to add more figure factories to plotly.py. As mentioned in #3833 (comment) further extensions like this would be better in a separate package - either one package to collect all sorts of new figure factories, or a package just for upset plots.

The challenge for us of adding figure factories here is it confuses people about plotly.py vs other ways to make Plotly charts, such as direct usage of plotly.js.

@rickymagner
Copy link
Author

Thanks for getting back on this. If anything changes in the future and you'd like to discuss merging this into the FF package, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants