UpSet plot fails for certain number of entries in input #193

mumichae · 2022-11-21T14:52:40Z

For some datasets with a large number of sets I get the following error.

import pandas as pd
from upsetplot import UpSet

tmp = pd.read_table('temp2_big.tsv')
tmp = tmp.set_index([x for x in tmp.columns if x != '0'])

tmp.index.shape
Out: (2,)

len(tmp.index[0])
Out: 275

UpSet(tmp)

Error Traceback

---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
Cell In [522], line 1
----> 1 UpSet(tmp)

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/upsetplot/plotting.py:416, in UpSet.__init__(self, data, orientation, sort_by, sort_categories_by, subset_size, sum_over, min_subset_size, max_subset_size, min_degree, max_degree, facecolor, other_dots_color, shading_color, with_lines, element_size, intersection_plot_elements, totals_plot_elements, show_counts, show_percentages)
    412 self._show_counts = show_counts
    413 self._show_percentages = show_percentages
    415 (self.total, self._df, self.intersections,
--> 416  self.totals) = _process_data(data,
    417                               sort_by=sort_by,
    418                               sort_categories_by=sort_categories_by,
    419                               subset_size=subset_size,
    420                               sum_over=sum_over,
    421                               min_subset_size=min_subset_size,
    422                               max_subset_size=max_subset_size,
    423                               min_degree=min_degree,
    424                               max_degree=max_degree,
    425                               reverse=not self._horizontal)
    426 self.subset_styles = [{"facecolor": facecolor}
    427                       for i in range(len(self.intersections))]
    428 self.subset_legend = []

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/upsetplot/plotting.py:196, in _process_data(df, sort_by, sort_categories_by, subset_size, sum_over, min_subset_size, max_subset_size, min_degree, max_degree, reverse)
    194 df_packed = _pack_binary(df.index.to_frame())
    195 data_packed = _pack_binary(agg.index.to_frame())
--> 196 df['_bin'] = pd.Series(df_packed).map(
    197     pd.Series(np.arange(len(data_packed))[::-1 if reverse else 1],
    198               index=data_packed))
    199 if reverse:
    200     agg = agg[::-1]

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/pandas/core/series.py:4539, in Series.map(self, arg, na_action)
   4460 def map(
   4461     self,
   4462     arg: Callable | Mapping | Series,
   4463     na_action: Literal["ignore"] | None = None,
   4464 ) -> Series:
   4465     """
   4466     Map values of Series according to an input mapping or function.
   4467
   (...)
   4537     dtype: object
   4538     """
-> 4539     new_values = self._map_values(arg, na_action=na_action)
   4540     return self._constructor(new_values, index=self.index).__finalize__(
   4541         self, method="map"
   4542     )

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/pandas/core/base.py:862, in IndexOpsMixin._map_values(self, mapper, na_action)
    858     return cat.map(mapper)
    860 values = self._values
--> 862 indexer = mapper.index.get_indexer(values)
    863 new_values = algorithms.take_nd(mapper._values, indexer)
    865 return new_values

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/pandas/core/indexes/base.py:3905, in Index.get_indexer(self, target, method, limit, tolerance)
   3902 self._check_indexing_method(method, limit, tolerance)
   3904 if not self._index_as_unique:
-> 3905     raise InvalidIndexError(self._requires_unique_msg)
   3907 if len(target) == 0:
   3908     return np.array([], dtype=np.intp)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

I have no idea where the error could come from, given that I use pandas.value_counts which should give me a unique index. The way I constructed the value counts seems to work when I use a smaller set of samples, however.

Unfortunately I couldn't reproduce the error on a random dataset, which I find more confusing.

import numpy as np
import pandas as pd
from upsetplot import UpSet


df = pd.DataFrame(np.random.choice([True, False], size=(100000, 1000)))
df.loc[:, :990] = True

UpSet(df.value_counts())

The text was updated successfully, but these errors were encountered:

RockyCanyon · 2023-01-03T20:13:24Z

I am able to replicate it when using the generate_samples function:

import matplotlib.pyplot as plt
import pandas as pd
from upsetplot import UpSet,plot,generate_data,generate_samples
test = generate_samples(n_samples=100,n_categories=90)
UpSet(test).plot()
plt.show()

@jnothman is this a limitation of how many categories/sets you can plot? using n_sets=40 in the above code works. I'm working on data sets where there could be 150+ memberships per element. E.x. you have 10 people each with 150 different characteristics, what is the overlap amongst the 10 people.

Fixes #193

jnothman · 2023-01-05T11:52:42Z

I can't reproduce with your snippet, @RockyCanyon, but I can with @mumichae's.

I've got a fix in #202.

The issue here was one of integer overflow. We create a decimal representation of the binary sequence indicated by the category membership masks. For over 63 categories we start going into negative number territory. In the case of temp2_big.csv`, the last several 64 columns are all false. So repeatedly multiplying an already overflowed number by 2 and adding 0 resulted in the integer 0 for both rows of the dataset. Thus we crashed on duplicate values.

jnothman added a commit that referenced this issue Jan 5, 2023

Fix handling of #categories > 63

c4a1226

Fixes #193

jnothman mentioned this issue Jan 5, 2023

Fix handling of #categories > 63 #202

Merged

jnothman closed this as completed in #202 Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UpSet plot fails for certain number of entries in input #193

UpSet plot fails for certain number of entries in input #193

mumichae commented Nov 21, 2022 •

edited

RockyCanyon commented Jan 3, 2023 •

edited

jnothman commented Jan 5, 2023

UpSet plot fails for certain number of entries in input #193

UpSet plot fails for certain number of entries in input #193

Comments

mumichae commented Nov 21, 2022 • edited

RockyCanyon commented Jan 3, 2023 • edited

jnothman commented Jan 5, 2023

mumichae commented Nov 21, 2022 •

edited

RockyCanyon commented Jan 3, 2023 •

edited