Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UpSet plot fails for certain number of entries in input #193

Closed
mumichae opened this issue Nov 21, 2022 · 2 comments · Fixed by #202
Closed

UpSet plot fails for certain number of entries in input #193

mumichae opened this issue Nov 21, 2022 · 2 comments · Fixed by #202

Comments

@mumichae
Copy link

mumichae commented Nov 21, 2022

For some datasets with a large number of sets I get the following error.

temp2_big.csv

import pandas as pd
from upsetplot import UpSet

tmp = pd.read_table('temp2_big.tsv')
tmp = tmp.set_index([x for x in tmp.columns if x != '0'])

tmp.index.shape
Out: (2,)

len(tmp.index[0])
Out: 275

UpSet(tmp)
Error Traceback
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
Cell In [522], line 1
----> 1 UpSet(tmp)

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/upsetplot/plotting.py:416, in UpSet.__init__(self, data, orientation, sort_by, sort_categories_by, subset_size, sum_over, min_subset_size, max_subset_size, min_degree, max_degree, facecolor, other_dots_color, shading_color, with_lines, element_size, intersection_plot_elements, totals_plot_elements, show_counts, show_percentages)
    412 self._show_counts = show_counts
    413 self._show_percentages = show_percentages
    415 (self.total, self._df, self.intersections,
--> 416  self.totals) = _process_data(data,
    417                               sort_by=sort_by,
    418                               sort_categories_by=sort_categories_by,
    419                               subset_size=subset_size,
    420                               sum_over=sum_over,
    421                               min_subset_size=min_subset_size,
    422                               max_subset_size=max_subset_size,
    423                               min_degree=min_degree,
    424                               max_degree=max_degree,
    425                               reverse=not self._horizontal)
    426 self.subset_styles = [{"facecolor": facecolor}
    427                       for i in range(len(self.intersections))]
    428 self.subset_legend = []

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/upsetplot/plotting.py:196, in _process_data(df, sort_by, sort_categories_by, subset_size, sum_over, min_subset_size, max_subset_size, min_degree, max_degree, reverse)
    194 df_packed = _pack_binary(df.index.to_frame())
    195 data_packed = _pack_binary(agg.index.to_frame())
--> 196 df['_bin'] = pd.Series(df_packed).map(
    197     pd.Series(np.arange(len(data_packed))[::-1 if reverse else 1],
    198               index=data_packed))
    199 if reverse:
    200     agg = agg[::-1]

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/pandas/core/series.py:4539, in Series.map(self, arg, na_action)
   4460 def map(
   4461     self,
   4462     arg: Callable | Mapping | Series,
   4463     na_action: Literal["ignore"] | None = None,
   4464 ) -> Series:
   4465     """
   4466     Map values of Series according to an input mapping or function.
   4467
   (...)
   4537     dtype: object
   4538     """
-> 4539     new_values = self._map_values(arg, na_action=na_action)
   4540     return self._constructor(new_values, index=self.index).__finalize__(
   4541         self, method="map"
   4542     )

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/pandas/core/base.py:862, in IndexOpsMixin._map_values(self, mapper, na_action)
    858     return cat.map(mapper)
    860 values = self._values
--> 862 indexer = mapper.index.get_indexer(values)
    863 new_values = algorithms.take_nd(mapper._values, indexer)
    865 return new_values

File ~/mambaforge/envs/plots/lib/python3.9/site-packages/pandas/core/indexes/base.py:3905, in Index.get_indexer(self, target, method, limit, tolerance)
   3902 self._check_indexing_method(method, limit, tolerance)
   3904 if not self._index_as_unique:
-> 3905     raise InvalidIndexError(self._requires_unique_msg)
   3907 if len(target) == 0:
   3908     return np.array([], dtype=np.intp)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

I have no idea where the error could come from, given that I use pandas.value_counts which should give me a unique index. The way I constructed the value counts seems to work when I use a smaller set of samples, however.

Unfortunately I couldn't reproduce the error on a random dataset, which I find more confusing.

import numpy as np
import pandas as pd
from upsetplot import UpSet


df = pd.DataFrame(np.random.choice([True, False], size=(100000, 1000)))
df.loc[:, :990] = True

UpSet(df.value_counts())
@RockyCanyon
Copy link

RockyCanyon commented Jan 3, 2023

I am able to replicate it when using the generate_samples function:

import matplotlib.pyplot as plt
import pandas as pd
from upsetplot import UpSet,plot,generate_data,generate_samples
test = generate_samples(n_samples=100,n_categories=90)
UpSet(test).plot()
plt.show()

@jnothman is this a limitation of how many categories/sets you can plot? using n_sets=40 in the above code works. I'm working on data sets where there could be 150+ memberships per element. E.x. you have 10 people each with 150 different characteristics, what is the overlap amongst the 10 people.

@jnothman
Copy link
Owner

jnothman commented Jan 5, 2023

I can't reproduce with your snippet, @RockyCanyon, but I can with @mumichae's.

I've got a fix in #202.

The issue here was one of integer overflow. We create a decimal representation of the binary sequence indicated by the category membership masks. For over 63 categories we start going into negative number territory. In the case of temp2_big.csv`, the last several 64 columns are all false. So repeatedly multiplying an already overflowed number by 2 and adding 0 resulted in the integer 0 for both rows of the dataset. Thus we crashed on duplicate values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants