Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) #641

Open
numpy-gitbot opened this issue Oct 19, 2012 · 19 comments

Comments

@numpy-gitbot
Copy link

@numpy-gitbot numpy-gitbot commented Oct 19, 2012

Original ticket http://projects.scipy.org/numpy/ticket/2188 on 2012-07-23 by @yarikoptic, assigned to unknown.

tested against current master (present in 1.6.2 as well):

If with python2.x series it works ok, without puking:

$> python2.7 -c 'import numpy as np; print repr(repr(np.unique(np.array([1,2, None, "str"]))))' 
'array([None, 1, 2, str], dtype=object)'

NB I will report a bug on repr here separately if not yet filed

it fails with python3.x altogether:

$> python3.2 -c 'import numpy as np; print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.2/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
    ar.sort()
TypeError: unorderable types: int() > NoneType()

whenever IMHO it must operate correctly -- semantic of unique() action should not imply ability to sort the elements

@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Sep 17, 2013

any fresh ideas on this issue?

@njsmith
Copy link
Member

@njsmith njsmith commented Sep 17, 2013

The only options for implementing unique are:

  • sorting the array
  • putting everything in a hash table
  • do brute-force == comparison on all objects against all objects

Only the sorting and hashing strategies have reasonable speed, and only the sorting and all-against-all strategies have reasonable memory overhead for large arrays. So I guess we could add fallback options to unique where if sorting doesn't work then it tries one of the other strategies? But OTOH it's not nice to have a function that sometimes suddenly takes massively more CPU or memory depending on what input you give it.

I guess I'd be +1 on a patch that adds a strategy={"sort", "hash", "bruteforce"} option to np.unique, so users with weird data can decide on what makes sense for their situation. If you want to write such a thing :-)

@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Sep 17, 2013

at first I wondered if it could be sorting + hash table for unsortable items (didn't check if cmp of elements is used when sorting elements of dtype object array) so sorting __cmp__ could position them on 'first-come-first-in-line' order?
but then realized that it doesn't provide a relief in general for uncomparable types, e.g. when it is a mix of int and str... so I wondered if for dtype=object it could be feasible to deduce first participating dtypes and 'unique' (possibly via sort) within each dtype possibly relying on hash tables for dtypes without __cmp__ ?

@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Sep 17, 2013

just for those who might need a workaround, here is how I do it via 'hashing' through the built in sets for my case:

$> python3.3 -c 'import numpy as np; print(np.array(list(set([1,2,"str", None])), dtype=object))' 
[None 1 2 'str']
@njsmith
Copy link
Member

@njsmith njsmith commented Sep 17, 2013

Not sure what you just said :-)

But on further thought sorting is really not reliable for dtype=object
anyway. I've probably written dozens of classes that override eq but
keep the default lt, which means that sort-based unique will just
silently return the wrong answer. This is a pretty nasty bug I guess.

If the objects are hashable then you can just do set(arr) to get the unique
elements, but no guarantee that they're hashable in general. (But at least
everyone that for hashable objects this should work, which is not true
for sorting.) Maybe this would be a better default implementation of
np.unique for object arrays.

On Tue, Sep 17, 2013 at 5:40 PM, Yaroslav Halchenko <
notifications@github.com> wrote:

at first I wondered if it could be sorting + hash table for unsortable
items (didn't check if cmp of elements is used when sorting elements of
dtype object array) so sorting cmp could position them on
'first-come-first-in-line' order?
but then realized that it doesn't provide a relief in general for
uncomparable types, e.g. when it is a mix of int and str... so I wondered
if for dtype=object it could be feasible to deduce first participating
dtypes and 'unique' (possibly via sort) within each dtype possibly relying
on hash tables for dtypes without cmp ?


Reply to this email directly or view it on GitHubhttps://github.com//issues/641#issuecomment-24603047
.

@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Sep 17, 2013

gy... ok -- cruel description in Python:

def bucketed_unique(a):
    buckets = {}
    for x in a:
        t = type(x)
        if not (t in buckets):
            buckets[t] = bucket = []
        else:
            bucket = buckets[t]
        bucket.append(x)
    out = []
    for bucket in buckets.itervalues():
        # here could be actually set of conditions instead of blind try/except
        try:
            out.append(np.unique(bucket))
        except:
            out.append(np.array(list(set(bucket)), dtype=object))
    return np.hstack(out)
print bucketed_unique([1, 2, 'str', None, np.nan, None, np.inf, int])
[1 2 'str' None <type 'int'> nan inf]
@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Sep 17, 2013

sure thing -- no 'bucketing' should be done for non-object ndarrays

@njsmith
Copy link
Member

@njsmith njsmith commented Sep 17, 2013

That algorithm doesn't use == as its definition of uniqueness. Objects of
different types can be ==. (Easy example: 1, 1.0). Its definition doesn't
correspond to any standard python concept.
On 17 Sep 2013 18:01, "Yaroslav Halchenko" notifications@github.com wrote:

sure thing -- no 'bucketing' should be done for non-object ndarrays


Reply to this email directly or view it on GitHubhttps://github.com//issues/641#issuecomment-24604740
.

@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Sep 17, 2013

indeed! not sure but may be post-hoc analysis across buckets would make sense... btw atm problem reveals itself also for comparison with complex numbers:

$> python3.3 -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))'  
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
    ar.sort()
TypeError: unorderable types: float() > complex()

$> python -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))' 
[1]
@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Sep 17, 2013

although on the 2nd thought -- what should the 'unique' value dtype then be among all available choices (int/float/complex)? with non-object array it is clear... with heterogeneous object array, not so -- may be even different dtypes should be maintained as such...

@jreback
Copy link

@jreback jreback commented Feb 1, 2014

Here's the way I solved argsort blowing up on mixed int/str in py3: https://github.com/pydata/pandas/pull/6222/files

order the ints before the strings in object dtypes
use a hashtable to map the locations to get the indexer
reasonably fast I think

uses pandas hashtable implementation but could easily be swapped out / adapted to c-code I think

@charris
Copy link
Member

@charris charris commented Feb 16, 2014

Anyone want to take a swing at this? Not sure what to do about record dtypes.

@sobayed
Copy link

@sobayed sobayed commented Oct 24, 2016

Any updates on this? I encountered this bug when trying to use scikit-learn's LabelEncoder on pandas DataFrame columns with dtype "object" and missing values

nbara pushed a commit to nbara/mne-python that referenced this issue Apr 19, 2017
- python3 requires decoding the bytes objects to produce strings
- np.unique() fails for dtype=object in python3 (see
numpy/numpy#641)
- flake, pydocstyle
larsoner added a commit to mne-tools/mne-python that referenced this issue Apr 28, 2017
* WIP: add basic GDF1 and GDF2 support

only tested on a few GDF test files
doesn't break existing EDF tests

* Remove \x00 from channel names

* Fix data calibration

- Unified offset and calibration values for EDF/BDF/GDF modules
- few PEP8 fixes
- TODO: GDF 2.x triggers are detected as multiples of 256
- TODO: overflow and saturation detection

* Proper imports imports in test_gdf.py

* Fix to test_edf.py

An additional warning has been added in edf.py when Lowpass information
is not available in file header.

As a result test_stim_channel() would fail. The test should now expect
2 warnings.

* Separated EDF and GDF code

- EDF and GDF file headers are now read in their own private subfunction
- some PEP8 and cosmetic changes
- GDF events are still multiples of 256

* Add biosig GDF test file

make Travis happy

* python3 fixes

- python3 requires decoding the bytes objects to produce strings
- np.unique() fails for dtype=object in python3 (see
numpy/numpy#641)
- flake, pydocstyle

* Read biosemi GDF stim_channel properly

* Speed up reading from file

- fromstring -> fromfile

* Test routines and added safechecks

Also remove test datafiles (see
mne-tools/mne-testing-data#23)

* Update utils, and missing lowpass info no longer issues warning

* Added stim_channel argument check

* Address comments

* Address comments and update doc

* Address comments

* Docstyle
@ron819
Copy link

@ron819 ron819 commented Nov 13, 2018

This one is really old. Is it still relevant?

@yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Nov 13, 2018

seems to be the case at least with 1.15.4 in debian:

$> python3 --version
Python 3.6.5

$> PYTHONPATH=.. python3 -c 'import numpy as np; print(np.__version__); print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'                                                                                   
1.15.4
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 233, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 281, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'int'
@acdr000
Copy link

@acdr000 acdr000 commented Feb 18, 2019

Definitely still relevant. Just came across this, trying to call np.unique(x, return_inverse=True) on an object array.

Regarding the question of how to make this work, when sorting is undefined: I'd much prefer a slow algorithm over the status quo of raising an error. (In my experience, oftentimes, if you need performant algorithms, you shouldn't be using an object array to begin with.)

@eric-wieser
Copy link
Member

@eric-wieser eric-wieser commented Feb 18, 2019

I think this is a feature request, not a bug. The docs clearly state:

Returns the sorted unique elements of an array.

For the case of an array like [1, None], no such sorted array exists in python 3 since sorting is no longer well-defined.

@charris
Copy link
Member

@charris charris commented Feb 18, 2019

It would be nice to have an option to not return a sorted array, it would allow some optimizations.

@mcgfeller
Copy link

@mcgfeller mcgfeller commented Mar 24, 2021

This was the most bothering regression in porting about 150K lines from Py2.7 to 3.8. Navigating around this lead to a lot of ugly workarounds and slower code. I see that this is caused by Guido's decision in Python, but a fast, non-sorted option for unique would be really helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet