BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants
Contributor

pijucha commented Jun 26, 2016

  • fixes some issues from #13432
  • closes #12044
  • closes #12814
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

  1. Added an internal safe_sort to safely sort mixed-integer
    arrays in Python3.
  2. Changed Index.difference and Index.symmetric_difference
    in order to:
    • sort mixed-int Indexes (#13432)
    • improve performance (#12044)
  3. Fixed DataFrame.join which raised in Python3 with mixed-int
    non-unique indexes (issue with sorting mixed-ints, #12814)
  4. Fixed Index.union returning an empty Index when one of
    arguments was a named empty Index (#13432)

Benchmarks (for index_object only):

]$ asv compare bf47 a02f -f 1.5 -s 
Benchmarks that have improved:
    before     after       ratio
  [bf47    ] [a02f    ]
-  155.66ms    10.32ms      0.07  index_object.index_datetime_set_difference.time_index_datetime_difference
-  154.66ms     2.98ms      0.02  index_object.index_datetime_set_difference.time_index_datetime_difference_disjoint
-  391.45ms    10.65ms      0.03  index_object.index_datetime_set_difference.time_index_datetime_symmetric_difference
-  195.55ms    88.95ms      0.45  index_object.index_int64_set_difference.time_index_int64_difference
-  440.93ms   103.65ms      0.24  index_object.index_int64_set_difference.time_index_int64_symmetric_difference
-    8.77ms     4.88ms      0.56  index_object.index_str_set_difference.time_str_symmetric_difference

@pijucha pijucha and 1 other commented on an outdated diff Jun 26, 2016

pandas/indexes/base.py
@@ -1923,14 +1924,38 @@ def difference(self, other):
other, result_name = self._convert_can_do_setop(other)
- theDiff = sorted(set(self) - set(other))
- return Index(theDiff, name=result_name)
+ this = self._get_unique_index()
+
+ # In what follows, get_indexer doesn't treat NaN's specially,
+ # so it would break the existing behavior, e.g.:
+ # `Index([1, nan]).difference(Index[nan]) == Index([nan, 1])`.
+ # Thus, dropping NaN's from `other` is a hack for backward compat.
+ # We do it here instead of manipulating `the_diff` later.
+ # (Check for MultiIndex is to get around a .hasnans exception)
+ dropna = not isinstance(self, ABCMultiIndex) and \
+ not isinstance(other, ABCMultiIndex) and \
+ self.hasnans and other.hasnans
+ other = other._get_unique_index(dropna=dropna)
@pijucha

pijucha Jun 26, 2016

Contributor

I would gladly remove this dropna stuff and break backward compatibility (so that NaN's are treated as any other element). Would it be dangerous?

@jreback

jreback Jun 26, 2016

Contributor

I think it would be ok to remove the nan stuff. does it still work? e.g. prob need some more tests that have nans in one/both sides

@pijucha

pijucha Jun 26, 2016 edited

Contributor

Removing dropna would make it treat NaN's as any other elements, so the following would be true

pd.Index([1, np.nan]).difference(pd.Index([2, np.nan])) == pd.Index([1])

Similarly for symmetric_difference. It would simplify the code and be consistent with intersection. So, I'm for it.

It would break one test that checks for nan's in symmetric_difference https://github.com/pydata/pandas/blob/master/pandas/tests/indexes/test_base.py#L771
(related to #6444 - sorting of nan's).

@jreback

jreback Jun 27, 2016

Contributor

that looks fine and more correct. Add an example in whatsnew for this. As an aside there are docs source/indexing.rst which may need updating.

@pijucha pijucha and 1 other commented on an outdated diff Jun 26, 2016

pandas/core/algorithms.py
+ """
+ def sort_mixed(values):
+ # order ints before strings, safe in py3
+ str_pos = np.array([isinstance(x, string_types) for x in values],
+ dtype=bool)
+ nums = np.sort(values[~str_pos])
+ strs = np.sort(values[str_pos])
+ return com._ensure_object(np.concatenate([nums, strs]))
+
+ sorter = None
+ try:
+ sorter = values.argsort()
+ ordered = values.take(sorter)
+ except:
+ # unorderable in py3 if mixed str/int
+ ordered = sort_mixed(values)
@pijucha

pijucha Jun 26, 2016

Contributor

Here, I also tried:

if compat.PY3 and lib.infer_dtype(values) == 'mixed-integer':
    ordered = sort_mixed(values)
else:
    try:
        sorter = values.argsort()
        ordered = values.take(sorter)
    except:
         ordered = sort_mixed(values)

which is about 2-5 x faster in python3 with mixed-typed arrays (tested with arrays of size of order 5*10e4).
I was wary of premature optimization though.
Is it ok use this code here?

@jreback

jreback Jun 26, 2016

Contributor

this would be fine. the inference routines are pretty fast as they short circuit.

@jreback jreback commented on the diff Jun 26, 2016

pandas/core/algorithms.py
@@ -142,6 +142,71 @@ def isin(comps, values):
return f(comps, values)
+def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False):
+ """
+ Sort ``values`` and reorder corresponding ``labels``.
+ ``values`` should be unique if ``labels`` is not None.
+ Safe for use with mixed types (int, str), orders ints before strs.
+
@jreback

jreback Jun 26, 2016

Contributor

add a versionadded tag

@pijucha

pijucha Jun 26, 2016

Contributor

OK. Should I also add an entry in whatsnew, New Features?

@jreback

jreback Jun 27, 2016

Contributor

no this is an internal function.

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/core/algorithms.py
+ nums = np.sort(values[~str_pos])
+ strs = np.sort(values[str_pos])
+ return com._ensure_object(np.concatenate([nums, strs]))
+
+ sorter = None
+ try:
+ sorter = values.argsort()
+ ordered = values.take(sorter)
+ except:
+ # unorderable in py3 if mixed str/int
+ ordered = sort_mixed(values)
+
+ if labels is None:
+ return ordered
+
+ if not assume_unique and len(values) != len(set(values)):
@jreback

jreback Jun 26, 2016

Contributor

make sure testing for this

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/core/algorithms.py
+ -------
+ ordered : ndarray
+ Sorted ``values``
+ new_labels : ndarray
+ Reordered ``labels``; returned when ``labels`` is not None
+ """
+ def sort_mixed(values):
+ # order ints before strings, safe in py3
+ str_pos = np.array([isinstance(x, string_types) for x in values],
+ dtype=bool)
+ nums = np.sort(values[~str_pos])
+ strs = np.sort(values[str_pos])
+ return com._ensure_object(np.concatenate([nums, strs]))
+
+ sorter = None
+ try:
@jreback

jreback Jun 26, 2016

Contributor

I think you need something like

if not is_list_like(values):
    raise ValueError(.....must be a list-like)
values = np.array(values, copy=False)

same for labels, these are cheap checks that you have ndarrays passed (e.g. a Series would work fine here).

alternatively, you can simply check if its an ndarray and raise otherwise (this fits your guarantees as well).

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/core/algorithms.py
+ new_labels : ndarray
+ Reordered ``labels``; returned when ``labels`` is not None
+ """
+ def sort_mixed(values):
+ # order ints before strings, safe in py3
+ str_pos = np.array([isinstance(x, string_types) for x in values],
+ dtype=bool)
+ nums = np.sort(values[~str_pos])
+ strs = np.sort(values[str_pos])
+ return com._ensure_object(np.concatenate([nums, strs]))
+
+ sorter = None
+ try:
+ sorter = values.argsort()
+ ordered = values.take(sorter)
+ except:
@jreback

jreback Jun 26, 2016

Contributor

ideally list the exception that you are catching

@jreback jreback commented on the diff Jun 26, 2016

pandas/core/algorithms.py
- t.map_locations(com._ensure_object(uniques))
-
- # order ints before strings
- ordered = np.concatenate([
- np.sort(np.array([e for i, e in enumerate(uniques) if f(e)],
- dtype=object)) for f in
- [lambda x: not isinstance(x, string_types),
- lambda x: isinstance(x, string_types)]])
- sorter = com._ensure_platform_int(t.lookup(
- com._ensure_object(ordered)))
-
- reverse_indexer = np.empty(len(sorter), dtype=np.int_)
- reverse_indexer.put(sorter, np.arange(len(sorter)))
-
- mask = labels < 0
- labels = reverse_indexer.take(labels)
@jreback

jreback Jun 26, 2016

Contributor

+1

@jreback jreback commented on the diff Jun 26, 2016

pandas/indexes/base.py
@@ -1750,7 +1750,7 @@ def _get_consensus_name(self, other):
else:
name = None
if self.name != name:
- return other._shallow_copy(name=name)
@jreback

jreback Jun 26, 2016

Contributor

really? must have not been tested well

@pijucha

pijucha Jun 26, 2016

Contributor

That's right. Only names are tested.

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/indexes/base.py
@@ -1923,14 +1924,38 @@ def difference(self, other):
other, result_name = self._convert_can_do_setop(other)
- theDiff = sorted(set(self) - set(other))
- return Index(theDiff, name=result_name)
+ this = self._get_unique_index()
+
+ # In what follows, get_indexer doesn't treat NaN's specially,
+ # so it would break the existing behavior, e.g.:
+ # `Index([1, nan]).difference(Index[nan]) == Index([nan, 1])`.
+ # Thus, dropping NaN's from `other` is a hack for backward compat.
+ # We do it here instead of manipulating `the_diff` later.
+ # (Check for MultiIndex is to get around a .hasnans exception)
+ dropna = not isinstance(self, ABCMultiIndex) and \
+ not isinstance(other, ABCMultiIndex) and \
+ self.hasnans and other.hasnans
@jreback

jreback Jun 26, 2016

Contributor

instead of this check, this is more idiomatic

try:
    other = other._get_unique_index(dropna=self.hasnan and other.hasnan)
except NotImplementeError:
   pass

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/indexes/base.py
+ # We do it here instead of manipulating `the_diff` later.
+ # (Check for MultiIndex is to get around a .hasnans exception)
+ dropna = not isinstance(self, ABCMultiIndex) and \
+ not isinstance(other, ABCMultiIndex) and \
+ self.hasnans and other.hasnans
+ other = other._get_unique_index(dropna=dropna)
+
+ indexer = this.get_indexer(other)
+ indexer = indexer.take((indexer != -1).nonzero()[0])
+
+ label_diff = np.setdiff1d(np.arange(this.size), indexer,
+ assume_unique=True)
+ the_diff = this.values.take(label_diff)
+ try:
+ the_diff = algos.safe_sort(the_diff)
+ except:
@jreback

jreback Jun 26, 2016

Contributor

name this exception if you can

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/core/algorithms.py
+ ``values`` should be unique if ``labels`` is not None.
+ Safe for use with mixed types (int, str), orders ints before strs.
+
+ Parameters
+ ----------
+ values : ndarray (1-d)
+ Sequence; must be unique if ``labels`` is not None.
+ labels : ndarray (1-d)
+ Indices to ``values``
+ na_sentinel : int, default -1
+ Value in ``labels`` to mark "not found".
+ Ignored when ``labels`` is None.
+ assume_unique : bool, default False
+ When True, doesn't check for uniqness of ``values``.
+ Ignored when ``labels`` is None.
+
@jreback

jreback Jun 26, 2016

Contributor

add a Raises section and detail what exceptions this can raise

@jreback jreback commented on the diff Jun 26, 2016

pandas/core/algorithms.py
+ values : ndarray (1-d)
+ Sequence; must be unique if ``labels`` is not None.
+ labels : ndarray (1-d)
+ Indices to ``values``
+ na_sentinel : int, default -1
+ Value in ``labels`` to mark "not found".
+ Ignored when ``labels`` is None.
+ assume_unique : bool, default False
+ When True, doesn't check for uniqness of ``values``.
+ Ignored when ``labels`` is None.
+
+ Returns
+ -------
+ ordered : ndarray
+ Sorted ``values``
+ new_labels : ndarray
@jreback

jreback Jun 26, 2016

Contributor

no spaces after the label

e.g. ordered: ndarray

@pijucha

pijucha Jul 3, 2016

Contributor

Are you sure about it? All docstrings seem to have spaces there.

@jreback

jreback Jul 3, 2016

Contributor

doc string params should be like:

ordered: ndarray
    Sorted ``values``
new_labels: ndarray

e.g. no space before the colon, 1 after; certainly possible other doc-strings are wrong

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/indexes/base.py
@@ -1967,8 +1992,33 @@ def symmetric_difference(self, other, result_name=None):
if result_name is None:
result_name = result_name_update
- the_diff = sorted(set((self.difference(other)).
- union(other.difference(self))))
+ this = self._get_unique_index()
+
+ # Dropping NaN's is a hack for NaN's backward compatibility
+ # (see the comment in `Index.difference`).
+ dropna = not isinstance(self, ABCMultiIndex) and \
@jreback

jreback Jun 26, 2016 edited

Contributor

use same idiom as above. you may want to factor out some of this to a shared (private) method for use in difference as well. as this looks like lots of duplicate code.

@jreback jreback commented on the diff Jun 26, 2016

pandas/indexes/base.py
@@ -1977,6 +2027,20 @@ def symmetric_difference(self, other, result_name=None):
sym_diff = deprecate('sym_diff', symmetric_difference)
+ def _get_unique_index(self, dropna=False):
@jreback

jreback Jun 26, 2016

Contributor

add a doc-string (even if this is private). Is this testsed?

@pijucha

pijucha Jun 26, 2016

Contributor

Not yet.

@jreback jreback and 1 other commented on an outdated diff Jun 26, 2016

pandas/tests/indexes/test_base.py
@@ -738,6 +742,13 @@ def test_difference(self):
self.assertEqual(len(result), 0)
self.assertEqual(result.name, first.name)
+ # mixed, GH 13432
+ idx1 = Index([0, 1, 'A', 'B'])
@jreback

jreback Jun 26, 2016

Contributor

these tests don't test very much; only simple cases for a basic Index. The real tests are in indexes/common.py where ALL indexes are tested.

@pijucha

pijucha Jun 26, 2016

Contributor

OK, I see it now.

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/tests/test_groupby.py
@@ -3169,6 +3169,12 @@ def test_groupby_nonstring_columns(self):
expected = df.groupby(df[0]).mean()
assert_frame_equal(result, expected)
+ def test_groupby_mixed_type_columns(self):
+ # GH 13432, unorderable types in py3
+ df = DataFrame([[0, 1, 2]], columns=['A', 'B', 0])
+ df.groupby('A').first()
@jreback

jreback Jun 26, 2016

Contributor

give an expected output and assert that

@jreback jreback commented on an outdated diff Jun 26, 2016

pandas/tools/tests/test_merge.py
@@ -548,6 +548,16 @@ def test_join_many_non_unique_index(self):
assert_frame_equal(inner, left)
assert_frame_equal(inner, right)
+ def test_join_mixed_non_unique_index(self):
+ # GH 12814, unorderable types in py3 with a non-unique index
+ df1 = DataFrame({'a': [1, 2, 3, 4]}, index=[1, 2, 3, 'a'])
+ df2 = DataFrame({'b': [5, 6, 7, 8]}, index=[1, 3, 3, 4])
+ df1.join(df2)
+
@jreback

jreback Jun 26, 2016

Contributor

give an expected output and assert it

@jreback jreback commented on the diff Jun 26, 2016

pandas/tools/merge.py
@@ -1202,16 +1202,11 @@ def _sort_labels(uniques, left, right):
# tuplesafe
uniques = Index(uniques).values
- sorter = uniques.argsort()
+ l = len(left)
+ labels = np.concatenate([left, right])
@jreback

jreback Jun 26, 2016

Contributor

xref to this: pydata#4588

This may/may not be impacted by this issue (and that should be a separate PR to address anyhow)

@pijucha

pijucha Jun 26, 2016

Contributor

Not related with difference or symmetric_difference but may be with union.

I think that issue is due to indexes.api._union_indexes, which shows an inconsistent behaviour: it looks like it sorts almost always (even a single index) except if all passed indexes are equal.

@jreback

jreback Jun 27, 2016

Contributor

ok thats fine then

Contributor

jreback commented Jun 26, 2016

xref #13504

Contributor

pijucha commented Jun 27, 2016

@jreback Thanks for the comments.

codecov-io commented Jul 3, 2016 edited

Current coverage is 84.39%

Merging #13514 into master will increase coverage by 0.01%

@@             master     #13514   diff @@
==========================================
  Files           142        142          
  Lines         51223      51278    +55   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43224      43276    +52   
- Misses         7999       8002     +3   
  Partials          0          0          

Powered by Codecov. Last updated by 043879f...3a96089

@pijucha pijucha commented on the diff Jul 3, 2016

pandas/indexes/base.py
@@ -1977,6 +2009,36 @@ def symmetric_difference(self, other, result_name=None):
sym_diff = deprecate('sym_diff', symmetric_difference)
+ def _get_unique_index(self, dropna=False):
+ """
+ Returns an index containing unique values.
+
+ Parameters
+ ----------
+ dropna : bool
+ If True, NaN values are dropped.
@pijucha

pijucha Jul 3, 2016

Contributor

Actually, I no longer need a dropna parameter, but decided to keep it for now.

@pijucha pijucha and 1 other commented on an outdated diff Jul 3, 2016

pandas/tests/indexes/test_base.py
@@ -48,7 +48,8 @@ def setUp(self):
catIndex=tm.makeCategoricalIndex(100),
empty=Index([]),
tuples=MultiIndex.from_tuples(lzip(
- ['foo', 'bar', 'baz'], [1, 2, 3])))
+ ['foo', 'bar', 'baz'], [1, 2, 3])),
+ mixedIndex=Index([0, 'a', 1, 'b', 2, 'c']))
@pijucha

pijucha Jul 3, 2016

Contributor

Is it the right way to test it on difference and symmetric_difference? I needed to exclude it from some tests in common.py (as e.g. test_argsort). Still, it triggers RuntimeWarning in Index.union (maybe exclude it from test_union_base as well?)

@jreback

jreback Jul 3, 2016

Contributor

you can just skip those in the acutal test running if need be (this is done for example with MultiIndex a bit, where you are looping thru the various tests objects but skipping for MultiIndex).

@jreback jreback commented on an outdated diff Jul 3, 2016

doc/source/whatsnew/v0.18.2.txt
@@ -412,6 +412,32 @@ Furthermore:
- Passing duplicated ``percentiles`` will now raise a ``ValueError``.
- Bug in ``.describe()`` on a DataFrame with a mixed-dtype column index, which would previously raise a ``TypeError`` (:issue:`13288`)
+.. _whatsnew_0182.api.difference:
+
+``Index.difference`` and ``Index.symmetric_difference`` will now, more consistently, treat ``NaN`` values as amy other values.
@jreback

jreback Jul 3, 2016

Contributor

amy -> any

@jreback jreback commented on an outdated diff Jul 3, 2016

doc/source/whatsnew/v0.18.2.txt
@@ -412,6 +412,32 @@ Furthermore:
- Passing duplicated ``percentiles`` will now raise a ``ValueError``.
- Bug in ``.describe()`` on a DataFrame with a mixed-dtype column index, which would previously raise a ``TypeError`` (:issue:`13288`)
+.. _whatsnew_0182.api.difference:
+
+``Index.difference`` and ``Index.symmetric_difference`` will now, more consistently, treat ``NaN`` values as amy other values.
+
+.. ipython:: python
@jreback

jreback Jul 3, 2016

Contributor

add the issue number here at the end of the first sentence. (if no direct issue, then use the PR number)

@jreback jreback commented on an outdated diff Jul 3, 2016

doc/source/whatsnew/v0.18.2.txt
@@ -444,7 +470,7 @@ Performance Improvements
- Improved performance of float64 hash table operations, fixing some very slow indexing and groupby operations in python 3 (:issue:`13166`, :issue:`13334`)
- Improved performance of ``DataFrameGroupBy.transform`` (:issue:`12737`)
-
+- Improved performance of ``Index.difference`` (:issue: `12044`)
@jreback

jreback Jul 3, 2016

Contributor

no space after :issue:12004``

@jreback jreback and 1 other commented on an outdated diff Jul 3, 2016

doc/source/whatsnew/v0.18.2.txt
@@ -527,3 +553,5 @@ Bug Fixes
- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
+- Bug in ``Index.union`` returns an incorrect result with a named empty index (:issue:`13432`)
+- Bugs in ``Index.difference`` and ``DataFrame.join`` raise in Python3 when using mixed-integer indexes (:issue:`13432`), (:issue:`12814`)
@jreback

jreback Jul 3, 2016

Contributor

do this as

(:issue:13432, :issue:12814)

@jreback

jreback Jul 6, 2016

Contributor

is there a specific test for the example in #12814 ? (the merge)

@pijucha

pijucha Jul 7, 2016

Contributor

Yes.

@jreback jreback commented on the diff Jul 3, 2016

pandas/core/algorithms.py
@@ -142,6 +142,103 @@ def isin(comps, values):
return f(comps, values)
+def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False):
+ """
+ Sort ``values`` and reorder corresponding ``labels``.
+ ``values`` should be unique if ``labels`` is not None.
+ Safe for use with mixed types (int, str), orders ints before strs.
+
+ .. versionadded:: 0.18.2
+
+ Parameters
+ ----------
+ values : list-like
+ Sequence; must be unique if ``labels`` is not None.
@jreback

jreback Jul 3, 2016

Contributor

do we check this? (the check is pretty cheap in any event)

@jreback jreback commented on the diff Jul 3, 2016

pandas/core/algorithms.py
+ ``values`` should be unique if ``labels`` is not None.
+ Safe for use with mixed types (int, str), orders ints before strs.
+
+ .. versionadded:: 0.18.2
+
+ Parameters
+ ----------
+ values : list-like
+ Sequence; must be unique if ``labels`` is not None.
+ labels : list_like
+ Indices to ``values``. All out of bound indices are treated as
+ "not found" and will be masked with ``na_sentinel``.
+ na_sentinel : int, default -1
+ Value in ``labels`` to mark "not found".
+ Ignored when ``labels`` is None.
+ assume_unique : bool, default False
@jreback

jreback Jul 3, 2016

Contributor

if we are checking do we need this?

@pijucha

pijucha Jul 4, 2016

Contributor

I added assume_unique only for performance. It looks like there's some gain:

vals = np.arange(10000)
np.random.shuffle(vals)

%timeit -n1 -r1 pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3])
1 loops, best of 1: 3.82 ms per loop
%timeit -n1 -r1 pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3], assume_unique=True)
1 loops, best of 1: 2.93 ms per loop

%timeit pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3])
100 loops, best of 3: 1.85 ms per loop
%timeit pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3], assume_unique=True)
1000 loops, best of 3: 816 µs per loop

A line profiler says the line len(values) != len(set(values)) takes about 50% of time in this setting. (For strings of the same size, it's up to 10%.)

If I use Index(values).is_unique instead, then this 50% dwindles to about 20% for ints. (There's not much difference for strings - pd.Index(values).unique may be even slightly slower than len(set(values)) for large arrays of strings.)

Should I remove it or let it stay?

@jreback

jreback Jul 6, 2016

Contributor

I guess that's fine. Just document that in the doc-string (e.g. so a reader knows that when assume_unique=True it is a guarantee). This is essentially an internal method anyhow.

@jreback jreback and 1 other commented on an outdated diff Jul 3, 2016

pandas/core/algorithms.py
+ ordered = values.take(sorter)
+ except TypeError:
+ # try this anyway
+ ordered = sort_mixed(values)
+
+ # labels:
+
+ if labels is None:
+ return ordered
+
+ if not com.is_list_like(labels):
+ raise TypeError("Only list-like objects or None are allowed to be"
+ "passed to safe_sort as labels")
+ labels = com._ensure_platform_int(np.asarray(labels))
+
+ if not assume_unique and len(values) != len(set(values)):
@jreback

jreback Jul 3, 2016

Contributor

you can also do Index(values).is_unique not sure of speed diff here.

@pijucha

pijucha Jul 4, 2016

Contributor

Yes, it's much faster for ints, no significant difference for strings.

@jreback jreback commented on the diff Jul 3, 2016

pandas/indexes/base.py
+ if self.is_unique and not dropna:
+ return self
+
+ values = self.values
+
+ if not self.is_unique:
+ values = self.unique()
+
+ if dropna:
+ try:
+ if self.hasnans:
+ values = values[~isnull(values)]
+ except NotImplementedError:
+ pass
+
+ return self._shallow_copy(values)
@jreback

jreback Jul 3, 2016 edited

Contributor

seems useful. Think about if we can use this elsewhere in the index code as well (in future).

@jreback jreback commented on an outdated diff Jul 3, 2016

pandas/tests/indexes/common.py
@@ -287,12 +287,58 @@ def test_duplicates(self):
self.assertEqual(result.name, 'foo')
self.assert_index_equal(result, Index([ind[0]], name='foo'))
+ def test_get_unique_index(self):
+ for ind in self.indices.values():
+
+ if not len(ind):
+ continue
+
+ idx = ind[[0] * 5]
+ # A workaround for MultiIndex
@jreback

jreback Jul 3, 2016

Contributor

you can just skip MultiIndex (and then add a specific test for it there).

@jreback jreback and 1 other commented on an outdated diff Jul 3, 2016

pandas/tests/indexes/common.py
@@ -325,6 +372,11 @@ def test_argsort(self):
def test_numpy_argsort(self):
for k, ind in self.indices.items():
+
+ # 'mixedIndex' unorderable in Python3
+ if k in ['mixedIndex']:
@jreback

jreback Jul 3, 2016

Contributor

yes this is all fine. Though you may want to add another test specifially dealing with mixedIndex, e.g. asserting that argsort raises and such.

@pijucha

pijucha Jul 4, 2016

Contributor

I rather wanted to skip this test because it raises only in python 3. But if you think it's worth adding, then I'll do it.

@jreback

jreback Jul 6, 2016

Contributor

no add a specific test for this. (and test in both py2 & py3) (obviously need an if). This is the behavior need to nail down.

@pijucha

pijucha Jul 7, 2016

Contributor

Ah, ok.

Contributor

jreback commented Jul 3, 2016

@pijucha looking really good. just a couple of comments.

@sinhrks if you can have a look.

Contributor

pijucha commented Jul 4, 2016

I just noticed that Index.sort_values may benefit from safe_sort. So maybe I'll rewrite it too.


Also, I was thinking about MultiIndex #13504. I see two possible solutions:

  1. make MultiIndex use Index.difference and symmetric_difference, and probably modify MultiIndex._shallow_copy (or .from_tuples)
  2. just copy Index.difference to MultiIndex.difference but without the sorting block (I think it's not needed, sortlevel will do the job).
    Solution 2 means some code duplication but it's faster.
    I don't want to interfere with the other PR, so I'll probably add a separate comment there.
Contributor

jreback commented Jul 6, 2016

ok @pijucha looks good. minor doc updates. ping when green. other changes leave to other PR's

@pijucha pijucha and 1 other commented on an outdated diff Jul 8, 2016

pandas/tests/indexes/test_base.py
@@ -1615,6 +1632,122 @@ def test_string_index_repr(self):
self.assertEqual(coerce(idx), expected)
+ def test_mixed_int_index(self):
@pijucha

pijucha Jul 8, 2016

Contributor

Added these tests.
(It turned out to be a bit lengthy. maybe split it or/and put in a separate class?)

@jreback

jreback Jul 14, 2016

Contributor

sure you can put in separate tests / class

@jreback

jreback Jul 14, 2016

Contributor

make the names meaningful, e.g. test_mixed_int_index_sort

@pijucha pijucha commented on the diff Jul 8, 2016

pandas/tests/indexes/test_base.py
+ s3 = s1 * s2
+ self.assertEqual(s3.index.name, 'mario')
+
+ # test_union_base
+ first = ind[3:]
+ second = ind[:5]
+
+ if PY3:
+ with tm.assert_produces_warning(RuntimeWarning):
+ # unorderable types
+ result = first.union(second)
+ expected = Index(['b', 2, 'c', 0, 'a', 1])
+ self.assert_index_equal(result, expected)
+ else:
+ result = first.union(second)
+ expected = Index(['b', 2, 'c', 0, 'a', 1])
@pijucha

pijucha Jul 8, 2016

Contributor

(Another inconsistency in union: in python2, sortedness of a mixed-int result Index is a bit unpredictable.)

Contributor

pijucha commented Jul 8, 2016

@jreback Done

Contributor

jreback commented Jul 10, 2016

@pijucha pls rebase, merging lots of stuff. ping when green. I think this lgtm.

jreback added this to the 0.19.0 milestone Jul 10, 2016

@jreback jreback commented on an outdated diff Jul 10, 2016

doc/source/whatsnew/v0.18.2.txt
@@ -412,6 +412,32 @@ Furthermore:
- Passing duplicated ``percentiles`` will now raise a ``ValueError``.
- Bug in ``.describe()`` on a DataFrame with a mixed-dtype column index, which would previously raise a ``TypeError`` (:issue:`13288`)
+.. _whatsnew_0182.api.difference:
@jreback

jreback Jul 10, 2016

Contributor

move to 0.19.0

Contributor

pijucha commented Jul 11, 2016

@jreback Updated

Contributor

pijucha commented Jul 14, 2016

@jreback I see several merges have been made in the meantime. Should I rebase again?

Contributor

jreback commented Jul 14, 2016

yes

Contributor

pijucha commented Jul 14, 2016

@jreback Rebased, tests green.

@jreback jreback commented on an outdated diff Jul 14, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -446,6 +446,32 @@ Furthermore:
- Passing duplicated ``percentiles`` will now raise a ``ValueError``.
- Bug in ``.describe()`` on a DataFrame with a mixed-dtype column index, which would previously raise a ``TypeError`` (:issue:`13288`)
+.. _whatsnew_0190.api.difference:
@jreback

jreback Jul 14, 2016

Contributor

add this in a sub-section (e.g. like describe above)

@jreback jreback commented on an outdated diff Jul 14, 2016

pandas/core/algorithms.py
@@ -163,6 +163,104 @@ def isin(comps, values):
return f(comps, values)
+def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False):
+ """
+ Sort ``values`` and reorder corresponding ``labels``.
+ ``values`` should be unique if ``labels`` is not None.
+ Safe for use with mixed types (int, str), orders ints before strs.
+
+ .. versionadded:: 0.18.2
@jreback

jreback Jul 14, 2016

Contributor

-> 0.19.0

@jreback jreback and 1 other commented on an outdated diff Jul 14, 2016

pandas/tests/indexes/test_base.py
result = idx1.symmetric_difference(idx2)
- # expected = Index([0.0, np.nan, 2.0, 3.0, np.nan])
+ expected = Index([0.0, 2.0, 3.0])
+ tm.assert_index_equal(result, expected)
+
+ # GH #6444, sorting of nans. Make sure the number of nans is right
+ # and the correct non-nan values are there. punt on sorting.
+ result = idx1.symmetric_difference(idx3)
+ # expected = Index([0.0, 2.0, 3.0, np.nan])
@pijucha

pijucha Jul 14, 2016

Contributor

The lines testing #6444 were there before. I think I only changed the commented line with expected.

Though, it's very likely the issue with nan ordering no longer exists here. IIRC it's related with python set operations. I'll check if python 2 and 3 differ here and clean it if all is ok.

@jreback jreback commented on an outdated diff Jul 14, 2016

pandas/tests/indexes/test_base.py
+ result = ind.argsort()
+ else:
+ result = ind.argsort()
+ expected = np.array(ind).argsort()
+ tm.assert_numpy_array_equal(result, expected, check_dtype=False)
+
+ # test_numpy_argsort
+ if PY3:
+ with tm.assertRaisesRegexp(TypeError, "unorderable types"):
+ result = np.argsort(ind)
+ else:
+ result = np.argsort(ind)
+ expected = ind.argsort()
+ tm.assert_numpy_array_equal(result, expected)
+
+ # test_copy_name
@jreback

jreback Jul 14, 2016

Contributor

the name checking should not be intermixed with these tests

@jreback jreback commented on an outdated diff Jul 14, 2016

pandas/tests/indexes/test_base.py
+ self.assertTrue(ind.equals(first))
+
+ self.assertEqual(first.name, 'mario')
+ self.assertEqual(second.name, 'mario')
+
+ s1 = Series(2, index=first)
+ s2 = Series(3, index=second[:-1])
+ if PY3:
+ with tm.assert_produces_warning(RuntimeWarning):
+ # unorderable types
+ s3 = s1 * s2
+ else:
+ s3 = s1 * s2
+ self.assertEqual(s3.index.name, 'mario')
+
+ # test_union_base
@jreback

jreback Jul 14, 2016

Contributor

separate test (yeah for sure put all of this in a class), then you can have lots of shorter tests

@jreback jreback commented on the diff Jul 14, 2016

pandas/tests/indexes/test_multi.py
@@ -1877,6 +1877,15 @@ def test_duplicate_meta_data(self):
self.assertTrue(idx.has_duplicates)
self.assertEqual(idx.drop_duplicates().names, idx.names)
+ def test_get_unique_index(self):
+ idx = self.index[[0, 1, 0, 1, 1, 0, 0]]
@jreback

jreback Jul 14, 2016

Contributor

do you tests this for Index? (do we even have it defined?)

@pijucha

pijucha Jul 14, 2016

Contributor

Yes. MultiIndex was excluded from those tests for clarity and is tested here.

Contributor

jreback commented Jul 14, 2016

@pijucha looks really good. just some tests splitting, minor corrections.

ping on green.

Contributor

pijucha commented Jul 15, 2016

@jreback OK, cleaned some tests. It's green.

@jreback jreback and 1 other commented on an outdated diff Jul 15, 2016

pandas/indexes/base.py
Parameters
----------
- other : Index or array-like
+ other: Index or array-like
@jreback

jreback Jul 15, 2016

Contributor

these actually should be other : (a space before the :), sorry about that.

@pijucha

pijucha Jul 15, 2016

Contributor

ok

@jreback jreback commented on the diff Jul 15, 2016

pandas/tests/indexes/test_base.py
+ expected = Index([0, 1, 'a'])
+ self.assert_index_equal(result, expected)
+
+ def test_symmetric_difference(self):
+ # (same results for py2 and py3 but sortedness not tested elsewhere)
+ idx = self.create_index()
+ first = idx[:4]
+ second = idx[3:]
+
+ result = first.symmetric_difference(second)
+ expected = Index([0, 1, 2, 'a', 'c'])
+ self.assert_index_equal(result, expected)
+
+ def test_logical_compat(self):
+ idx = self.create_index()
+ self.assertEqual(idx.all(), idx.values.all())
@jreback

jreback Jul 15, 2016

Contributor

much nicer, thxs!

Contributor

jreback commented Jul 15, 2016

minor doc change. ping on green.

Contributor

jreback commented Jul 15, 2016

some failures on windows. we ultimately test on appveyor.com (you just need a free account), but its very slow. so easiest to spin up a vm if you can.

these are prob because you need to use _ensure_platform_int when you are using an indexer (on windows things like .take and .putmask take a int32 rather than standard int64. in this particular case you actually need to _ensure_int64 (as you actually have platform ints up front).

(pandas3.5) C:\Users\conda\Documents\pandas3.5>nosetests pandas/tests/indexes
..........................................................................................................................................
..........................................................................................................................................
..........................................................................................................................................
.........................................................................S................................................................
......................................................................................................E...................................
................................................E...............................................
======================================================================
ERROR: test_join_non_unique (pandas.tests.indexes.test_numeric.TestInt64Index)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\conda\Documents\pandas3.5\pandas\tests\indexes\test_numeric.py", line 755, in test_join_non_unique
    joined, lidx, ridx = left.join(left, return_indexers=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2635, in join
    return_indexers=return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2720, in _join_non_unique
    sort=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\tools\merge.py", line 909, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas\src\join.pyx", line 51, in pandas.algos.left_outer_join (pandas\algos.c:33321)
    def left_outer_join(ndarray[int64_t] left, ndarray[int64_t] right,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

======================================================================
ERROR: test_join_non_unique (pandas.tests.indexes.test_range.TestRangeIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\conda\Documents\pandas3.5\pandas\tests\indexes\test_range.py", line 524, in test_join_non_unique
    res, lidx, ridx = self.index.join(other, return_indexers=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\range.py", line 435, in join
    return super(RangeIndex, self).join(other, how, level, return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2642, in join
    return_indexers=return_indexers)
  File "C:\Users\conda\Documents\pandas3.5\pandas\indexes\base.py", line 2720, in _join_non_unique
    sort=True)
  File "C:\Users\conda\Documents\pandas3.5\pandas\tools\merge.py", line 909, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas\src\join.pyx", line 51, in pandas.algos.left_outer_join (pandas\algos.c:33321)
    def left_outer_join(ndarray[int64_t] left, ndarray[int64_t] right,
ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'long'

----------------------------------------------------------------------
Ran 796 tests in 10.849s

FAILED (SKIP=1, errors=2)
Contributor

pijucha commented Jul 15, 2016

OK, I think I've found the problem but need some time to test it on windows.

@pijucha pijucha BUG/PERF: Sort mixed-int in Py3, fix Index.difference
1. Added an internal `safe_sort` to safely sort mixed-integer
arrays in Python3.

2. Changed Index.difference and Index.symmetric_difference
in order to:
- sort mixed-int Indexes (#13432)
- improve performance (#12044)

3. Fixed DataFrame.join which raised in Python3 with mixed-int
non-unique indexes (issue with sorting mixed-ints, #12814)

4. Fixed Index.union returning an empty Index when one of
arguments was a named empty Index (#13432)
3a96089

@pijucha pijucha commented on the diff Jul 17, 2016

pandas/tools/merge.py
- reverse_indexer = np.empty(len(sorter), dtype=np.int64)
- reverse_indexer.put(sorter, np.arange(len(sorter)))
-
- new_left = reverse_indexer.take(_ensure_platform_int(left))
- np.putmask(new_left, left == -1, -1)
-
- new_right = reverse_indexer.take(_ensure_platform_int(right))
- np.putmask(new_right, right == -1, -1)
+ _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
+ new_labels = _ensure_int64(new_labels)
@pijucha

pijucha Jul 17, 2016

Contributor

The fix. safe_sort returns platform int labels but the join function needs int64.

Contributor

pijucha commented Jul 17, 2016

@jreback The fix was straightforward and it should be ok now.

However, I'm getting these two errors on windows (64bit) - also on the master branch, so they're independent of this PR. Either I messed up the windows setup or it's not being tested thoroughly.

======================================================================
FAIL: test_next (pandas.io.tests.test_common.TestMMapWrapper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\GitHub\pandas-pijucha\pandas\io\tests\test_common.py", line 141, in t
est_next
    self.assertEqual(next_line, line)
AssertionError: 'a,b,c\r\n' != 'a,b,c\n'
- a,b,c
?      -
+ a,b,c

======================================================================
FAIL: test_alignment_non_pandas (pandas.tests.frame.test_operators.TestDataFrame
Operators)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\GitHub\pandas-pijucha\pandas\tests\frame\test_operators.py", line 120
2, in test_alignment_non_pandas
    Series([1, 2, 3], index=df.index))
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 1157, in assert_s
eries_equal
    assert_attr_equal('dtype', left, right)
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 882, in assert_at
tr_equal
    left_attr, right_attr)
  File "D:\GitHub\pandas-pijucha\pandas\util\testing.py", line 1021, in raise_as
sert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  int32
[right]: int64None
Contributor

jreback commented Jul 17, 2016

2nd is already fixed
make sure u r rebased
first was fixed a while back as well

Contributor

pijucha commented Jul 17, 2016

Ah, then it's probably ok. I tested 0.18.1+202.g0a70b5f on windows.
(This commit is already rebased to the current master. I'll check on windows in a minute.)

Contributor

pijucha commented Jul 17, 2016

Yes, everything is all right.
The second error is fixed, and the first one was apparently due to my git autocrlf config.

Contributor

jreback commented Jul 19, 2016

is #13432 completely closed by this? if not pls make checkboxes in the top (and tick off the things that are done and leave open things that are not).

jreback closed this in b225cac Jul 19, 2016

Contributor

jreback commented Jul 19, 2016

thanks @pijucha pretty awesome!

jreback referenced this pull request Jul 19, 2016

Open

BUG: Index set operations issues #13432

2 of 5 tasks complete
Contributor

pijucha commented Jul 19, 2016

@jreback Thanks.

As for #13432, the issues with difference and symmetric_difference are solved. But union and intersection are still open. I'll comment there soon.

Contributor

jreback commented Jul 19, 2016

thanks!

pijucha deleted the pijucha:setop13432 branch Jul 19, 2016

Contributor

jreback commented Jul 19, 2016 edited

@pijucha this is failing on windows / py2.7 (3 is clean), and only windows: https://ci.appveyor.com/project/jreback/pandas-465/build/1.0.762/job/ghokkwmadwog983y

not sure what is going on

Contributor

pijucha commented Jul 19, 2016

@jreback
This is strange. It's not windows because I'm getting it now on linux (py2.7, numpy 1.11.0). I guess travis tests mostly older numpies and I missed it too.

According to test_categorical.py, numpy (>= 1.10) should sort mixed int-datetime array. But it doesn't:

In [3]: arr = np.array([1, 2, datetime.now(), 0, 3], dtype='O')

In [4]: np.sort(arr)
/home/users/piotr/workspace/pandas-pijucha/pandas_dev_python2/lib/python2.7/site-packages/numpy/core/fromnumeric.py:825: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  a.sort(axis, kind, order)
Out[4]: array([1, 2, datetime.datetime(2016, 7, 19, 9, 49, 28, 214675), 0, 3], dtype=object)

In [6]: np.__version__
Out[6]: '1.11.0'

Ipython probably interferes here because in pure python2.7 I'm getting

TypeError: can't compare datetime.datetime to int

In the old code in factorize, there was a list comprehension similar to this:

ordered = [np.sort(np.array([e for e in arr if f(e)], dtype=object))
           for f in [lambda x: True, lambda x: False]]

I haven't yet caught it precisely but it looks as if it sometimes swallowed an exception. (New code in safe_sort is simpler - sorts each of the two arrays separately, but still with np.sort.)

It looks to me that Categorical.from_array(arr, ordered=True) should always raise now. And maybe test_constructor_unsortable from test_categorical.py needs to be rewritten. (I'll check later in the evening if it works.)

Contributor

jreback commented Jul 19, 2016

ok can u make a new issue about this? (pretty much copy your above comment(

Contributor

pijucha commented Jul 19, 2016

Sure, i will.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment