New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Joining a DataFrame with a PeriodIndex fails #16541

Closed
max-sixty opened this Issue May 30, 2017 · 8 comments

Comments

Projects
None yet
5 participants
@max-sixty
Contributor

max-sixty commented May 30, 2017

Code Sample

In [19]: dates = pd.period_range('20100101','20100105', freq='D')

In [20]: weights = pd.DataFrame(np.random.randn(5, 5), index=dates, columns = ['g1_%d' % x for x in range(5)])

In [21]: weights.join(pd.DataFrame(np.random.randn(5,5), index=dates, columns = ['g2_%d' % x for x in range(5)]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-2fdb8b02f5a4> in <module>()
      1 weights.join(
----> 2             pd.DataFrame(np.random.randn(5,5), index=dates, columns = ['g2_%d' % x for x in range(5)]))

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
   4765         # For SparseDataFrame's benefit
   4766         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4767                                  rsuffix=rsuffix, sort=sort)
   4768
   4769     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   4780             return merge(self, other, left_on=on, how=how,
   4781                          left_index=on is None, right_index=True,
-> 4782                          suffixes=(lsuffix, rsuffix), sort=sort)
   4783         else:
   4784             if on is not None:

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
     53                          copy=copy, indicator=indicator)
---> 54     return op.get_result()
     55
     56

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in get_result(self)
    567                 self.left, self.right)
    568
--> 569         join_index, left_indexer, right_indexer = self._get_join_info()
    570
    571         ldata, rdata = self.left._data, self.right._data

/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/merge.pyc in _get_join_info(self)
    720             join_index, left_indexer, right_indexer = \
    721                 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722                              sort=self.sort)
    723         elif self.right_index and self.how == 'left':
    724             join_index, left_indexer, right_indexer = \

TypeError: join() got an unexpected keyword argument 'sort'

It seems the sort kwarg is invalid, but the internals are passing it in regardless

Output of pd.show_versions()

In [22]: pd.show_versions() /usr/local/lib/python2.7/dist-packages/xarray/core/formatting.py:16: FutureWarning: The pandas.tslib module is deprecated and will be removed in a future version. from pandas.tslib import OutOfBoundsDatetime

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.13-moby
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.2
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: 0.1.6
pandas_datareader: None

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 31, 2017

Contributor

this fixes if you can do a PR

diff --git a/pandas/core/indexes/period.py b/pandas/core/indexes/period.py
index 15fd9b7..50d5958 100644
--- a/pandas/core/indexes/period.py
+++ b/pandas/core/indexes/period.py
@@ -919,14 +919,16 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
                               self[loc:].asi8))
         return self._shallow_copy(idx)
 
-    def join(self, other, how='left', level=None, return_indexers=False):
+    def join(self, other, how='left', level=None, return_indexers=False,
+             sort=False):
         """
         See Index.join
         """
         self._assert_can_do_setop(other)
 
         result = Int64Index.join(self, other, how=how, level=level,
-                                 return_indexers=return_indexers)
+                                 return_indexers=return_indexers,
+                                 sort=sort)
 
         if return_indexers:
             result, lidx, ridx = result

obviously need some more tests on the index join methods as well :>

Here is the tests for datetimes in pandas/tests/indexes/datetimes/test_datetimes.py
need to do something like this in periods/test_period.py

    def test_join_self(self):
        index = date_range('1/1/2000', periods=10)
        kinds = 'outer', 'inner', 'left', 'right'
        for kind in kinds:
            joined = index.join(index, how=kind)
            assert index is joined
Contributor

jreback commented May 31, 2017

this fixes if you can do a PR

diff --git a/pandas/core/indexes/period.py b/pandas/core/indexes/period.py
index 15fd9b7..50d5958 100644
--- a/pandas/core/indexes/period.py
+++ b/pandas/core/indexes/period.py
@@ -919,14 +919,16 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
                               self[loc:].asi8))
         return self._shallow_copy(idx)
 
-    def join(self, other, how='left', level=None, return_indexers=False):
+    def join(self, other, how='left', level=None, return_indexers=False,
+             sort=False):
         """
         See Index.join
         """
         self._assert_can_do_setop(other)
 
         result = Int64Index.join(self, other, how=how, level=level,
-                                 return_indexers=return_indexers)
+                                 return_indexers=return_indexers,
+                                 sort=sort)
 
         if return_indexers:
             result, lidx, ridx = result

obviously need some more tests on the index join methods as well :>

Here is the tests for datetimes in pandas/tests/indexes/datetimes/test_datetimes.py
need to do something like this in periods/test_period.py

    def test_join_self(self):
        index = date_range('1/1/2000', periods=10)
        kinds = 'outer', 'inner', 'left', 'right'
        for kind in kinds:
            joined = index.join(index, how=kind)
            assert index is joined
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 31, 2017

Contributor

if you can do this in next day or 2 can get into 0.20.2 (end of week)

Contributor

jreback commented May 31, 2017

if you can do this in next day or 2 can get into 0.20.2 (end of week)

@jreback jreback modified the milestones: 0.20.2, 0.21.0 Jun 1, 2017

@rosygupta

This comment has been minimized.

Show comment
Hide comment
@rosygupta

rosygupta Jun 7, 2017

@jreback Is this issue still open?

rosygupta commented Jun 7, 2017

@jreback Is this issue still open?

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Jun 7, 2017

Contributor

PR waiting here: #16586

Contributor

max-sixty commented Jun 7, 2017

PR waiting here: #16586

@Dr-Irv

This comment has been minimized.

Show comment
Hide comment
@Dr-Irv

Dr-Irv Jun 8, 2017

Contributor

So I used my version 0.20.1 to add the fix suggested above by @jreback , and that fixed the problem for me, but then a different one cropped up. Not sure if I should just put this in a different issue. In my use case, I took dates and made them a monthly period, and there are duplicates. Here is a way to make it happen:

perindex = pd.period_range('2016-01-01', periods=16, freq='M')
perdf = pd.DataFrame([i for i in range(len(perindex))],
                     index=perindex, columns=['pnum'])
df2 = pd.concat([perdf, perdf])
perdf.merge(df2, left_index=True, right_index=True, how='outer')

This gives this sequence of errors:

TypeError                                 Traceback (most recent call last)
<ipython-input-45-a9a1ea5d6a78> in <module>()
      1 df2 = pd.concat([perdf, perdf])
----> 2 perdf.merge(df2, left_index=True, right_index=True, how='outer')

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
   4818                      right_on=right_on, left_index=left_index,
   4819                      right_index=right_index, sort=sort, suffixes=suffixes,
-> 4820                      copy=copy, indicator=indicator)
   4821 
   4822     def round(self, decimals=0, *args, **kwargs):

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
     53                          copy=copy, indicator=indicator)
---> 54     return op.get_result()
     55 
     56 

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
    567                 self.left, self.right)
    568 
--> 569         join_index, left_indexer, right_indexer = self._get_join_info()
    570 
    571         ldata, rdata = self.left._data, self.right._data

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _get_join_info(self)
    720             join_index, left_indexer, right_indexer = \
    721                 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722                              sort=self.sort)
    723         elif self.right_index and self.how == 'left':
    724             join_index, left_indexer, right_indexer = \

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\period.py in join(self, other, how, level, return_indexers, sort)
    927 
    928         result = Int64Index.join(self, other, how=how, level=level,
--> 929                                  return_indexers=return_indexers, sort=sort)
    930 
    931         if return_indexers:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\base.py in join(self, other, how, level, return_indexers, sort)
   2995             else:
   2996                 return self._join_non_unique(other, how=how,
-> 2997                                              return_indexers=return_indexers)
   2998         elif self.is_monotonic and other.is_monotonic:
   2999             try:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\base.py in _join_non_unique(self, other, how, return_indexers)
   3076         left_idx, right_idx = _get_join_indexers([self.values],
   3077                                                  [other._values], how=how,
-> 3078                                                  sort=True)
   3079 
   3080         left_idx = _ensure_platform_int(left_idx)

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
    980 
    981     # get left & right join labels and num. of levels at each location
--> 982     llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
    983 
    984     # get flat i8 keys from label lists

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _factorize_keys(lk, rk, sort)
   1409     if sort:
   1410         uniques = rizer.uniques.to_array()
-> 1411         llab, rlab = _sort_labels(uniques, llab, rlab)
   1412 
   1413     # NA group

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _sort_labels(uniques, left, right)
   1435     labels = np.concatenate([left, right])
   1436 
-> 1437     _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
   1438     new_labels = _ensure_int64(new_labels)
   1439     new_left, new_right = new_labels[:l], new_labels[l:]

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
    476     if compat.PY3 and lib.infer_dtype(values) == 'mixed-integer':
    477         # unorderable in py3 if mixed str/int
--> 478         ordered = sort_mixed(values)
    479     else:
    480         try:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\algorithms.py in sort_mixed(values)
    469         str_pos = np.array([isinstance(x, string_types) for x in values],
    470                            dtype=bool)
--> 471         nums = np.sort(values[~str_pos])
    472         strs = np.sort(values[str_pos])
    473         return _ensure_object(np.concatenate([nums, strs]))

C:\Anaconda3\envs\py36\lib\site-packages\numpy\core\fromnumeric.py in sort(a, axis, kind, order)
    820     else:
    821         a = asanyarray(a).copy(order="K")
--> 822     a.sort(axis=axis, kind=kind, order=order)
    823     return a
    824 

pandas\_libs\period.pyx in pandas._libs.period._Period.__richcmp__ (pandas\_libs\period.c:12067)()

TypeError: Cannot compare type 'Period' with type 'int'

Let me know if I should open up a new issue, given that this bug happens when applying the above fix.

Contributor

Dr-Irv commented Jun 8, 2017

So I used my version 0.20.1 to add the fix suggested above by @jreback , and that fixed the problem for me, but then a different one cropped up. Not sure if I should just put this in a different issue. In my use case, I took dates and made them a monthly period, and there are duplicates. Here is a way to make it happen:

perindex = pd.period_range('2016-01-01', periods=16, freq='M')
perdf = pd.DataFrame([i for i in range(len(perindex))],
                     index=perindex, columns=['pnum'])
df2 = pd.concat([perdf, perdf])
perdf.merge(df2, left_index=True, right_index=True, how='outer')

This gives this sequence of errors:

TypeError                                 Traceback (most recent call last)
<ipython-input-45-a9a1ea5d6a78> in <module>()
      1 df2 = pd.concat([perdf, perdf])
----> 2 perdf.merge(df2, left_index=True, right_index=True, how='outer')

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\frame.py in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
   4818                      right_on=right_on, left_index=left_index,
   4819                      right_index=right_index, sort=sort, suffixes=suffixes,
-> 4820                      copy=copy, indicator=indicator)
   4821 
   4822     def round(self, decimals=0, *args, **kwargs):

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
     53                          copy=copy, indicator=indicator)
---> 54     return op.get_result()
     55 
     56 

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
    567                 self.left, self.right)
    568 
--> 569         join_index, left_indexer, right_indexer = self._get_join_info()
    570 
    571         ldata, rdata = self.left._data, self.right._data

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _get_join_info(self)
    720             join_index, left_indexer, right_indexer = \
    721                 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722                              sort=self.sort)
    723         elif self.right_index and self.how == 'left':
    724             join_index, left_indexer, right_indexer = \

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\period.py in join(self, other, how, level, return_indexers, sort)
    927 
    928         result = Int64Index.join(self, other, how=how, level=level,
--> 929                                  return_indexers=return_indexers, sort=sort)
    930 
    931         if return_indexers:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\base.py in join(self, other, how, level, return_indexers, sort)
   2995             else:
   2996                 return self._join_non_unique(other, how=how,
-> 2997                                              return_indexers=return_indexers)
   2998         elif self.is_monotonic and other.is_monotonic:
   2999             try:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\indexes\base.py in _join_non_unique(self, other, how, return_indexers)
   3076         left_idx, right_idx = _get_join_indexers([self.values],
   3077                                                  [other._values], how=how,
-> 3078                                                  sort=True)
   3079 
   3080         left_idx = _ensure_platform_int(left_idx)

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
    980 
    981     # get left & right join labels and num. of levels at each location
--> 982     llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
    983 
    984     # get flat i8 keys from label lists

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _factorize_keys(lk, rk, sort)
   1409     if sort:
   1410         uniques = rizer.uniques.to_array()
-> 1411         llab, rlab = _sort_labels(uniques, llab, rlab)
   1412 
   1413     # NA group

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\merge.py in _sort_labels(uniques, left, right)
   1435     labels = np.concatenate([left, right])
   1436 
-> 1437     _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
   1438     new_labels = _ensure_int64(new_labels)
   1439     new_left, new_right = new_labels[:l], new_labels[l:]

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
    476     if compat.PY3 and lib.infer_dtype(values) == 'mixed-integer':
    477         # unorderable in py3 if mixed str/int
--> 478         ordered = sort_mixed(values)
    479     else:
    480         try:

C:\Anaconda3\envs\py36\lib\site-packages\pandas\core\algorithms.py in sort_mixed(values)
    469         str_pos = np.array([isinstance(x, string_types) for x in values],
    470                            dtype=bool)
--> 471         nums = np.sort(values[~str_pos])
    472         strs = np.sort(values[str_pos])
    473         return _ensure_object(np.concatenate([nums, strs]))

C:\Anaconda3\envs\py36\lib\site-packages\numpy\core\fromnumeric.py in sort(a, axis, kind, order)
    820     else:
    821         a = asanyarray(a).copy(order="K")
--> 822     a.sort(axis=axis, kind=kind, order=order)
    823     return a
    824 

pandas\_libs\period.pyx in pandas._libs.period._Period.__richcmp__ (pandas\_libs\period.c:12067)()

TypeError: Cannot compare type 'Period' with type 'int'

Let me know if I should open up a new issue, given that this bug happens when applying the above fix.

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Jun 8, 2017

Contributor

Do you get the error run on that PR?

If so, I would open a new issue?

Contributor

max-sixty commented Jun 8, 2017

Do you get the error run on that PR?

If so, I would open a new issue?

@Dr-Irv

This comment has been minimized.

Show comment
Hide comment
@Dr-Irv

Dr-Irv Jun 8, 2017

Contributor

@MaximilianR I did a hand edit of pandas 0.20.1 to implement what is in the PR, and got the error. To test it against all PR's, I think I'd need that PR to be merged into master and then I can pull master and test.

Contributor

Dr-Irv commented Jun 8, 2017

@MaximilianR I did a hand edit of pandas 0.20.1 to implement what is in the PR, and got the error. To test it against all PR's, I think I'd need that PR to be merged into master and then I can pull master and test.

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Jun 8, 2017

Contributor

Great!

FYI you can pull someone's PR for convenience, rather than hand-editing

Contributor

max-sixty commented Jun 8, 2017

Great!

FYI you can pull someone's PR for convenience, rather than hand-editing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment