Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.concat on two (or more) series produces all-NaN dataframe #11058

Closed
timfeirg opened this issue Sep 11, 2015 · 5 comments · Fixed by #33728
Closed

pd.concat on two (or more) series produces all-NaN dataframe #11058

timfeirg opened this issue Sep 11, 2015 · 5 comments · Fixed by #33728
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@timfeirg
Copy link

this is from a stackoverflow question here, you can download serialized object here and reproduce using python 2.7 and pandas 0.16.2.

I'm trying to concat two series with multiindex using pd.concat([a, b], axis=1) like so:

>>>payed_orders.head()
dt          product_id
2015-01-15  10001          1
            10007          1
            10016         14
            10022          1
            10023          1
Name: payed_orders, dtype: int64

>>>refund_orders.head()
dt          product_id
2015-01-15  10007         1
            10016         4
            10030         1
2015-01-16  10007         3
            10008         1
Name: refund_orders, dtype: int64

>>>pd.concat([payed_orders.head(), refund_orders.head()], axis=1, ignore_index=False)
        payed_orders    refund_orders
dt  product_id      
2015-01-15  10001   NaN NaN
            10007   NaN NaN
            10016   NaN NaN
            10022   NaN NaN
            10023   NaN NaN
            10030   NaN NaN
2015-01-16  10007   NaN NaN
            10008   NaN NaN

I've checked the index type and many other stuff to make sure no obvious were made, I've read the docs to learn that concatenating and merging series and dataframes may introduce NaN, but I didn't find anything in the docs to explain this behavior.

@jreback
Copy link
Contributor

jreback commented Sep 11, 2015

see #10060

you are using datetime.date, which is really-really frowned upon and just about non-supported. Its pretty hard to actually construct a multi-index (in fact we were always coercing them to Timestamps) which is the correct dtype here.

Just use Timestamps/datetime.datetime. Their is NO benefit to using datetime.date and just more and more issues.

This is the correct usage

In [23]: s = Series(['a','b'],index=pd.MultiIndex.from_tuples([(1,Timestamp('20130101')),(2,Timestamp('20140101'))],names=['first','seconds']))

In [24]: s2 = Series(['a','b'],index=pd.MultiIndex.from_tuples([(1,Timestamp('20130101')),(2,Timestamp('20150101'))],names=['first','seconds']))

In [25]: pd.concat([s,s2],axis=1)
Out[25]: 
                    0    1
first seconds             
1     2013-01-01    a    a
2     2014-01-01    b  NaN
      2015-01-01  NaN    b

This is technically a bug, but really really frowned upon.

In [10]: s = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2014,1,1)])]))

In [11]: s
Out[11]: 
1  2013-01-01    a
2  2014-01-01    b
dtype: object

In [12]: s = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2014,1,1)])],names=['first','second']))

In [13]: s
Out[13]: 
first  second    
1      2013-01-01    a
2      2014-01-01    b
dtype: object

In [14]: s.index.levels[1].values
Out[14]: array([datetime.date(2013, 1, 1), datetime.date(2014, 1, 1)], dtype=object)

In [15]: pd.concat([s,s],axis=1)
Out[15]: 
                  0  1
first second          
1     2013-01-01  a  a
2     2014-01-01  b  b

In [16]: s = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2014,1,1)])],names=['first','second']))

In [17]: s2 = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2015,1,1)])],names=['first','second']))

In [18]: s
Out[18]: 
first  second    
1      2013-01-01    a
2      2014-01-01    b
dtype: object

In [19]: s.index.levels[1].values
Out[19]: array([datetime.date(2013, 1, 1), datetime.date(2014, 1, 1)], dtype=object)

In [20]: pd.concat([s,s2],axis=1)
Out[20]: 
                    0    1
first second              
1     2013-01-01  NaN  NaN
2     2014-01-01  NaN  NaN
      2015-01-01  NaN  NaN

I prop won't ever get to this. This really requires creation of a new dtype to support these. I don't think its worth it nor is it useful in any way.

I mark it as a bug in any event.

@jreback jreback added Bug Timeseries Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff MultiIndex labels Sep 11, 2015
@jreback jreback added this to the Someday milestone Sep 11, 2015
@jreback
Copy link
Contributor

jreback commented Sep 11, 2015

the one thing I think might useful is a note in the doc-string for MultiIndex and a note in the docs to avoid using datetime.date. Would you do a pull-request?

@timfeirg
Copy link
Author

I think I will do that.
BTW I notice that if I pd.merge on two dataframes with multiindex, one level of which is datetime.datetime, they gets automatically converted pandas TimeStamp, I wonder if pd.concat could have the same behavior

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [106]: s
Out[106]:
first  second
1      2013-01-01    a
2      2014-01-01    b
dtype: object

In [107]: s2
Out[107]:
first  second
1      2013-01-01    a
2      2015-01-01    b
dtype: object

In [108]: pd.concat([s,s2],axis=1)
Out[108]:
                    0    1
first second
1     2013-01-01    a    a
2     2014-01-01    b  NaN
      2015-01-01  NaN    b

In [109]: pd.__version__
Out[109]: '1.1.0.dev0+1027.g767335719'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Timeseries labels Mar 31, 2020
@simonjayhawkins
Copy link
Member

Thanks @timfeirg for reporting this issue. before closing this, we should add a test to prevent regressions.

@simonjayhawkins simonjayhawkins modified the milestones: Someday, Contributions Welcome Mar 31, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants