pd.concat on two (or more) series produces all-NaN dataframe #11058

timfeirg · 2015-09-11T07:04:32Z

this is from a stackoverflow question here, you can download serialized object here and reproduce using python 2.7 and pandas 0.16.2.

I'm trying to concat two series with multiindex using pd.concat([a, b], axis=1) like so:

>>>payed_orders.head()
dt          product_id
2015-01-15  10001          1
            10007          1
            10016         14
            10022          1
            10023          1
Name: payed_orders, dtype: int64

>>>refund_orders.head()
dt          product_id
2015-01-15  10007         1
            10016         4
            10030         1
2015-01-16  10007         3
            10008         1
Name: refund_orders, dtype: int64

>>>pd.concat([payed_orders.head(), refund_orders.head()], axis=1, ignore_index=False)
        payed_orders    refund_orders
dt  product_id      
2015-01-15  10001   NaN NaN
            10007   NaN NaN
            10016   NaN NaN
            10022   NaN NaN
            10023   NaN NaN
            10030   NaN NaN
2015-01-16  10007   NaN NaN
            10008   NaN NaN

I've checked the index type and many other stuff to make sure no obvious were made, I've read the docs to learn that concatenating and merging series and dataframes may introduce NaN, but I didn't find anything in the docs to explain this behavior.

The text was updated successfully, but these errors were encountered:

jreback · 2015-09-11T14:33:01Z

see #10060

you are using datetime.date, which is really-really frowned upon and just about non-supported. Its pretty hard to actually construct a multi-index (in fact we were always coercing them to Timestamps) which is the correct dtype here.

Just use Timestamps/datetime.datetime. Their is NO benefit to using datetime.date and just more and more issues.

This is the correct usage

In [23]: s = Series(['a','b'],index=pd.MultiIndex.from_tuples([(1,Timestamp('20130101')),(2,Timestamp('20140101'))],names=['first','seconds']))

In [24]: s2 = Series(['a','b'],index=pd.MultiIndex.from_tuples([(1,Timestamp('20130101')),(2,Timestamp('20150101'))],names=['first','seconds']))

In [25]: pd.concat([s,s2],axis=1)
Out[25]: 
                    0    1
first seconds             
1     2013-01-01    a    a
2     2014-01-01    b  NaN
      2015-01-01  NaN    b

This is technically a bug, but really really frowned upon.

In [10]: s = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2014,1,1)])]))

In [11]: s
Out[11]: 
1  2013-01-01    a
2  2014-01-01    b
dtype: object

In [12]: s = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2014,1,1)])],names=['first','second']))

In [13]: s
Out[13]: 
first  second    
1      2013-01-01    a
2      2014-01-01    b
dtype: object

In [14]: s.index.levels[1].values
Out[14]: array([datetime.date(2013, 1, 1), datetime.date(2014, 1, 1)], dtype=object)

In [15]: pd.concat([s,s],axis=1)
Out[15]: 
                  0  1
first second          
1     2013-01-01  a  a
2     2014-01-01  b  b

In [16]: s = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2014,1,1)])],names=['first','second']))

In [17]: s2 = Series(['a','b'],index=pd.MultiIndex.from_arrays([[1,2],Index([datetime.date(2013,1,1),datetime.date(2015,1,1)])],names=['first','second']))

In [18]: s
Out[18]: 
first  second    
1      2013-01-01    a
2      2014-01-01    b
dtype: object

In [19]: s.index.levels[1].values
Out[19]: array([datetime.date(2013, 1, 1), datetime.date(2014, 1, 1)], dtype=object)

In [20]: pd.concat([s,s2],axis=1)
Out[20]: 
                    0    1
first second              
1     2013-01-01  NaN  NaN
2     2014-01-01  NaN  NaN
      2015-01-01  NaN  NaN

I prop won't ever get to this. This really requires creation of a new dtype to support these. I don't think its worth it nor is it useful in any way.

I mark it as a bug in any event.

jreback · 2015-09-11T14:34:10Z

the one thing I think might useful is a note in the doc-string for MultiIndex and a note in the docs to avoid using datetime.date. Would you do a pull-request?

timfeirg · 2015-09-12T05:11:35Z

I think I will do that.
BTW I notice that if I pd.merge on two dataframes with multiindex, one level of which is datetime.datetime, they gets automatically converted pandas TimeStamp, I wonder if pd.concat could have the same behavior

mroeschke · 2020-03-31T05:06:27Z

This looks to work on master now. Could use a test

In [106]: s
Out[106]:
first  second
1      2013-01-01    a
2      2014-01-01    b
dtype: object

In [107]: s2
Out[107]:
first  second
1      2013-01-01    a
2      2015-01-01    b
dtype: object

In [108]: pd.concat([s,s2],axis=1)
Out[108]:
                    0    1
first second
1     2013-01-01    a    a
2     2014-01-01    b  NaN
      2015-01-01  NaN    b

In [109]: pd.__version__
Out[109]: '1.1.0.dev0+1027.g767335719'

simonjayhawkins · 2020-03-31T10:56:42Z

Thanks @timfeirg for reporting this issue. before closing this, we should add a test to prevent regressions.

jreback added Bug Timeseries Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff MultiIndex labels Sep 11, 2015

jreback added this to the Someday milestone Sep 11, 2015

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Timeseries labels Mar 31, 2020

timfeirg closed this as completed Mar 31, 2020

simonjayhawkins reopened this Mar 31, 2020

simonjayhawkins modified the milestones: Someday, Contributions Welcome Mar 31, 2020

simonjayhawkins mentioned this issue Apr 22, 2020

TST: pd.concat on two (or more) series produces all-NaN dataframe #33728

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.1 Apr 22, 2020

jreback closed this as completed in #33728 Apr 22, 2020

jbrockmendel mentioned this issue Dec 17, 2020

BUG: Timestamp == date match stdlib #36131

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.concat on two (or more) series produces all-NaN dataframe #11058

pd.concat on two (or more) series produces all-NaN dataframe #11058

timfeirg commented Sep 11, 2015

jreback commented Sep 11, 2015

jreback commented Sep 11, 2015

timfeirg commented Sep 12, 2015

mroeschke commented Mar 31, 2020

simonjayhawkins commented Mar 31, 2020

pd.concat on two (or more) series produces all-NaN dataframe #11058

pd.concat on two (or more) series produces all-NaN dataframe #11058

Comments

timfeirg commented Sep 11, 2015

jreback commented Sep 11, 2015

jreback commented Sep 11, 2015

timfeirg commented Sep 12, 2015

mroeschke commented Mar 31, 2020

simonjayhawkins commented Mar 31, 2020