Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial string indexing returns ndarray rather than Series. #27516

Closed
anetbnd opened this issue Jul 22, 2019 · 10 comments · Fixed by #27712
Closed

Partial string indexing returns ndarray rather than Series. #27516

anetbnd opened this issue Jul 22, 2019 · 10 comments · Fixed by #27712
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version Timeseries
Milestone

Comments

@anetbnd
Copy link

anetbnd commented Jul 22, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

print(pd.__version__)

df_pass = pd.DataFrame(index=range(1,1000), columns=['A', 'B', 'C'])
df_pass.loc[:, :] = np.random.uniform(-100, 100, size=(len(df_pass.index), len(df_pass.columns)))
print(df_pass.loc[range(1,500), 'A'].sum(skipna=False)) # everything is fine here

df_fail = pd.DataFrame(index=pd.date_range('01-01-2005', '12-01-2006'), columns=['A', 'B', 'C'])
df_fail .loc[:, :] = np.random.uniform(-100, 100, size=(len(df_fail .index), len(df_fail .columns)))
print(df_fail .loc['2005', 'A'].sum(skipna=False)) # Here the type-error appears

Output:

0.25.0
-847.9947710494175
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-26bdb9fa00f7> in <module>()
     10 df_fail = pd.DataFrame(index=pd.date_range('01-01-2005', '12-01-2006'), columns=['A', 'B', 'C'])
     11 df_fail .loc[:, :] = np.random.uniform(-100, 100, size=(len(df_fail .index), len(df_fail .columns)))
---> 12 print(df_fail .loc['2005', 'A'].sum(skipna=False)) # Here the type-error appears
     13

Problem description

Before updating from 0.24.0 to 0.25.0 everything worked fine. I can also not see, that there was an API change here. I would expect, that the second sum, works without issues.

Expected Output

Output (something like):

0.25.0
-847.9947710494175
-451.5691327012012

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 0.25.0
numpy            : 1.14.3
pytz             : 2018.4
dateutil         : 2.7.3
pip              : 19.1.1
setuptools       : 39.1.0
Cython           : 0.28.2
pytest           : 3.10.0
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : 1.0.5
lxml.etree       : 4.2.5
html5lib         : 1.0.1
pymysql          : None
psycopg2         : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2           : 2.10
IPython          : 6.4.0
pandas_datareader: None
bs4              : 4.7.1
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.2.5
matplotlib       : 2.2.2
numexpr          : None
odfpy            : None
openpyxl         : 2.6.0
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.1.0
sqlalchemy       : 1.2.12
tables           : None
xarray           : None
xlrd             : 1.1.0
xlwt             : None
xlsxwriter       : 1.0.5
@TomAugspurger
Copy link
Contributor

That's an indexing bug. Somehow .loc is returning an an ndarray rather than a Series.

In [27]: df = pd.DataFrame({"A": 1}, index=pd.date_range("2000", periods=100))

In [28]: df.loc['2000-01', 'A']
Out[28]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1])

@TomAugspurger TomAugspurger added this to the 0.25.1 milestone Jul 22, 2019
@TomAugspurger TomAugspurger added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 22, 2019
@TomAugspurger
Copy link
Contributor

#27110 seems like the most likely candidate (cc @jbrockmendel).

IIUC, we can't treat ('2000-01', 'A') as a scalar, since it's really shorthand for the expanded indexing. Do you have time to look into this @jbrockmendel?

This may warrant an expedited 0.25.1. WDYT @jreback?

@jreback
Copy link
Contributor

jreback commented Jul 22, 2019

likely more things would show as people actually use the new release

let’s just do a few weeks on this

@jbrockmendel
Copy link
Member

At first glance, I don't see how #27110 would cause this since that should affect DatetimeTZBlock but not DatetimeBlock.

There have been some other recent PRs that have tried to simplify core.indexing, maybe something got lost in there. I'll take a look.

@TomAugspurger
Copy link
Contributor

Ah sorry. I was just going release notes that sounded promising and stopped at that one.

@jbrockmendel
Copy link
Member

Tracking this down a bit, following Tom's example.

df.loc.__getitem__ eventually calls df._get_value('2000-01', 'A'). In 0.24.2 KeyError is raised by engine.get_value. Now we fall through following that KeyError

@jbrockmendel
Copy link
Member

Looks like the relevant change was #26298

@TomAugspurger
Copy link
Contributor

Thanks @jbrockmendel. I would not have guessed that based on the name. Do you have a fix in mind?

@TomAugspurger TomAugspurger added Regression Functionality that used to work in a prior pandas version Timeseries labels Jul 22, 2019
@jbrockmendel
Copy link
Member

Do you have a fix in mind?

In DataFrame._get_value that PR changed the KeyError behavior to only raise for MultIIndex. That will need to raise in more cases. Not yet sure just how tight it will need to be.

@TomAugspurger TomAugspurger changed the title Pandas 0.25.0: TypeError: _sum() got an unexpected keyword argument 'skipna' Partial string indexing returns ndarray rather than Series. Aug 1, 2019
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 1, 2019
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 2, 2019
jreback pushed a commit that referenced this issue Aug 4, 2019
@anetbnd
Copy link
Author

anetbnd commented Aug 5, 2019

Thanks for taking care about this.

quintusdias pushed a commit to quintusdias/pandas_dev that referenced this issue Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version Timeseries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants