Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior of the str attribute which is working with Series of list #25240

Open
fran6w opened this issue Feb 9, 2019 · 4 comments
Open
Labels
Docs Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Strings String extension data type and string data

Comments

@fran6w
Copy link

fran6w commented Feb 9, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
s1 = pd.Series(['AA/aa', 'BB/bb', 'CC/cc'])
s2 = s1.str.split('/')
s2.str[0]

Result:
0 AA
1 BB
2 CC
dtype: object

Problem description

In this example, the second 'str' attribute is applyied to a Series of list and not to a Series of string.
Then the [] operator works fine with each list an retrieve their first element...

As it is an unexpected working behavior, one may wonder if it is secure to code like this (instead of working with an apply + lambda for instance). This works also with Series of dict, and probably with any object implementing the [] operator.

Expected Output

Warning or error?
Or a word in the documentation?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.3.2
pip: 19.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: 0.10.9
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.3
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Strings String extension data type and string data Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves Bug and removed Index Related to the Index class or subclasses labels Feb 9, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 9, 2019

Hmm...I'm not sure why the str accessor is even available in s2, since the values are list, not string.

That looks like a bug to me.

cc @jreback

@fran6w
Copy link
Author

fran6w commented Feb 10, 2019

In fact, I have looked at the file core/string.py, lines 1729+, try:

import pd
help(pd.Series.str.get)

The str_get() function is documented and explains that it is able to extract element from each component at specified position. Examples with strings, list, tuple, dict.

My opinion is that the str accessor has indeed a broader use than it is explained in the main pandas documentation, e.g., https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html

@gfyoung gfyoung added the Docs label Feb 10, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 10, 2019

@fran6w : Ah, I see. In this case, I think we just need to document this much more thoroughly at the link you provided above. Thus, I'm less inclined to believe your example is buggy and is in fact expected behavior, in light of the behavior of str_get.

cc @jreback

@fran6w
Copy link
Author

fran6w commented Feb 11, 2019

Indeed. Updating the documentation would be great.

BTW, if you take this example below, the global behavior of the str accessor remains strange. For instance, the contains(regex=False) method works fine for Series of list (or dict).

import pandas as pd
s1 = pd.Series(['AA/aa', 'BB/bb', 'CC/cc'])
s2 = s1.str.split('/')
s2.str.contains('AA', regex=False)

Result:
0 True
1 False
2 False
dtype: bool

In fact, the str accessor works fine in cases where the "string" function implementation used after "str" is compatible the actual objects type in the Series... In the case of contains(regex=False), the branch of the code uses a lambda (f = lambda x: pat in x) which appears to work with list or dict as well.

IMHO, those are working side effects...

@mroeschke mroeschke removed the Bug label May 3, 2020
@jbrockmendel jbrockmendel added Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants