Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.extract raises ValueError with group named "name" #11385

Closed
tdhock opened this issue Oct 20, 2015 · 3 comments
Closed

str.extract raises ValueError with group named "name" #11385

tdhock opened this issue Oct 20, 2015 · 3 comments
Labels
Bug Strings String extension data type and string data
Milestone

Comments

@tdhock
Copy link
Contributor

tdhock commented Oct 20, 2015

For a Series S, I find the S.str.extract method very useful. It is great how you implemented naming the resulting DataFrame columns according to the names specified in the capturing groups of the regular expression.

However there seems to be a bug when there is a capture group named "name" for example

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     'Dave': 'dave@google.com',
...     'multiple': 'rob@gmail.com some text steve@gmail.com',
...     'none': np.nan,
...     }
>>> pattern = r'''
... (?P<name>[a-z]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> S = pd.Series(data)
>>> result = S.str.extract(pattern, re.VERBOSE)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/strings.py", line 1370, in extract
    return self._wrap_result(result, name=name)
  File "pandas/core/strings.py", line 1088, in _wrap_result
    name = kwargs.get('name') or getattr(result, 'name', None) or self.series.name
  File "pandas/core/generic.py", line 730, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> from pandas.util.print_versions import show_versions
>>> show_versions()

INSTALLED VERSIONS
------------------
commit: 5d953e3fba420b6721c7f1c5d53e5812fe113bbc
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.8.0-44-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.17.0+73.g5d953e3
nose: 1.1.2
pip: None
setuptools: 0.6
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: None
IPython: 0.12.1
sphinx: None
patsy: None
dateutil: 1.5
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.7.2
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
>>>

The result I expected was

>>> exp_list = [
...     ("dave", "google", "com"),
...     ("rob", "gmail", "com"),
...     (np.nan, np.nan, np.nan),
...     ]
>>> exp = pd.DataFrame(
...     exp_list,
...     ["Dave", "multiple", "none"],
...     ["name", "domain", "tld"])
>>> print exp
          name  domain  tld
Dave      dave  google  com
multiple   rob   gmail  com
none       NaN     NaN  NaN
>>>
@tdhock
Copy link
Contributor Author

tdhock commented Oct 20, 2015

@sinhrks @jorisvandenbossche @jreback @mortada since you seem to be discussing extract in #10103

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data labels Oct 20, 2015
@jreback jreback added this to the Next Major Release milestone Oct 20, 2015
@Winterflower
Copy link
Contributor

@jreback should we prevent a user from using 'name' as one of the regex capture group names?

(I may be wrong here)

The problem seems to occur because in pandas/core/strings.py _wrap_result
getattr(result, 'name', None)
returns the 'name' column/series instead of the name attribute.

The name attribute is not set for the result return variable in str_extract, so the result from getattr would default to None unless we are talking about @tdhock 's usecase or another method calling _wrap_result explicitly sets a value for name in result.

One solution, I suppose would be to check in f inside str_extract if one of the named groups in the pattern is called 'name', but idk if this is a good approach to solving this.

Something like:

def str_extract(arr, pat, flags=0):
   #omitting extra stuff
    def f(x):
        if not isinstance(x, compat.string_types):
            return empty_row
        m = regex.search(x)
        if m:
            if "name" in m.groupdict().keys():
                #do something to warn user
            else:
                return [np.nan if item is None else item for item in m.groups()]
        else:
            return empty_row

@sinhrks sinhrks modified the milestones: 0.18.0, Next Major Release Apr 10, 2016
@sinhrks
Copy link
Member

sinhrks commented Apr 10, 2016

Closed by #11386.

@sinhrks sinhrks closed this as completed Apr 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

5 participants