str.extract raises ValueError with group named "name" #11385

tdhock · 2015-10-20T16:15:02Z

For a Series S, I find the S.str.extract method very useful. It is great how you implemented naming the resulting DataFrame columns according to the names specified in the capturing groups of the regular expression.

However there seems to be a bug when there is a capture group named "name" for example

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     'Dave': 'dave@google.com',
...     'multiple': 'rob@gmail.com some text steve@gmail.com',
...     'none': np.nan,
...     }
>>> pattern = r'''
... (?P<name>[a-z]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> S = pd.Series(data)
>>> result = S.str.extract(pattern, re.VERBOSE)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/strings.py", line 1370, in extract
    return self._wrap_result(result, name=name)
  File "pandas/core/strings.py", line 1088, in _wrap_result
    name = kwargs.get('name') or getattr(result, 'name', None) or self.series.name
  File "pandas/core/generic.py", line 730, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> from pandas.util.print_versions import show_versions
>>> show_versions()

INSTALLED VERSIONS
------------------
commit: 5d953e3fba420b6721c7f1c5d53e5812fe113bbc
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.8.0-44-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.17.0+73.g5d953e3
nose: 1.1.2
pip: None
setuptools: 0.6
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: None
IPython: 0.12.1
sphinx: None
patsy: None
dateutil: 1.5
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.7.2
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
>>>

The result I expected was

>>> exp_list = [
...     ("dave", "google", "com"),
...     ("rob", "gmail", "com"),
...     (np.nan, np.nan, np.nan),
...     ]
>>> exp = pd.DataFrame(
...     exp_list,
...     ["Dave", "multiple", "none"],
...     ["name", "domain", "tld"])
>>> print exp
          name  domain  tld
Dave      dave  google  com
multiple   rob   gmail  com
none       NaN     NaN  NaN
>>>

The text was updated successfully, but these errors were encountered:

tdhock · 2015-10-20T16:16:37Z

@sinhrks @jorisvandenbossche @jreback @mortada since you seem to be discussing extract in #10103

Winterflower · 2015-10-23T13:36:46Z

@jreback should we prevent a user from using 'name' as one of the regex capture group names?

(I may be wrong here)

The problem seems to occur because in pandas/core/strings.py _wrap_result
getattr(result, 'name', None)
returns the 'name' column/series instead of the name attribute.

The name attribute is not set for the result return variable in str_extract, so the result from getattr would default to None unless we are talking about @tdhock 's usecase or another method calling _wrap_result explicitly sets a value for name in result.

One solution, I suppose would be to check in f inside str_extract if one of the named groups in the pattern is called 'name', but idk if this is a good approach to solving this.

Something like:

def str_extract(arr, pat, flags=0):
   #omitting extra stuff
    def f(x):
        if not isinstance(x, compat.string_types):
            return empty_row
        m = regex.search(x)
        if m:
            if "name" in m.groupdict().keys():
                #do something to warn user
            else:
                return [np.nan if item is None else item for item in m.groups()]
        else:
            return empty_row

sinhrks · 2016-04-10T01:46:44Z

Closed by #11386.

jorisvandenbossche added Bug Strings String extension data type and string data labels Oct 20, 2015

jreback added this to the Next Major Release milestone Oct 20, 2015

jreback added Difficulty Novice labels Oct 20, 2015

sinhrks modified the milestones: 0.18.0, Next Major Release Apr 10, 2016

sinhrks closed this as completed Apr 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str.extract raises ValueError with group named "name" #11385

str.extract raises ValueError with group named "name" #11385

tdhock commented Oct 20, 2015

tdhock commented Oct 20, 2015

Winterflower commented Oct 23, 2015

sinhrks commented Apr 10, 2016

str.extract raises ValueError with group named "name" #11385

str.extract raises ValueError with group named "name" #11385

Comments

tdhock commented Oct 20, 2015

tdhock commented Oct 20, 2015

Winterflower commented Oct 23, 2015

sinhrks commented Apr 10, 2016