New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: str.extractall for several matches #11386

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@tdhock

tdhock commented Oct 20, 2015

For a series S, the excellent S.str.extract method returns the first match in each subject of the series:

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     'Dave': 'dave@google.com',
...     'multiple': 'rob@gmail.com some text steve@gmail.com',
...     'none': np.nan,
...     }
>>> pattern = r'''
... (?P<user>[a-z]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> S = pd.Series(data)
>>> S.str.extract(pattern, re.VERBOSE)
          user  domain  tld
Dave      dave  google  com
multiple   rob   gmail  com
none       NaN     NaN  NaN
>>> 

That's great, but sometimes we want to extract all matches in each element of the series. You can do that with S.str.findall but its result does not include the names specified in the capturing groups of the regular expression:

>>> S.str.findall(pattern, re.VERBOSE)
Dave                           [(dave, google, com)]
multiple    [(rob, gmail, com), (steve, gmail, com)]
none                                             NaN
dtype: object
>>> 

I propose the S.str.extractall method which returns a Series the same length as the subject S. Each element of the series is a DataFrame with a row for each match and a column for each group:

>>> result = S.str.extractall(pattern, re.VERBOSE)
>>> result[0]
   user  domain  tld
0  dave  google  com
>>> result[1]
    user domain  tld
0    rob  gmail  com
1  steve  gmail  com
>>> result[2]
Empty DataFrame
Columns: [user, domain, tld]
Index: []
>>> 

Before I write any more testing code, can we start a discussion about whether or not this is an acceptable design choice, in relation to the other functionality of pandas? @sinhrks @jorisvandenbossche @jreback @mortada since you seem to be discussing extract in #10103

Also do you have any ideas about how to get the result (a Series of DataFrames) to print more nicely? With my current fork we have

>>> result
Dave                   user  domain  tld
0  dave  google  com
multiple        user domain  tld
0    rob  gmail  com
1  s...
none        Empty DataFrame
Columns: [user, domain, tld]
I...
dtype: object
>>> 

In R the equivalent functionality is provided by the https://github.com/tdhock/namedCapture package (str_match_all_named returns a list of data.frames), and the resulting printout is readable because of the way that R prints lists:

> library(namedCapture)
> S <- c(
+   Dave='dave@google.com',
+   multiple='rob@gmail.com some text steve@gmail.com',
+   none=NA)
> pattern <- paste0(
+   "(?P<user>[a-z]+)",
+   "@",
+   "(?P<domain>[a-z]+)",
+   "[.]",
+   "(?P<tld>[a-z]{2,4})")
> str_match_all_named(S, pattern)
$Dave
     user   domain   tld  
[1,] "dave" "google" "com"

$multiple
     user    domain  tld  
[1,] "rob"   "gmail" "com"
[2,] "steve" "gmail" "com"

$none
<0 x 0 matrix>

> 
@jreback

This comment has been minimized.

Contributor

jreback commented Oct 20, 2015

never embed DataFrames within a Series!

instead you could simply return a multi-indexed frame, e.g. the first level are the original indexes of the Series, the 2nd level are the number of matches the columns as you have them

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 20, 2015

In [13]: DataFrame({'user' : ['dave','rob','steve',np.nan],
                              'domain' : ['google.com','gmail.com','gmail.com',np.nan]},
                              index=pd.MultiIndex.from_tuples([('Dave',0),('multiple',0),('multiple',1),('none',0)]))
Out[13]: 
                domain   user
Dave     0  google.com   dave
multiple 0   gmail.com    rob
         1   gmail.com  steve
none     0         NaN    NaN
@jreback

This comment has been minimized.

Contributor

jreback commented Oct 20, 2015

@tdhock further you would prob add a parameter on how to handle partial matches (e.g. the some and text fields), e.g. would you be greedy then nan-fill or skip unless all filled (e.g. what you are doing now)

@tdhock

This comment has been minimized.

tdhock commented Oct 21, 2015

Thanks for the suggestions @jreback. Now extractall returns a DataFrame with a MultiIndex:

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data_dict = {
...     'single': {
...         "Dave":'dave@google.com',
...         "Toby":'tdhock5@gmail.com',
...         "Maude":'maudelaperriere@gmail.com',
...         },
...     'multiple': {
...         "robAndSteve": 'rob@gmail.com some text steve@gmail.com',
...         "abcdef": 'a@b.com some text c@d.com and e@f.com',
...         },
...     'none': {
...         "missing":np.nan,
...         "empty":"",
...         },
...     }
>>> tuple_list = []
>>> subject_list = []
>>> for k1, d in data_dict.iteritems():
...     for k2, subject in d.iteritems():
...         k = (k1, k2)
...         tuple_list.append(k)
...         subject_list.append(subject)
... 
>>> index = pd.MultiIndex.from_tuples(tuple_list, names=("matches", "subject"))
>>> Si = pd.Series(subject_list, index)
>>> named_pattern = r'''
... (?P<user>[a-z0-9]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> iresult = Si.str.extractall(named_pattern, re.VERBOSE)
>>> iresult
                                 user  domain  tld
matches  subject                                  
single   Dave                    dave  google  com
         Maude        maudelaperriere   gmail  com
         Toby                 tdhock5   gmail  com
multiple robAndSteve              rob   gmail  com
         robAndSteve            steve   gmail  com
         abcdef                     a       b  com
         abcdef                     c       d  com
         abcdef                     e       f  com
>>> S = pd.Series(subject_list)
>>> result = S.str.extractall(named_pattern, re.VERBOSE)
>>> result
              user  domain  tld
0             dave  google  com
1  maudelaperriere   gmail  com
2          tdhock5   gmail  com
3              rob   gmail  com
3            steve   gmail  com
4                a       b  com
4                c       d  com
4                e       f  com
>>>

So then we can access all the matches for the subject string with rob and steve via:

>>> iresult.loc["multiple", "robAndSteve"]
                       user domain  tld
matches  subject                       
multiple robAndSteve    rob  gmail  com
         robAndSteve  steve  gmail  com
>>> result.loc[3]
    user domain  tld
3    rob  gmail  com
3  steve  gmail  com

For subjects that have 0 matches I think it would be more consistent and user-friendly if the following would return a DataFrame with 0 rows rather than an exception. Is that possible using some index options?

>>> result.loc[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexing.py", line 1198, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "pandas/core/indexing.py", line 1342, in _getitem_axis
    self._has_valid_type(key, axis)
  File "pandas/core/indexing.py", line 1304, in _has_valid_type
    error()
  File "pandas/core/indexing.py", line 1291, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [5] is not in the [index]'
>>> iresult.loc["none", "empty"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexing.py", line 1196, in __getitem__
    return self._getitem_tuple(key)
  File "pandas/core/indexing.py", line 709, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "pandas/core/indexing.py", line 822, in _getitem_lowerdim
    result = self._handle_lowerdim_multi_index_axis0(tup)
  File "pandas/core/indexing.py", line 804, in _handle_lowerdim_multi_index_axis0
    raise e1
KeyError: 'none'
>>> 
@tdhock

This comment has been minimized.

tdhock commented Nov 1, 2015

I further propose the extractiter method which returns an iterator over DataFrames, one for each subject: one row for each match, one column for each group:

>>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
...   print df
... 
   user  domain  tld
0  dave  google  com
              user domain  tld
0  maudelaperriere  gmail  com
      user domain  tld
0  tdhock5  gmail  com
    user domain  tld
0    rob  gmail  com
1  steve  gmail  com
  user domain  tld
0    a      b  com
1    c      d  com
2    e      f  com
Empty DataFrame
Columns: [user, domain, tld]
Index: []
Empty DataFrame
Columns: [user, domain, tld]
Index: []
>>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
...   print df["domain"]
... 
0    google
Name: domain, dtype: object
0    gmail
Name: domain, dtype: object
0    gmail
Name: domain, dtype: object
0    gmail
1    gmail
Name: domain, dtype: object
0    b
1    d
2    f
Name: domain, dtype: object
Series([], Name: domain, dtype: object)
Series([], Name: domain, dtype: object)
>>> 
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 1, 2015

@tdhock let's just keep it straightforward for now
we don't support iterators like this so it would be a big API change

@tdhock

This comment has been minimized.

tdhock commented Nov 1, 2015

OK, in that case I deleted extractiter, and added some more tests and examples for extractall.

@jreback

View changes

pandas/core/strings.py Outdated
>>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
letter digit
A a 1

This comment has been minimized.

@jreback

jreback Nov 2, 2015

Contributor

why would this NOT be a multi-index here? Having a duplicated index is not at all convenient.

This comment has been minimized.

@tdhock

tdhock Nov 2, 2015

not sure what you mean. Can you please clarify?

When the input Series is not multi-indexed, there is no reason the output DataFrame should be. This is the same as the behavior of the standard extract method:

>>> from pandas import Series
>>> S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
>>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
  letter digit
A      a     1
A      a     2
B      b     1
>>> S.str.extract("(?P<letter>[ab])(?P<digit>\d)")
  letter digit
A      a     1
B      b     1
C    NaN   NaN
>>> e_df = S.str.extract("(?P<letter>[ab])(?P<digit>\d)")
>>> e_df.index
Index([u'A', u'B', u'C'], dtype='object')
>>> e_df.keys()
Index([u'letter', u'digit'], dtype='object')
>>> ea_df = S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
>>> ea_df.index
Index([u'A', u'A', u'B'], dtype='object')
>>> ea_df.keys()
Index([u'letter', u'digit'], dtype='object')
>>> 

This comment has been minimized.

@jreback

jreback Nov 2, 2015

Contributor

.extract returns a like-index object to the original.

the proposed .extractall by definition will have duplicates of some of the index elements. This is very different. This by its ery nature should return a MultiIndex (or if the input has a multi-index), then add a level.

@jreback

View changes

pandas/core/strings.py Outdated
def str_extractall(arr, pat, flags=0):
"""
Find all non-overlapping groups in each
subject string in the Series using regular expression pat.

This comment has been minimized.

@jreback

jreback Nov 2, 2015

Contributor

explain how this is different than str.extract, e.g. when it would return the same / different

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 25, 2015

can you rebase / squash and i'll take a look

@tdhock

This comment has been minimized.

tdhock commented Nov 25, 2015

OK @jreback this is my first time doing a rebase / squash. Did I do it correctly?

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 25, 2015

yes that looks right

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 25, 2015

  • add to api.rst
  • add a note in v0.18.0.txt (use this PR number as the issue number), in Enhancement
  • update text.rst
  • update test in test_categorical if needed (this implements .str as well)
@jreback

View changes

pandas/core/strings.py Outdated
>>> S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
letter digit
match

This comment has been minimized.

@jreback

jreback Nov 25, 2015

Contributor

I think this should have the Index name from the original Series (could be None), for the first level, the 2nd level is ok as match. pls add tests for that as well.

This comment has been minimized.

@tdhock

tdhock Nov 25, 2015

In fact the name of the index is taken from the original Series, and it could be None.

In [3]: S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])

In [5]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
Out[5]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0        NaN     1

In [6]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)").index
Out[6]: 
MultiIndex(levels=[[u'A', u'B', u'C'], [0, 1]],
           labels=[[0, 0, 1, 2], [0, 1, 0, 0]],
           names=[None, u'match'])

In [7]: Sn = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])

In [10]: Sn.index.name = "capital"

In [12]: Sn.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
Out[12]: 
              letter digit
capital match             
A       0          a     1
        1          a     2
B       0          b     1
C       0        NaN     1

Do you think I should add a name to the Series used in the docstring?

This comment has been minimized.

@jreback

jreback Nov 25, 2015

Contributor

not necessary, just make sure have a test for the name. This is now the default to check names (on series/frame comparisons)

This comment has been minimized.

@tdhock

tdhock Nov 25, 2015

OK, indeed there is already a test for a subject Series with a named index.

@jreback

View changes

pandas/core/strings.py Outdated
Indices with no matches will not appear in the result.
>>> S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
>>> S.str.extractall("[ab](?P<digit>\d)")

This comment has been minimized.

@jreback

jreback Nov 25, 2015

Contributor

show S here (and make all lower case s)

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/tests/test_strings.py Outdated
expected_tuples, expected_index, expected_columns)
computed_df = S.str.extractall(named_pattern, re.VERBOSE)
tm.assert_frame_equal(computed_df, expected_df)
# The index of the input Series should be used to construct

This comment has been minimized.

@jreback

jreback Nov 25, 2015

Contributor

put blank lines between different tests here (or sections of tests), makes more readable

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/core/strings.py Outdated
regex = re.compile(pat, flags=flags)
# the regex must contain capture groups.
if regex.groups == 0:
raise ValueError("This pattern contains no groups to capture.")

This comment has been minimized.

@jreback

jreback Nov 25, 2015

Contributor

need a tests for this error condition, also if only have 1 pattern, assert that this returns the same as .extract. Though case could be made for always returning a MultiIndex (even if only a single element on the top-level). That way its predictable. (I am leaning towards this). @jorisvandenbossche @shoyer

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/core/strings.py Outdated
first_col = columns[0]
name = first_col if first_col in regex.groupindex else None
result = Series(first_list, index, name=name)
else:

This comment has been minimized.

@jreback

jreback Nov 25, 2015

Contributor

IIRC this is what the expand kw does (or maybe we should always return a DataFrame here?), again so its not depedent on whether have a single or multiple groups, becomes more predictible from a user perspective. What do we do in .find/.findall for this? (e.g. we should match, maybe fix this so its consistent)

This comment has been minimized.

@tdhock

tdhock Nov 25, 2015

I see the expand keyword defined in #10103 as

Or you can specify ``expand=False`` to return Series.

I agree that sometimes returning a DataFrame and sometimes returning a Series is confusing from a user perspective.

This design choice (return a Series if there is only one group) was made to be consistent with the current implementation of extract.

I'm not sure if the analogy with find/findall is appropriate:

  • find takes a string (not a regex) and returns the index of the first match.
  • findall takes a regex with capture groups, and returns the contents of those capture groups.
In [3]: s = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])

In [10]: s.str.find("a")
Out[10]: 
A    0
B   -1
C   -1
dtype: int64

In [11]: s.str.findall("a")
Out[11]: 
A    [a, a]
B        []
C        []
dtype: object

>>> s.str.findall("(a)")
A    [a, a]
B        []
C        []
dtype: object

>>> s.str.findall("(a)([12])")
A    [(a, 1), (a, 2)]
B                  []
C                  []
dtype: object

This comment has been minimized.

@tdhock

tdhock Nov 25, 2015

If it was my choice I would always return a DataFrame, in both extract and extractall!

  • easier to maintain
  • easier for users to understand

This comment has been minimized.

@jreback

jreback Nov 25, 2015

Contributor

@tdhock I think I agree with you here.ok, let's always return a DataFrame. This will entail changing .extract and putting in API breaking changes. @jorisvandenbossche ?

@tdhock

This comment has been minimized.

tdhock commented Nov 26, 2015

TODO update docstrings

@tdhock

This comment has been minimized.

tdhock commented Nov 26, 2015

TODOs

  • double check api.rst
  • add a note in v0.18.0.txt (use this PR number as the issue number), in Enhancement
  • double check text.rst
  • add extractall tests to test_categorical? (this implements .str as well) Is this needed? There are no tests for extract in test_categorical.
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 26, 2015

.str functions are all tested in test--categorical - only the ones that need args are special cased

@jreback

View changes

doc/source/text.rst Outdated
S = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
S.str.extract("[ab](?P<digit>\d)")
the ``extractall`` method (introduced in version 0.18)

This comment has been minimized.

@jreback

jreback Nov 27, 2015

Contributor

use a versionadded tag here

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/core/strings.py Outdated
if isinstance(arr, Index):
if 1 < regex.groups:
raise ValueError("only one regex group is supported with Index")

This comment has been minimized.

@jreback

jreback Nov 27, 2015

Contributor

do you have a test for this?

This comment has been minimized.

@tdhock

tdhock Nov 27, 2015

Seems to me that there is no reason we should impose a limit of only one regex group for indices. This feature comes from #9980 which explains that they want extract to return an Index when it is called on an Index. In that case it seems like extract with one group should return an Index, and with several groups it should return a MultiIndex (not an error as in current implementation). what do you think?

This comment has been minimized.

@tdhock

tdhock Nov 27, 2015

In my opinion we should not return Index, even when it is called on an Index.

extract/extractall should always return DataFrame, and then if the user wants he can create an Index or MultiIndex.

Again this should be easier to maintain and easier for users to understand.

@jreback

View changes

pandas/core/strings.py Outdated
names = dict(zip(regex.groupindex.values(), regex.groupindex.keys()))
columns = [names.get(1 + i, i) for i in range(regex.groups)]
if len(arr) == 0:
result = DataFrame(columns=columns, dtype=object)
else:

This comment has been minimized.

@jreback

jreback Nov 27, 2015

Contributor

you don't need an else here (as the if returns)

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/core/strings.py Outdated
Series has exactly one match, extractall(pat).xs(0, level='match')
is the same as extract(pat).
Parameters

This comment has been minimized.

@jreback

jreback Nov 27, 2015

Contributor

add a versionadded tag here

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/tests/test_strings.py Outdated
r = s.str.extract(r'(?P<letter>[a-z])')
e = DataFrame({"letter":['a', 'b', 'c']})
tm.assert_frame_equal(r, e)

This comment has been minimized.

@jreback

jreback Nov 27, 2015

Contributor

on these test add a comment with this issue number (and an explanation if the title is not enough)

This comment has been minimized.

@tdhock
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 27, 2015

@tdhock need a big note in whatsnew API breaking changes to show off the prior / new behavior in .extract

@jreback jreback added this to the 0.18.0 milestone Nov 27, 2015

@jreback

View changes

pandas/core/strings.py Outdated
"""
from pandas.core.series import Series
from pandas.core.frame import DataFrame

This comment has been minimized.

@jreback

jreback Jan 19, 2016

Contributor

same import patter here as above

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/tests/test_strings.py Outdated
tm.makeDateIndex, tm.makePeriodIndex]:
i_funs = [
tm.makeStringIndex, tm.makeUnicodeIndex, tm.makeIntIndex,
tm.makeDateIndex, tm.makePeriodIndex

This comment has been minimized.

@jreback

jreback Jan 19, 2016

Contributor

add tm.makeRangeIndex

This comment has been minimized.

@tdhock
@jreback

This comment has been minimized.

Contributor

jreback commented Jan 19, 2016

@tdhock really good!

I didn't see any tests asserting the FutureWarning for passing expand=None. can you add?

@jreback

View changes

doc/source/whatsnew/v0.18.0.txt Outdated
s.str.extract("(?P<letter>[a-zA-Z])", expand=False)
s.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Calling ``index.str.extract`` on a regex with exactly one capture

This comment has been minimized.

@jreback

jreback Jan 19, 2016

Contributor

same comments as in text.rst

This comment has been minimized.

@tdhock
@jreback

View changes

doc/source/whatsnew/v0.18.0.txt Outdated
``NaT`` defines more arithmetic operations with ``datetime64[ns]`` and ``timedelta64[ns]``.
.. ipython:: python

This comment has been minimized.

@jreback

jreback Jan 19, 2016

Contributor

you will need to rebase on master

@jreback

View changes

doc/source/whatsnew/v0.18.0.txt Outdated
``expand=False`` it returns a ``Series``, ``Index``, or ``DataFrame``,
depending on the subject and regular expression pattern (same behavior
as pre-0.18.0). When ``expand=True`` it always returns a
``DataFrame``, which is more consistent and less confusing from the

This comment has been minimized.

@jreback

jreback Jan 19, 2016

Contributor

give a little explanation of the requirement for passing expand (otherwise you will get the warning)

This comment has been minimized.

@tdhock
@tdhock

This comment has been minimized.

tdhock commented Jan 29, 2016

I rebased with master and removed the duplications in the whatsnew file.

I ran the tests on my machine but I am getting many error which are unrelated to my PR

thocking@silene:~/pandas(extractall)$ nosetests pandas/tests/test_strings.py
.E..EE.EEE....E........E..EEE..EE..E......EEEEE....EEE...EEE.EE.E..E..EE......E.EE...
======================================================================
ERROR: test_capitalize (pandas.tests.test_strings.TestStringMethods)
----------------------------------------------------------------------

======================================================================
ERROR: test_title (pandas.tests.test_strings.TestStringMethods)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/thocking/pandas/pandas/tests/test_strings.py", line 288, in test_title
    tm.assert_almost_equal(mixed, exp)
  File "testing.pyx", line 58, in pandas._testing.assert_almost_equal (pandas/src/testing.c:3342)
  File "testing.pyx", line 139, in pandas._testing.assert_almost_equal (pandas/src/testing.c:2287)
  File "/home/thocking/pandas/pandas/core/series.py", line 558, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/thocking/pandas/pandas/indexes/base.py", line 1920, in get_value
    raise IndexError(key)
IndexError: 0

----------------------------------------------------------------------
Ran 85 tests in 0.369s

FAILED (errors=34)
@jreback

This comment has been minimized.

Contributor

jreback commented Jan 29, 2016

did you rebuild the extensions, e.g. make?

@tdhock

This comment has been minimized.

tdhock commented Jan 29, 2016

thanks for the tip. After rebuilding the extensions all tests pass on my machine.

@@ -201,9 +207,106 @@ and optional groups like
.. ipython:: python
pd.Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)')

This comment has been minimized.

@jreback

jreback Feb 1, 2016

Contributor

make extract and extractall sub-sections (I think you might have to use ^^^^) as the sub-headings

This comment has been minimized.

@tdhock
@jreback

View changes

pandas/core/strings.py Outdated
empty_row = [np.nan] * regex.groups
def f(x):

This comment has been minimized.

@jreback

jreback Feb 1, 2016

Contributor

might worth it to separate out this f(x) to an external function factory to generate for _str_extract_noexpand and here to not repeat the function. (only do if it looks better / less confusing)

@jreback

View changes

doc/source/whatsnew/v0.18.0.txt Outdated
perspective of a user (:issue:`11386`). Currently the default is
``expand=None`` which gives a ``FutureWarning`` and uses
``expand=False``. To avoid this warning, please explicitly specify
``expand``.

This comment has been minimized.

@jreback

jreback Feb 1, 2016

Contributor

show an example (maybe in a note), that shows the future warning you will get when you don't supply expand

This comment has been minimized.

@tdhock
@jreback

View changes

doc/source/whatsnew/v0.18.0.txt Outdated
The :ref:`.str.extract <text.extract>` method takes a regular
expression with capture groups, finds the first match in each subject
string, and returns the contents of the capture groups.

This comment has been minimized.

@jreback

jreback Feb 1, 2016

Contributor

put this issue number (e.g. the PR number) at the end of this first sentence (as an issue number)

This comment has been minimized.

@tdhock
@jreback

This comment has been minimized.

Contributor

jreback commented Feb 1, 2016

@tdhock just a couple of doc changes. ping when pushed and green and we'll merge.

groups_or_na = _groups_or_na_fun(regex)
if regex.groups == 1:
result = np.array([groups_or_na(val)[0] for val in arr], dtype=object)

This comment has been minimized.

@tdhock

tdhock Feb 9, 2016

groups_or_na(subject) should be easier to understand than f(subject)

@tdhock

View changes

pandas/core/strings.py Outdated
from pandas.core.index import Index
regex = re.compile(pat, flags=flags)
groups_or_na = _groups_or_na_fun(regex)

This comment has been minimized.

@tdhock

tdhock Feb 9, 2016

To avoid repetition, _groups_or_na_fun(regex) is used in both _str_extract_noexpand and _str_extract_frame

@tdhock

This comment has been minimized.

tdhock commented Feb 9, 2016

OK @jreback I think I have addressed all your concerns.

@jreback jreback closed this in 67730dd Feb 9, 2016

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 9, 2016

@tdhock thanks! great PR! and you put up with our comments!

only last thing: http://pandas-docs.github.io/pandas-docs-travis/ will have the built docs (may take a bit of time as Travis is sometimes queued). This builds all docs & doc-strings. Have a look and pls issue a followup-PR if anything needs clarification / formatting.

jreback added a commit that referenced this pull request Feb 10, 2016

DOC: extract/extractall clarifications
This PR clarifies the new documentation for extract and extractall.
It was requested by @jreback in
#11386 (comment)

Author: Toby Dylan Hocking <tdhock5@gmail.com>

Closes #12281 from tdhock/extract-docs and squashes the following commits:

2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications

cldy added a commit to cldy/pandas that referenced this pull request Feb 11, 2016

ENH: str.extractall for several matches
Author: Toby Dylan Hocking <tdhock5@gmail.com>

Closes pandas-dev#11386 from tdhock/extractall and squashes the following commits:

0c1c3d1 [Toby Dylan Hocking] ENH: extract(expand), extractall

cldy added a commit to cldy/pandas that referenced this pull request Feb 11, 2016

DOC: extract/extractall clarifications
This PR clarifies the new documentation for extract and extractall.
It was requested by @jreback in
pandas-dev#11386 (comment)

Author: Toby Dylan Hocking <tdhock5@gmail.com>

Closes pandas-dev#12281 from tdhock/extract-docs and squashes the following commits:

2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment