Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: More convenient regex? #4685

Closed
danielballan opened this issue Aug 27, 2013 · 9 comments · Fixed by #4696

Comments

@danielballan
Copy link
Contributor

commented Aug 27, 2013

How about circumventing annoying match objects by providing an additional, somewhat redundant string method?

s.str.extract(pattern)

as a shortcut for

s.str.match(pattern).str.get(0)

If the pattern contains multiple groups, a DataFrame should be returned. Thoughts? Maybe @hayd would be into this one.

@hayd

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2013

I do hate match objects (is this like a get groups thing?)... this sounds useful. I think should be able to extract using group name (and the moment get is by integer location only), not sure how that would work with mulitple.

I also dislike how findall returns a series of lists...

@danielballan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2013

This is how I think single and multiple groups should work.

>>> Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
0    1
1    2
2    NaN
dtype: object

>>> Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
     0    1
0    a    1
1    b    2
2  NaN  NaN

What are you thinking for group names?

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2013

Would this / does this only work for series? Also, how would you handle
named match groups? (looks like '(?pattern)')

@danielballan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2013

I think it would only work neatly for Series. As for named match groups, what about this?

>>> Series(['a1', 'b2', 'c3']).str.match('(?P<letter>[ab])(?P<digit>\d)')
     letter  digit
0    a       1
1    b       2
2    NaN   NaN

As above, if there is no match, the whole row is NaN. If there is a optional group and that group is absent from a given entry, only the absent group is NaN.

>>> Series(['a1', 'b2', '3']).str.match('(?P<letter>[ab])?(?P<digit>\d)') # first group optional
     letter  digit
0    a       1
1    b       2
2    NaN     3
@jtratner

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2013

That's a good way to do it.

@hayd

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2013

Yep, that was exactly what I was getting at with group names, this would be a great enhancement.

The stupid thing I was also thinking was multiple optional things, but actually this doesn't appear to be supported in re anyway (nor does it really work with unnamed stuff):

In [11]: re.match('((?P<letter>[ab]))*','ab').groupdict()
Out[11]: {'letter': 'b'}
@danielballan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2013

Two examples that work in the PR referenced above. Will write tests out of these, and more.

In [1]: Series(['a1', 'b1', '1', 'a']).str.extract('([ab])?(?P<digit>\d)')
Out[1]: 
     1 digit
0    a     1
1    b     1
2  NaN     1
3  NaN   NaN

In [2]: Series(['a1', 'b1', '1', 'a']).str.extract('([ab])?')
Out[2]: 
0      a
1      b
2    NaN
3      a
dtype: object

Notice that, in the column names, I follow the re module's convention of labeling any unnamed groups with 1. It's actually '1'. Should it be 1? Which is less confusing?

@hayd

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2013

Oh sorry, just commented about that same thing. Not sure what's better. Probably should follow their (weird) convention.

@danielballan

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2013

Ha. Saw your comment and changed it. I'm really not sure which is better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.