-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str.split() should support regex #4819
Comments
As a work around, can you Like |
Just bumped into this. Workaround was to use
Seems like it could be useful if it worked like the other |
python In python: if you split on space whereas in polars there would be multiple list entries for each space. at times it is helpful to handle multiple split characters in a row though. |
@evbo That's only if you do not supply a 'hello world'.split() # sep=None
# ['hello', 'world']
'hello world'.split(' ')
# ['hello', '', '', '', 'world']
pl.select(pl.lit('hello world').str.split(' ')).item()
# shape: (5,)
# Series: '' [str]
# [
# "hello"
# ""
# ""
# ""
# "world"
# ] |
@cmdlineluser thanks, I should have clarified for the Rust API this is not currently (documented as) supported by the API. If you try to pass
|
I found this which worked well for my case: |
I would accept a PR on this. If we can keep the non-regex fast path. |
Also the regex parser used by polars doesn't appear to support look-ahead/look-behind which I feel is important for splitting - i.e. I often want to split on a zero-length token, for example between text and numbers etc.
Note this is part of a regex I use frequently in a huggingface (i.e. rust backed) tokenizer so the regex engine they use supports look-around. Edit: hugginface use |
@david-waterworth I think they picked the one they did because look arounds are relatively slow as they're recursive. One could build a plugin that used the other regex engine. |
Problem Description
I want to tokenize a string column, and there are multiple split characters; I believe my current options are to
.apply()
explode()
/str.split
passesflatten()
andstr.split()
It would be nicer to have
rsplit
or regex support insplit
itself (contains
,replace
both already support it).It would be also nice to have list-flattening support (ie not explode but taking a nested list and making it unnested).
The text was updated successfully, but these errors were encountered: