Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.split() should support regex #4819

Open
indigoviolet opened this issue Sep 11, 2022 · 9 comments
Open

str.split() should support regex #4819

indigoviolet opened this issue Sep 11, 2022 · 9 comments
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@indigoviolet
Copy link

indigoviolet commented Sep 11, 2022

Problem Description

I want to tokenize a string column, and there are multiple split characters; I believe my current options are to

  • .apply()
  • go through multiple explode()/str.split passes
  • chain a bunch of flatten() and str.split()

It would be nicer to have rsplit or regex support in split itself (contains, replace both already support it).

It would be also nice to have list-flattening support (ie not explode but taking a nested list and making it unnested).

@deanm0000
Copy link
Collaborator

deanm0000 commented Oct 20, 2022

As a work around, can you replace the regex with something static and then split on that?

Like with_column(pl.col(yourcol).str.replace('\d{1,2}','|D|D|D|D').str.split('|D|D|D|D'))

@cmdlineluser
Copy link
Contributor

Just bumped into this.

Workaround was to use .extract_all() then .replace() which is mostly equivalent.

df = pl.DataFrame({
   "data": [ "AB one ABB two ABBBBBB three ABBBBBBBB"]
})

pattern = r"AB+"

df.select(
   pl.col("data")
     .str.extract_all(rf".*?({pattern}|$)")
     .arr.eval(
        pl.all().str.replace(pattern, ""),
        parallel=True)
)

shape: (1, 1)
┌──────────────────────────────┐
│ data                         │
│ ---                          │
│ list[str]                    │
╞══════════════════════════════╡
│ ["", " one ", ... " three "] │
└──────────────────────────────┘

Seems like it could be useful if it worked like the other .extract() / .replace() methods with a literal: bool option to disable regex matching.

@evbo
Copy link

evbo commented Sep 23, 2023

python split works a bit differently than polars split, whereby multiple split characters are removed in the former.

In python:
hello world
becomes:
['hello', 'world']

if you split on space whereas in polars there would be multiple list entries for each space. at times it is helpful to handle multiple split characters in a row though.

@cmdlineluser
Copy link
Contributor

@evbo That's only if you do not supply a sep is it not?

'hello    world'.split() # sep=None
# ['hello', 'world']

'hello    world'.split(' ')
# ['hello', '', '', '', 'world']

pl.select(pl.lit('hello    world').str.split(' ')).item()
# shape: (5,)
# Series: '' [str]
# [
# 	"hello"
# 	""
# 	""
# 	""
# 	"world"
# ]

@evbo
Copy link

evbo commented Sep 23, 2023

@cmdlineluser thanks, I should have clarified for the Rust API this is not currently (documented as) supported by the API. If you try to pass lit(Null {}) to split it will complain it must have a UTF8 Expr.

SchemaMismatch(
ErrString(
"invalid series dtype: expected Utf8, got null",
),
)

@TheWizier
Copy link

TheWizier commented Nov 30, 2023

I found this which worked well for my case:
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html
I did:
extract_groups(pattern).struct.rename_fields("a", "b", "c").alias("fields")
And then
unnest("fields")

@ritchie46 ritchie46 added the accepted Ready for implementation label Jan 2, 2024
@ritchie46
Copy link
Member

I would accept a PR on this. If we can keep the non-regex fast path.

@david-waterworth
Copy link

david-waterworth commented Mar 6, 2024

Also the regex parser used by polars doesn't appear to support look-ahead/look-behind which I feel is important for splitting - i.e. I often want to split on a zero-length token, for example between text and numbers etc.

ComputeError: regex error: regex parse error:
    .*?((?<=[a-zA-Z])(?=\d)|$)
        ^^^^
error: look-around, including look-ahead and look-behind, is not supported

Note this is part of a regex I use frequently in a huggingface (i.e. rust backed) tokenizer so the regex engine they use supports look-around.

Edit: hugginface use onigruma rather than the rust regex engine - huggingface/tokenizers#1057

@deanm0000
Copy link
Collaborator

@david-waterworth I think they picked the one they did because look arounds are relatively slow as they're recursive. One could build a plugin that used the other regex engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

8 participants