str.split() should support regex #4819

indigoviolet · 2022-09-11T02:52:03Z

Problem Description

I want to tokenize a string column, and there are multiple split characters; I believe my current options are to

.apply()
go through multiple explode()/str.split passes
chain a bunch of flatten() and str.split()

It would be nicer to have rsplit or regex support in split itself (contains, replace both already support it).

It would be also nice to have list-flattening support (ie not explode but taking a nested list and making it unnested).

The text was updated successfully, but these errors were encountered:

deanm0000 · 2022-10-20T16:01:29Z

As a work around, can you replace the regex with something static and then split on that?

Like with_column(pl.col(yourcol).str.replace('\d{1,2}','|D|D|D|D').str.split('|D|D|D|D'))

cmdlineluser · 2022-12-22T10:39:41Z

Just bumped into this.

Workaround was to use .extract_all() then .replace() which is mostly equivalent.

df = pl.DataFrame({
   "data": [ "AB one ABB two ABBBBBB three ABBBBBBBB"]
})

pattern = r"AB+"

df.select(
   pl.col("data")
     .str.extract_all(rf".*?({pattern}|$)")
     .arr.eval(
        pl.all().str.replace(pattern, ""),
        parallel=True)
)

shape: (1, 1)
┌──────────────────────────────┐
│ data                         │
│ ---                          │
│ list[str]                    │
╞══════════════════════════════╡
│ ["", " one ", ... " three "] │
└──────────────────────────────┘

Seems like it could be useful if it worked like the other .extract() / .replace() methods with a literal: bool option to disable regex matching.

evbo · 2023-09-23T15:15:10Z

python split works a bit differently than polars split, whereby multiple split characters are removed in the former.

In python:
hello world
becomes:
['hello', 'world']

if you split on space whereas in polars there would be multiple list entries for each space. at times it is helpful to handle multiple split characters in a row though.

cmdlineluser · 2023-09-23T15:42:00Z

@evbo That's only if you do not supply a sep is it not?

'hello    world'.split() # sep=None
# ['hello', 'world']

'hello    world'.split(' ')
# ['hello', '', '', '', 'world']

pl.select(pl.lit('hello    world').str.split(' ')).item()
# shape: (5,)
# Series: '' [str]
# [
# 	"hello"
# 	""
# 	""
# 	""
# 	"world"
# ]

evbo · 2023-09-23T20:08:34Z

@cmdlineluser thanks, I should have clarified for the Rust API this is not currently (documented as) supported by the API. If you try to pass lit(Null {}) to split it will complain it must have a UTF8 Expr.

SchemaMismatch(
ErrString(
"invalid series dtype: expected Utf8, got null",
),
)

TheWizier · 2023-11-30T13:12:54Z

I found this which worked well for my case:
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html
I did:
extract_groups(pattern).struct.rename_fields("a", "b", "c").alias("fields")
And then
unnest("fields")

ritchie46 · 2024-01-02T13:42:10Z

I would accept a PR on this. If we can keep the non-regex fast path.

david-waterworth · 2024-03-06T04:30:16Z

Also the regex parser used by polars doesn't appear to support look-ahead/look-behind which I feel is important for splitting - i.e. I often want to split on a zero-length token, for example between text and numbers etc.

ComputeError: regex error: regex parse error:
    .*?((?<=[a-zA-Z])(?=\d)|$)
        ^^^^
error: look-around, including look-ahead and look-behind, is not supported

Note this is part of a regex I use frequently in a huggingface (i.e. rust backed) tokenizer so the regex engine they use supports look-around.

Edit: hugginface use onigruma rather than the rust regex engine - huggingface/tokenizers#1057

deanm0000 · 2024-03-06T13:09:58Z

@david-waterworth I think they picked the one they did because look arounds are relatively slow as they're recursive. One could build a plugin that used the other regex engine.

indigoviolet added the feature label Sep 11, 2022

cmdlineluser mentioned this issue Jan 9, 2023

Unpythonic Expr.str method names #6120

Closed

stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023

cmdlineluser mentioned this issue Sep 3, 2023

Split String to Multiple Substrings by Given Length #10833

Open

etiennebacher mentioned this issue Sep 5, 2023

Should all functions that use pattern also have an argument literal? #10930

Closed

NickCrews mentioned this issue Dec 18, 2023

feat: String.re_split() ibis-project/ibis#7799

Closed

1 task

ritchie46 added the accepted Ready for implementation label Jan 2, 2024

Yunuuuu mentioned this issue Mar 26, 2024

Does str$split support regex as the document said? or the regex syntax is different from usual regex ? pola-rs/r-polars#971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str.split() should support regex #4819

str.split() should support regex #4819

indigoviolet commented Sep 11, 2022 •

edited

deanm0000 commented Oct 20, 2022 •

edited

cmdlineluser commented Dec 22, 2022

evbo commented Sep 23, 2023

cmdlineluser commented Sep 23, 2023

evbo commented Sep 23, 2023

TheWizier commented Nov 30, 2023 •

edited

ritchie46 commented Jan 2, 2024

david-waterworth commented Mar 6, 2024 •

edited

deanm0000 commented Mar 6, 2024

str.split() should support regex #4819

str.split() should support regex #4819

Comments

indigoviolet commented Sep 11, 2022 • edited

Problem Description

deanm0000 commented Oct 20, 2022 • edited

cmdlineluser commented Dec 22, 2022

evbo commented Sep 23, 2023

cmdlineluser commented Sep 23, 2023

evbo commented Sep 23, 2023

TheWizier commented Nov 30, 2023 • edited

ritchie46 commented Jan 2, 2024

david-waterworth commented Mar 6, 2024 • edited

deanm0000 commented Mar 6, 2024

indigoviolet commented Sep 11, 2022 •

edited

deanm0000 commented Oct 20, 2022 •

edited

TheWizier commented Nov 30, 2023 •

edited

david-waterworth commented Mar 6, 2024 •

edited