feat: add `String.splitOn_of_valid` #495

chabulhwi · 2024-01-02T13:41:34Z

Add lemmas for Nat.le and List.append.
Add functions on List:
- List.rdropSublist drops a sublist from the tail end of a list.
- List.splitOnSublist splits a list at every occurrence of a separator sublist. The separators are not in the result.
Add simp lemmas for List.rdropSublist.
State and prove two theorems about String: splitOnAux_of_valid and splitOn_of_valid.

Before this pull request gets merged, I need to replace String.splitOnAux and String.splitOn in Lean core with the modified versions I made:

def splitOnAux' (s sep : String) (b : Pos) (i : Pos) (j : Pos) (r : List String) : List String :=
  if sep == "" then
    let r := (s.extract b s.endPos)::r
    r.reverse
  else if h : s.atEnd i then
    let r := (s.extract b i)::r
    r.reverse
  else
    have := Nat.sub_lt_sub_left (Nat.gt_of_not_le (mt decide_eq_true h)) (lt_next s _)
    if sep.atEnd j then
      if s.get i == sep.get 0
        splitOnAux' s sep i (s.next i) (sep.next 0) (s.extract b (i - j)::r)
      else
        splitOnAux' s sep i (s.next i) 0 (s.extract b (i - j)::r)
    else if s.get i == sep.get j then
      splitOnAux' s sep b (s.next i) (sep.next j) r
    else
      splitOnAux' s sep b (s.next i) 0 r
termination_by _ => s.endPos.1 - i.1

def splitOn' (s : String) (sep : String := " ") : List String :=
  splitOnAux' s sep 0 0 0 []

The function splitOnAux' differs from its original version in two ways. First, it checks whether the separator (sep) is empty. Second, it checks whether sep.atEnd j before it checks whether s.get i == sep.get j. For these reasons, the outputs of the original splitOnAux and its modified version are different for some inputs, as shown in the #eval commands below:

import Std.Data.String.Lemmas

namespace Test

open String
set_option linter.missingDocs false

private def l := "A12a".1
private def m := "B12b".1
private def r := "C12c".1

private def sep₁ := "1".1
private def sep₂ := "2".1
private def acc := ["X", "x"]

#eval splitOnAux  ⟨l++m++sep₁++r⟩ ⟨sep₁++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++sep₁)⟩
  ⟨utf8Len (sep₁++sep₂)⟩ acc -- output: ["x", "X", "B12b1C", "c"]
#eval splitOnAux' ⟨l++m++sep₁++r⟩ ⟨sep₁++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++sep₁)⟩
  ⟨utf8Len (sep₁++sep₂)⟩ acc -- ["x", "X", "B12", "C", "c"]

#eval splitOnAux  ⟨l++m++"A".1++r⟩ ⟨"A".1++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++"A".1)⟩
  ⟨utf8Len ("A".1++sep₂)⟩ acc -- ["x", "X", "B12bAC12c"]
#eval splitOnAux' ⟨l++m++"A".1++r⟩ ⟨"A".1++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++"A".1)⟩
  ⟨utf8Len ("A".1++sep₂)⟩ acc -- ["x", "X", "B12", "C12c"]

end Test

I'll open an issue on Lean 4 GitHub issues after I get feedback about splitOnAux' and splitOn'.

digama0 · 2024-01-02T14:04:24Z

But why? You say that splitOnAux has different behavior, but this is an auxiliary function, its behavior doesn't matter. Does splitOn have different behavior?

Checking the separator for emptiness in splitOnAux is a bad idea, this is the inner loop of splitOn and sep is a constant. Just check it outside the loop and pass in an appropriate hypothesis.

* `List.rdropSublist` drops a sublist from the tail end of a list. * `List.splitOnSublist` splits a list at every occurrence of a separator sublist. The separators are not in the result.

State and prove two theorems about `String`: `splitOnAux_of_valid` and `splitOn_of_valid`.

digama0 · 2024-01-02T14:19:32Z

Looking at the definitions of splitOn and splitOnAux, I think they are correct. splitOnAux has the invariant that sep != "", and j < sep.endPos, which justifies the two aspects that you felt the need to change. (Doing the s.get i = sep.get pos check first is a better design, because most characters are not going to be the same as the first character in sep, so this saves one test per character.)

chabulhwi · 2024-01-02T16:47:42Z

@digama0 There are some errors in splitOnAux', which make splitOn' behave differently from splitOn. I'll reply to your questions after I fix those.

#eval "".splitOn "_" == "".splitOn' "_" -- output: true
#eval "My<>life<>for<>Aiur!".splitOn "<>" == "My<>life<>for<>Aiur!".splitOn' "<>" -- true
#eval "BABABA".splitOn  "A" -- ["B", "B", "B", ""]
#eval "BABABA".splitOn' "A" -- ["B", "B", "BA"]

fgdorais · 2024-01-02T19:01:14Z

Why not use the KMP matcher at Std.Data.Array.Match?

digama0 · 2024-01-03T07:11:26Z

@chabulhwi Based on your example, it seems like splitOn is correct and splitOn' is not, no? If you split "BABABA" by "A" the correct answer is ["B", "B", "B", ""], and if you split "ABABAB" by "A" the correct answer is ["", "B", "B", "B"]. splitOn gets both right, splitOn' only gets the second one right.

As for using the KMP matcher, I was thinking of making the same suggestion but this is a definition which is targeting core, which does not have the KMP matcher. We may consider a splitOn' function which returns the same result as splitOn but is implemented using the KMP matcher.

chabulhwi · 2024-01-03T07:12:29Z

@chabulhwi Based on your example, it seems like splitOn is correct and splitOn' is not, no?

You're right, but I think I can fix this error.

digama0 · 2024-01-03T07:13:07Z

Why not just use and prove properties of the existing function?

Std/Data/List/Basic.lean

chabulhwi · 2024-01-03T09:51:57Z

But why? You say that splitOnAux has different behavior, but this is an auxiliary function, its behavior doesn't matter. Does splitOn have different behavior?

Checking the separator for emptiness in splitOnAux is a bad idea, this is the inner loop of splitOn and sep is a constant. Just check it outside the loop and pass in an appropriate hypothesis.

I think the current definition of splitOnAux makes it more difficult to prove the following two theorems:

theorem splitOnAux_of_valid (sep₁ sep₂ l m r acc) :
    splitOnAux ⟨l++m++sep₁++r⟩ ⟨sep₁++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++sep₁)⟩ ⟨utf8Len sep₁⟩ acc =
      acc.reverse++(List.splitOnList.go r (sep₁++sep₂) sep₂ (m++sep₁).reverse).map mk := by
  sorry

theorem splitOn_of_valid (s sep) : splitOn s sep = (List.splitOnList s.1 sep.1).map mk := by
  simpa [splitOn] using splitOnAux_of_valid [] sep.1 [] [] s.1 []

When proving splitOnAux_of_valid, we need to consider the case where the separator is empty. Having splitOnAux check the separator for emptiness helps simplify both sides of the equation we aim to prove on each case.

Of course, we should make sure splitOn' has the same behavior as splitOn.

digama0 · 2024-01-03T12:54:23Z

splitOn and splitOnAux are designed for efficient execution, not verification. As long as they produce the right result it is better to spend more work on the verification in exchange for a more efficient function. So unless you have an idea for how to make them more efficient, or if you found a bug in the demonstrated behavior of the function, I suggest taking them as given and proving what you can about them as is.

chabulhwi · 2024-01-03T13:05:27Z

Looking at the definitions of splitOn and splitOnAux, I think they are correct. splitOnAux has the invariant that sep != "", and j < sep.endPos, which justifies the two aspects that you felt the need to change. (Doing the s.get i = sep.get pos check first is a better design, because most characters are not going to be the same as the first character in sep, so this saves one test per character.)

In order to prove splitOnAux_of_valid, we also have to deal with the case where it's true that sep.atEnd j; this is where the problem with the current splitOnAux lies.

When sep.atEnd j is true, the value of sep.get j is default, which is 'A'. In that event, checking whether s.get i == sep.get j is equivalent to checking whether s.get i == 'A'. This leads to the inconsistent behavior of splitOnAux shown below:

#eval "AB".splitOnAux "C" ⟨0⟩ ⟨0⟩ ⟨1⟩ []
#eval "AB".splitOnAux "C" ⟨1⟩ ⟨1⟩ ⟨0⟩ [""]
#eval "AB".splitOnAux "C" ⟨1⟩ ⟨2⟩ ⟨0⟩ [""]
#eval ("B"::[""]).reverse
#eval ["", "B"]

#eval "DB".splitOnAux "C" ⟨0⟩ ⟨0⟩ ⟨1⟩ []
#eval "DB".splitOnAux "C" ⟨0⟩ ⟨1⟩ ⟨0⟩ []
#eval "DB".splitOnAux "C" ⟨0⟩ ⟨2⟩ ⟨0⟩ []
#eval ("DB"::[]).reverse
#eval ["DB"]

Why not just use and prove properties of the existing function?

The weird behavior of splitOnAux I mentioned above makes it painful to prove splitOnAux_of_valid.

digama0 · 2024-01-03T13:27:05Z

In order to prove splitOnAux_of_valid, we also have to deal with the case where it's true that sep.atEnd j; this is where the problem with the current splitOnAux lies.

Like I said earlier, splitOnAux has a precondition that j < sep.endPos, which eliminates this case. splitOnAux_of_valid should likewise have this as a hypothesis. (NB: this also eliminates the sep != "" precondition, since j < sep.endPos implies it.)

The whole reason that *_of_valid functions have to be written in such a weird way is because most string functions have very nontrivial preconditions to do with positions lying on UTF-8 byte boundaries. If you don't assume the correct preconditions they just yield garbage values, and we aren't interested in proving properties about those garbage values.

chabulhwi · 2024-01-03T13:38:46Z

Like I said earlier, splitOnAux has a precondition that j < sep.endPos, which eliminates this case. splitOnAux_of_valid should likewise have this as a hypothesis.

Adding j < sep.endPos to splitOnAux_of_valid as a hypothesis will solve the problems I had. I'll close this pull request and make another one later. But I need to take a break first. 🫠

chabulhwi added 2 commits January 2, 2024 09:38

feat: add lemmas for Nat.le

a528924

feat: add lemmas for List.append

509c9b0

chabulhwi added 3 commits January 2, 2024 23:07

feat: add functions on List

742d8cc

* `List.rdropSublist` drops a sublist from the tail end of a list. * `List.splitOnSublist` splits a list at every occurrence of a separator sublist. The separators are not in the result.

feat: add simp lemmas for List.rdropSublist

3e20dd1

feat: add String.splitOn_of_valid

9b42bad

State and prove two theorems about `String`: `splitOnAux_of_valid` and `splitOn_of_valid`.

chabulhwi force-pushed the String.splitOn_of_valid branch from fbbcb46 to 9b42bad Compare January 2, 2024 14:07

fgdorais added awaiting-review This PR is ready for review; the author thinks it is ready to be merged. WIP work in progress and removed awaiting-review This PR is ready for review; the author thinks it is ready to be merged. labels Jan 2, 2024

digama0 reviewed Jan 3, 2024

View reviewed changes

Std/Data/List/Basic.lean Show resolved Hide resolved

digama0 reviewed Jan 3, 2024

View reviewed changes

Std/Data/List/Basic.lean Show resolved Hide resolved

chabulhwi closed this Jan 3, 2024

digama0 mentioned this pull request Jan 16, 2024

feat: add Std.Data.List.SplitOnList #531

Closed

chabulhwi mentioned this pull request Apr 3, 2024

feat: add String.splitOn_of_valid #719

Closed

chabulhwi mentioned this pull request Apr 16, 2024

feat: add String.splitOn_of_valid #743

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `String.splitOn_of_valid` #495

feat: add `String.splitOn_of_valid` #495

chabulhwi commented Jan 2, 2024 •

edited

digama0 commented Jan 2, 2024

digama0 commented Jan 2, 2024 •

edited

chabulhwi commented Jan 2, 2024

fgdorais commented Jan 2, 2024

digama0 commented Jan 3, 2024 •

edited

chabulhwi commented Jan 3, 2024 •

edited

digama0 commented Jan 3, 2024

chabulhwi commented Jan 3, 2024 •

edited

digama0 commented Jan 3, 2024

chabulhwi commented Jan 3, 2024 •

edited

digama0 commented Jan 3, 2024 •

edited

chabulhwi commented Jan 3, 2024 •

edited

feat: add String.splitOn_of_valid #495

feat: add String.splitOn_of_valid #495

Conversation

chabulhwi commented Jan 2, 2024 • edited

digama0 commented Jan 2, 2024

digama0 commented Jan 2, 2024 • edited

chabulhwi commented Jan 2, 2024

fgdorais commented Jan 2, 2024

digama0 commented Jan 3, 2024 • edited

chabulhwi commented Jan 3, 2024 • edited

digama0 commented Jan 3, 2024

chabulhwi commented Jan 3, 2024 • edited

digama0 commented Jan 3, 2024

chabulhwi commented Jan 3, 2024 • edited

digama0 commented Jan 3, 2024 • edited

chabulhwi commented Jan 3, 2024 • edited

feat: add `String.splitOn_of_valid` #495

feat: add `String.splitOn_of_valid` #495

chabulhwi commented Jan 2, 2024 •

edited

digama0 commented Jan 2, 2024 •

edited

digama0 commented Jan 3, 2024 •

edited

chabulhwi commented Jan 3, 2024 •

edited

chabulhwi commented Jan 3, 2024 •

edited

chabulhwi commented Jan 3, 2024 •

edited

digama0 commented Jan 3, 2024 •

edited

chabulhwi commented Jan 3, 2024 •

edited