Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add String.splitOn_of_valid #495

Closed

Conversation

chabulhwi
Copy link
Contributor

@chabulhwi chabulhwi commented Jan 2, 2024

  • Add lemmas for Nat.le and List.append.
  • Add functions on List:
    • List.rdropSublist drops a sublist from the tail end of a list.
    • List.splitOnSublist splits a list at every occurrence of a separator sublist. The separators are not in the result.
  • Add simp lemmas for List.rdropSublist.
  • State and prove two theorems about String: splitOnAux_of_valid and splitOn_of_valid.

Before this pull request gets merged, I need to replace String.splitOnAux and String.splitOn in Lean core with the modified versions I made:

def splitOnAux' (s sep : String) (b : Pos) (i : Pos) (j : Pos) (r : List String) : List String :=
  if sep == "" then
    let r := (s.extract b s.endPos)::r
    r.reverse
  else if h : s.atEnd i then
    let r := (s.extract b i)::r
    r.reverse
  else
    have := Nat.sub_lt_sub_left (Nat.gt_of_not_le (mt decide_eq_true h)) (lt_next s _)
    if sep.atEnd j then
      if s.get i == sep.get 0
        splitOnAux' s sep i (s.next i) (sep.next 0) (s.extract b (i - j)::r)
      else
        splitOnAux' s sep i (s.next i) 0 (s.extract b (i - j)::r)
    else if s.get i == sep.get j then
      splitOnAux' s sep b (s.next i) (sep.next j) r
    else
      splitOnAux' s sep b (s.next i) 0 r
termination_by _ => s.endPos.1 - i.1

def splitOn' (s : String) (sep : String := " ") : List String :=
  splitOnAux' s sep 0 0 0 []

The function splitOnAux' differs from its original version in two ways. First, it checks whether the separator (sep) is empty. Second, it checks whether sep.atEnd j before it checks whether s.get i == sep.get j. For these reasons, the outputs of the original splitOnAux and its modified version are different for some inputs, as shown in the #eval commands below:

import Std.Data.String.Lemmas

namespace Test

open String
set_option linter.missingDocs false

private def l := "A12a".1
private def m := "B12b".1
private def r := "C12c".1

private def sep₁ := "1".1
private def sep₂ := "2".1
private def acc := ["X", "x"]

#eval splitOnAux  ⟨l++m++sep₁++r⟩ ⟨sep₁++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++sep₁)⟩
  ⟨utf8Len (sep₁++sep₂)⟩ acc -- output: ["x", "X", "B12b1C", "c"]
#eval splitOnAux' ⟨l++m++sep₁++r⟩ ⟨sep₁++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++sep₁)⟩
  ⟨utf8Len (sep₁++sep₂)⟩ acc -- ["x", "X", "B12", "C", "c"]

#eval splitOnAux  ⟨l++m++"A".1++r⟩ ⟨"A".1++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++"A".1)⟩
  ⟨utf8Len ("A".1++sep₂)⟩ acc -- ["x", "X", "B12bAC12c"]
#eval splitOnAux' ⟨l++m++"A".1++r⟩ ⟨"A".1++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++"A".1)⟩
  ⟨utf8Len ("A".1++sep₂)⟩ acc -- ["x", "X", "B12", "C12c"]

end Test

I'll open an issue on Lean 4 GitHub issues after I get feedback about splitOnAux' and splitOn'.

@digama0
Copy link
Member

digama0 commented Jan 2, 2024

But why? You say that splitOnAux has different behavior, but this is an auxiliary function, its behavior doesn't matter. Does splitOn have different behavior?

Checking the separator for emptiness in splitOnAux is a bad idea, this is the inner loop of splitOn and sep is a constant. Just check it outside the loop and pass in an appropriate hypothesis.

* `List.rdropSublist` drops a sublist from the tail end of a list.
* `List.splitOnSublist` splits a list at every occurrence of a separator
  sublist. The separators are not in the result.
State and prove two theorems about `String`: `splitOnAux_of_valid` and
`splitOn_of_valid`.
@digama0
Copy link
Member

digama0 commented Jan 2, 2024

Looking at the definitions of splitOn and splitOnAux, I think they are correct. splitOnAux has the invariant that sep != "", and j < sep.endPos, which justifies the two aspects that you felt the need to change. (Doing the s.get i = sep.get pos check first is a better design, because most characters are not going to be the same as the first character in sep, so this saves one test per character.)

@fgdorais fgdorais added awaiting-review This PR is ready for review; the author thinks it is ready to be merged. WIP work in progress and removed awaiting-review This PR is ready for review; the author thinks it is ready to be merged. labels Jan 2, 2024
@chabulhwi
Copy link
Contributor Author

@digama0 There are some errors in splitOnAux', which make splitOn' behave differently from splitOn. I'll reply to your questions after I fix those.

#eval "".splitOn "_" == "".splitOn' "_" -- output: true
#eval "My<>life<>for<>Aiur!".splitOn "<>" == "My<>life<>for<>Aiur!".splitOn' "<>" -- true
#eval "BABABA".splitOn  "A" -- ["B", "B", "B", ""]
#eval "BABABA".splitOn' "A" -- ["B", "B", "BA"]

@fgdorais
Copy link
Collaborator

fgdorais commented Jan 2, 2024

Why not use the KMP matcher at Std.Data.Array.Match?

@digama0
Copy link
Member

digama0 commented Jan 3, 2024

@chabulhwi Based on your example, it seems like splitOn is correct and splitOn' is not, no? If you split "BABABA" by "A" the correct answer is ["B", "B", "B", ""], and if you split "ABABAB" by "A" the correct answer is ["", "B", "B", "B"]. splitOn gets both right, splitOn' only gets the second one right.

As for using the KMP matcher, I was thinking of making the same suggestion but this is a definition which is targeting core, which does not have the KMP matcher. We may consider a splitOn' function which returns the same result as splitOn but is implemented using the KMP matcher.

@chabulhwi
Copy link
Contributor Author

chabulhwi commented Jan 3, 2024

@chabulhwi Based on your example, it seems like splitOn is correct and splitOn' is not, no?

You're right, but I think I can fix this error.

@digama0
Copy link
Member

digama0 commented Jan 3, 2024

Why not just use and prove properties of the existing function?

@chabulhwi
Copy link
Contributor Author

chabulhwi commented Jan 3, 2024

But why? You say that splitOnAux has different behavior, but this is an auxiliary function, its behavior doesn't matter. Does splitOn have different behavior?

Checking the separator for emptiness in splitOnAux is a bad idea, this is the inner loop of splitOn and sep is a constant. Just check it outside the loop and pass in an appropriate hypothesis.

I think the current definition of splitOnAux makes it more difficult to prove the following two theorems:

theorem splitOnAux_of_valid (sep₁ sep₂ l m r acc) :
    splitOnAux ⟨l++m++sep₁++r⟩ ⟨sep₁++sep₂⟩ ⟨utf8Len l⟩ ⟨utf8Len (l++m++sep₁)⟩ ⟨utf8Len sep₁⟩ acc =
      acc.reverse++(List.splitOnList.go r (sep₁++sep₂) sep₂ (m++sep₁).reverse).map mk := by
  sorry

theorem splitOn_of_valid (s sep) : splitOn s sep = (List.splitOnList s.1 sep.1).map mk := by
  simpa [splitOn] using splitOnAux_of_valid [] sep.1 [] [] s.1 []

When proving splitOnAux_of_valid, we need to consider the case where the separator is empty. Having splitOnAux check the separator for emptiness helps simplify both sides of the equation we aim to prove on each case.

Of course, we should make sure splitOn' has the same behavior as splitOn.

@digama0
Copy link
Member

digama0 commented Jan 3, 2024

splitOn and splitOnAux are designed for efficient execution, not verification. As long as they produce the right result it is better to spend more work on the verification in exchange for a more efficient function. So unless you have an idea for how to make them more efficient, or if you found a bug in the demonstrated behavior of the function, I suggest taking them as given and proving what you can about them as is.

@chabulhwi
Copy link
Contributor Author

chabulhwi commented Jan 3, 2024

Looking at the definitions of splitOn and splitOnAux, I think they are correct. splitOnAux has the invariant that sep != "", and j < sep.endPos, which justifies the two aspects that you felt the need to change. (Doing the s.get i = sep.get pos check first is a better design, because most characters are not going to be the same as the first character in sep, so this saves one test per character.)

In order to prove splitOnAux_of_valid, we also have to deal with the case where it's true that sep.atEnd j; this is where the problem with the current splitOnAux lies.

When sep.atEnd j is true, the value of sep.get j is default, which is 'A'. In that event, checking whether s.get i == sep.get j is equivalent to checking whether s.get i == 'A'. This leads to the inconsistent behavior of splitOnAux shown below:

#eval "AB".splitOnAux "C"0⟩ ⟨0⟩ ⟨1⟩ []
#eval "AB".splitOnAux "C"1⟩ ⟨1⟩ ⟨0⟩ [""]
#eval "AB".splitOnAux "C"1⟩ ⟨2⟩ ⟨0⟩ [""]
#eval ("B"::[""]).reverse
#eval ["", "B"]

#eval "DB".splitOnAux "C"0⟩ ⟨0⟩ ⟨1⟩ []
#eval "DB".splitOnAux "C"0⟩ ⟨1⟩ ⟨0⟩ []
#eval "DB".splitOnAux "C"0⟩ ⟨2⟩ ⟨0⟩ []
#eval ("DB"::[]).reverse
#eval ["DB"]

Why not just use and prove properties of the existing function?

The weird behavior of splitOnAux I mentioned above makes it painful to prove splitOnAux_of_valid.

@digama0
Copy link
Member

digama0 commented Jan 3, 2024

In order to prove splitOnAux_of_valid, we also have to deal with the case where it's true that sep.atEnd j; this is where the problem with the current splitOnAux lies.

Like I said earlier, splitOnAux has a precondition that j < sep.endPos, which eliminates this case. splitOnAux_of_valid should likewise have this as a hypothesis. (NB: this also eliminates the sep != "" precondition, since j < sep.endPos implies it.)

The whole reason that *_of_valid functions have to be written in such a weird way is because most string functions have very nontrivial preconditions to do with positions lying on UTF-8 byte boundaries. If you don't assume the correct preconditions they just yield garbage values, and we aren't interested in proving properties about those garbage values.

@chabulhwi
Copy link
Contributor Author

chabulhwi commented Jan 3, 2024

Like I said earlier, splitOnAux has a precondition that j < sep.endPos, which eliminates this case. splitOnAux_of_valid should likewise have this as a hypothesis.

Adding j < sep.endPos to splitOnAux_of_valid as a hypothesis will solve the problems I had. I'll close this pull request and make another one later. But I need to take a break first. 🫠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants