Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String class is error prone and has usability issues #107

Open
codec-abc opened this issue Oct 23, 2017 · 2 comments
Open

String class is error prone and has usability issues #107

codec-abc opened this issue Oct 23, 2017 · 2 comments

Comments

@codec-abc
Copy link

codec-abc commented Oct 23, 2017

Following the discussion of this issue there was an agreement that the String class has some paint points that I myself ran into.

One issue is that the String class is not encoding aware and expose the underlying byte sequence. So a user could start with a valid string with a known encoding (UTF-8 for instance) and end-up with an invalid one.

As a user I would like to have a String class that do not expose the inner byte array. As such it should have the following properties:

  • The string should be encoding aware. Either by design (only support an encoding and reject everything else) or because it is in the the type with the actual data.
  • The hard-coded string in source code should have a well defined encoding. I think UTF-8 makes a good default. As such, the compiler should reject any file that is not as strictly valid UTF-8 document (or other encoding if UTF-8 is not the default)
  • The string class should not expose functions that let the user alter the inner buffer in such a way that it become invalid.
  • The string constructor using "untrusted values" (like a byte array for example) should be partial and fail on invalid input to be coherent with the previous point.
  • The string class should implement Seq[String] instead of Seq[U8] as it is more convenient to represent a character as a string of length 1 than exposing the internal byte array.
  • Any string function that use index should use index in terms of characters instead of index of byte in the internal array. Otherwise it is error prone as some common encoding (for instance UTF-8) has variable width characters.
  • The null terminator concern should only surface when the user want to interact with external code with FFI.
  • It would be nicer if the standard library could provide encoding conversion (UTF-8 <=> UTF32 and so on) and even nicer if the common string manipulation would work on every string class no matter which encoding is used.

Here is an example that illustrates a few of the problems:

use "format"

actor Main
  new create(env: Env) =>
    let sentence = " this is a pony -> 🐎"
    try 
      let index_of_pony = sentence.find("🐎")?
      
      (let char_u32 : U32, let nb_bytes_of_char : U8) = sentence.utf32(index_of_pony)?
      env.out.print(recover val String.from_utf32(char_u32) end) // correct
      
      env.out.print(sentence.substring(index_of_pony, index_of_pony + 1)) // incorrect
      env.out.print(sentence.substring(index_of_pony, index_of_pony + 4)) //correct
    end
    
    var sentence2 : String ref = "hi".clone() // so far valid UTF-8
    let invalid_code_point = U32(0x11FFF) // 0x11FFF is not a valid codepoint look at http://unicode.org/faq/utf_bom.html#gen6
    sentence2.push_utf32(invalid_code_point)
    env.out.print(recover val sentence2.clone() end)
    try
      (let char_u32 : U32, let nb_bytes : U8) = sentence2.utf32(2)?
      env.out.print(
        "char as codepoint is " + Format.int[U32](char_u32, FormatHex) + 
        " takes " + nb_bytes.string() + " bytes")
    end

As you can see in the first try, finding a character in a string is actually error prone. Calling naively the substring function can return a invalid UTF-8 string even in the original one is valid. And the input string in only a valid UTF-8 string if the code sample is itself encoded in UTF-8. In the second try we see that we can insert in a valid UTF-8 string a codepoint that is not defined by the unicode standard (or so it seems).

@esoterra
Copy link

A good reference to use for a safe string API would be Rust's std::string::String and std::ffi::OsString which allow users to interact with valid UTF-8 strings and platform-specific string representations respectively.

There are a few differences between Rust's string design and what is proposed above.

  • Rust strings have a String::chars method which allows iterating over the char type instead of treating it as a sequence of strings itself.
  • Rust additionally offers a String::char_indices method that allows you to see the byte indices of each character you iterate over
  • Rust does allow users to do some indexing into strings as byte arrays, but doing so risks a panic (example with index) or returns an Option (example with get).

Users have a tendency to assume indexing is a constant time operation which is not the case for character indexing into UTF-8 and I believe is why that ability is only offered through an iterator (e.g. s.chars().nth(...)) in Rust.

@SeanTAllen
Copy link
Member

FYI there's a String RFC that is being worked on so anyone interested in this might want to check that out:

#179

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants