Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String class useability concerns #1088

Closed
chrisosaurus opened this issue Jul 30, 2016 · 15 comments
Closed

String class useability concerns #1088

chrisosaurus opened this issue Jul 30, 2016 · 15 comments

Comments

@chrisosaurus
Copy link
Contributor

chrisosaurus commented Jul 30, 2016

When working on #1087 it was clear how... painful? the String class is to use.

I can easily make one, but as soon as I want to take it apart it becomes very difficult to use.

Characters

Getting a single 'character' out

let str = "Hello world"
let c = str.(4)
env.out.print(c.string) // outputs "111", rather than the more useful "o"

How am I meant to take the 'character' I get from applying String and produce an actual representation of that character?

Substring

How can I get a substring ranging from the first occurrence of a character until the end of string? I couldn't find any '.end()' or '.start()' methods.
And a negative index of -1 for the second argument makes it exclusive.
.size() returns USize but substring wants ISize.

Printing a substring requires taking my string and calling .string on it, presumable to massage the type to get the right capabilities.

  let str = "Hello world"
  let substr = str.substring( str.find("w"), str.find("d")+1 )
  env.out.print( substr.string() )

Follow up

Does pony support overloading? as it feels like most of those String methods would be much nicer with overloading versions.

I also think string needs a .end() and .start() (similar to C++'s String iterators) to make substring useful.

What am I missing?

@chrisosaurus chrisosaurus changed the title String class useability String class useability concerns Jul 30, 2016
@Perelandric
Copy link
Contributor

Perelandric commented Jul 30, 2016

Getting a single 'character' out

Since a string is really an array of U8, not of characters, you'd need to do this manually since there is no designated encoding in the String type. Having a U8.string() return something other than its own numeric representation would be confusing.

I would imagine the string formatter param to .string() gives you an option to set a UTF encoding, but I don't know. In the meantime String.from_utf32(str(4).u32()) would work.

I could see having a Unicode package like Go has to consolidate character encoding utilities.

How can I get a substring ranging from the first occurrnece of a character until the end of string?

The to param is optional and defaults to the end of the string.

let substr = str.substring(str.find("w")) // world
env.out.print(consume substr) // Consume the iso

@chrisosaurus
Copy link
Contributor Author

chrisosaurus commented Jul 31, 2016

Since a string is really an array of U8, not of characters, you'd need to do this manually since there is no designated encoding in the String type. Having a U8.string() return something other than its own numeric representation would be confusing.

I would imagine the string formatter param to .string() gives you an option to set a UTF encoding, but I don't know. In the meantime String.from_utf32(str(4).u32()) would work.

From a user's POV this is frustrating, as far as I am concerned when I created the string I had the encoding information, but when I pull out 'characters' I no longer have the information.

let str = "hello world"

here the 4th character is clearly 'o', and I would like to be able to perform operations on this without having to specify an encoding - just like how I was able to construct the String without specifying an encoding.

Since the String is really a Seq[U8] doesn't that require an 8bit encoding like UTF8 anyway?

Why must I specify the encoding once the characters are pulled out as U8, but not when I put them in as a String literal?

I could see having a Unicode package like Go has to consolidate character encoding utilities.

Pony is consistent with Go here, at least

package main
import "fmt"
func main() {
    fmt.Println("Hello world"[4])
}

Outputs 111.

The only difference is Go's Println function being overloaded for types other than String, whereas Pony requires an explicit conversion.

How can I get a substring ranging from the first occurrence of a character until the end of string?
The to param is optional and defaults to the end of the string.

let substr = str.substring(str.find("w")) // world
env.out.print(consume substr) // Consume the iso

Thanks for this, I missed the second arg being optional.

@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 2, 2016

String is probably going to be rewritten, there's been lots of noise around that for a while.

For what it is worth, I think both @jemc and I like the Erlang approach.

Its binary and you have different functions to work on it.

A string becomes a view with encoding onto the underlying data.

@Perelandric
Copy link
Contributor

@SeanTAllen Would that mean that I could use string functions on a Array[U8], for example? That would be super. Tonight I'll be reading up on what Erlang does for sure.

@sparrisable
Copy link

sparrisable commented Aug 3, 2016

I got curious about the string handling in Erlang and found the following quote regarding performance (from https://erlangcentral.org/wiki/index.php?title=String_Basics):

"To understand why Erlang string handling is less efficient than a language like Perl, you need to know that each character uses 8 bytes of memory. That's right -- 8 bytes, not 8 bits! Erlang stores each character as a 32-bit integer, with a 32-bit pointer for the next item in the list (remember, strings are lists of characters.)"

The advantage seems to be:

"This was not done out of wanton wastefullness; using such large values means that Erlang can easily handle anything the UNICODE people throw at it, and the decision to represent strings as lists of characters means that a host of built-in Erlang primitives work on strings without any work on our parts."

I am a hobby user of pony that so far have been pleasantly suprised with pony and really like your philosophy:

"The faster the program can get stuff done, the better. This is more important than anything except a correct result."

So to me, I would think that the Erlang string representation might not be the best fit for pony. But I might be wrong in this regard of course.

@SeanTAllen
Copy link
Member

@sparrisable you are reading an awful lot into what I said.

I said nothing about using 8 bytes rather than 8 bits for characters.

I have endorsed nothing beyond:

"Its binary and you have different functions to work on it.

A string becomes a view with encoding onto the underlying data."

That's it.

@sparrisable
Copy link

@SeanTAllen thanks for your clarification, I suspected that I had misunderstood the discussion.

@Perelandric
Copy link
Contributor

@sparrisable Thanks for doing that research. Interesting information and concerns worth bringing up.

@SeanTAllen
Copy link
Member

Closing this. Anyone on this who wants to open an RFC for improvements to the existing String class or an entirely new and better approach please do. We all agree the current API is lacking and painful to work with sometimes and agree that the RFC process is the right way to address this.

@sylvanc
Copy link
Contributor

sylvanc commented Sep 14, 2016

@mkfifo , @Perelandric , if you would be up for leading or participating in an RFC for a better string API, that would be great. I think your contributions would be a huge help.

@sparrisable
Copy link

Some discussion regarding the Swift language's string API that apparently is perceived as hard to use, but there is a blogger who disagrees:

"I'm going to explain just why Swift's String API is designed the way it is (or at least, why I think it is) and why I ultimately think it's the best string API out there in terms of its fundamental design."

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html

(might be interesting input to pony string handling)

@jemc
Copy link
Member

jemc commented Sep 19, 2016

@sparrisable - thanks for the link - I'll be sure to read through the post.

@malthe
Copy link
Contributor

malthe commented Sep 22, 2016

It seems very reasonable to go with that design in Pony. This should be an RFC then, but very briefly this would allow us to remove all codec-specific code from the built-in string module and shift it to views:

use "codec"

let s: String = "Hello"
let c = Encoded<String>(s, UTF8CodepointCodec)(4)
let g = Encoded<String>(s, UTF8CharacterCodec)(4)

(Where the codec argument is a primitive.)

@chrisosaurus
Copy link
Contributor Author

I've been thinking more about String API design lately, I hope replying here is appropriate.

"Its binary and you have different functions to work on it.

A string becomes a view with encoding onto the underlying data."

I agree that the string itself should be encoding aware - when I put the "characters" in I likely had encoding information, it seems silly to drop it.
I might even say that I think all strings should have the same underlying encoding, but I'm not so sure on that point.

@sparrisable
I dislike the Erlang approach using a fixed-width encoding (8 bytes per character), you are wasting a lot of space.
This does mean that rune-index operations are cheaper - but rune-index operations are likely not very common.
This would also make the api nice in that a String is now just a List/Array of (fixed size) Characters and all of your normal list operations work without adaptation.
I'm a bit uncomfortable about making the runtime memory usage significantly higher so that you can re-use list/array operatons :/
For Ascii text this would be an 8-fold increase in runtime string size which seems absurd.

http://utf8everywhere.org/
Is a reasonably good write-up of why UTF-8 is often preferred.

The approach in Go lang is pretty good, I'll have to look more into Swift and Erlang.

@jemc
Copy link
Member

jemc commented Apr 18, 2017

I don't think that implementations that use anything other than simple byte sequences under the hood are very practical.

At the end of the day, you need to be able to read/write the string from/to a file or a socket, so if the internal representation of the string is anything other than a sequence of bytes (for example, if it were a sequence of codepoints represented as U32 or U64), then it would require a transformation/copy of the whole buffer for any read/write operations. With Pony's focus on performance, I think it would probably be unacceptable to make something like that the default representation of a String (though if there is compelling need for it, it could possibly be an alternative option).

For that reason, I think it is the right choice to look at Strings as being a sequence of bytes, viewed through an encoding. And yes, I agree that UTF-8 is an excellent choice for a default encoding to use. In fact, when we're getting started we could probably implement only UTF-8 to begin with, and proceed to implement other encodings as there comes a need to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants