Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upString class useability concerns #1088
Comments
chrisosaurus
changed the title from
String class useability
to
String class useability concerns
Jul 30, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Perelandric
Jul 30, 2016
Contributor
Getting a single 'character' out
Since a string is really an array of U8, not of characters, you'd need to do this manually since there is no designated encoding in the String type. Having a U8.string() return something other than its own numeric representation would be confusing.
I would imagine the string formatter param to .string() gives you an option to set a UTF encoding, but I don't know. In the meantime String.from_utf32(str(4).u32()) would work.
I could see having a Unicode package like Go has to consolidate character encoding utilities.
How can I get a substring ranging from the first occurrnece of a character until the end of string?
The to param is optional and defaults to the end of the string.
let substr = str.substring(str.find("w")) // world
env.out.print(consume substr) // Consume the iso
Since a string is really an array of I would imagine the string formatter param to I could see having a Unicode package like Go has to consolidate character encoding utilities.
The let substr = str.substring(str.find("w")) // world
env.out.print(consume substr) // Consume the iso |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chrisosaurus
Jul 31, 2016
Contributor
Since a string is really an array of U8, not of characters, you'd need to do this manually since there is no designated encoding in the String type. Having a U8.string() return something other than its own numeric representation would be confusing.
I would imagine the string formatter param to .string() gives you an option to set a UTF encoding, but I don't know. In the meantime String.from_utf32(str(4).u32()) would work.
From a user's POV this is frustrating, as far as I am concerned when I created the string I had the encoding information, but when I pull out 'characters' I no longer have the information.
let str = "hello world"
here the 4th character is clearly 'o', and I would like to be able to perform operations on this without having to specify an encoding - just like how I was able to construct the String without specifying an encoding.
Since the String is really a Seq[U8] doesn't that require an 8bit encoding like UTF8 anyway?
Why must I specify the encoding once the characters are pulled out as U8, but not when I put them in as a String literal?
I could see having a Unicode package like Go has to consolidate character encoding utilities.
Pony is consistent with Go here, at least
package main
import "fmt"
func main() {
fmt.Println("Hello world"[4])
}
Outputs 111.
The only difference is Go's Println function being overloaded for types other than String, whereas Pony requires an explicit conversion.
How can I get a substring ranging from the first occurrence of a character until the end of string?
The to param is optional and defaults to the end of the string.let substr = str.substring(str.find("w")) // world
env.out.print(consume substr) // Consume the iso
Thanks for this, I missed the second arg being optional.
From a user's POV this is frustrating, as far as I am concerned when I created the string I had the encoding information, but when I pull out 'characters' I no longer have the information.
here the 4th character is clearly 'o', and I would like to be able to perform operations on this without having to specify an encoding - just like how I was able to construct the String without specifying an encoding. Since the String is really a Why must I specify the encoding once the characters are pulled out as U8, but not when I put them in as a String literal?
Pony is consistent with Go here, at least
Outputs The only difference is Go's Println function being overloaded for types other than String, whereas Pony requires an explicit conversion.
Thanks for this, I missed the second arg being optional. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SeanTAllen
Aug 2, 2016
Member
String is probably going to be rewritten, there's been lots of noise around that for a while.
For what it is worth, I think both @jemc and I like the Erlang approach.
Its binary and you have different functions to work on it.
A string becomes a view with encoding onto the underlying data.
|
String is probably going to be rewritten, there's been lots of noise around that for a while. For what it is worth, I think both @jemc and I like the Erlang approach. Its binary and you have different functions to work on it. A string becomes a view with encoding onto the underlying data. |
SeanTAllen
added
the
needs discussion during sync
label
Aug 2, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Perelandric
Aug 2, 2016
Contributor
@SeanTAllen Would that mean that I could use string functions on a Array[U8], for example? That would be super. Tonight I'll be reading up on what Erlang does for sure.
|
@SeanTAllen Would that mean that I could use string functions on a |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sparrisable
Aug 3, 2016
I got curious about the string handling in Erlang and found the following quote regarding performance (from https://erlangcentral.org/wiki/index.php?title=String_Basics):
"To understand why Erlang string handling is less efficient than a language like Perl, you need to know that each character uses 8 bytes of memory. That's right -- 8 bytes, not 8 bits! Erlang stores each character as a 32-bit integer, with a 32-bit pointer for the next item in the list (remember, strings are lists of characters.)"
The advantage seems to be:
"This was not done out of wanton wastefullness; using such large values means that Erlang can easily handle anything the UNICODE people throw at it, and the decision to represent strings as lists of characters means that a host of built-in Erlang primitives work on strings without any work on our parts."
I am a hobby user of pony that so far have been pleasantly suprised with pony and really like your philosophy:
"The faster the program can get stuff done, the better. This is more important than anything except a correct result."
So to me, I would think that the Erlang string representation might not be the best fit for pony. But I might be wrong in this regard of course.
sparrisable
commented
Aug 3, 2016
•
|
I got curious about the string handling in Erlang and found the following quote regarding performance (from https://erlangcentral.org/wiki/index.php?title=String_Basics):
The advantage seems to be:
I am a hobby user of pony that so far have been pleasantly suprised with pony and really like your philosophy:
So to me, I would think that the Erlang string representation might not be the best fit for pony. But I might be wrong in this regard of course. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SeanTAllen
Aug 3, 2016
Member
@sparrisable you are reading an awful lot into what I said.
I said nothing about using 8 bytes rather than 8 bits for characters.
I have endorsed nothing beyond:
"Its binary and you have different functions to work on it.
A string becomes a view with encoding onto the underlying data."
That's it.
|
@sparrisable you are reading an awful lot into what I said. I said nothing about using 8 bytes rather than 8 bits for characters. I have endorsed nothing beyond: "Its binary and you have different functions to work on it. A string becomes a view with encoding onto the underlying data." That's it. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sparrisable
Aug 3, 2016
@SeanTAllen thanks for your clarification, I suspected that I had misunderstood the discussion.
sparrisable
commented
Aug 3, 2016
|
@SeanTAllen thanks for your clarification, I suspected that I had misunderstood the discussion. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Perelandric
Aug 3, 2016
Contributor
@sparrisable Thanks for doing that research. Interesting information and concerns worth bringing up.
|
@sparrisable Thanks for doing that research. Interesting information and concerns worth bringing up. |
SeanTAllen
removed
the
needs discussion during sync
label
Sep 13, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SeanTAllen
Sep 14, 2016
Member
Closing this. Anyone on this who wants to open an RFC for improvements to the existing String class or an entirely new and better approach please do. We all agree the current API is lacking and painful to work with sometimes and agree that the RFC process is the right way to address this.
|
Closing this. Anyone on this who wants to open an RFC for improvements to the existing String class or an entirely new and better approach please do. We all agree the current API is lacking and painful to work with sometimes and agree that the RFC process is the right way to address this. |
SeanTAllen
closed this
Sep 14, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sylvanc
Sep 14, 2016
Contributor
@mkfifo , @Perelandric , if you would be up for leading or participating in an RFC for a better string API, that would be great. I think your contributions would be a huge help.
|
@mkfifo , @Perelandric , if you would be up for leading or participating in an RFC for a better string API, that would be great. I think your contributions would be a huge help. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sparrisable
Sep 19, 2016
Some discussion regarding the Swift language's string API that apparently is perceived as hard to use, but there is a blogger who disagrees:
"I'm going to explain just why Swift's String API is designed the way it is (or at least, why I think it is) and why I ultimately think it's the best string API out there in terms of its fundamental design."
https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html
(might be interesting input to pony string handling)
sparrisable
commented
Sep 19, 2016
|
Some discussion regarding the Swift language's string API that apparently is perceived as hard to use, but there is a blogger who disagrees: "I'm going to explain just why Swift's String API is designed the way it is (or at least, why I think it is) and why I ultimately think it's the best string API out there in terms of its fundamental design." https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html (might be interesting input to pony string handling) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jemc
Sep 19, 2016
Member
@sparrisable - thanks for the link - I'll be sure to read through the post.
|
@sparrisable - thanks for the link - I'll be sure to read through the post. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
malthe
Sep 22, 2016
Contributor
It seems very reasonable to go with that design in Pony. This should be an RFC then, but very briefly this would allow us to remove all codec-specific code from the built-in string module and shift it to views:
use "codec"
let s: String = "Hello"
let c = Encoded<String>(s, UTF8CodepointCodec)(4)
let g = Encoded<String>(s, UTF8CharacterCodec)(4)(Where the codec argument is a primitive.)
|
It seems very reasonable to go with that design in Pony. This should be an RFC then, but very briefly this would allow us to remove all codec-specific code from the built-in string module and shift it to views: use "codec"
let s: String = "Hello"
let c = Encoded<String>(s, UTF8CodepointCodec)(4)
let g = Encoded<String>(s, UTF8CharacterCodec)(4)(Where the codec argument is a primitive.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chrisosaurus
Apr 18, 2017
Contributor
I've been thinking more about String API design lately, I hope replying here is appropriate.
"Its binary and you have different functions to work on it.
A string becomes a view with encoding onto the underlying data."
I agree that the string itself should be encoding aware - when I put the "characters" in I likely had encoding information, it seems silly to drop it.
I might even say that I think all strings should have the same underlying encoding, but I'm not so sure on that point.
@sparrisable
I dislike the Erlang approach using a fixed-width encoding (8 bytes per character), you are wasting a lot of space.
This does mean that rune-index operations are cheaper - but rune-index operations are likely not very common.
This would also make the api nice in that a String is now just a List/Array of (fixed size) Characters and all of your normal list operations work without adaptation.
I'm a bit uncomfortable about making the runtime memory usage significantly higher so that you can re-use list/array operatons :/
For Ascii text this would be an 8-fold increase in runtime string size which seems absurd.
http://utf8everywhere.org/
Is a reasonably good write-up of why UTF-8 is often preferred.
The approach in Go lang is pretty good, I'll have to look more into Swift and Erlang.
|
I've been thinking more about String API design lately, I hope replying here is appropriate.
I agree that the string itself should be encoding aware - when I put the "characters" in I likely had encoding information, it seems silly to drop it. @sparrisable http://utf8everywhere.org/ The approach in Go lang is pretty good, I'll have to look more into Swift and Erlang. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jemc
Apr 18, 2017
Member
I don't think that implementations that use anything other than simple byte sequences under the hood are very practical.
At the end of the day, you need to be able to read/write the string from/to a file or a socket, so if the internal representation of the string is anything other than a sequence of bytes (for example, if it were a sequence of codepoints represented as U32 or U64), then it would require a transformation/copy of the whole buffer for any read/write operations. With Pony's focus on performance, I think it would probably be unacceptable to make something like that the default representation of a String (though if there is compelling need for it, it could possibly be an alternative option).
For that reason, I think it is the right choice to look at Strings as being a sequence of bytes, viewed through an encoding. And yes, I agree that UTF-8 is an excellent choice for a default encoding to use. In fact, when we're getting started we could probably implement only UTF-8 to begin with, and proceed to implement other encodings as there comes a need to do so.
|
I don't think that implementations that use anything other than simple byte sequences under the hood are very practical. At the end of the day, you need to be able to read/write the string from/to a file or a socket, so if the internal representation of the string is anything other than a sequence of bytes (for example, if it were a sequence of codepoints represented as U32 or U64), then it would require a transformation/copy of the whole buffer for any read/write operations. With Pony's focus on performance, I think it would probably be unacceptable to make something like that the default representation of a String (though if there is compelling need for it, it could possibly be an alternative option). For that reason, I think it is the right choice to look at Strings as being a sequence of bytes, viewed through an encoding. And yes, I agree that UTF-8 is an excellent choice for a default encoding to use. In fact, when we're getting started we could probably implement only UTF-8 to begin with, and proceed to implement other encodings as there comes a need to do so. |
chrisosaurus commentedJul 30, 2016
•
edited
Edited 1 time
-
chrisosaurus
edited Jul 31, 2016 (most recent)
When working on #1087 it was clear how... painful? the String class is to use.
I can easily make one, but as soon as I want to take it apart it becomes very difficult to use.
Characters
Getting a single 'character' out
How am I meant to take the 'character' I get from applying String and produce an actual representation of that character?
Substring
How can I get a substring ranging from the first occurrence of a character until the end of string? I couldn't find any '.end()' or '.start()' methods.
And a negative index of
-1for the second argument makes it exclusive..size()returnsUSizebutsubstringwantsISize.Printing a substring requires taking my string and calling .string on it, presumable to massage the type to get the right capabilities.
Follow up
Does pony support overloading? as it feels like most of those String methods would be much nicer with overloading versions.
I also think string needs a
.end()and.start()(similar to C++'s String iterators) to make substring useful.What am I missing?