String class useability concerns #1088

chrisosaurus · 2016-07-30T06:12:55Z

When working on #1087 it was clear how... painful? the String class is to use.

I can easily make one, but as soon as I want to take it apart it becomes very difficult to use.

Characters

Getting a single 'character' out

let str = "Hello world"
let c = str.(4)
env.out.print(c.string) // outputs "111", rather than the more useful "o"

How am I meant to take the 'character' I get from applying String and produce an actual representation of that character?

Substring

How can I get a substring ranging from the first occurrence of a character until the end of string? I couldn't find any '.end()' or '.start()' methods.
And a negative index of -1 for the second argument makes it exclusive.
.size() returns USize but substring wants ISize.

Printing a substring requires taking my string and calling .string on it, presumable to massage the type to get the right capabilities.

  let str = "Hello world"
  let substr = str.substring( str.find("w"), str.find("d")+1 )
  env.out.print( substr.string() )

Follow up

Does pony support overloading? as it feels like most of those String methods would be much nicer with overloading versions.

I also think string needs a .end() and .start() (similar to C++'s String iterators) to make substring useful.

What am I missing?

The text was updated successfully, but these errors were encountered:

Perelandric · 2016-07-30T14:28:06Z

Getting a single 'character' out

Since a string is really an array of U8, not of characters, you'd need to do this manually since there is no designated encoding in the String type. Having a U8.string() return something other than its own numeric representation would be confusing.

I would imagine the string formatter param to .string() gives you an option to set a UTF encoding, but I don't know. In the meantime String.from_utf32(str(4).u32()) would work.

I could see having a Unicode package like Go has to consolidate character encoding utilities.

How can I get a substring ranging from the first occurrnece of a character until the end of string?

The to param is optional and defaults to the end of the string.

let substr = str.substring(str.find("w")) // world
env.out.print(consume substr) // Consume the iso

chrisosaurus · 2016-07-31T03:19:49Z

Since a string is really an array of U8, not of characters, you'd need to do this manually since there is no designated encoding in the String type. Having a U8.string() return something other than its own numeric representation would be confusing.

I would imagine the string formatter param to .string() gives you an option to set a UTF encoding, but I don't know. In the meantime String.from_utf32(str(4).u32()) would work.

From a user's POV this is frustrating, as far as I am concerned when I created the string I had the encoding information, but when I pull out 'characters' I no longer have the information.

let str = "hello world"

here the 4th character is clearly 'o', and I would like to be able to perform operations on this without having to specify an encoding - just like how I was able to construct the String without specifying an encoding.

Since the String is really a Seq[U8] doesn't that require an 8bit encoding like UTF8 anyway?

Why must I specify the encoding once the characters are pulled out as U8, but not when I put them in as a String literal?

I could see having a Unicode package like Go has to consolidate character encoding utilities.

Pony is consistent with Go here, at least

package main
import "fmt"
func main() {
    fmt.Println("Hello world"[4])
}

Outputs 111.

The only difference is Go's Println function being overloaded for types other than String, whereas Pony requires an explicit conversion.

How can I get a substring ranging from the first occurrence of a character until the end of string?
The to param is optional and defaults to the end of the string.

let substr = str.substring(str.find("w")) // world
env.out.print(consume substr) // Consume the iso

Thanks for this, I missed the second arg being optional.

SeanTAllen · 2016-08-02T22:07:39Z

String is probably going to be rewritten, there's been lots of noise around that for a while.

For what it is worth, I think both @jemc and I like the Erlang approach.

Its binary and you have different functions to work on it.

A string becomes a view with encoding onto the underlying data.

Perelandric · 2016-08-02T22:39:32Z

@SeanTAllen Would that mean that I could use string functions on a Array[U8], for example? That would be super. Tonight I'll be reading up on what Erlang does for sure.

sparrisable · 2016-08-03T07:53:59Z

I got curious about the string handling in Erlang and found the following quote regarding performance (from https://erlangcentral.org/wiki/index.php?title=String_Basics):

"To understand why Erlang string handling is less efficient than a language like Perl, you need to know that each character uses 8 bytes of memory. That's right -- 8 bytes, not 8 bits! Erlang stores each character as a 32-bit integer, with a 32-bit pointer for the next item in the list (remember, strings are lists of characters.)"

The advantage seems to be:

"This was not done out of wanton wastefullness; using such large values means that Erlang can easily handle anything the UNICODE people throw at it, and the decision to represent strings as lists of characters means that a host of built-in Erlang primitives work on strings without any work on our parts."

I am a hobby user of pony that so far have been pleasantly suprised with pony and really like your philosophy:

"The faster the program can get stuff done, the better. This is more important than anything except a correct result."

So to me, I would think that the Erlang string representation might not be the best fit for pony. But I might be wrong in this regard of course.

SeanTAllen · 2016-08-03T12:19:39Z

@sparrisable you are reading an awful lot into what I said.

I said nothing about using 8 bytes rather than 8 bits for characters.

I have endorsed nothing beyond:

"Its binary and you have different functions to work on it.

A string becomes a view with encoding onto the underlying data."

That's it.

sparrisable · 2016-08-03T14:18:09Z

@SeanTAllen thanks for your clarification, I suspected that I had misunderstood the discussion.

Perelandric · 2016-08-03T15:08:00Z

@sparrisable Thanks for doing that research. Interesting information and concerns worth bringing up.

SeanTAllen · 2016-09-14T19:53:25Z

Closing this. Anyone on this who wants to open an RFC for improvements to the existing String class or an entirely new and better approach please do. We all agree the current API is lacking and painful to work with sometimes and agree that the RFC process is the right way to address this.

sylvanc · 2016-09-14T19:55:29Z

@mkfifo , @Perelandric , if you would be up for leading or participating in an RFC for a better string API, that would be great. I think your contributions would be a huge help.

sparrisable · 2016-09-19T14:34:15Z

Some discussion regarding the Swift language's string API that apparently is perceived as hard to use, but there is a blogger who disagrees:

"I'm going to explain just why Swift's String API is designed the way it is (or at least, why I think it is) and why I ultimately think it's the best string API out there in terms of its fundamental design."

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html

(might be interesting input to pony string handling)

jemc · 2016-09-19T15:49:21Z

@sparrisable - thanks for the link - I'll be sure to read through the post.

malthe · 2016-09-22T19:31:35Z

It seems very reasonable to go with that design in Pony. This should be an RFC then, but very briefly this would allow us to remove all codec-specific code from the built-in string module and shift it to views:

use "codec"

let s: String = "Hello"
let c = Encoded<String>(s, UTF8CodepointCodec)(4)
let g = Encoded<String>(s, UTF8CharacterCodec)(4)

(Where the codec argument is a primitive.)

chrisosaurus · 2017-04-18T07:52:26Z

I've been thinking more about String API design lately, I hope replying here is appropriate.

"Its binary and you have different functions to work on it.

A string becomes a view with encoding onto the underlying data."

I agree that the string itself should be encoding aware - when I put the "characters" in I likely had encoding information, it seems silly to drop it.
I might even say that I think all strings should have the same underlying encoding, but I'm not so sure on that point.

@sparrisable
I dislike the Erlang approach using a fixed-width encoding (8 bytes per character), you are wasting a lot of space.
This does mean that rune-index operations are cheaper - but rune-index operations are likely not very common.
This would also make the api nice in that a String is now just a List/Array of (fixed size) Characters and all of your normal list operations work without adaptation.
I'm a bit uncomfortable about making the runtime memory usage significantly higher so that you can re-use list/array operatons :/
For Ascii text this would be an 8-fold increase in runtime string size which seems absurd.

http://utf8everywhere.org/
Is a reasonably good write-up of why UTF-8 is often preferred.

The approach in Go lang is pretty good, I'll have to look more into Swift and Erlang.

jemc · 2017-04-18T18:36:12Z

I don't think that implementations that use anything other than simple byte sequences under the hood are very practical.

At the end of the day, you need to be able to read/write the string from/to a file or a socket, so if the internal representation of the string is anything other than a sequence of bytes (for example, if it were a sequence of codepoints represented as U32 or U64), then it would require a transformation/copy of the whole buffer for any read/write operations. With Pony's focus on performance, I think it would probably be unacceptable to make something like that the default representation of a String (though if there is compelling need for it, it could possibly be an alternative option).

For that reason, I think it is the right choice to look at Strings as being a sequence of bytes, viewed through an encoding. And yes, I agree that UTF-8 is an excellent choice for a default encoding to use. In fact, when we're getting started we could probably implement only UTF-8 to begin with, and proceed to implement other encodings as there comes a need to do so.

chrisosaurus changed the title ~~String class useability~~ String class useability concerns Jul 30, 2016

chrisosaurus mentioned this issue Jul 30, 2016

Write class String documentation #725

Closed

SeanTAllen added the needs discussion label Aug 2, 2016

aturley mentioned this issue Aug 3, 2016

String.from_array does too much #1099

Closed

SeanTAllen removed the needs discussion label Sep 13, 2016

SeanTAllen closed this as completed Sep 14, 2016

codec-abc mentioned this issue Oct 23, 2017

String class is error prone and has usability issues ponylang/rfcs#107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String class useability concerns #1088

String class useability concerns #1088

chrisosaurus commented Jul 30, 2016 •

edited

Perelandric commented Jul 30, 2016 •

edited

chrisosaurus commented Jul 31, 2016 •

edited

SeanTAllen commented Aug 2, 2016 •

edited

Perelandric commented Aug 2, 2016

sparrisable commented Aug 3, 2016 •

edited

SeanTAllen commented Aug 3, 2016

sparrisable commented Aug 3, 2016

Perelandric commented Aug 3, 2016

SeanTAllen commented Sep 14, 2016

sylvanc commented Sep 14, 2016

sparrisable commented Sep 19, 2016

jemc commented Sep 19, 2016

malthe commented Sep 22, 2016

chrisosaurus commented Apr 18, 2017

jemc commented Apr 18, 2017

String class useability concerns #1088

String class useability concerns #1088

Comments

chrisosaurus commented Jul 30, 2016 • edited

Characters

Substring

Follow up

Perelandric commented Jul 30, 2016 • edited

chrisosaurus commented Jul 31, 2016 • edited

SeanTAllen commented Aug 2, 2016 • edited

Perelandric commented Aug 2, 2016

sparrisable commented Aug 3, 2016 • edited

SeanTAllen commented Aug 3, 2016

sparrisable commented Aug 3, 2016

Perelandric commented Aug 3, 2016

SeanTAllen commented Sep 14, 2016

sylvanc commented Sep 14, 2016

sparrisable commented Sep 19, 2016

jemc commented Sep 19, 2016

malthe commented Sep 22, 2016

chrisosaurus commented Apr 18, 2017

jemc commented Apr 18, 2017

chrisosaurus commented Jul 30, 2016 •

edited

Perelandric commented Jul 30, 2016 •

edited

chrisosaurus commented Jul 31, 2016 •

edited

SeanTAllen commented Aug 2, 2016 •

edited

sparrisable commented Aug 3, 2016 •

edited