Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runes: create new package analogous to bytes, for rune slices #34313

Closed
srinathh opened this issue Sep 16, 2019 · 14 comments
Closed

proposal: runes: create new package analogous to bytes, for rune slices #34313

srinathh opened this issue Sep 16, 2019 · 14 comments

Comments

@srinathh
Copy link
Contributor

@srinathh srinathh commented Sep 16, 2019

Working with and manipulating non-English data requires us to use runes slices. If we want to do operations like comparing two rune slices, replacing, indexing etc, we have to cast to string, do those operations and cast back or write custom functions.

I would like to therefore propose creating a package runes mirroring the package bytes with functionality to work directly with rune slices rather than bytes to support international language use cases

@gopherbot gopherbot added this to the Proposal milestone Sep 16, 2019
@gopherbot gopherbot added the Proposal label Sep 16, 2019
@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Sep 16, 2019

This would not be necessary once (if) generics are implemented.

@lootch

This comment has been minimized.

Copy link

@lootch lootch commented Sep 16, 2019

@bserdar

This comment has been minimized.

Copy link

@bserdar bserdar commented Sep 16, 2019

@robpike

This comment has been minimized.

Copy link
Contributor

@robpike robpike commented Sep 16, 2019

I have trouble with your opening sentence: Working with and manipulating non-English data requires us to use runes slices. That is presented as a fact but is an opinion, one I just don't think is true.

I speak only English but I have spent a lot of time working with text that is not ASCII and, although it can be attractive to work with rune slices, they are not really a good solution. In fact, I think they are a trap: they don't answer most of the questions that persist with multilingual text because, despite what many want to believe, a rune is not a character. (See blog.golang.org/strings for an explanation of this.)

I would therefore prefer not to add such a package as it would promote bad practice.

@srinathh

This comment has been minimized.

Copy link
Contributor Author

@srinathh srinathh commented Sep 16, 2019

@robpike I hear you but now I'm really puzzled. My take away from your blog post (which I have revisited many times over the years including just before making this proposal today) is that runes are a better way to deal with non-english characters and smileys ad what not vs. bytes. Ranging over a string gives runes.

Now I do recall from reading the article linked to in your blog that some Unicode code points are modifiers and what not and some characters can be made with multiple combination of Unicode code points and they can mess things up but what's a better way to deal with mutable collections of Unicode code points than a slice of runes that's made available in Go?

@robpike

This comment has been minimized.

Copy link
Contributor

@robpike robpike commented Sep 17, 2019

Runes are code points, from which characters are made. Bytes are also things from which characters are made. Why use both?

Sometimes we need the code points themselves, but providing a package that handles slices of them will encourage the poor practice of converting back and forth between rune slices and bytes slices/strings rather than the more efficient method of just iterating the bytes appropriately.

@srinathh

This comment has been minimized.

Copy link
Contributor Author

@srinathh srinathh commented Sep 17, 2019

May I share an example use case? Suppose we're building a simple text editor. When people enter text, the enter unicode code points to make characters. If we use rune slices, we can simply insert the required rune at the right position.

If we are using byte slices, for each insertion or deletion, we would have to iterate the slice through a function to parse Unicode, find the right position to insert or delete & make the change. Since this iteration can throw an error, we'd have to check for error. If we are using strings, we'd have to reallocate for every single insertion or deletion & then again run iterations.

Essentially if we want to work with mutable sets of unicode characters, then neither the bytes solution nor the strings solution seems efficient

@apparluk

This comment has been minimized.

Copy link

@apparluk apparluk commented Nov 3, 2019

off topic, but I thought to mention Perl6 here

https://www.evanmiller.org/a-review-of-perl-6.html

cf: Strings and Regexes

caveat, see footnote 2

a contributor to Perl6

https://perlgeek.de/

also wrote this module

https://metacpan.org/pod/Perl6::Str

@apparluk

This comment has been minimized.

Copy link

@apparluk apparluk commented Nov 3, 2019

the idea of using rope data structures in an editor intrigued me at one point

but I've never taken the time to look into it

@robpike

This comment has been minimized.

Copy link
Contributor

@robpike robpike commented Nov 3, 2019

Essentially if we want to work with mutable sets of unicode characters, then neither the bytes solution nor the strings solution seems efficient

And the runes solution is misleading and leads to incorrect thinking. Text is hard, and rune slices solve almost none of what makes text hard.

@apparluk

This comment has been minimized.

Copy link

@apparluk apparluk commented Nov 4, 2019

on a side note

A Philosophy of Software Design

by J. Ousterhout

The book includes commentary on a student project of writing a text editor.

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Nov 6, 2019

Using runes in a text editor seems like a good idea at first, but it fails badly once you get to Unicode compose sequences, like e + composing acute vs é. The former is two runes while the latter is one. And for some sequences there's not even a single-rune sequence. In general Unicode text processing requires considering largish sequences of input, not just a single byte and not just a single rune either. There's little benefit to []rune as the representation, and there are real drawbacks to having two representations. So Go has standardized on []byte/string and UTF-8.

If you find that []rune works really well for your editor somehow (maybe you ignore all the multirune characters), that's fine. A "runes" library forked from "bytes" could easily be maintained as a go get-able package outside the standard library.

Note that generics are not going to help here, because the encoding stored in the underlying data is different between []byte and []rune.

This is a likely decline. Leaving open for a week for final comments.

@rsc rsc changed the title proposal: create a package `runes` with functionality similar to `bytes` to work with rune slices proposal: runes: create new package analogous to bytes, for rune slices Nov 6, 2019
@apparluk

This comment has been minimized.

Copy link

@apparluk apparluk commented Nov 7, 2019

Hopefully my comment won't be interpreted as cultural bias.

I'm opposed to this on linguistic reasons.

Rune is used in Plan 9, and also appears in Golang.

The suggested use diverging excessively from the original North Germanic languages' use of the word.

D. Mendeleev used एक (eka) and द्वि (dvi) for certain postulated elements.

экаалюминій, экаборъ, экасилицій
двимарганец

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Nov 13, 2019

There have been no comments objecting to declining this issue. Declined.

@rsc rsc closed this Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.