is_utf8 helper #1

hadley · 2016-06-14T12:44:12Z

We need some helper function that checks if a character vector is all utf-8.

Not sure how to distinguish this from a class based test for utf8.

krlmlr · 2016-06-14T12:54:27Z

This will need C code to distinguish between "unknown" encoding and truly "all ASCII", basically an enhanced do_encoding() that also checks IS_ASCII(). Eventually this could be added upstream.

hadley · 2016-06-14T12:56:23Z

What's the difference between unknown and ascii? I thought unknown implied ascii.

krlmlr · 2016-06-14T13:02:55Z

Need to double-check.

krlmlr · 2016-06-14T15:02:41Z

There' a dedicated ASCII bit that is set if the code of all characters is 127 or less.

hadley · 2016-06-14T18:18:13Z

But that never gets used in Encoding(): https://github.com/wch/r-source/blob/2f2e4711ad7089f97f22c6b1ae25ba582d2e99a6/src/main/util.c#L1110

hadley · 2016-06-14T19:14:11Z

Also, none of the internal string representation stuff is in the exported API, so I think that means doing checks in R with Encoding() will be the easiest way forward.

krlmlr · 2016-06-14T20:45:43Z

ASCII implies unknown, but not the other way round. It will be difficult to detect pure ASCII strings using Encoding() only.

hadley · 2016-06-14T20:58:10Z

Yes, but it'll be accurate >95% of the time, I'd imagine

- New `encoding()`, returns `"ASCII"` for pure ASCII strings and behaves identical to `base::Encoding()` otherwise. - New `all_utf8()`, returns an atomic logical that indicates if all elements of a character vector are UTF-8 encoded; this includes pure ASCII stringsi (#1). - Remove `Encoding<-` override, with documentation (#7).

github-actions · 2021-03-28T02:28:11Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

krlmlr added a commit that referenced this issue Aug 8, 2016

Merge branch 'master' into f-#1-all-utf8

9810137

krlmlr closed this as completed in cdeaa90 Aug 8, 2016

github-actions bot locked and limited conversation to collaborators Mar 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_utf8 helper #1

is_utf8 helper #1

hadley commented Jun 14, 2016

krlmlr commented Jun 14, 2016

hadley commented Jun 14, 2016

krlmlr commented Jun 14, 2016

krlmlr commented Jun 14, 2016

hadley commented Jun 14, 2016

hadley commented Jun 14, 2016

krlmlr commented Jun 14, 2016

hadley commented Jun 14, 2016

github-actions bot commented Mar 28, 2021

is_utf8 helper #1

is_utf8 helper #1

Comments

hadley commented Jun 14, 2016

krlmlr commented Jun 14, 2016

hadley commented Jun 14, 2016

krlmlr commented Jun 14, 2016

krlmlr commented Jun 14, 2016

hadley commented Jun 14, 2016

hadley commented Jun 14, 2016

krlmlr commented Jun 14, 2016

hadley commented Jun 14, 2016

github-actions bot commented Mar 28, 2021