Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

is_utf8 helper #1

Closed
hadley opened this issue Jun 14, 2016 · 9 comments
Closed

is_utf8 helper #1

hadley opened this issue Jun 14, 2016 · 9 comments

Comments

@hadley
Copy link

hadley commented Jun 14, 2016

We need some helper function that checks if a character vector is all utf-8.

Not sure how to distinguish this from a class based test for utf8.

@krlmlr
Copy link
Owner

krlmlr commented Jun 14, 2016

This will need C code to distinguish between "unknown" encoding and truly "all ASCII", basically an enhanced do_encoding() that also checks IS_ASCII(). Eventually this could be added upstream.

@hadley
Copy link
Author

hadley commented Jun 14, 2016

What's the difference between unknown and ascii? I thought unknown implied ascii.

@krlmlr
Copy link
Owner

krlmlr commented Jun 14, 2016

Need to double-check.

@krlmlr
Copy link
Owner

krlmlr commented Jun 14, 2016

There' a dedicated ASCII bit that is set if the code of all characters is 127 or less.

@hadley
Copy link
Author

hadley commented Jun 14, 2016

@hadley
Copy link
Author

hadley commented Jun 14, 2016

Also, none of the internal string representation stuff is in the exported API, so I think that means doing checks in R with Encoding() will be the easiest way forward.

@krlmlr
Copy link
Owner

krlmlr commented Jun 14, 2016

ASCII implies unknown, but not the other way round. It will be difficult to detect pure ASCII strings using Encoding() only.

@hadley
Copy link
Author

hadley commented Jun 14, 2016

Yes, but it'll be accurate >95% of the time, I'd imagine

@krlmlr krlmlr closed this as completed in cdeaa90 Aug 8, 2016
krlmlr added a commit that referenced this issue Aug 8, 2016
- New `encoding()`, returns `"ASCII"` for pure ASCII strings and behaves identical to `base::Encoding()` otherwise.
- New `all_utf8()`, returns an atomic logical that indicates if all elements of a character vector are UTF-8 encoded; this includes pure ASCII stringsi (#1).
- Remove `Encoding<-` override, with documentation (#7).
@github-actions
Copy link

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants