GitHub - matgat/unicode_text: A small c++ library for unicode encoding and decoding

unicode_text

A single header c++ library providing unicode encoding and decoding facilities.

basic example (godbolt)

#include <iostream>
#include <string_view>
using namespace std::literals;
#include "unicode_text.hpp" // utxt::*
int main()
{
    const std::u32string_view utf32 = U"🍕🍞🧀"sv;
    const std::string utf8 = utxt::to_utf8(utf32);
    std::cout << utf8;
}

Features

Not throwing, decoding errors are handled returning/inserting the replacement codepoint � ('\uFFFD', utxt::codepoint::invalid)
Needs c++23 for std::unreachable

Encodings enumeration

Define	Description
~~`ANSI`~~	~~8-bit encodings~~
`UTF8`	utf-8
`UTF16LE`	utf-16 little-endian
`UTF16BE`	utf-16 big-endian
`UTF32LE`	utf-32 little-endian
`UTF32BE`	utf-32 big-endian

Note

No support for 8-bit encodings and related codepages: it's time to drop them.

API documentation

Decode bytes to utf-32 string

Converts bytes to a utf-32 string

to_utf32<INENC>(…)
to_utf32(…)

Input
- std::string_view encoded as INENC or std::u8string_view
Return value
- std::u32string

example

using enum utxt::Enc;
std::string_view in_bytes = "..."sv;
std::u32string u32str1 = utxt::to_utf32<UTF16LE>(in_bytes);
std::u32string u32str2 = utxt::to_utf32(u8"..."sv);

Encode utf-32 to utf-8

Encodes a char32_t string or codepoint to a utf-8 string

to_utf8(…)

Input
- std::u32string_view or char32_t
Return value
- std::string bytes encoded as utf-8 (avoiding std::u8string until better support in stdlib)

example

std::cout << utxt::to_utf8(U"..."sv)
          << utxt::to_utf8(U'a');

Encode utf-32 to bytes

Encodes a char32_t string or codepoint

encode_as<OUTENC>(…)

Inputs
- utxt::Enc OUTENC output encoding
- std::u32string_view or char32_t codepoints to encode
Return value
- std::string encoded bytes as OUTENC

example

using enum utxt::Enc;
std::string out_bytes = utxt::encode_as<UTF16BE>(U"..."sv);

In case OUTENC is not known at compile time, there's an alternate version of this function that chooses the correct template at runtime:

std::string out_bytes = utxt::encode_as(UTF16BE, U"..."sv);

Re-encode bytes detecting input encoding

Re-encodes a string of bytes (detecting its encoding) to a given encoding

encode_as<OUTENC>(…)

Inputs
- utxt::Enc OUTENC output encoding
- std::string_view input bytes of unknown encoding
- utxt::flags_t if specified flag::SKIP_BOM output won't contain the byte order mask
Return value
- std::string output bytes encoded as OUTENC

example

using enum utxt::Enc;
std::string_view in_bytes = "..."sv;
std::string out_bytes = utxt::encode_as<UTF8>(in_bytes);
static_assert( utxt::encode_as<UTF16BE>(U'🔥') == "\xD8\x3D\xDD\x25"sv );

In case OUTENC is not known at compile time, there's an alternate version of this function that chooses the correct template at runtime:

std::string out_bytes = utxt::encode_as(UTF8, in_bytes);

Alternate functions that take a buffer and return a string_view are provided to skip the re-encoding in case the input and output encodings are the same:

example

using enum utxt::Enc;
std::string_view in_bytes = "..."sv;
std::string maybe_reencoded_buf;
std::string_view out_bytes = utxt::encode_if_necessary_as<UTF8>(in_bytes, maybe_reencoded_buf);

The corresponding runtime version:

std::string_view out_bytes = utxt::encode_if_necessary_as(UTF8, in_bytes, maybe_reencoded_buf);

Re-encode bytes

Re-encodes a string of bytes from one encoding to another

reencode<INENC,OUTENC>(…)

Inputs
- utxt::Enc INENC input encoding
- utxt::Enc OUTENC output encoding
- std::string_view input bytes encoded as INENC
Return value
- std::string output bytes encoded as OUTENC

example

using enum utxt::Enc;
std::string_view in_bytes = "..."sv;
std::string out_bytes = utxt::reencode<UTF8, UTF16LE>(in_bytes);

An alternate function that takes a buffer and returns a string_view is provided to skip the re-encoding in case INENC==OUTENC:

example

std::string_view in_bytes = "..."sv;
std::string maybe_reencoded_buf;
std::string_view out_bytes = utxt::reencode_if_necessary<INENC,OUTENC>(in_bytes, maybe_reencoded_buf);

Low level facilities

`bytes_buffer_t` class

A class that represents a byte stream interpreted with a given encoding.

example

using enum utxt::Enc;
utxt::bytes_buffer_t<UTF8> bytes_buf("..."sv);
while( bytes_buf.has_codepoint() )
   {
    const char32_t cp = bytes_buf.extract_codepoint();
   }
if( bytes_buf.has_bytes() )
   {// Truncated codepoint!
   }

Encoding Detection

Detects the encoding of raw bytes, it just detects the byte order mask, no euristic analysis of bytes

detect_encoding_of(…)

Input
- std::string_view raw bytes
Return value
- struct{ Enc enc; std::uint8_t bom_size; }

example

std::string to_utf8(const std::string_view bytes)
{
    const auto [bytes_enc, bom_size] = utxt::detect_encoding_of(bytes);
    switch(bytes_enc)
       {using enum utxt::Enc;
        case UTF16LE: return utxt::reencode<UTF16LE,UTF8>(bytes);
        case UTF16BE: return utxt::reencode<UTF16BE,UTF8>(bytes);
        case UTF32LE: return utxt::reencode<UTF32LE,UTF8>(bytes);
        case UTF32BE: return utxt::reencode<UTF32BE,UTF8>(bytes);
        default:      return std::string{bytes};
       }
}

Decoding a single codepoint

Extracts a codepoint from a string of raw bytes interpreted with encoding Enc, updating the current position that points to the data.

extract_codepoint<Enc>(…)

Inputs
- std::string_view raw bytes encoded as Enc
- std::size_t& current position
Preconditions
- Assumes enough remaining bytes to extract the codepoint, undefined behavior otherwise
Return value
- char32_t extracted codepoint, codepoint::invalid in case of decoding errors

example

using enum utxt::Enc;
std::string_view bytes = "\xD8\x3D\xDD\x25"sv;
std::size_t pos = 0;
assert( (pos+sizeof(char32_t)) <= bytes.size() );
const char32_t cp = utxt::extract_codepoint<UTF16BE>(bytes, pos);
assert( cp == U'🔥' );

Encoding a single codepoint

Appends a codepoint to a given string of bytes using encoding Enc

append_codepoint<Enc>(…)

Input
- char32_t codepoint to encode
Output
- std::string& destination bytes encoded as Enc

example

std::string bytes = "a ";
char32_t codepoint = U'🔥';
utxt::append_codepoint<UTF8>(codepoint, bytes);
assert( bytes == "a \xF0\x9F\x94\xA5"sv);

Build

Build with at least -std=c++23 (/std:c++23 in case of msvc), the culprit is std::unreachable.

Testing

Run unit tests directly in godbolt or:

$ git clone https://github.com/matgat/unicode_text.git
$ cd unicode_text
$ curl -O https://raw.githubusercontent.com/boost-ext/ut/master/include/boost/ut.hpp
$ g++ -std=c++23 -Wall -Wextra -Wpedantic -Wconversion -Wsign-conversion -o test test.cpp && ./test

Note

On windows:

$ cl /std:c++latest /permissive- /utf-8 /W4 /WX /EHsc test.cpp

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
readme.md		readme.md
test.cpp		test.cpp
unicode_text.hpp		unicode_text.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unicode_text

Features

Encodings enumeration

API documentation

Decode bytes to utf-32 string

Encode utf-32 to utf-8

Encode utf-32 to bytes

Re-encode bytes detecting input encoding

Re-encode bytes

Low level facilities

`bytes_buffer_t` class

Encoding Detection

Decoding a single codepoint

Encoding a single codepoint

Build

Testing

About

Releases

Packages

Languages

matgat/unicode_text

Folders and files

Latest commit

History

Repository files navigation

unicode_text

Features

Encodings enumeration

API documentation

Decode bytes to utf-32 string

Encode utf-32 to utf-8

Encode utf-32 to bytes

Re-encode bytes detecting input encoding

Re-encode bytes

Low level facilities

bytes_buffer_t class

Encoding Detection

Decoding a single codepoint

Encoding a single codepoint

Build

Testing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`bytes_buffer_t` class

Packages