GitHub - grandchild/base32k: binary-to-text encoding with a better encoding ratio in character-limited situations such as twitter

base32k

(You should really use base2048 instead, it's far superior. I had fun with this one, but I didn't know about base2048 at the time.)

base32k is a slightly whimsical binary-to-text encoding, which transforms raw binary data into (possibly obscene combinations of) UTF-8 characters from the CJK and Hangul unicode blocks. Its alphabet consists of 2^15 = 32768 characters, hence the name. In comparison to other encodings like base64 or base122, this does not save space in terms of bytes, but it is smaller than those two in terms of characters. This is only useful when for some reason the medium (* cough * twitter * cough * ) is character-limited rather than byte- limited.

Example

$ echo testing | base32k
整棦릥茻l

$ echo 整棦릥茻l | base32k -d
testing

Installation

Library

go get github.com/grandchild/base32k

Executable Binary

go get github.com/grandchild/base32k/base32k

Encoding Ratio

base32k has an encoding ratio of 15 bits per unicode glyph, which amounts to a ratio of 15/24 (0.625) plus one byte padding in 14 out of 15 cases. This makes it a worse encoding than base64, which has a 3/4 (0.75) and base122, which has 7/8 (0.875).

A twitter message may be 280 characters long, but only 140 CJK glyphs. Still, this encoding slightly outperforms base64 and even base122 in the space available in a single tweet:

	space ratio	char ratio	bytes per tweet
base64	0.75	6	210
base122	0.875	7	245
base32k	0.625	15	256
	( more is better for all columns )

base32k outperforming base122 on twitter results from the fact that twitter counts a CJK or Hangul glyph as two characters, whereas in UTF-8 it's actually 3 characters. This gives us, in effect, an encoding ratio of 15/16 over base122's 7/8, a slight advantage.

So, given good-enough font coverage of the basic multilingual unicode plane, this can be used to transmit data in situations where characters are limited, rather than disk space.

Stability

This implementation will run out of memory when en-/decoding very large chunks of data (several gigabytes). But since this is aimed at character-limited settings this is not likely to be an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
base32k		base32k
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base32k.go		base32k.go
base32k_test.go		base32k_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

base32k

Example

Installation

Library

Executable Binary

Encoding Ratio

Stability

About

Releases

Packages

Languages

License

grandchild/base32k

Folders and files

Latest commit

History

Repository files navigation

base32k

Example

Installation

Library

Executable Binary

Encoding Ratio

Stability

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages