utf8strings

Extract strings of UTF-8 (four characters or longer) from binary blobs.

Usage

Compilation:
    make utf8strings

Usage:
    utf8strings [ filename ]

Examples:
    utf8strings /usr/sbin/bomb
    utf8strings /dev/mem | less
    somebinaryemittingprogram | utf8strings

Why?

"Binary" files often have text strings embedded in them, but the standard strings utility that comes with GNU binutils does not (yet) understand UTF-8. This is a serious problem because UTF-8 has become the defacto standard for text in UNIX systems and on the Internet.

How

UTF-8 is a beautiful design and includes the ability to self synchronize. Each character in a UTF-8 string is made up of a sequence of up to four bytes. By looking at the first two bits of a byte, one knows immediately if the byte represents an ASCII character (00, 01), an initial byte in a sequence (11), or a continuation byte (10). That means that there is never any confusion about possibly overlapping UTF-8 interpretations.

Initial release

This was designed to be simple and correct. It was implemented in bog-standard C. No thought was put in to optimization, yet. It correctly identifies valid UTF-8 sequences and rejects non-UTF-8. It shows strings with a minimum length of four characters (not bytes). Works on stdin or a single filename may be specified.

It works for my purposes and probably will be fine for you as well.

Deficiencies

Hardcoded to strings of minlength 4.
Could be a lot faster with some simple optimizations.
Does not handle any options.
Should be merged with strings from GNU binutils.

Future

I've licensed this code under the same license as GNU binutils in the hope that it will be useful to the GNU folks as they improve the official version of strings to support UTF-8.

Implementation Notes

A. INVALID UTF-8 SEQUENCES are correctly discarded:

For example,

Bytes that don't begin with UTF's magic (10*, 110*, 1110*, or 11110*).
A byte with the correct magic bits, but all 0s for data. (E.g., 11110000).
Incorrect usage of continuation bytes (10*)
1. After 110*, there must be one continuation byte.
2. After 1110*, there must be two continuation bytes.
3. After 11110*, there must be three continuation bytes.
4. Continuation bytes (10*) not preceeded by one of the above are invalid.
Bytes C0 and C1. (They would encode ASCII as two bytes).
U+D800 to U+DFFF are reserved for UTF-16's surrogate halves.
Leading byte of F4 and codepoint is beyond Unicode's limit. (>0x10FFFF)
Leading byte of F5 to FD. (Codepoint is greater than 0x10FFFF).
Leading byte of FE or FF. (Undefined in UTF-8 to allow for UTF-16 BOM).
Code points U+80 to U+9F are skipped as control characters.
End of file before a complete character is read.

B. MAYBE IT COULD BE BETTER.

Some valid UTF-8 sequences are actually undefined code points in Unicode and shouldn't be printed. Similarly, for a strings program like this, we would want to check Unicode's syntactic tables so we can ignore non-printable characters. Those features have been left out intentionally as they would be much more complex and require updating with every new release of the Unicode standard.

C. SOME TESTS:

1a. Values beyond Unicode (>= 0x110000) should NOT be shown:

   echo -n $'XX\xf4\x90\x80\x80XX' | ./utf8strings  | hd

1b. Characters <= 0x10FFFF should show something:

   echo -n $'XX\xf4\x8f\xbf\xbfXX' | ./utf8strings  | hd

2a. UTF-16 surrogate halves should NOT be shown:

   echo -n $'XX\xED\xA0\x80XX' | ./utf8strings | hd

2b. Characters between U+D000 to U+D7FF should be shown:

   echo -n $'XX\xED\x9F\xBFXX' | ./utf8strings | hd

3a. UTF-8 Control characters 0x80 to 0x9F should NOT be shown:

   echo $'XX\xC2\x80XX'  | ./utf8strings | hd

3b. Characters >= 0xA0 should be shown:

   echo $XX'\xC2\xA0XX'  | ./utf8strings | hd

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md.d		README.md.d
LICENSE		LICENSE
README.md		README.md
utf8strings.c		utf8strings.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md.d

README.md.d

LICENSE

LICENSE

README.md

README.md

utf8strings.c

utf8strings.c

Repository files navigation

utf8strings

Usage

Why?

How

Initial release

Deficiencies

Future

Implementation Notes

A. INVALID UTF-8 SEQUENCES are correctly discarded:

B. MAYBE IT COULD BE BETTER.

C. SOME TESTS:

About

Releases

Packages

Languages

License

hackerb9/utf8strings

Folders and files

Latest commit

History

Repository files navigation

utf8strings

Usage

Why?

How

Initial release

Deficiencies

Future

Implementation Notes

A. INVALID UTF-8 SEQUENCES are correctly discarded:

B. MAYBE IT COULD BE BETTER.

C. SOME TESTS:

About

Resources

License

Stars

Watchers

Forks

Languages