Extract strings of UTF-8 (four characters or longer) from binary blobs.
Compilation:
make utf8strings
Usage:
utf8strings [ filename ]
Examples:
utf8strings /usr/sbin/bomb
utf8strings /dev/mem | less
somebinaryemittingprogram | utf8strings
"Binary" files often have text strings embedded in them, but the
standard strings
utility that comes with GNU
binutils does not (yet)
understand UTF-8. This is a serious problem because UTF-8 has become
the defacto standard for text in UNIX systems and on the Internet.
UTF-8 is a beautiful design and includes the ability to self synchronize. Each character in a UTF-8 string is made up of a sequence of up to four bytes. By looking at the first two bits of a byte, one knows immediately if the byte represents an ASCII character (00, 01), an initial byte in a sequence (11), or a continuation byte (10). That means that there is never any confusion about possibly overlapping UTF-8 interpretations.
This was designed to be simple and correct. It was implemented in bog-standard C. No thought was put in to optimization, yet. It correctly identifies valid UTF-8 sequences and rejects non-UTF-8. It shows strings with a minimum length of four characters (not bytes). Works on stdin or a single filename may be specified.
It works for my purposes and probably will be fine for you as well.
- Hardcoded to strings of minlength 4.
- Could be a lot faster with some simple optimizations.
- Does not handle any options.
- Should be merged with
strings
from GNU binutils.
I've licensed this code under the same license as GNU binutils in the
hope that it will be useful to the GNU folks as they improve the
official version of strings
to support UTF-8.
For example,
- Bytes that don't begin with UTF's magic (10*, 110*, 1110*, or 11110*).
- A byte with the correct magic bits, but all 0s for data. (E.g., 11110000).
- Incorrect usage of continuation bytes (10*)
- After 110*, there must be one continuation byte.
- After 1110*, there must be two continuation bytes.
- After 11110*, there must be three continuation bytes.
- Continuation bytes (10*) not preceeded by one of the above are invalid.
- Bytes C0 and C1. (They would encode ASCII as two bytes).
- U+D800 to U+DFFF are reserved for UTF-16's surrogate halves.
- Leading byte of F4 and codepoint is beyond Unicode's limit. (>0x10FFFF)
- Leading byte of F5 to FD. (Codepoint is greater than 0x10FFFF).
- Leading byte of FE or FF. (Undefined in UTF-8 to allow for UTF-16 BOM).
- Code points U+80 to U+9F are skipped as control characters.
- End of file before a complete character is read.
Some valid UTF-8 sequences are actually undefined code points in
Unicode and shouldn't be printed. Similarly, for a strings
program like this, we would want to check Unicode's syntactic
tables so we can ignore non-printable characters. Those features
have been left out intentionally as they would be much more complex
and require updating with every new release of the Unicode
standard.
1a. Values beyond Unicode (>= 0x110000) should NOT be shown:
echo -n $'XX\xf4\x90\x80\x80XX' | ./utf8strings | hd
1b. Characters <= 0x10FFFF should show something:
echo -n $'XX\xf4\x8f\xbf\xbfXX' | ./utf8strings | hd
2a. UTF-16 surrogate halves should NOT be shown:
echo -n $'XX\xED\xA0\x80XX' | ./utf8strings | hd
2b. Characters between U+D000 to U+D7FF should be shown:
echo -n $'XX\xED\x9F\xBFXX' | ./utf8strings | hd
3a. UTF-8 Control characters 0x80 to 0x9F should NOT be shown:
echo $'XX\xC2\x80XX' | ./utf8strings | hd
3b. Characters >= 0xA0 should be shown:
echo $XX'\xC2\xA0XX' | ./utf8strings | hd