What you see is not what's there. A hidden carriage return overwrites the screen but the bytes still execute.
bytesight scans files at the byte level for invisible, misleading, or deceptive Unicode content. If it says clean, every byte is tab, newline, or printable ASCII.
Built against UNICODE 17.0.0 (https://www.unicode.org/versions/Unicode17.0.0/)
bytesight src/lib.rs src/main.rsbytesight -r src/wl-paste -n | bytesight -Coming soon
Coming soon
See dev-info.md
grep -Pna '[^\x09\x0a\x20-\x7e]'git clone https://github.com/mrname5/bytesight.git
cd bytesight
cargo build --releaseSha256sum: Install:
Sha256sum: Install:
--windows Allow end-of-line \r\n (Windows line endings)
--tab-width N Set tab expansion width, 1-16 (default: 8)
--wide-line N Set wide line threshold in columns, 1-10000 (default: 500)
-q, --quiet Suppress output; exit code only (0=clean, 1=issues, 2=error)
-r, --recursive Recursively scan directories
-V, --version Print version and exit
-h, --help Show help
0 All files clean
1 Issues found
2 Error (cannot read file, bad arguments, etc.)
bytesight flags anything outside the printable ASCII set plus tab and newline. Specific categories get specific warnings:
- Invisible characters (zero-width spaces, bidi overrides, soft hyphens)
- Trojan Source attack vectors (CVE-2021-42574, CVE-2021-42694)
- Fake spaces (NBSP, em space, ideographic space, 13 others)
- Homoglyphs (Cyrillic, Greek, fullwidth ASCII, math symbols)
- Dangerous control characters (NUL, ESC, backspace)
- Mid-line carriage returns (hides preceding text in terminal)
- C1 terminal control codes (U+0080-U+009F)
- Combining marks and variation selectors
- Invalid UTF-8 byte sequences
- Unicode noncharacters
- Content past edge of editor hidden when wrap text off
Files in the demo directory demonstrate real attack vectors that bytesight detects. All files are demonstrations only -- no actual malicious payloads.
The file contains a hidden command. When displayed in a terminal with
cat, the carriage return (0x0D) moves the cursor back to the start
of the line. Everything before the CR is overwritten on screen by
everything after it. The hidden command still executes.
The file contains two variables that look identical: admin (Latin)
and a second one starting with Cyrillic 'a' (U+0430). To a reviewer
they are the same word. To the compiler they are different variables.
The function returns the wrong one.
The file contains two functions with visually identical names. One is
validateInput and the other contains a zero-width space (U+200B)
making it validate[invisible]Input. A reviewer sees one function.
The code calls the malicious copy.
A normal Rust file with no hidden content. bytesight reports it clean.
String.fromCodePoint(97)"a".codePointAt(0)Claude used to generate most of the code with lots of input and guidance from user.
Chatgpt and Google Gemini used to verify.
General human review on whole codebase. Specifics of the ASCII and Unicode ranges have not been fully verified yet, only generally reviewed. However, the code logic, argument handling, and general behaviour have been verified via human review.
See LICENSE file.
ASCII:
- ANSI X3.4-1986 (the original American standard)
- ECMA-6 (international equivalent): https://www.ecma-international.org/publications-and-standards/standards/ecma-6/
Unicode:
- The Unicode Standard, full text: https://www.unicode.org/releases/
- Unicode Character Database (every codepoint's properties): https://www.unicode.org/Public/UCD/latest/ucd/
- UnicodeData.txt (the machine-readable master list): https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
- Code charts (visual PDF per block): https://www.unicode.org/charts/
- PropList.txt (properties like Noncharacter): https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
UTF-8 encoding:
- RFC 3629: https://www.rfc-editor.org/rfc/rfc3629
C1 control characters:
- ECMA-48 (defines all 32 C1 codes): https://ecma-international.org/publications-and-standards/standards/ecma-48/
For verifying combining marks specifically:
- UnicodeData.txt, column 3 (General_Category): values
Mn(nonspacing mark),Mc(spacing mark),Me(enclosing mark) are the combining marks
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt