PhantomText Toolkit

Official experiment repo of our paper "The Hidden Threat in Plain Text: Attacking RAG Data Loaders" presented at ACM AISec 2025 (CCS workshop).

PhantomText is a Python library designed for handling content injection, content obfuscation, file scanning, and file sanitization across various document formats. This toolkit provides a comprehensive set of tools to manage and secure document content effectively.

Features

Content Injection: Easily inject content into different document formats using various steganographic techniques like zero-size text, transparent text, and out-of-bound positioning.
Content Obfuscation: Protect sensitive information with advanced obfuscation techniques including zero-width characters, homoglyphs, diacritical marks, and bidirectional text reordering.
File Scanning: Scan files for malicious content or vulnerabilities using the FileScanner class that detects obfuscated and injected content.
File Sanitization: Sanitize files to remove harmful content with the FileSanitizer class.

Attack Families

Obfuscation Attacks

Zero-Width Characters: Uses invisible Unicode characters (Zero Width Space, Zero Width Non-Joiner, etc.) to obfuscate text
Homoglyph Characters: Replaces characters with visually similar Unicode characters from different scripts
Diacritical Marks: Adds combining diacritical marks to characters to alter their appearance
Bidi/Reordering: Uses Unicode bidirectional override characters to manipulate text direction and rendering

Injection Attacks

Zero-Size Injection: Injects content using zero or near-zero font sizes to make text invisible
Transparent Injection: Injects content using transparent colors or opacity settings
Camouflage Injection: (In development) Hides content by matching background colors or patterns
Out-of-Bound Injection: (In development) Places content outside visible document boundaries
Metadata Injection: (In development) Embeds content in document metadata

Supported Formats

PhantomText supports the following document formats:

PDF
DOCX
HTML

Installation

To install PhantomText, you can use pip:

pip install phantomtext

Usage

Content Injection Example

from phantomtext.content_injection import ContentInjector

injector = ContentInjector()
injector.inject_content('document.pdf', 'New Content')

Content Obfuscation Example

from phantomtext.content_obfuscation import ContentObfuscator

obfuscator = ContentObfuscator()

# Basic obfuscation
obfuscated_content = obfuscator.obfuscate_content('Sensitive Information')

# Advanced obfuscation with specific techniques
content = "Sensitive info: email@example.com and phone 123-456-7890."
target = "email@example.com"

# Zero-width character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="zeroWidthCharacter", 
                                  modality="default", 
                                  file_format="html")

# Homoglyph character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="homoglyph", 
                                  file_format="pdf")

# Diacritical marks obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="diacritical", 
                                  modality="heavy", 
                                  file_format="docx")

# Bidi/reordering character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="bidi", 
                                  modality="default", 
                                  file_format="html")

Content Injection Example

from phantomtext.injection.zerosize_injection import ZeroSizeInjection
from phantomtext.injection.transparent_injection import TransparentInjection

# Zero-size injection
injector = ZeroSizeInjection(modality="default", file_format="pdf")
injector.apply(input_document="document.pdf", 
               injection="Hidden content", 
               output_path="injected_document.pdf")

# Transparent injection
injector = TransparentInjection(modality="opacity-0", file_format="html")
injector.apply(input_document="document.html", 
               injection="Invisible text", 
               output_path="injected_document.html")

Supported Attacks

Obfuscation Attacks

Attack Family	Attack Name	Variant	HTML	DOCX	PDF
Obfuscation	diacritical_marks	default	✅	✅	✅
		heavy	✅	✅	✅
Obfuscation	homoglyph_characters	default	✅	✅	✅
Obfuscation	zero_width_characters	default	✅	✅	✅
		heavy	✅	✅	✅
Obfuscation	bidi_reordering	default	✅	✅	✅
		heavy	✅	✅	✅

Injection Attacks

Attack Family	Attack Name	Variant	HTML	DOCX	PDF
Injection	zero_size	default	✅	✅	✅
		close-to-zero	✅	❌	✅
Injection	transparent	default	✅	✅	✅
		opacity-0	✅	❌	✅
		opacity-close-to-zero	✅	❌	✅
		vanish	❌	✅	❌
Injection	camouflage	default	🚧	🚧	🚧
Injection	out_of_bound	default	🚧	🚧	🚧
Injection	metadata	default	🚧	🚧	🚧

Legend:

✅ Implemented and working
❌ Not supported for this format
🚧 Placeholder implementation (not yet functional)

File Scanning Example

from phantomtext.file_scanning import FileScanner

scanner = FileScanner()

# Scan a single file
result = scanner.scan_file('document.docx')
print(f"Malicious content found: {result['malicious_content_found']}")
print(f"Vulnerabilities: {result['vulnerabilities']}")

# Scan an entire directory
reports = scanner.scan_dir('./output')
for report in reports:
    if report['malicious_content_found']:
        print(f"⚠️ Issues found in {report['file_path']}")
        for vulnerability in report['vulnerabilities']:
            print(f"  - {vulnerability}")

Detection Capabilities

The FileScanner can detect the following obfuscation techniques:

Zero-width character sequences
Homoglyph character substitutions
Diacritical mark insertions
Bidirectional text overrides

File Sanitization Example

from phantomtext.file_sanitization import FileSanitizer

sanitizer = FileSanitizer()
sanitizer.sanitize_file('malicious_file.txt')

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.

Cite

@article{castagnaro2025hidden,
  title={The Hidden Threat in Plain Text: Attacking RAG Data Loaders},
  author={Castagnaro, Alberto and Salviati, Umberto and Conti, Mauro and Pajola, Luca and Pizzi, Simeone},
  journal={arXiv preprint arXiv:2507.05093},
  year={2025}
}

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dist		dist
docs		docs
examples		examples
output		output
phantomtext.egg-info		phantomtext.egg-info
phantomtext		phantomtext
test		test
tests		tests
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PUBLISHING.md		PUBLISHING.md
README.md		README.md
build_package.sh		build_package.sh
custom_simple_pdf.pdf		custom_simple_pdf.pdf
phantomText.jpg		phantomText.jpg
phantomText.png		phantomText.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
simple_pdf.docx		simple_pdf.docx
simple_webpage.html		simple_webpage.html
test_package.py		test_package.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhantomText Toolkit

Features

Attack Families

Obfuscation Attacks

Injection Attacks

Supported Formats

Installation

Usage

Content Injection Example

Content Obfuscation Example

Content Injection Example

Supported Attacks

Obfuscation Attacks

Injection Attacks

File Scanning Example

Detection Capabilities

File Sanitization Example

Contributing

Cite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhantomText Toolkit

Features

Attack Families

Obfuscation Attacks

Injection Attacks

Supported Formats

Installation

Usage

Content Injection Example

Content Obfuscation Example

Content Injection Example

Supported Attacks

Obfuscation Attacks

Injection Attacks

File Scanning Example

Detection Capabilities

File Sanitization Example

Contributing

Cite

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages