pseudoswapper

A local CLI tool for tokenising sensitive data in files before sharing them with AI assistants.

All processing happens on your machine. No sensitive data or token mappings ever leave your device. Designed for temporary, one-off use — each redact → restore cycle is self-contained and leaves no persistent artefacts. Unlike tools that use encryption keys to support long-term reversible anonymisation, pseudoswapper requires no key management: the mapping exists only for the duration of a single session and is deleted automatically after restore.

What it does

Before you share a document or data file with an AI assistant — or any untrusted or public tool — pseudoswapper scans it and replaces sensitive values — names, email addresses, phone numbers, domains, company identifiers — with human-readable tokens like [PERSON_1], [EMAIL_2], [COMPANY_1]. You share the tokenised file. When the AI returns its output, you run pseudoswapper restore and the original values are substituted back in.

The token-to-value mapping is held in a temporary, user-only directory for the duration of the session and deleted automatically after a successful restore. Nothing is written to a persistent file.

When to use it

You want to ask an AI assistant (ChatGPT, Claude, Gemini, Copilot) to review, summarise, or analyse a document that contains real names, email addresses, or internal identifiers
You need to upload a file to an online conversion, translation, or analysis service and don't want to expose its contents
You're sharing data with a contractor or external collaborator via a tool or platform you don't fully control
You're running a log file, support ticket, or incident report through a third-party debugging or analysis service
You want to share the structure and content of a dataset for analysis without revealing who or what it relates to

Two modes

Document mode — for prose: emails, reports, articles, freeform text, Word documents (.docx), and PDFs (.pdf). Detects PII using a combination of exact-match config, regex patterns, and NLP (via Presidio + spaCy en_core_web_lg). For .docx files, replacement is applied at the paragraph level and the output is a valid .redacted.docx file. For .pdf files, text is extracted and the output is a .redacted.txt file (layout is not preserved).

Structured mode — for CSV, JSON, and XLSX files. Uses an anchor field (a unique identifier column like employee_id or full_name) to correlate all fields in a row to a single entity. The same anchor value always produces the same token across all rows, preserving relational integrity.

Install

Requires Python 3.12.

git clone https://github.com/imauchi-sd/pseudoswapper
cd pseudoswapper
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .
python -m spacy download en_core_web_lg

Copy and fill in the example config:

cp pseudoswapper_config.example.yaml ~/.pseudoswapper_config.yaml

For platform-specific instructions (Mac, Windows, Linux) and alternative install methods, see the Installation section in USER_GUIDE.md.

Quick start

# Set a work directory so you can omit file paths (optional but convenient)
pseudoswapper workdir --set ~/Documents/sensitive-files

# Redact a prose document — pick from work directory if no path given
pseudoswapper document report.txt
pseudoswapper document            # → prompts file selection from work directory

# Redact a structured file (anchor auto-detected or from config)
# An interactive prompt lists columns and asks which to force-tokenize (Enter to skip)
pseudoswapper structured employees.csv
pseudoswapper structured          # → prompts selection of .csv/.json/.xlsx files

# Override the anchor field on the CLI
pseudoswapper structured access_logs.csv --anchor user_id

# Force-tokenize specific columns, bypassing NER (repeatable; skips interactive prompt)
pseudoswapper structured employees.xlsx --force-fields "Last name, First name" --force-fields "Email"

# Leave certain entity types unreplaced (e.g. keep IPs visible in a security log)
# Protected types (PERSON, EMAIL, COMPANY, ORG) are always tokenized regardless
pseudoswapper document incident.log --passthrough IP
pseudoswapper document incident.log --passthrough IP --passthrough DOMAIN
pseudoswapper structured access_logs.csv --anchor user_id --passthrough IP --passthrough URL

# Supply an employee roster per-invocation (both modes)
pseudoswapper document report.txt --employees-csv ~/company_employees.csv
pseudoswapper structured access_logs.csv --anchor user_id --employees-csv ~/company_employees.csv

# After the AI returns its output, restore original values
pseudoswapper restore ai_output.txt
pseudoswapper restore             # → prompts file selection from work directory
# → writes ai_output.restored.txt

# Show a human-readable summary of what will be tokenized
pseudoswapper config --summary

# Inspect or edit the raw config YAML
pseudoswapper config --show
pseudoswapper config --edit

# Manage work directory
pseudoswapper workdir --show
pseudoswapper workdir --clear

# Abandon a stuck session
pseudoswapper clear-session

Configuration

pseudoswapper reads ~/.pseudoswapper_config.yaml on every run. Use it to define:

company_terms — exact strings to always redact (project names, internal system names, domains)
employees — known individuals listed inline; guarantees consistent tokenisation even when NLP misses a name
employees_csv — path to a CSV file of employees (use instead of or alongside employees for large rosters); must have a full_name column, optionally email and username
exclude_terms — words to exclude from NLP detection (prevents over-redaction of common names)
passthrough_types — entity types to leave unreplaced (e.g. IP, DOMAIN, URL, PHONE, LOC); useful when certain technical values carry analytical value. Protected types (PERSON, EMAIL, COMPANY, ORG) cannot be bypassed. Also overridable per-run via --passthrough on the CLI.
structured.anchor_field — default anchor column for structured mode
structured.correlated_fields — columns to correlate to the anchor entity per row
structured.force_fields — columns to always tokenize unconditionally, bypassing NLP (useful for name columns with non-Western names or non-standard formatting)

You can also pass a CSV per-invocation with --employees-csv on any redact command — this takes priority over the config key.

See pseudoswapper_config.example.yaml for a fully annotated template, and employees_sample.csv for the expected CSV format.

Work directory preference (~/.pseudoswapper_prefs.yaml) — set via pseudoswapper workdir --set PATH. This file is written by the tool and is separate from your config file, so editing one never affects the other.

Session lifecycle

Event	What happens
`pseudoswapper document` or `pseudoswapper structured` succeeds	Session created; `.pseudoswapper_session` pointer written to CWD
`pseudoswapper restore` succeeds	Session and pointer file deleted automatically
`pseudoswapper restore` fails	Session preserved; fix the issue and retry
`pseudoswapper clear-session`	Deletes session and pointer file; abandons current session
System reboot	Temp dir gone; pointer file in CWD becomes stale (safe to delete)

Run pseudoswapper restore from the same directory where you ran the redact command.

Security notes

The redacted file is safe to share. The token mapping never leaves your machine.
~/.pseudoswapper_config.yaml contains employee names and internal identifiers — treat it as sensitive. Do not commit it to version control.
pseudoswapper makes no network calls during redact or restore.

See PRIVACY.md for the full privacy policy and user responsibility disclaimers. To report a security vulnerability, see SECURITY.md.

Known limitations

spaCy NER may miss names in non-prose contexts (log lines, headers, tables). Mitigate by listing known employees in the config or using force_fields in structured mode to guarantee tokenization of specific columns.
Email-to-name inference is not attempted in Document mode. Use Structured mode with an anchor field for correlated data.
Single anchor field only — composite identity (e.g. tenant_id + user_id) is not supported in v1.
.docx is supported natively — output is a .redacted.docx file. Intra-paragraph run-level formatting (bold/italic on specific words) is lost in paragraphs that contain a replaced token.
.pdf is supported natively — output is always a .redacted.txt file (layout not preserved). Scanned/image-only PDFs with no embedded text are not supported.

See USER_GUIDE.md for full documentation including anchor field selection, restoration behaviour, and all known limitations.

Support

If this tool saves you time, consider buying me a coffee.

Development

source .venv/bin/activate
pip install -e ".[dev]"
python -m pytest          # 152 tests, all passing

Python interpreter: the project requires the .venv at the project root (Python 3.12). Always activate it before running any python or pytest commands.

See CONTRIBUTING.md for contribution guidelines and the data safety checklist that applies to all commits.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src/pseudoswapper		src/pseudoswapper
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
PRIVACY.md		PRIVACY.md
README.md		README.md
SECURITY.md		SECURITY.md
USER_GUIDE.md		USER_GUIDE.md
employees_sample.csv		employees_sample.csv
pseudoswapper_config.example.yaml		pseudoswapper_config.example.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pseudoswapper

What it does

When to use it

Two modes

Install

Quick start

Configuration

Session lifecycle

Security notes

Known limitations

Support

Development

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pseudoswapper

What it does

When to use it

Two modes

Install

Quick start

Configuration

Session lifecycle

Security notes

Known limitations

Support

Development

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages