Authorship attribution toolkit
mowen is a modular framework for identifying who wrote a document based on stylometric analysis. It provides a configurable pipeline of text canonicization, feature extraction, event culling, distance measurement, and machine-learning classification — usable as a Python library, CLI, or full-stack web application.
mowen is a clean-room Python successor to JGAAP, the Java Graphical Authorship Attribution Program.
- 113 pipeline components across 5 stages, all pluggable via a registry system
- 41 event drivers — character/word n-grams, skip-grams, sorted n-grams, POS tags, NER, function words, transformer embeddings, SELMA instruction-tuned embeddings, perplexity/surprisal features, GNN syntactic embeddings, and more
- 26 distance functions — cosine, Manhattan, KL divergence, chi-square, PPM compression, NCD, cross-entropy, Hellinger, and more
- 21 analysis methods — nearest neighbor, KNN, SVM, MLP (with R-Drop), Burrows' Delta, General Imposters, Unmasking, contrastive learning, LLM zero-shot prompting, and more
- 15 event cullers and 10 canonicizers for feature selection and text normalization
- Authorship verification — General Imposters Method (Koppel & Winter 2014) and Unmasking (Koppel & Schler 2004) with calibrated non-answer support
- Style change detection — detect authorship boundaries within multi-author documents
- PAN-standard evaluation — accuracy, F1, EER, c@1, F_0.5u, Brier score, AUROC, MRR, cross-genre and topic-controlled evaluation
- 20 bundled sample corpora — Federalist Papers, Shakespeare, Brontë Sisters, Pauline Epistles, Homer, Russian Literature, State of the Union, and 13 AAAC benchmark problems
- 15 stylometry presets — Burrows' Delta, Cosine Delta, Eder's Delta, Character N-grams, Function Words, Multi-Feature SVM, Transformer Embeddings, SELMA, PAN cngdist, PPM Compression, Forensic Verification, Cross-Genre Robust, Perplexity Profile, General Imposters, Unmasking
- Multi-language support — pluggable tokenizer framework with Chinese segmentation (jieba) and function word lists for 10 languages
- 4 document loaders — plain text, PDF, DOCX, HTML
- JGAAP CSV compatibility for existing experiment files
- React web UI with experiment builder and results viewer
- REST API with OpenAPI docs at
/docs - Docker deployment
pip install -e core/from mowen import Pipeline, PipelineConfig, Document
known = [
Document(text="The government must be strong.", author="Hamilton"),
Document(text="Factions are controlled by diversity.", author="Madison"),
]
unknown = [Document(text="The federal union requires power.")]
config = PipelineConfig(
event_drivers=[{"name": "word_ngram", "params": {"n": 2}}],
distance_function={"name": "cosine"},
analysis_method={"name": "nearest_neighbor"},
)
results = Pipeline(config).execute(known, unknown)
print(results[0].top_author) # "Hamilton"from mowen import leave_one_out, PipelineConfig, Document
docs = [
Document(text="...", author="Hamilton"),
Document(text="...", author="Madison"),
# ... more documents with known authors
]
config = PipelineConfig(
event_drivers=[{"name": "character_ngram", "params": {"n": 3}}],
distance_function={"name": "cosine"},
analysis_method={"name": "nearest_neighbor"},
)
result = leave_one_out(docs, config)
print(f"Accuracy: {result.accuracy:.1%}")
print(f"Macro F1: {result.macro_f1:.4f}")pip install -e cli/# Run an attribution experiment
mowen run -d docs.csv -e word_ngram -e character_ngram:n=3 --distance cosine
# Evaluate accuracy via leave-one-out cross-validation
mowen evaluate -d corpus.csv -e character_ngram:n=3 --distance cosine --mode loo
# Evaluate with k-fold and export results
mowen evaluate -d corpus.csv -e word_events --mode kfold -k 10 --output-csv results.csv
# List all available components
mowen list-components
mowen list-components event-drivers --jsonpip install -e server/
mowen-serverOpen http://localhost:8000. API docs at http://localhost:8000/docs.
docker compose upServes the full app at http://localhost:8000 with data persisted in a Docker volume.
core/ Python library (mowen)
src/mowen/
pipeline.py Pipeline orchestrator
evaluation.py Cross-validation and metrics
types.py Core data types (Document, Event, Histogram, ...)
canonicizers/ 10 text preprocessors
event_drivers/ 41 feature extractors
event_cullers/ 15 feature selectors
distance_functions/ 26 distance/similarity metrics
analysis_methods/ 21 classifiers + verification methods
tokenizers/ Pluggable word segmentation (whitespace, jieba)
document_loaders/ File format readers (txt, pdf, docx, html)
data/ Function word lists + 20 sample corpora
compat/ JGAAP CSV import
cli/ Command-line interface (mowen-cli)
server/ FastAPI backend + static frontend serving (mowen-server)
web/ React/TypeScript frontend
tests/ 790+ tests (pytest)
scripts/ Corpus build scripts
# Install all packages in development mode
pip install -e core/ -e cli/ -e server/
# Run tests
python -m pytest tests/
# Lint
ruff check core/ cli/ server/ tests/See docs/ONBOARDING.md for the full developer setup guide.
The core library has no required dependencies. Optional features:
| Extra | Install | Enables |
|---|---|---|
nlp |
pip install 'mowen[nlp]' |
POS tagging, NER (spaCy) |
transformers |
pip install 'mowen[transformers]' |
Transformer embeddings (HuggingFace) |
chinese |
pip install 'mowen[chinese]' |
Chinese word segmentation (jieba) |
wordnet |
pip install 'mowen[wordnet]' |
WordNet definition events (NLTK) |
pdf |
pip install 'mowen[pdf]' |
PDF document loading (pdfplumber) |
docx |
pip install 'mowen[docx]' |
DOCX document loading (python-docx) |
html |
pip install 'mowen[html]' |
HTML document loading (BeautifulSoup) |
all |
pip install 'mowen[all]' |
Everything above |
MIT — see LICENSE.
Copyright 2026 John Noecker Jr.
