Skip to content

jnoecker/mowen

Repository files navigation

mowen

mowen (墨紋)

Authorship attribution toolkit

mowen is a modular framework for identifying who wrote a document based on stylometric analysis. It provides a configurable pipeline of text canonicization, feature extraction, event culling, distance measurement, and machine-learning classification — usable as a Python library, CLI, or full-stack web application.

mowen is a clean-room Python successor to JGAAP, the Java Graphical Authorship Attribution Program.

Features

  • 113 pipeline components across 5 stages, all pluggable via a registry system
  • 41 event drivers — character/word n-grams, skip-grams, sorted n-grams, POS tags, NER, function words, transformer embeddings, SELMA instruction-tuned embeddings, perplexity/surprisal features, GNN syntactic embeddings, and more
  • 26 distance functions — cosine, Manhattan, KL divergence, chi-square, PPM compression, NCD, cross-entropy, Hellinger, and more
  • 21 analysis methods — nearest neighbor, KNN, SVM, MLP (with R-Drop), Burrows' Delta, General Imposters, Unmasking, contrastive learning, LLM zero-shot prompting, and more
  • 15 event cullers and 10 canonicizers for feature selection and text normalization
  • Authorship verification — General Imposters Method (Koppel & Winter 2014) and Unmasking (Koppel & Schler 2004) with calibrated non-answer support
  • Style change detection — detect authorship boundaries within multi-author documents
  • PAN-standard evaluation — accuracy, F1, EER, c@1, F_0.5u, Brier score, AUROC, MRR, cross-genre and topic-controlled evaluation
  • 20 bundled sample corpora — Federalist Papers, Shakespeare, Brontë Sisters, Pauline Epistles, Homer, Russian Literature, State of the Union, and 13 AAAC benchmark problems
  • 15 stylometry presets — Burrows' Delta, Cosine Delta, Eder's Delta, Character N-grams, Function Words, Multi-Feature SVM, Transformer Embeddings, SELMA, PAN cngdist, PPM Compression, Forensic Verification, Cross-Genre Robust, Perplexity Profile, General Imposters, Unmasking
  • Multi-language support — pluggable tokenizer framework with Chinese segmentation (jieba) and function word lists for 10 languages
  • 4 document loaders — plain text, PDF, DOCX, HTML
  • JGAAP CSV compatibility for existing experiment files
  • React web UI with experiment builder and results viewer
  • REST API with OpenAPI docs at /docs
  • Docker deployment

Quick start

Python library

pip install -e core/
from mowen import Pipeline, PipelineConfig, Document

known = [
    Document(text="The government must be strong.", author="Hamilton"),
    Document(text="Factions are controlled by diversity.", author="Madison"),
]
unknown = [Document(text="The federal union requires power.")]

config = PipelineConfig(
    event_drivers=[{"name": "word_ngram", "params": {"n": 2}}],
    distance_function={"name": "cosine"},
    analysis_method={"name": "nearest_neighbor"},
)

results = Pipeline(config).execute(known, unknown)
print(results[0].top_author)  # "Hamilton"

Cross-validation

from mowen import leave_one_out, PipelineConfig, Document

docs = [
    Document(text="...", author="Hamilton"),
    Document(text="...", author="Madison"),
    # ... more documents with known authors
]
config = PipelineConfig(
    event_drivers=[{"name": "character_ngram", "params": {"n": 3}}],
    distance_function={"name": "cosine"},
    analysis_method={"name": "nearest_neighbor"},
)
result = leave_one_out(docs, config)
print(f"Accuracy: {result.accuracy:.1%}")
print(f"Macro F1: {result.macro_f1:.4f}")

CLI

pip install -e cli/
# Run an attribution experiment
mowen run -d docs.csv -e word_ngram -e character_ngram:n=3 --distance cosine

# Evaluate accuracy via leave-one-out cross-validation
mowen evaluate -d corpus.csv -e character_ngram:n=3 --distance cosine --mode loo

# Evaluate with k-fold and export results
mowen evaluate -d corpus.csv -e word_events --mode kfold -k 10 --output-csv results.csv

# List all available components
mowen list-components
mowen list-components event-drivers --json

Web UI & API

pip install -e server/
mowen-server

Open http://localhost:8000. API docs at http://localhost:8000/docs.

Docker

docker compose up

Serves the full app at http://localhost:8000 with data persisted in a Docker volume.

Project structure

core/       Python library (mowen)
  src/mowen/
    pipeline.py          Pipeline orchestrator
    evaluation.py        Cross-validation and metrics
    types.py             Core data types (Document, Event, Histogram, ...)
    canonicizers/        10 text preprocessors
    event_drivers/       41 feature extractors
    event_cullers/       15 feature selectors
    distance_functions/  26 distance/similarity metrics
    analysis_methods/    21 classifiers + verification methods
    tokenizers/          Pluggable word segmentation (whitespace, jieba)
    document_loaders/    File format readers (txt, pdf, docx, html)
    data/                Function word lists + 20 sample corpora
    compat/              JGAAP CSV import

cli/        Command-line interface (mowen-cli)
server/     FastAPI backend + static frontend serving (mowen-server)
web/        React/TypeScript frontend
tests/      790+ tests (pytest)
scripts/    Corpus build scripts

Development

# Install all packages in development mode
pip install -e core/ -e cli/ -e server/

# Run tests
python -m pytest tests/

# Lint
ruff check core/ cli/ server/ tests/

See docs/ONBOARDING.md for the full developer setup guide.

Optional dependencies

The core library has no required dependencies. Optional features:

Extra Install Enables
nlp pip install 'mowen[nlp]' POS tagging, NER (spaCy)
transformers pip install 'mowen[transformers]' Transformer embeddings (HuggingFace)
chinese pip install 'mowen[chinese]' Chinese word segmentation (jieba)
wordnet pip install 'mowen[wordnet]' WordNet definition events (NLTK)
pdf pip install 'mowen[pdf]' PDF document loading (pdfplumber)
docx pip install 'mowen[docx]' DOCX document loading (python-docx)
html pip install 'mowen[html]' HTML document loading (BeautifulSoup)
all pip install 'mowen[all]' Everything above

License

MIT — see LICENSE.

Copyright 2026 John Noecker Jr.

About

Authorship attribution toolkit — modular stylometry framework for identifying who wrote a document

Resources

License

Stars

Watchers

Forks

Packages