A filename tokenizer for manga, manhwa, manhua, and light novels, written in Rust.
chaptr takes a release filename — [Group] Title v03 c042.5 (Digital).cbz, Some Light Novel v05 (Yen Press) (Digital) [LuCaZ].epub — and returns a struct with the parts named: title, volume, chapter, group, source, edition, language, revision, extension. The shape mirrors what anitomy-rs does for anime release titles, but for the manga/LN side, where anitomy doesn't have a chapter or volume concept and treats c042 as noise.
It exists because the next version of Ryokan (a self-hosted anime PVR) is being extended to cover manga and light novels, and the torrent-grab path needs structured filename parsing for dupe detection (same chapter from different scanlation groups), upgrade detection (v2 revisions), range grabs (c001-050 (Batch).cbz), and Custom Format scoring. Doing that in template strings or one-off regexes inside the consumer is a known footgun, hence a real library.
cargo add chaptruse chaptr::{manga, novel};
let m = manga::parse("[MangaPlus] Chainsaw Man v12 c103 (Digital).cbz");
assert_eq!(m.title.as_deref(), Some("Chainsaw Man"));
assert_eq!(m.group, Some("MangaPlus"));
assert_eq!(m.source, Some(manga::MangaSource::Digital));
assert_eq!(m.extension, Some("cbz"));
// m.volume and m.chapter are structured NumberRange values
let n = novel::parse("[Unpaid Ferryman] Youjo Senki v01-23 (2018-2024) (Digital) (LuCaZ)");
assert_eq!(n.group, Some("Unpaid Ferryman"));
assert_eq!(n.scanner, Some("LuCaZ"));
assert_eq!(n.year, Some(2018));
assert!(n.is_digital);
// n.volume is a range 1..=23Both entry points are pure functions — string in, struct out, no I/O. String fields borrow from the input via &'a str where no normalization is needed. title is a Cow<'a, str> so underscore-as-space normalization (B_Gata_H_Kei → "B Gata H Kei") borrows when possible and allocates only when it must.
Volume detection:
- Single-token:
v01,Vol01,Volume01,S01(season-style),t6/T2000(French/Batman) - Multi-token:
Vol 1,Vol. 1,Volume 11,Том 1(Russian),Tome 2(French) - Decimals:
v1.1,v03.5 - Ranges:
v01-09,v16-17,Том 1-4 - Nested in parens/brackets:
(v01),[Volume 11] - CJK postfix:
1巻(Japanese),第03卷(Chinese),13장(Korean) - Range validation rejects backward ranges:
vol_356-1→ 356, not 356-1
Chapter detection:
- Single-token:
c001,Ch001,Chp02,Chapter001,Chapter11v2(revision silently consumed) - Multi-token:
Ch 4,Ch. 4,Chapter 12,Глава 3(Russian),Episode 406 - Decimals:
c42.5 - Ranges:
c001-008 - CJK postfix:
第25话/章/回/회/화 - Bare-number fallback:
Hinowa ga CRUSH! 018 (2019)→ 18 (requires following metadata)
Title extraction:
- Slices from after leading group bracket(s) to first marker/trailing-bracket/extension
- Normalizes underscores to spaces (
B_Gata_H_Kei→"B Gata H Kei") - Skips leading paren/bracket/curly chains (
(一般コミック) [奥浩哉] いぬやしき→いぬやしき) - Trims trailing punctuation (
-,.,_,,,#,:) - Disambiguates title-vs-chapter numbers (
Kaiju No. 8 036→ titleKaiju No. 8, chapter 36)
Group / publisher / scanner:
- Manga: first non-volume-keyword bracketed token
- Novel: first leading bracket that isn't a known publisher or scanner
- Publisher/scanner lookup against compile-time tables (Yen Press, J-Novel Club, Seven Seas, LuCaZ, Stick, CleanBookGuy, etc.)
Source (manga): Digital, MangaPlus, Viz, Kodansha, Lezhin, Naver, Kakao — each with common aliases (Viz Media, Kodansha USA, Naver Webtoon, Kakao Page, Digital-HD).
Edition (manga): Omnibus, Uncensored, Omnibus Edition compounds.
Language: ISO shortcodes ([EN], [JP], [zh-tw]), full English names ((English), (Japanese), (Simplified Chinese)), and native-script tags (简体中文, 繁體中文, 한국어, русский). Unqualified CN / zh / Chinese defaults to Simplified Chinese, matching Nyaa convention. [Raw] is deliberately not mapped — it's a format tag, not a language declaration.
LN-specific: is_digital / is_premium tags, year extraction, revision from {r2}-style curly tags.
| In | Out |
|---|---|
Manga, manhwa, manhua filenames (.cbz, .cbr, .zip, .7z, .rar, .pdf) |
Anime — use anitomy-rs |
Light novel filenames (.epub, .pdf, .azw3, .mobi, .txt) |
Web novels — content is HTML scraped into controlled EPUBs, no external filename to parse |
| String-in, struct-out | Sidecar reading (ComicInfo.xml, OPF) — belongs one layer up in the consumer |
| Compile-time tables for known groups, publishers, scanners | Network or filesystem I/O |
Manhwa / manhua live in the manga module. Grammar is close enough that splitting them wouldn't pay for duplicated lexer logic — source-tag differences (Lezhin / Naver / Kakao) are carried by the MangaSource enum.
-
182 unit tests (includes 12 named regression tests for past fixed bugs), clippy + fmt clean
-
corpus/manga_kavita.json— 350 real-world fixtures lifted from Kavita's manga parser tests (GPL-3.0, per-entry attribution). Current aggregate pass rate: 98.5%. Per-method:Method Rate ParseVolumeTest 100% ParseDuplicateVolumeTest 90.5% (Thai-only failures) ParseChaptersTest 100% ParseDuplicateChapterTest 100% ParseExtraNumberChaptersTest 100% ParseSeriesTest 97.7% (Thai-only failures) ParseEditionTest 100% -
corpus/novel_nyaa.json— 15 hand-picked LN fixtures from Nyaa with full-struct field assertions across nine fields (group, volume_range, publisher, scanner, language, extension, revision, is_digital, is_premium). Current pass rate: 100% (52/52 field asserts). -
corpus/chapters_mihon.json— 54 chapter-number edge cases from Mihon (Tachiyomi successor), Apache-2.0. Converted tof64at assertion time for comparison against ourChapterNumber { whole, decimal }model. Detection coverage: 54% (our bare-number heuristic is intentionally more conservative than Mihon's); whole-part accuracy when detected: 86% (4 of 29 detected chapters have the wrong integer — alpha-suffix convention divergence: MihonCh.4.a→ 4.1, ours → 4.5). -
corpus/smoke_novel.txt— 512 real Nyaa LN filenames.novel::parsemust not panic on any of them; includes false-positive manga entries deliberately so the LN parser also has to degrade gracefully on out-of-domain input. An aggregate-stat test asserts floors on title-detection (≥95%, currently 100%) and volume-detection (≥75%, currently 82%) rates so silent regressions surface even without per-entry assertions.
All documented in the module-level doc comments. The ones that show up as corpus failures:
- Thai
เล่ม/เล่มที่— Ryokan's intended upstream (Nyaa English-translated) doesn't carry Thai script; supporting it would need lexer changes for Thai combining marks and additional keyword entries. The five remaining Kavita failures are all in this bucket. - X-suffix ranges —
c001-006x1(rare Kavita syntax) - Kavita "special" empty-series cases — filenames like
Love Hina - Special.cbzwhere Kavita expects empty series; no oneshot/special detection yet
Closed in 1.5.0: language detection ([EN] / (English) / 简体中文 / etc.) — previously a stubbed field returning None. Expanded LN corpus from 8 to 15 fixtures, now covering language/extension/revision/is_premium fields in addition to the original group/volume/publisher/scanner/is_digital set.
Closed in 1.4.0: reverse-range CJK (38-1화 → 38), #N-at-end chapter detection (Episode 3 ... #02 → 2), mixed-prefix chapter range (c01-c04 → 1-4), trailing title-dot preservation (Hentai Ouji... Neko.), Russian postfix Том (5 Том Test → vol 5).
Closed in 1.2.0: Korean 시즌 multi-char prefix (시즌34삽화2 → volume 34), alpha-suffix decimals (Beelzebub_153b → 153.5 per Kavita convention).
- One library, two modules, shared lexer.
manga::parseandnovel::parseconsume the sameTokenstream fromlexer::tokenize; domain-specific L2 classifiers on top. Shared detectors (volume, chapter, title slicing, CJK markers) live in a privatecommonmodule so a bug fix hits both domains identically. ChapterNumberis(whole: u32, decimal: Option<u16>), notf64. Sort keys, equality, and hashing all work without precision footguns. Decimal chapters (c42.5) and revisions (c42v2) are distinct values, not colliding.- Lookup tables are compile-time
&'static [(&str, T)]slices. Graduates tophf::Mapwhen any single table exceeds ~20 entries (none do yet). - No
Resultin the public API. Parsing never fails — it produces aParsed_withNonefields for anything it can't extract. - Inline
#[cfg(test)] mod testsat the bottom of each file. No top-leveltests/directory.
Per-call cost is microseconds on modern hardware; tokenization plus a few
small token scans, no heap allocation in the common path (title only
allocates when underscore-to-space normalization forces it).
| Bench | Time |
|---|---|
manga::parse on a typical filename |
~1.0 µs |
manga::parse with CJK markers |
~1.5 µs |
novel::parse on a typical filename |
~0.6 µs |
| 512-entry LN smoke corpus batch | ~0.8 ms (~660 K entries/sec) |
For a 100-torrent Nyaa search result batch, total parse time is under 0.15 ms — the parser isn't the bottleneck in a search pipeline.
(1.2.0 dropped ~33% off every bench by removing a per-Word Vec<char>
allocation on the CJK marker check path, which every parse walks through
regardless of whether the word has CJK chars.)
Run cargo bench for fresh numbers on your hardware. The benches/
directory has a Criterion
harness with representative inputs.
cargo build
cargo test
cargo clippy --all-targets -- -D warningsRequires Rust 1.95+ (edition 2024). No native dependencies, no build script.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Filename strings under corpus/ are lifted with attribution from upstream sources and carry their own licenses, separate from chaptr's MIT/Apache-2.0:
- Kavita (GPL-3.0). Used only as test data; the test harness reads
corpus/manga_kavita.jsonto validate parses but no Kavita code is linked into the chaptr crate. If GPL-tainted test data is a concern for your downstream use, excludecorpus/from your build. - Mihon (Apache-2.0; same text as LICENSE-APACHE).