tools to generate custom student editions of Japanese texts
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
lib
samples
.gitignore
LICENSE
Makefile
README.md
windows-setup.bat
windows.md
ytrc.sample

README.md

yomitori

Tools to generate custom student editions of Japanese texts. Default output is Kindle-sized PDF files, but HTML, Word, and LibreOffice are also supported. This is very much a work in progress, and no matter how messy it looks, trust me that it's far better than the version I've been reading novels with for the past two years.

It runs under Windows with a little bit of work. See windows.md for details.

a4r is a similar project, focused on faithfully converting Aozora Bunko markup to PDF using pLaTeX, without surgically extracting vocabulary or augmenting the original furigana. I intend to borrow their code for handling the many poorly-documented [#...] extensions to the basic AB markup; it's much better than what I was using.

Requirements

  • Perl 5.10.1+ with DBI, DBD::SQLite, Text::MeCab, XML::Twig, Archive::Zip
  • JMdict, JMnedict, kanjidic2
  • MeCab
  • Unidic (use the full source distribution (unidic-mecab_kana-accent) or one of the binaries; the smaller source zipfile is missing several fields)
  • TeXLive 2013
  • dviasm.py (distributed with TeXLive)

Optional

Tools

  • ytmakedict: convert JMdict, JMnedict, and kanjidic2 into a simple SQLite database
  • mobi2yt: convert text from DRM-free, unpacked AZW3 files. Currently very basic.
  • ab2yt: strip out Aozora Bunko markup and optionally save the original ruby to a file, removing the ones that are normal dictionary readings. Attempts to guess input encoding, which can be overridden on the command line. Currently pretty basic, ignoring [#...] markup.
  • ytgloss: add readings and English definitions to a UTF8-encoded text file
  • ytknown: strip out definitions and readings for words the user knows
  • ytruby: convert the embedded readings into proper furigana by stripping out leading, trailing, and interior kana
  • yt2latex: format a document for processing with upLaTeX
  • yt2odt: convert to LibreOffice/OpenOffice, with basic ruby support
  • yt2word: convert to Word HTML, with vertical text and basic ruby support
  • yt2html: convert to HTML with ruby tags and jQueryUI-based tooltips
  • dvicleanruby: use dviasm.py to strip furigana that appear more than once per page.
  • ytvocab: extract a vocabulary list from a document, optionally incorporating page-number information from the upLaTeX .aux file. Normal text output can be pasted into Word/LibreOffice and converted to a table.
  • ytdegloss: strip all embedded readings and glosses from a file; useful for comparing versions of a document.
  • Yomitori.pm: utility functions

Basic Usage

kanji-config-updmap auto
ytmakedict
ab2yt foo.ab --encoding cp932 --ruby orig-ruby.txt > foo.txt
mobi2yt foo-azw3-unpacked --ruby orig-ruby.txt > foo.txt
ytgloss -f fixparse.txt -g fixgloss.txt foo.txt |
    ytknown -k known.txt -r rubyonly.txt > foo.yt
ytruby foo.yt | yt2latex > foo.tex
uplatex foo.tex
dvicleanruby foo.dvi
dvipdfmx foo.dvi

ytvocab -t foo.aux -l foo.yt > foo-vocab.tex
uplatex foo-vocab.tex
dvipdfmx foo-vocab.dvi

ytruby foo.yt | yt2odt -o foo.odt
ytruby foo.yt | yt2word > foo.doc
yt2html foo.yt > foo.html

TODO

  • mobi2yt: detect inline JPEGs used to replace non-sJIS characters (such as roman numerals); publisher workflow seems to leave these in Kindle books even though they're UTF-8 XHTML
  • document the config file and known/rubyonly/fix* files, with samples
  • yt2html: use decent CSS styling
  • ab2yt: process [#...] markup, warn about embedded HTML
  • add additional paper sizes to yt2latex
  • clean up ytmakedict code and output
  • explain the tricky bits
  • Unidic glitch: can't match いつの間にか because Unidic returns あいだ as the reading for 間; this is probably like getting た for 他 in too many contexts, sigh.
  • Unidic glitch: 今日一日, returns ついたち instead of いちにち
  • small tool to more easily override Unidic glitches; writing Perl one-liners to extract and modify the existing records gets old fast.