Skip to content

A set of tools for parsing and studying Japanese

Notifications You must be signed in to change notification settings

himselfv/jptools

Repository files navigation

This set contains several command-line tools intended to help those studying kanji and Japanese language, and some libraries/units for Delphi to parse/write file formats commonly encountered when working with Japanese text.

AnkiList

These tools generate tab-separated files suitable for importing into Anki or updating some fields of your Anki deck.

  • AnkiKanjiList - converts raw kanji list to tab-separated list with ons/kuns/meanings (uses KANJIDIC compatible dictionary).

  • AnkiWordList - converts word/expression list to tab-separated list of words and translations (uses EDICT/CEDICT compatible dictionary)

  • AnkiExampleList - converts word/expression list to tab-separated list of words and example sentences (uses Tanaka corpus compatible corpus)

Converts Warodai (big classical JP->RU dictionary) into EDICT2 and JMDict-style dictionaries which you can use in any program.

The translation is far from perfect yet but it does work in a way, and the resulting dictionary is usable with more than 100 000 entries.

How to use:

Or download converted dictionary:

Miscellaneous

These tools may be usable by itself or as an example when working with their underlying libraries.

  • KanjiStats: list kanji by frequency in a given text.

  • kanjistats_4Gb: kanji sorted by frequency, as they appeared in 21000 of books in Japanese

  • KanjiList: manipulate kanji lists (trim/merge/intersect/etc)

  • AozoraTxt: strips Aozora-Ruby from the text or gives some statistical info about it.

  • MiscTxt: gives some common statistical info about a text (# of kana, kanji, char and line count)

  • YarxiKanjiInfo: uses Yarxi database parser to extract kanji information.

Libraries

Libraries in Delphi for common CJK-related tasks.

  • JWBIO - fast stream reader/writer with encoding detection and a bunch of encodings out of the box, including JIS/Shift-JIS, GB, UTF16/8 and other common japanese ones.
  • KanjidicReader: KANJIDIC style dictionary parser + basic in-memory representation ("load and use")
  • EdictReader: EDICT/CCEDICT dictionary format parser (very forgiving to deviations in formats) + in-memory representation
  • EdictWriter - programmer friendly EDICT1/EDICT2/JMDICT file generator.
  • AozoraTxt parser: - parses text files in Aozora Bunko format
  • UnihanReader - simple Unihan database parser.
  • KanaConv - romaji-katakana-hiragana conversions, supports common and custom romaji schemes, using multiple at once --- not yet moved here from Wakan project.
  • YarxiReader

Downloads

Latest jptools.zip (AnkiKanjiList/WordList, AozoraTxt, KanjiStats and more)

All downloads

Building

May be required for some projects:

  • Wakan
  • SQLite3.pas
  • SQLite3Dataset.pas

At runtime:

  • sqlite3.dll
  • EDICT2
  • kanjidic
  • radkfile
  • ewarodai.txt
  • yarxi.db

About

A set of tools for parsing and studying Japanese

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages