Extracts the data from the Wenlin dictionary program.
Unfortunately, this great data is wrapped by a less-than-great UI. This code is intended to be useful to Chinese language students who wish to interact with the data on their own terms.
The tool ships as a Ruby gem, and the standard installation process applies. The code relies on Ruby 1.9 syntax and String encoding. It was tested to work with MRI 1.9.3.
gem install wenlin_db_scanner
The following commands assume that the current directory of your
Command Prompt is the Wenlin application's main directory. If your current
directory contains a
W4DB directory, you're probably in the right place.
Parses a dictionary database into a file containing one JSON line per entry.
wenlin_dict W4DB/ en-zh > en_zh.json wenlin_dict W4DB/ zh-en > zh_en.json wenlin_dict W4DB/ hz-en > hz_en.json
Parses the database that breaks down hanzi (Chinese characters) into components.
wenlin_hanzi W4DB > hanzi.json
Parses a parts-of-speech database into a file containing one JSON line per part of speech.
The parts of speech are referenced by the word defintion databases, which use their abbreviations.
wenlin_parts W4DB/ en > en_parts.json wenlin_parts W4DB/ zh > zh_parts.json
Extracts the raw text entries in a .db file. Useful for debugging and understanding the record format.
The scripts in the
bin directory are thin wrappers over the API. Read them if
you want to use the Ruby API directly.
It is very likely that you'll get your job done faster by using the output of the CLI tools.
I test this code by runing the tools inside
bin against the Wenlin databases,
and by spot-checking the output.
This tool works fairly well on the Wenlin 4 data files. Bugfixes and support for new .db file formats are welcome, other features are most likely outside the project's scope.
Note that this tool is designed to help moving the data into another program, so it only supports full table scans. Support for random access using the B-tree indexes is outside the scope of this project.
This code is licensed under the CC0 Public Domain license.