A project for converting the kanjidic (and in future edict and Tanaka corpus) files from Jim Breen's wwwjdic project into a database format.
Ruby JavaScript
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
doc initial commit Mar 29, 2010


Database generator for Jim Breen's wwwjdic kanjidic file

This is basically a command line app that has been built
alongside a Rails app so that there is easy access to things
like ActiveRecord, and the various Rake tasks that Rails
developers are used to.


Note this is only currently working with Ruby 1.9.1
tested against :
*ruby 1.9.1p129 (2009-05-12 revision 23412) [i686-linux]
*ruby 1.9.1 p378


Current Status of this Project

At the moment running the spec should run the tests and 
generate a Sqlite3 database from kanjidic.

The basic structure of the database can be seen by 
examining schema.rb or looking at the models.

But in summary:

The main model is Kanji, and there are various other
models which refer to the other things in kanjidic.

For an introduction to the kanjidic file structure please
see Jim Breen's page at http://www.csse.monash.edu.au/~jwb/kanjidic.html
or see http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html for more
detailed information.

This application imports all the data from the kanjidic file I perceive
the data to be of two basic types:

* Language related data
* Dictionary indexes


Language related data

The language related models are

*Meaning (the English meaning of a Kanji)
*Korean (the Korean reading of a Kanji)
*Pinyin (the Chinese Pinyin reading of a Kanji)

Each of these models have their own tables and join tables
to join to the Kanji model.

Dictionary indexes

The majority of the data refer to various dictionary indexes
and study book indexes such as James Heisig's "Remembering the 

These indexes have been moved into the kanji_lookups table, 
where the dictionary_id can be used to find out which dictionary or
index it refers to.



Future plans for this project.

* Using Rsync do an hourly check and download only the changes to the kanjidic file, and
update the database accordingly. 
* Do periodic entire rebuilds of the database.
* Provide a copy of this database for the wwwjdic website to allow people to get up to date
versions of the database.
* Incorporate data from edict into their own tables
* Incorporate example sentences from the tanaka corpus (tatoeba project)
* Create join tables to the kanji for tanaka corpus based on the reading and sense number
* Possible automated discovery of likely joins between edict words and kanjidic entries
(this is possibly a research project in itself, but it may be worth finding some exact matches
and at least creating a 'possible matches' table. This would mean for any client applications
using the database that they may be able to match edict entries to kanji reading entries, and
also tanaka corpus sentences to their corresponding kanji reading entries)