Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
Database generator for Jim Breen's wwwjdic kanjidic file ======================================================= This is basically a command line app that has been built alongside a Rails app so that there is easy access to things like ActiveRecord, and the various Rake tasks that Rails developers are used to. _____________ Note this is only currently working with Ruby 1.9.1 tested against : *ruby 1.9.1p129 (2009-05-12 revision 23412) [i686-linux] *ruby 1.9.1 p378 _____________ Current Status of this Project ============================== At the moment running the spec should run the tests and generate a Sqlite3 database from kanjidic. The basic structure of the database can be seen by examining schema.rb or looking at the models. But in summary: The main model is Kanji, and there are various other models which refer to the other things in kanjidic. For an introduction to the kanjidic file structure please see Jim Breen's page at http://www.csse.monash.edu.au/~jwb/kanjidic.html or see http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html for more detailed information. This application imports all the data from the kanjidic file I perceive the data to be of two basic types: * Language related data * Dictionary indexes _____________________ Language related data ===================== The language related models are *Kunyomi *Onyomi *Nanori *Meaning (the English meaning of a Kanji) *Korean (the Korean reading of a Kanji) *Pinyin (the Chinese Pinyin reading of a Kanji) Each of these models have their own tables and join tables to join to the Kanji model. Dictionary indexes ================== The majority of the data refer to various dictionary indexes and study book indexes such as James Heisig's "Remembering the Kanji" These indexes have been moved into the kanji_lookups table, where the dictionary_id can be used to find out which dictionary or index it refers to. _________________________ Future ======= Future plans for this project. * Using Rsync do an hourly check and download only the changes to the kanjidic file, and update the database accordingly. * Do periodic entire rebuilds of the database. * Provide a copy of this database for the wwwjdic website to allow people to get up to date versions of the database. * Incorporate data from edict into their own tables * Incorporate example sentences from the tanaka corpus (tatoeba project) * Create join tables to the kanji for tanaka corpus based on the reading and sense number * Possible automated discovery of likely joins between edict words and kanjidic entries (this is possibly a research project in itself, but it may be worth finding some exact matches and at least creating a 'possible matches' table. This would mean for any client applications using the database that they may be able to match edict entries to kanji reading entries, and also tanaka corpus sentences to their corresponding kanji reading entries)