A python parser of the JMdict file.
Only read the values back out as a test to see how it looks. I am lazily trying to match the original EDICT format. I don't like the original warehouse implementation because it relies on comma separated values, which *is* the point, but it smells bad. I'm not entirely sold on the other two attempts and the moment, but it does depend on the ultimate goal, which is searching for various text. Knowing the limitations on the data technically the original warehouse could be perfect.
|models||Add part of speach relationship.|
|.gitignore||Fix for parser being a built-in on Windows.|
|JMdictParser.py||Update to use the observer pattern.|
|__init__.py||Fix for parser being a built-in on Windows.|
|data.py||Add alternate warehouse implementation and list values.|
|main.py||Add alternate warehouse implementation and list values.|
|observer.py||Update to use the observer pattern.|
|readme.txt||Update to use the observer pattern.|
|test.py||Add unit tests for the parser.|
The ultimate goal of this project is to feed a list of japanese words into a program and get a list of translations back out. The first step was to read a Japanese translation dictionary into a format that could then be queried. Then create something to break up Japanese text, feed the words into this, and print out the results. My original approach was to insert the contents of the JMdict Japanese translation file into a sqlite database. I was hoping than I could then use sql syntax to make searching easier. Inserting the data into a sqlite database was relatively easy, despite initially running into issues using SqlAlchemy. It may eventually be useful but the insert queries it ran would take hours to complete. I've now managed to get it down to a couple minutes, but the join required on fully normalized data meant that it was slower than reading the file directly from xml. I was in the process of creating a single warehouse table before getting distracted by other things. That would still allow the sqlite file to be a cross platform data file, but it would loose I don't necessarily want to entirely scrap that idea, but for any data analysis I may try other databases backends instead. The current goal is still to read the dictionary file and convert it into some kind of non-xml store that can be quickly read in or queried. I haven't gotten into any other specifics yet. The current project setup is a little cluttered. At some point it will have to be cleaned up. Required packages: - lxml - SqlAlchemy - for databse setup (Need to eventually remove, or at least move, this requirement, as not all stores are going to need it). TODO: - Create a means of querying a data store. - Develop companion readers/writers for each existing type - ideally you will be able to read in anything that has been written out, and write out anything that has been written in. - Figure out why sqlite on windows isn't saving the data as unicode, or if it is just a console issue. - Perhaps move some of the sqlite normalized table infomration into the reader as it is currently an extra step. Although it may only be useful for sql stores.