Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

A python parser of the JMdict file.

branch: master

Add alternate warehouse implementation and list values.

Only read the values back out as a test to see how it looks. I am lazily
trying to match the original EDICT format.

I don't like the original warehouse implementation because it relies on comma
separated values, which *is* the point, but it smells bad.

I'm not entirely sold on the other two attempts and the moment, but it does
depend on the ultimate goal, which is searching for various text. Knowing the
limitations on the data technically the original warehouse could be perfect.
latest commit 71367889b3
gpeterson2 authored November 10, 2011
Octocat-spinner-32 models Add part of speach relationship. November 30, 2010
Octocat-spinner-32 .gitignore Fix for parser being a built-in on Windows. August 15, 2011
Octocat-spinner-32 JMdictParser.py Update to use the observer pattern. September 14, 2011
Octocat-spinner-32 __init__.py Fix for parser being a built-in on Windows. August 15, 2011
Octocat-spinner-32 data.py Add alternate warehouse implementation and list values. November 10, 2011
Octocat-spinner-32 main.py Add alternate warehouse implementation and list values. November 10, 2011
Octocat-spinner-32 observer.py Update to use the observer pattern. September 14, 2011
Octocat-spinner-32 readme.txt Update to use the observer pattern. September 14, 2011
Octocat-spinner-32 test.py Add unit tests for the parser. August 03, 2011
readme.txt
The ultimate goal of this project is to feed a list of japanese words into a
program and get a list of translations back out.

The first step was to read a Japanese translation dictionary into a format
that could then be queried. Then create something to break up Japanese text,
feed the words into this, and print out the results.

My original approach was to insert the contents of the JMdict
Japanese translation file into a sqlite database. I was hoping than I could
then use sql syntax to make searching easier.

Inserting the data into a sqlite database was relatively easy, despite
initially running into issues using SqlAlchemy. It may eventually be useful
but the insert queries it ran would take hours to complete. I've now managed
to get it down to a couple minutes, but the join required on fully normalized
data meant that it was slower than reading the file directly from xml. I was
in the process of creating a single warehouse table before getting distracted
by other things. That would still allow the sqlite file to be a cross
platform data file, but it would loose

I don't necessarily want to entirely scrap that idea, but for any data analysis
I may try other databases backends instead.

The current goal is still to read the dictionary file and convert it into some
kind of non-xml store that can be quickly read in or queried. I haven't gotten
into any other specifics yet.

The current project setup is a little cluttered. At some point it will have to
be cleaned up.

Required packages:
- lxml
- SqlAlchemy - for databse setup (Need to eventually remove, or at least move,
    this requirement, as not all stores are going to need it).

TODO:
- Create a means of querying a data store.
- Develop companion readers/writers for each existing type - ideally you will
be able to read in anything that has been written out, and write out anything
that has been written in.
- Figure out why sqlite on windows isn't saving the data as unicode, or if it
is just a console issue.
- Perhaps move some of the sqlite normalized table infomration into the reader
as it is currently an extra step. Although it may only be useful for sql
stores.

Something went wrong with that request. Please try again.