Skip to content
Commits on Mar 1, 2015
  1. Update to support reading default gzip files.

    So I can get rid of the unziped file from the repo.
Commits on Jan 11, 2015
  1. Fix for erroring tests.

Commits on Apr 20, 2014
  1. Update to make the readme an rst file.

    Mostly so github, bitbucket, etc. format it to look better.
  2. Update to improve the project structure.

    I guess It could be argued whether this is actually better. The main reason
    behind it is that I wanted to add support for things like mongo, and revisit
    using sqlalchemy now that a few versions have come and gone.
    It still lacks some Python niceties, such as package information, but it
    should be an improvement moving forward.
  3. Add JMdict file to the repo.

    I got tired of finding and downloading it again on separate systems.
Commits on Nov 11, 2011
  1. Add alternate warehouse implementation and list values.

    Only read the values back out as a test to see how it looks. I am lazily
    trying to match the original EDICT format.
    I don't like the original warehouse implementation because it relies on comma
    separated values, which *is* the point, but it smells bad.
    I'm not entirely sold on the other two attempts and the moment, but it does
    depend on the ultimate goal, which is searching for various text. Knowing the
    limitations on the data technically the original warehouse could be perfect.
  2. Fix for the gloss join table.

    Not sure how necessary it is, but at least now it works.
Commits on Sep 14, 2011
  1. Update to use the observer pattern.

    As print can only get you so far.
Commits on Sep 13, 2011
  1. Updates to get this running on Windows.

    Greg Peterson committed
    - Removed instances of sqlalchemy from main. May add it in again later, but
    not in the main program.
    - Removed some unecesary sections from the parser.
    - Updated the data writer to handle the new entries list. Not 100% and running
    into issues with encodings, either writing to the db or in the windows shell.
Commits on Aug 15, 2011
  1. Fix for parser being a built-in on Windows.

    Greg Peterson committed
    It may be from some other libraries I installed, but on Windows 7
    running "import Parser from parser" lead to a name collision. This
    actually makes sense to do in general, though.
    Also added a .gz ignore because I downloaded the JMdict file locally
Commits on Aug 4, 2011
  1. Add unit tests for the parser.

    Also required updating the parse to return a list of Entry objects. Rather
    than the combination of items from before. Probably a better decision overall
    but this means that the database insert no longer works.
    Also created a gloss class due to the annoyence of dealing with tuples.
    Finall moved the JMdict file to the main directory as it tends to be easier
    to type when testing. So added it to the ignore file.
Commits on Mar 13, 2011
  1. Add "warehouse" table.

    Basically a simple collection of all of the data, un-normalized,
    but faster to search and display.
    The sad fact of the matter is that it's probably exactly what I
    want, aside from the glosses not containing all of their information.
    Well I should say I do still want to have a normalized table to
    do random querying against, but for the other immediate tools I want to
    build this will probably work perfectly (i.e. it's fast).
  2. Add comments and redid the message system to support unicode.

    Also added a to string to entries, for testing purposes. It
    doesn't handle the glosses very well yet.
  3. Add comments

  4. Changed how output messages are handled.

    As using print statments is nice for debugging, but not for much
  5. Split data parsing and data saving function.

    Should have done this a while ago. I'm kind of toying with the
    idea of going back and setting up the ORM again, but I'm still
    concerned with performance issues.
Commits on Feb 10, 2011
  1. Cleaned up code a bit more.

    Still fast, and seems to get all the correct pairings. But querying
    everything is still incredibly slow.
Commits on Jan 14, 2011
  1. Tried setting up a couple specific dictionaries.

    Using those instead to track joins. Not sure it's really helping.
    Things seem to be missing from the final listing, but I can't
    quite seem to track them down.
    I wonder if it might be easier to keep a list of joins, and read
    the ids back out from the table? Then again the big problem seems
    to be a lack of kana kanji link and spead of reading the list back
    out more than anything else.
Commits on Dec 2, 2010
  1. Added kana/kanji ids.

    Still slow to query the entire thing.
  2. Add ids for join table.

    Incrementing my own ids as a solution.
    I don't like how I figured out how to do this, and I certainly
    don't like that it's valid sqlite syntax...but such is life.
    The sub selects weren't working, so I guess for now this works.
    Then again I was trying to cut out everything I could to increse
    the speed of the conversion.
  3. Add gloss entry join table.

    Thinking about it, I'm not entirely sure that the sense element
    was entirely necessary. I couldn't find a unique way to identify
    them aside from the entire element, so it made sense to just move
    the part of speach into the gloss and run with that.
    Who knows, I may be completely wrong about it.
    It is becoming increasingly obvious that I need to references ids
    somehow, this join table is noticably slow, so figuring that out
    will be the first step to try fixing it.
  4. Added kanji/kana to entry join tables.

    I really need a way to get ids at this point. It seems that querying
    the table as is right now is noticably slower than before. Or maybe
    it's just my imagination.
    Working on the pos now. Simplified the error checking before
    thinking that it should be a separate commit.
  5. Fix to add lists directly to database.

    Rather than using intermediary files.
    Ran into an odd issue where it seemed the cursor was trying to
    iterate over the characters in the text rather than taking them
    as a single item. Actually that may be the correct way of handling
    these lists...either way, I just ran a list comprehension on them
    to get everything into the correct values.
  6. Redid the parser to not use sqlalchemy for inserts.

    The files (while not yet quite complete) insert all of this data
    in seconds rather than hours.
    Perhaps there are things that can be done with sqlalchemy to speed
    things up. Although I doubt it will ever match a bulk insert.
    That's not a huge priority, I started using it mostly to make
    querying easier not necessarily insertions.
Commits on Nov 30, 2010
  1. Fix for missing pos in dictionary.

    Seems to be "none" although I'm not sure why that would be included
    in the xml file...
    Well I should remember to do further testing assuming I run it against
    the whole file.
  2. Add part of speach table.

    Hardcoded it into the parser because I don't think it's possible
    to easily read out of the xml file (at least not with lxml). I
    doubt these will change that much at this point, but you never know.
    The next step is to alter the parser so this information can be
    linked to the sense elment.
Commits on Nov 29, 2010
  1. Add language attribute into the gloss table.

    It's being grabbed in a roundabout way simply because I didn't want
    to hardcode the whole namespace defenition. Yeah, that's a little
    lazy, but for now it works.
    I also don't like how I'm displaying these, but then again using
    the __str__ method was only meant to used for debugging purposes
    anyway. I'll have to write more robust print code again eventually
    anyway. So this can wait for now.
  2. Rename model to models

    That's another thing which has been bothering me for a while.
    It makes more sense as a plural given that there are more than
    one models.
  3. Fix for dropping and creating tables before insert.

    That's been bugging me for a while.
    It probably shouldn't need to be called explicitely, although the
    option is nice.
    Also I'm not 100% sure if this is the best way to handle setting
    the sqlalchemy objects, but the globals weren't being set correctly
    the way I was importing and initializing them before.
  4. Add command line args.

    Default is nothing, a file may be imported, or contents listed.
Commits on Nov 18, 2010
  1. Revert to many-to-one relationship

    If I was smart I would have acually reverted...
    Anyway, it makes more sense the way it was, but I could come
    back to this in the future.
Something went wrong with that request. Please try again.