I guess It could be argued whether this is actually better. The main reason behind it is that I wanted to add support for things like mongo, and revisit using sqlalchemy now that a few versions have come and gone. It still lacks some Python niceties, such as package information, but it should be an improvement moving forward.
Only read the values back out as a test to see how it looks. I am lazily trying to match the original EDICT format. I don't like the original warehouse implementation because it relies on comma separated values, which *is* the point, but it smells bad. I'm not entirely sold on the other two attempts and the moment, but it does depend on the ultimate goal, which is searching for various text. Knowing the limitations on the data technically the original warehouse could be perfect.
- Removed instances of sqlalchemy from main. May add it in again later, but not in the main program. - Removed some unecesary sections from the parser. - Updated the data writer to handle the new entries list. Not 100% and running into issues with encodings, either writing to the db or in the windows shell.
It may be from some other libraries I installed, but on Windows 7 running "import Parser from parser" lead to a name collision. This actually makes sense to do in general, though. Also added a .gz ignore because I downloaded the JMdict file locally here.
Also required updating the parse to return a list of Entry objects. Rather than the combination of items from before. Probably a better decision overall but this means that the database insert no longer works. Also created a gloss class due to the annoyence of dealing with tuples. Finall moved the JMdict file to the main directory as it tends to be easier to type when testing. So added it to the ignore file.
Basically a simple collection of all of the data, un-normalized, but faster to search and display. The sad fact of the matter is that it's probably exactly what I want, aside from the glosses not containing all of their information. Well I should say I do still want to have a normalized table to do random querying against, but for the other immediate tools I want to build this will probably work perfectly (i.e. it's fast).
Also added a to string to entries, for testing purposes. It doesn't handle the glosses very well yet.
Using those instead to track joins. Not sure it's really helping. Things seem to be missing from the final listing, but I can't quite seem to track them down. I wonder if it might be easier to keep a list of joins, and read the ids back out from the table? Then again the big problem seems to be a lack of kana kanji link and spead of reading the list back out more than anything else.
Incrementing my own ids as a solution. I don't like how I figured out how to do this, and I certainly don't like that it's valid sqlite syntax...but such is life. The sub selects weren't working, so I guess for now this works. Then again I was trying to cut out everything I could to increse the speed of the conversion.
Thinking about it, I'm not entirely sure that the sense element was entirely necessary. I couldn't find a unique way to identify them aside from the entire element, so it made sense to just move the part of speach into the gloss and run with that. Who knows, I may be completely wrong about it. It is becoming increasingly obvious that I need to references ids somehow, this join table is noticably slow, so figuring that out will be the first step to try fixing it.
I really need a way to get ids at this point. It seems that querying the table as is right now is noticably slower than before. Or maybe it's just my imagination. Working on the pos now. Simplified the error checking before thinking that it should be a separate commit.
Rather than using intermediary files. Ran into an odd issue where it seemed the cursor was trying to iterate over the characters in the text rather than taking them as a single item. Actually that may be the correct way of handling these lists...either way, I just ran a list comprehension on them to get everything into the correct values.
The files (while not yet quite complete) insert all of this data in seconds rather than hours. Perhaps there are things that can be done with sqlalchemy to speed things up. Although I doubt it will ever match a bulk insert. That's not a huge priority, I started using it mostly to make querying easier not necessarily insertions.
Hardcoded it into the parser because I don't think it's possible to easily read out of the xml file (at least not with lxml). I doubt these will change that much at this point, but you never know. The next step is to alter the parser so this information can be linked to the sense elment.
It's being grabbed in a roundabout way simply because I didn't want to hardcode the whole namespace defenition. Yeah, that's a little lazy, but for now it works. I also don't like how I'm displaying these, but then again using the __str__ method was only meant to used for debugging purposes anyway. I'll have to write more robust print code again eventually anyway. So this can wait for now.
That's been bugging me for a while. It probably shouldn't need to be called explicitely, although the option is nice. Also I'm not 100% sure if this is the best way to handle setting the sqlalchemy objects, but the globals weren't being set correctly the way I was importing and initializing them before.