General extractor structure #15

brianjgeiger · 2014-05-03T15:34:49Z

Now that there's a basic program structure, I have some proposals for changes. I know you're still trying to get things working, and I'm absolutely against premature optimization. What I'm going to suggest uses a lot of the basic ideas but fixes some things that are listed as bugs / enhancements already and will also make things faster.

Right now we are reading/writing the really large file 10.5 times at least. Now that we have the YAML Structure definition, and we know that the lines start with which file they belong to, it makes a lot more sense just to go through the initial file once, look at each line, figure out which YAML definition goes with that line, and output the corresponding CSV line. (Eventually that will be "Put it in a data structure and export it in the format requested", but one step at a time.)

The type of parsing we're doing doesn't really require anything particularly fancy. It's "grab characters x through y and write it out to a file with a comma after / store them as item z in a dictionary", so I don't think we're gaining a lot by using optimized tools.

The things that look like will be handled or helped by the change are:
#11 is only really necessary because we're creating a bunch of intermediate files.
#10 will be hard to step through as you are shelling out to a script. Handling that in-program will make it easier to debug
#9 is handled during this change
#7 is pretty much the same situation as #10.

But it's a big change, so before I just dove in, I wanted to see what you thought. I'd hate to have two people put in a bunch of work on divergent paths if it's not what we're going for.

waldoj · 2014-05-03T16:31:01Z

I only used AWK as a quick port from the Bash script. My guess is that shelling out to AWK is way faster than doing the same work in Python, but I haven't benchmarked it to determine that. AWK will quite happily pipe out contents to other files based on the first 2 characters. That said, I quite dislike the sloppiness of invoking a shell command, and if there's not a serious I/O hit, I think it's worth moving the logic into Python in the name of simplicity. Also, moving that logic into Python could have been done in the time that it took me to write this paragraph. :)

I'm not sure how to feel about not saving the 9 files. On the one hand, as an open data practice, I favor emitting as many types of files as possible. One of those could reasonably be the big file, broken into nine pieces, which is what I was thinking. But, on reflection, I'm not sure that there's a constituency for that. :) Providing those same nine files, as CSV instead of as fixed-width, is probably more useful to people than the nine files as fixed-width data.

brianjgeiger · 2014-05-03T16:37:07Z

While, I grant you that I also have not done the benchmarks, I find it terribly unlikely that any amount of code optimization is going to make up for the need to go through every line of those files 8.5 times, which is what we're doing now. What I'm saying is: $5 says that it will run much faster after a refactoring. :) (Or, you know, my honor as a developer if money wagers are inappropriate or illegal).

Hopefully it won't take too long to get it refactored, though probably not quite as fast as writing that first paragraph.

waldoj · 2014-05-03T16:51:50Z

Oh, I see—you mean by eliminating entirely the interim step, of emitting the nine fixed-width files, and instead going straight from the master file to nine CSV files. Well, yes, that would make a big difference, wouldn't it? :)

brianjgeiger mentioned this issue May 3, 2014

Refactoring for speed and debug capability #16

Merged

This was referenced May 3, 2014

Converted file sizes are suspiciously small #7

Closed

Add progress messages #17

Closed

Change how we detect sub-files #9

Closed

waldoj closed this as completed May 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General extractor structure #15

General extractor structure #15

brianjgeiger commented May 3, 2014

waldoj commented May 3, 2014

brianjgeiger commented May 3, 2014

waldoj commented May 3, 2014

General extractor structure #15

General extractor structure #15

Comments

brianjgeiger commented May 3, 2014

waldoj commented May 3, 2014

brianjgeiger commented May 3, 2014

waldoj commented May 3, 2014