Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General extractor structure #15

Closed
brianjgeiger opened this issue May 3, 2014 · 3 comments
Closed

General extractor structure #15

brianjgeiger opened this issue May 3, 2014 · 3 comments

Comments

@brianjgeiger
Copy link
Contributor

Now that there's a basic program structure, I have some proposals for changes. I know you're still trying to get things working, and I'm absolutely against premature optimization. What I'm going to suggest uses a lot of the basic ideas but fixes some things that are listed as bugs / enhancements already and will also make things faster.

Right now we are reading/writing the really large file 10.5 times at least. Now that we have the YAML Structure definition, and we know that the lines start with which file they belong to, it makes a lot more sense just to go through the initial file once, look at each line, figure out which YAML definition goes with that line, and output the corresponding CSV line. (Eventually that will be "Put it in a data structure and export it in the format requested", but one step at a time.)

The type of parsing we're doing doesn't really require anything particularly fancy. It's "grab characters x through y and write it out to a file with a comma after / store them as item z in a dictionary", so I don't think we're gaining a lot by using optimized tools.

The things that look like will be handled or helped by the change are:
#11 is only really necessary because we're creating a bunch of intermediate files.
#10 will be hard to step through as you are shelling out to a script. Handling that in-program will make it easier to debug
#9 is handled during this change
#7 is pretty much the same situation as #10.

But it's a big change, so before I just dove in, I wanted to see what you thought. I'd hate to have two people put in a bunch of work on divergent paths if it's not what we're going for.

@waldoj
Copy link
Member

waldoj commented May 3, 2014

I only used AWK as a quick port from the Bash script. My guess is that shelling out to AWK is way faster than doing the same work in Python, but I haven't benchmarked it to determine that. AWK will quite happily pipe out contents to other files based on the first 2 characters. That said, I quite dislike the sloppiness of invoking a shell command, and if there's not a serious I/O hit, I think it's worth moving the logic into Python in the name of simplicity. Also, moving that logic into Python could have been done in the time that it took me to write this paragraph. :)

I'm not sure how to feel about not saving the 9 files. On the one hand, as an open data practice, I favor emitting as many types of files as possible. One of those could reasonably be the big file, broken into nine pieces, which is what I was thinking. But, on reflection, I'm not sure that there's a constituency for that. :) Providing those same nine files, as CSV instead of as fixed-width, is probably more useful to people than the nine files as fixed-width data.

@brianjgeiger
Copy link
Contributor Author

While, I grant you that I also have not done the benchmarks, I find it terribly unlikely that any amount of code optimization is going to make up for the need to go through every line of those files 8.5 times, which is what we're doing now. What I'm saying is: $5 says that it will run much faster after a refactoring. :) (Or, you know, my honor as a developer if money wagers are inappropriate or illegal).

Hopefully it won't take too long to get it refactored, though probably not quite as fast as writing that first paragraph.

@waldoj
Copy link
Member

waldoj commented May 3, 2014

Oh, I see—you mean by eliminating entirely the interim step, of emitting the nine fixed-width files, and instead going straight from the master file to nine CSV files. Well, yes, that would make a big difference, wouldn't it? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants