Consider using a parser combinator approach #90

Open
jamescasbon opened this Issue Jan 30, 2013 · 5 comments

Comments

Projects
None yet
3 participants
Owner

jamescasbon commented Jan 30, 2013

I'm still unbelievably angry that VCF even exists, when they could have used hdf5 or something.

Anyway, I'm not convinced that hand rolled parsers are really where it's at in the 21st century and maybe investigate the performance of a parser combinator approach.

Owner

jamescasbon commented Feb 14, 2013

I have a prototype of this working for the header using funcparserlib. I think it is quite nice, you have a separation of concerns: a tokenizer, grammar and object model. It's clearly slower, but for the header we don't care about speed, we care about correctness. The main sample parsing code will still require hand rolled loops for performance, but with the improved header model it should be easier to handle the type casting. I will share when I'm not posting from a phone.

Owner

jamescasbon commented Feb 15, 2013

Have a look at this gist...
https://gist.github.com/jamescasbon/4960150

Owner

jamescasbon commented Feb 19, 2013

@martijnvermaat any thoughts?

Collaborator

martijnvermaat commented Feb 19, 2013

Yes, I like this a lot. It would be good to have an idea what kind of spec violations are out there and whether or not we can still parse them using this approach. I think parsing all headers in our test data is a first minimal requirement.

If we can still be flexible enough to parse what is being used in practice I'm all for it. I'd be happy to see the regexes go.

By the way, do you know how funcparserlib compares to pyparsing? The latter is the only one I've used (and more popular I think) and while it did everything I needed, I wasn't terribly impressed (especially documentation).

This is kind of connected with #89 but I was wondering if there is already a multiprocess solution to read the whole VCF in memory, so than can be moved to a pandas dataframe for operations.

Or some other strategy to read the bulk VCF in a quick way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment