GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
I'm still unbelievably angry that VCF even exists, when they could have used hdf5 or something.
Anyway, I'm not convinced that hand rolled parsers are really where it's at in the 21st century and maybe investigate the performance of a parser combinator approach.
I have a prototype of this working for the header using funcparserlib. I think it is quite nice, you have a separation of concerns: a tokenizer, grammar and object model. It's clearly slower, but for the header we don't care about speed, we care about correctness. The main sample parsing code will still require hand rolled loops for performance, but with the improved header model it should be easier to handle the type casting. I will share when I'm not posting from a phone.
Have a look at this gist...
@martijnvermaat any thoughts?
Yes, I like this a lot. It would be good to have an idea what kind of spec violations are out there and whether or not we can still parse them using this approach. I think parsing all headers in our test data is a first minimal requirement.
If we can still be flexible enough to parse what is being used in practice I'm all for it. I'd be happy to see the regexes go.
By the way, do you know how funcparserlib compares to pyparsing? The latter is the only one I've used (and more popular I think) and while it did everything I needed, I wasn't terribly impressed (especially documentation).
This is kind of connected with #89 but I was wondering if there is already a multiprocess solution to read the whole VCF in memory, so than can be moved to a pandas dataframe for operations.
Or some other strategy to read the bulk VCF in a quick way