-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
As a bioinformatician, I frequently use Pandas to parse hideous scientific file formats. Often standard biological file formats include a section of header comments that provide useful information about the file, such as reference build, species, versioning, etc.
For example, a typical VCF file:
#VCF v1.1.4
#RefBuild: hg19
#Assay:Oncopanel
#CHROM POS REF ALT QUAL FILTER INFO
22 123845 A G . . STRAND=+
Above, we'd need to know the RefBuild to determine which species and genome version we are working with, and Assay to let us know which assay of mutations that we are looking for. There are all strange manner of things found in these headers that is often important to our analysis.
Currently, I would need to read in the file once without Pandas to grab the header information, and then read it in again with pandas to skip the commented lines and turn the content into a dataframe. I primarily build pipelines for processing very large datasets and this little workaround is often the bane of my existence.
I would like to put in a feature request to add this ability. Similar questions/requests already exist on StackOverflow, as well Thank you.