Skip to content

DOC: Example on how to parse comments #22055

@summerela

Description

@summerela

As a bioinformatician, I frequently use Pandas to parse hideous scientific file formats. Often standard biological file formats include a section of header comments that provide useful information about the file, such as reference build, species, versioning, etc.

For example, a typical VCF file:

#VCF v1.1.4
#RefBuild: hg19
#Assay:Oncopanel
#CHROM    POS    REF    ALT    QUAL    FILTER    INFO
22            123845    A      G          .             .           STRAND=+

Above, we'd need to know the RefBuild to determine which species and genome version we are working with, and Assay to let us know which assay of mutations that we are looking for. There are all strange manner of things found in these headers that is often important to our analysis.

Currently, I would need to read in the file once without Pandas to grab the header information, and then read it in again with pandas to skip the commented lines and turn the content into a dataframe. I primarily build pipelines for processing very large datasets and this little workaround is often the bane of my existence.

I would like to put in a feature request to add this ability. Similar questions/requests already exist on StackOverflow, as well Thank you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions