Using PyVCF with Hadoop Streaming: separate header from the data parsing #99

Closed
laserson opened this Issue Mar 11, 2013 · 4 comments

2 participants

@laserson

I'd like to use PyVCF with Hadoop Streaming. This means that I may start consuming lines from anywhere in the VCF file. However, I can get access to the VCF header by reading it first.

Would this be the correct way to initialize the Reader object with one file and consume VCF lines from another?

# get VCF filename
input_file = os.environ['map_input_file']

# initialize Reader object with VCF file.  The data is being read from stdout of
# a subprocess that dumps the VCF file
p = subprocess.Popen("hadoop fs -cat %s" % input_file, shell=True, stdout=subprocess.PIPE)
vcf_reader = vcf.Reader(fsock=p.stdout, strict_whitespace=True)
p.kill()

# the actual data I want to parse comes in on stdin, so point the parser there
vcf_reader._reader = sys.stdin
vcf_reader.reader = (line.strip() for line in vcf_reader._reader if line.strip())

# iterate through VCF records
for record in vcf_reader:
    print record

Are there any other things I need to deal with when I change the input source?

Thanks!
Uri

@martijnvermaat
Collaborator

This probably works, but perhaps a bit less hacky way would be to do something like this (not tested):

import itertools
vcf_reader = vcf.Reader(itertools.chain(p.stdout, sys.stdin))
p.kill()
@laserson

The only issue is that p.stdout and sys.stdin are actually referring to the same very large file, except p.stdout starts at the top of the file (including header info) while sys.stdin starts somewhere in the middle of the file. With your proposal, wouldn't calling vcf_reader.next() iterate through the entire file, and then also iterate through the chunk in sys.stdin?

@martijnvermaat
Collaborator

Ah, I assumed it was just the header in input_file. You could change it like this:

header = itertools.takewhile(lambda l: l.startswith('#'), p.stdout)
vcf_reader = vcf.Reader(itertools.chain(header, sys.stdin))
p.kill()

I think this is a bit safer than your original approach, though both should work.

@laserson

I like it. Thanks!

@laserson laserson closed this Mar 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment