I'd like to use PyVCF with Hadoop Streaming. This means that I may start consuming lines from anywhere in the VCF file. However, I can get access to the VCF header by reading it first.
Would this be the correct way to initialize the Reader object with one file and consume VCF lines from another?
# get VCF filename
input_file = os.environ['map_input_file']
# initialize Reader object with VCF file. The data is being read from stdout of
# a subprocess that dumps the VCF file
p = subprocess.Popen("hadoop fs -cat %s" % input_file, shell=True, stdout=subprocess.PIPE)
vcf_reader = vcf.Reader(fsock=p.stdout, strict_whitespace=True)
# the actual data I want to parse comes in on stdin, so point the parser there
vcf_reader._reader = sys.stdin
vcf_reader.reader = (line.strip() for line in vcf_reader._reader if line.strip())
# iterate through VCF records
for record in vcf_reader:
Are there any other things I need to deal with when I change the input source?
This probably works, but perhaps a bit less hacky way would be to do something like this (not tested):
vcf_reader = vcf.Reader(itertools.chain(p.stdout, sys.stdin))
The only issue is that p.stdout and sys.stdin are actually referring to the same very large file, except p.stdout starts at the top of the file (including header info) while sys.stdin starts somewhere in the middle of the file. With your proposal, wouldn't calling vcf_reader.next() iterate through the entire file, and then also iterate through the chunk in sys.stdin?
Ah, I assumed it was just the header in input_file. You could change it like this:
header = itertools.takewhile(lambda l: l.startswith('#'), p.stdout)
vcf_reader = vcf.Reader(itertools.chain(header, sys.stdin))
I think this is a bit safer than your original approach, though both should work.
I like it. Thanks!