Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Add preserve_order option to toggle between using an OrderedDict or normal dict #127

Closed
wants to merge 2 commits into from

2 participants

@bruth

This addresses #96 by giving the option of preserving the order of INFO fields.

bruth added some commits
@bruth bruth Add preserve_order option
This enables toggling between an OrderedDict vs. a normal dict
9e935a1
@bruth bruth Change preserve_order to default to True for backwards compat
Amend the docstring to describe what the parameter does.
51f0242
@martijnvermaat martijnvermaat commented on the diff
vcf/parser.py
@@ -350,7 +358,7 @@ def _parse_info(self, info_str):
return {}
entries = info_str.split(';')
- retdict = OrderedDict()
@martijnvermaat Collaborator

This is the only line that's really hitting performance, right?

@bruth
bruth added a note

Yep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@martijnvermaat
Collaborator

I'm not too happy with this approach. I think we really need the order in the header lines and it doesn't make sense to make that configurable.

The line I annotated is the thing you're really looking for and I do think it makes sense to change that to an ordinary dictionary while keeping the ordered dictionary for the header lines. We wouldn't need any configuration that way.

(That one line was also added later, following #46.)

I'll continue the discussion in #96 now, where we have some more context.

@bruth

@martijnvermaat I would thrilled to have that single line change. I made it configurable merely so the behavior did not change for others using the library.

@bruth bruth closed this
@martijnvermaat
Collaborator

Alternative solution in #128 was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Nov 14, 2013
  1. @bruth

    Add preserve_order option

    bruth authored
    This enables toggling between an OrderedDict vs. a normal dict
  2. @bruth

    Change preserve_order to default to True for backwards compat

    bruth authored
    Amend the docstring to describe what the parameter does.
This page is out of date. Refresh to see the latest.
Showing with 18 additions and 9 deletions.
  1. +2 −1  .gitignore
  2. +16 −8 vcf/parser.py
View
3  .gitignore
@@ -1,7 +1,8 @@
PyVCF.egg-info
build
dist
-*.pyc
+*.sw?
+*.py?
docs/_build
.ropeproject
1kg.prof
View
24 vcf/parser.py
@@ -72,7 +72,8 @@
class _vcf_metadata_parser(object):
'''Parse the metadat in the header of a VCF file.'''
- def __init__(self):
+ def __init__(self, dict_type):
+ self.dict_type = dict_type
super(_vcf_metadata_parser, self).__init__()
self.info_pattern = re.compile(r'''\#\#INFO=<
ID=(?P<id>[^,]+),
@@ -159,7 +160,7 @@ def read_format(self, format_string):
match.group('type'), match.group('desc'))
return (match.group('id'), form)
-
+
def read_contig(self, contig_string):
'''Read a meta-contigrmation INFO line.'''
match = self.contig_pattern.match(contig_string)
@@ -179,7 +180,7 @@ def read_meta_hash(self, meta_string):
# Removing initial hash marks and final equal sign
key = items[0][2:-1]
# N.B., items can have quoted values, so cannot just split on comma
- val = OrderedDict()
+ val = self.dict_type()
state = 0
k = ''
v = ''
@@ -223,7 +224,7 @@ class Reader(object):
""" Reader for a VCF v 4.0 file, an iterator returning ``_Record objects`` """
def __init__(self, fsock=None, filename=None, compressed=False, prepend_chr=False,
- strict_whitespace=False):
+ strict_whitespace=False, preserve_order=True):
""" Create a new Reader for a VCF file.
You must specify either fsock (stream) or filename. Gzipped streams
@@ -235,9 +236,16 @@ def __init__(self, fsock=None, filename=None, compressed=False, prepend_chr=Fals
'strict_whitespace=True' will split records on tabs only (as with VCF
spec) which allows you to parse files with spaces in the sample names.
+
+ 'preserve_order=True' will use an OrderedDict instead of a regular
+ dict to preserve the order of the record's fields and INFO data.
+ Note, at large sizes there are performance implications to
+ preserving the order.
"""
super(Reader, self).__init__()
+ self.dict_type = OrderedDict if preserve_order else dict
+
if not (fsock or filename):
raise Exception('You must provide at least fsock or filename')
@@ -292,9 +300,9 @@ def _parse_metainfo(self):
The end user shouldn't have to use this. She can access the metainfo
directly with ``self.metadata``.'''
for attr in ('metadata', 'infos', 'filters', 'alts', 'contigs', 'formats'):
- setattr(self, attr, OrderedDict())
+ setattr(self, attr, self.dict_type())
- parser = _vcf_metadata_parser()
+ parser = _vcf_metadata_parser(self.dict_type)
line = self.reader.next()
while line.startswith('##'):
@@ -315,7 +323,7 @@ def _parse_metainfo(self):
elif line.startswith('##FORMAT'):
key, val = parser.read_format(line)
self.formats[key] = val
-
+
elif line.startswith('##contig'):
key, val = parser.read_contig(line)
self.contigs[key] = val
@@ -350,7 +358,7 @@ def _parse_info(self, info_str):
return {}
entries = info_str.split(';')
- retdict = OrderedDict()
@martijnvermaat Collaborator

This is the only line that's really hitting performance, right?

@bruth
bruth added a note

Yep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ retdict = self.dict_type()
for entry in entries:
entry = entry.split('=')
Something went wrong with that request. Please try again.