Skip to content
This repository

improvments to vcf_filter #40

Merged
merged 5 commits into from almost 2 years ago

2 participants

Libor Mořkovský James Casbon
Libor Mořkovský

I like the idea of vcf filtering framework, so I tried to use that for my project. To be able to use that fluently, i've added few changes.

Filter classes can be contained in local file

The file is given on the command line, so not all the possible arguments are known the first time command line is parsed. So it was necessary to slightly change the command syntax to

vcf_filter.py --optionals input filter1 [--filter1-optionals] 
  [filter2 [--filter2-optionals]]

This way the arguments can be parsed filter by filter and checked for correctness.

Output 'PASS' in filter field only if filtered data are not dropped

Small change to add_filter() to remove the '.' when adding first filter

James Casbon
Owner

This is super cool, thanks very much.

I just need to check this out and update the tests, before it gets merged. Unless they are already passing?

Libor Mořkovský
James Casbon jamescasbon merged commit 30c71b6 into from
James Casbon jamescasbon closed this
James Casbon
Owner

Hey @libor-m do you want to submit your filters for inclusion? Would be great to get them in before I cut the next release.

Libor Mořkovský

The filters I was using for my project are in https://github.com/libor-m/scrimer/blob/master/pyvcf_filters.py . Check them and if any of them seems sensible - and reuasble for the others - to you, you're welcome to include those in pyvcf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
174  scripts/vcf_filter.py
@@ -6,67 +6,161 @@
6 6
 import vcf
7 7
 from vcf.parser import _Filter
8 8
 
9  
-parser = argparse.ArgumentParser(description='Filter a VCF file',
10  
-        formatter_class=argparse.RawDescriptionHelpFormatter,
11  
-        )
12  
-parser.add_argument('input', metavar='input', type=str, nargs=1,
13  
-        help='File to process (use - for STDIN)')
14  
-parser.add_argument('filters', metavar='filter', type=str, nargs='+',
15  
-        help='Filters to use')
16  
-parser.add_argument('--no-short-circuit', action='store_true',
17  
-        help='Do not stop filter processing on a site if a single filter fails.')
18  
-parser.add_argument('--output', action='store', default=sys.stdout,
19  
-        help='Filename to output (default stdout)')
20  
-parser.add_argument('--no-filtered', action='store_true',
21  
-        help='Remove failed sites')
22  
-
23  
-
24  
-if __name__ == '__main__':
  9
+def create_filt_parser(name):
  10
+    parser = argparse.ArgumentParser(description='Parser for %s' % name,
  11
+            add_help=False
  12
+            )
  13
+    parser.add_argument('rest', nargs=argparse.REMAINDER, help=argparse.SUPPRESS)
  14
+    
  15
+    return parser
  16
+
  17
+def create_core_parser():
  18
+    # we have to use custom formatted usage, because of the 
  19
+    # multi-stage argument parsing (otherwise the filter arguments
  20
+    # are grouped together with the other optionals)
  21
+    parser = argparse.ArgumentParser(description='Filter a VCF file',
  22
+            add_help=False,
  23
+            formatter_class=argparse.ArgumentDefaultsHelpFormatter,
  24
+            usage="""%(prog)s [-h] [--no-short-circuit] [--no-filtered] 
  25
+              [--output OUTPUT] [--local-script LOCAL_SCRIPT]
  26
+              input filter [filter_args] [filter [filter_args]] ...
  27
+            """
  28
+            )
  29
+    parser.add_argument('-h', '--help', action='store_true',
  30
+            help='Show this help message and exit.')
  31
+    parser.add_argument('input', metavar='input', type=argparse.FileType('rb'), nargs='?', default=None,
  32
+            help='File to process (use - for STDIN)')
  33
+#    parser.add_argument('filters', metavar='filter', type=str, nargs='*', default=None,
  34
+#            help='Filters to use')
  35
+    parser.add_argument('--no-short-circuit', action='store_true',
  36
+            help='Do not stop filter processing on a site if any filter is triggered')
  37
+    parser.add_argument('--output', action='store', default=sys.stdout,
  38
+            help='Filename to output [STDOUT]')
  39
+    parser.add_argument('--no-filtered', action='store_true',
  40
+            help='Output only sites passing the filters')
  41
+    parser.add_argument('--local-script', action='store', default=None,
  42
+            help='Python file in current working directory with the filter classes')
  43
+    parser.add_argument('rest', nargs=argparse.REMAINDER, help=argparse.SUPPRESS)
  44
+    
  45
+    return parser
  46
+
  47
+# argument parsing strategy
  48
+# loading a script given at the command line poses a difficulty 
  49
+# for using the argparse in a simple way -- the command line arguments
  50
+# are not completely known the first time command line is parsed
  51
+# requirements:
  52
+#  - display all filters with options grouped by the filters in help screen
  53
+#  - check if only arguments for currently used filters are given
  54
+#  - to increase legibility when using more filters, arguments should 
  55
+#    follow the filter name
  56
+#  - it is good to specify the filters explicitly by name, 
  57
+#    because the order of filtering can matter
  58
+# solution
  59
+# - change the command syntax to 
  60
+#   vcf_filter.py --core-options input filter1 --filter1-args filter2 filter3
  61
+# - parse the core program options with parse_known_args
  62
+# - use add_argument_group for filters (subparsers won't work, they require 
  63
+#   the second command in argv[1])
  64
+# - create all-filters parser when displaying the help
  65
+# - parse the arguments incrementally on argparse.REMAINDER of the previous
  66
+
25 67
     # TODO: allow filter specification by short name
26 68
     # TODO: flag that writes filter output into INFO column
27 69
     # TODO: argument use implies filter use
28 70
     # TODO: parallelize
29 71
     # TODO: prevent plugins raising an exception from crashing the script
30 72
 
31  
-
  73
+def main():
32 74
     # dynamically build the list of available filters
33 75
     filters = {}
34  
-    filter_help = '\n\navailable filters:'
35  
-
36  
-    for p in pkg_resources.iter_entry_points('vcf.filters'):
37  
-        filt = p.load()
38  
-        filters[filt.name] = filt
39  
-        filt.customize_parser(parser)
40  
-        filter_help += '\n  %s:\t%s' % (filt.name, filt.description)
41  
-
42  
-    parser.description += filter_help
43 76
 
44 77
     # parse command line args
45  
-    args = parser.parse_args()
  78
+    # (mainly because of local_script)
  79
+    parser = create_core_parser()
  80
+    (args, unknown_args) = parser.parse_known_args()
46 81
 
47  
-    inp = vcf.Reader(file(args.input[0]))
  82
+    # add filter to dictionary, extend help message
  83
+    # with help/arguments of each filter
  84
+    def addfilt(filt):
  85
+        filters[filt.name] = filt
  86
+        arg_group = parser.add_argument_group(filt.name, filt.description)
  87
+        filt.customize_parser(arg_group)
  88
+    
  89
+    # look for global extensions
  90
+    for p in pkg_resources.iter_entry_points('vcf.filters'):
  91
+        filt = p.load()
  92
+        addfilt(filt)
  93
+
  94
+    # add all classes from local script, if present
  95
+    if args.local_script != None:
  96
+        import inspect
  97
+        import os
  98
+        sys.path.insert(0, os.getcwd())
  99
+        module_name = args.local_script.replace('.py', '')
  100
+        mod = __import__(module_name)
  101
+        classes = inspect.getmembers(mod, inspect.isclass)
  102
+        for name, cls in classes:
  103
+            addfilt(cls)
  104
+
  105
+    # go through the filters on the command line 
  106
+    # one by one, trying to consume only the declared arguments
  107
+    used_filters = []
  108
+    while len(args.rest):
  109
+        filter_name = args.rest.pop(0)
  110
+        if filter_name not in filters:
  111
+            sys.exit("%s is not a known filter (%s)" % (filter_name, str(filters.keys())))
  112
+        
  113
+        # create a parser only for arguments of current filter
  114
+        filt_parser = create_filt_parser(filter_name)
  115
+        filters[filter_name].customize_parser(filt_parser)
  116
+        (known_filt_args, unknown_filt_args) = filt_parser.parse_known_args(args.rest)
  117
+        if len(unknown_filt_args):
  118
+            sys.exit("%s has no arguments like %s" % (filter_name, unknown_filt_args))
  119
+
  120
+        used_filters.append((filter_name, known_filt_args))
  121
+        args.rest = known_filt_args.rest
  122
+
  123
+    # print help using the 'help' parser, so it includes 
  124
+    # all possible filters and arguments
  125
+    if args.help or len(used_filters) == 0 or args.input == None:
  126
+        parser.print_help()
  127
+        parser.exit()
  128
+
  129
+    inp = vcf.Reader(args.input)
48 130
 
49 131
     # build filter chain
50 132
     chain = []
51  
-    for name in args.filters:
52  
-        f = filters[name](args)
  133
+    for (name, filter_args) in used_filters:
  134
+        f = filters[name](filter_args)
53 135
         chain.append(f)
  136
+        # add a filter record to the output
54 137
         inp.filters[f.filter_name()] = _Filter(f.filter_name(), f.description)
55 138
 
56  
-    oup = vcf.Writer(args.output, inp)
  139
+    # output must be created after all the filter records have been added
  140
+    output = vcf.Writer(args.output, inp)
57 141
 
58 142
     # apply filters
59 143
     short_circuit = not args.no_short_circuit
  144
+    drop_filtered = args.no_filtered
60 145
 
61 146
     for record in inp:
  147
+        output_record = True
62 148
         for filt in chain:
63 149
             result = filt(record)
64  
-            if result:
65  
-                record.add_filter(filt.filter_name())
66  
-                if short_circuit:
67  
-                    break
68  
-
69  
-        if (not args.no_filtered) or (record.FILTER == '.'):
70  
-            oup.write_record(record)
71  
-
72  
-
  150
+            if result == None: continue
  151
+
  152
+            # save some work by skipping the rest of the code
  153
+            if drop_filtered: 
  154
+                output_record = False
  155
+                break
  156
+            
  157
+            record.add_filter(filt.filter_name())
  158
+            if short_circuit: break
  159
+
  160
+        if output_record:
  161
+            # use PASS only if other filter names appear in the FILTER column 
  162
+            #FIXME: is this good idea?
  163
+            if record.FILTER == '.' and not drop_filtered: record.FILTER = 'PASS'
  164
+            output.write_record(record)
  165
+
  166
+if __name__ == '__main__': main()
4  vcf/parser.py
@@ -255,7 +255,9 @@ def add_format(self, fmt):
255 255
         self.FORMAT = self.FORMAT + ':' + fmt
256 256
 
257 257
     def add_filter(self, flt):
258  
-        if self.FILTER is None or self.FILTER == 'PASS':
  258
+        if self.FILTER is None \
  259
+        or self.FILTER == 'PASS'\
  260
+        or self.FILTER == '.':
259 261
             self.FILTER = ''
260 262
         else:
261 263
             self.FILTER = self.FILTER + ';'
Commit_comment_tip

Tip: You can add notes to lines in a file. Hover to the left of a line to make a note

Something went wrong with that request. Please try again.