Allow filtering by metadata only #679

huddlej · 2021-02-13T00:49:39Z

Description of proposed changes

Allows users to filter strains using only metadata information and output the strains that pass filters as either a metadata table or a list of names. Additionally, allows users to select only those sequences whose strain names are listed in one or more include files. To enable this functionality, this commit makes the following major changes to the filter interface:

sequence input and output are optional
non-sequence-based filters work without sequence input or index
the include argument accepts multiple inputs whose contents are unioned
a new --exclude-all flag excludes all strains by default, allowing only those in the include flags to be added back (this explicit supports functionality that I've often hacked with an all-encompassing exclusion filter like --exclude-where 'country!=fakecountry' plus the --include argument)
new metadata and strain output arguments are supported
the initial complete list of strain names to consider for filtering originates from the metadata instead of the sequences

Collectively, these changes allow multiple filter commands to be run in parallel without accessing the original sequence data and then the outputs of these separate commands can be joined by a single final filter command to output the sequences corresponding to all desired strain names. For workflows with large sequence inputs, like ncov, these changes should speed up analyses considerably.

A useful pattern (also shown in the functional tests) that is enabled by this new interface looks like this:

# Get Zika strains from Brazil.
augur filter \
  --sequence-index tests/functional/filter/sequence_index.tsv \
  --metadata tests/functional/filter/metadata.tsv \
  --query "country == 'Brazil'" \
  --output-strains brazil.txt

# Get Zika strains from Colombia.
augur filter \
  --sequence-index tests/functional/filter/sequence_index.tsv \
  --metadata tests/functional/filter/metadata.tsv \
  --query "country == 'Colombia'" \
  --output-strains colombia.txt

# Get sequences and metadata for all selected strains.
augur filter \
  --sequences tests/functional/filter/sequences.fasta \
  --sequence-index tests/functional/filter/sequence_index.tsv \
  --metadata tests/functional/filter/metadata.tsv \
  --exclude-all \
  --include brazil.txt colombia.txt \
  --output-sequences filtered.fasta \
  --output-metadata filtered_metadata.tsv

Testing

This PR adds functional tests for the proposed new filter interface in tests/functional/filter.t.

codecov · 2021-02-13T00:51:28Z

Codecov Report

Merging #679 (1665e1b) into master (698df3d) will increase coverage by 0.35%.
The diff coverage is 52.34%.

@@            Coverage Diff             @@
##           master     #679      +/-   ##
==========================================
+ Coverage   30.27%   30.62%   +0.35%     
==========================================
  Files          40       40              
  Lines        5563     5812     +249     
  Branches     1354     1463     +109     
==========================================
+ Hits         1684     1780      +96     
- Misses       3819     3946     +127     
- Partials       60       86      +26

Impacted Files	Coverage Δ
augur/filter.py	`45.84% <49.28%> (-1.20%)`	⬇️
augur/utils.py	`38.36% <100.00%> (+1.72%)`	⬆️
augur/index.py	`84.00% <0.00%> (+1.85%)`	⬆️
augur/titer_model.py	`21.81% <0.00%> (+2.90%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 698df3d...1665e1b. Read the comment docs.

huddlej · 2021-02-18T19:38:12Z

augur/filter.py

@@ -146,7 +188,8 @@ def run(args):
    #If VCF, open and get sequence names
    if is_vcf:
        seq_keep, all_seq = read_vcf(args.sequences)
-    else:
+        # TODO: How to best handle "source of truth" for VCF files vs. metadata?
+    elif args.sequences or args.sequence_index:


One issue I just ran into while mocking up a ncov workflow with this interface was that I had to include a sequence index in the final filter step that pulls in all of the subsampled strains and extracts their FASTA sequences. Even though the index isn't necessary at that stage because there are no sequence-based filters being applied, if I didn't include the sequence index, augur would try to build it for me.

So, this line here should probably also check for whether any sequence-based filters have been requested (as in the argument validation code above) and then skip the sequence index completely if it isn't needed.

huddlej · 2021-02-18T19:49:42Z

An alternative interface for merging the subsampled strains and extracting their sequences could be a combination of UNIX utilities and samtools. For example:

# Collect all unique strains from previous filter steps into a single file.
sort -u brazil.txt colombia.txt > strains.txt

# Extract sequences for the given strains.
# This builds a samtools FASTA index (a `sequences.fasta.fai` file) the first time it is run
# and then uses that index for subsequent commands.
samtools faidx -r strains.txt tests/functional/filter/sequences.fasta > filtered.fasta

This approach is much faster (1 second to extract sequences compared to 1-2 minutes with augur filter) because samtools is written in C and optimized for random reads in big sequence files by using an index. The downside here is that users can't easily get the metadata for the filtered strains, too. While having a filtered metadata file has been critical for my own flu analyses, that may not be a common use case and I've gotten by with my custom scripts to perform the same function.

If we used this approach, we could get rid of the --exclude-all and multiple input support for --include described in this PR.

Builds on a prototype augur interface for metadata-only filtering [1] to speed up the ncov subsampling steps. Specifically, subsampling writes out lists of selected strain names to text files instead of writing out sequences. Then the "combine samples" step extracts sequences from the unique list of strain names with samtools faidx. The resulting workflow requires substantially less I/O and speeds up subsampling from an average of 163 seconds per job (for 12 jobs in a Nextstrain global build) to 36 seconds per job. One downside to the current metadata-only approach in Augur that reveals itself here is that its possible to have mismatches between metadata and sequences such that a strain has metadata but does not have a sequence in the filtered FASTA. One way around this issue is to export the filtered metadata earlier in the pipeline. For now, we use samtools faidx with the `-c` flag to ignore missing sequences. [1] nextstrain/augur#679

huddlej · 2021-02-18T22:23:29Z

@jameshadfield I've mocked up a ncov workflow using with the proposed metadata-only interface from this PR. After upfront costs of indexing the sequences (6 minutes for augur index and 1-2 minutes for samtools faidx), each subsampling job only takes ~30 seconds and the final steps to collect strain names and extract their sequences with samtools takes ~30 seconds.

jameshadfield

This is really good @huddlej and the speed improvements for ncov will be immense -- we currently find ourselves in a situation where aligning is faster than filtering!

I tried to cover the different paths available for running augur filter - there are many, and I know I will have missed some, especially around VCF handling. The main confusion I had was the inconsistencies between different sources of truth. For instance, using the ncov pipeline's example_metadata.tsv

echo -e ">something\nATCG" > dummy.fasta
augur filter --metadata example_metadata.tsv --sequences dummy.fasta  \
    --max-date 2020-01-30 --output-strains result.txt 
 # results in 14 strains. I argue it should be 0.
augur filter --metadata example_metadata.tsv --sequences dummy.fasta  \
    --max-date 2020-01-30 --output-strains result.txt --output result.fasta
 # results in 0 strains in the FASTA -- which I think is correct
 # and result.txt is not modified (so it may still exist from a previous call)

I've made in-line comments which would lead us towards taking the intersection of inputs, but there may be other directions to pursue.

augur/filter.py

jameshadfield · 2021-02-19T09:12:31Z

augur/filter.py

@@ -91,15 +91,16 @@ def filter_by_query(sequences, metadata_file, query):
    return [seq for seq in sequences if seq in filtered_meta_dict]

 def register_arguments(parser):
-    parser.add_argument('--sequences', '-s', required=True, help="sequences in fasta or VCF format")
+    parser.add_argument('--sequences', '-s', help="sequences in FASTA or VCF format")


The augur filter interface is becoming rather complicated, but I'm not sure how much we can explain within augur --help. We could consider using argument groups to separate into "data input args", "parameters", and "outputs", each with a small preamble. This would help with errors such as ERROR: You need to select at least one output.

P.S. the arguement validation & error messages you've added in this PR are really helpful once a user starts to actually run things.

I like this idea a lot. I'm not sure what those groups would be exactly though, so I'm hesitant to commit any specifics here. Maybe we can discuss this at the next meeting, too?

augur/filter.py

huddlej · 2021-02-21T00:22:11Z

Thank you for this detailed review, @jameshadfield! Your example commands were helpful for mocking up end-to-end tests that we were missing.

I've tried to address most of the comments. The only one I couldn't determine a solution to was the grouping of arguments. The current state of the PR still treats the metadata as "the source of truth". How we choose to handle this could be discussed more. My main thought here is that metadata are the only required input, so they basically have to be prompted to "source of truth". If we don't want this behavior, then we should require args.sequences or args.sequence_index instead of allowing them to be optional.

We still get the speed benefit of the "metadata-only" approach, if we require a sequence index, so maybe that is a better direction. One benefit of the current approach is it allows us to skip creating a sequence index completely if we don't request any sequence filters.

augur/utils.py

jameshadfield

This looks good to me and works well in most situations, however I think there are still a couple of edge cases with strange behavior.

Firstly, given (e.g) a set of 4 sequences (FASTA), and 4 rows of metadata, where there is only 2 strains in common (i.e. there are 2 sequences without metadata, and two rows of metadata without sequences) then the output is:

2 sequences were dropped during filtering
        2 were excluded because they had no sequence data
        2 sequences were excluded because they did not have metadata

Ideally this would be:

4 samples were dropped during filtering
        2 were excluded because they had no sequence data
        2 were excluded because they did not have metadata

Secondly, there is inconsistent output stemming (I think) from the --includeed samples. E.g. using the multiple-inputs tutorial from nCoV we have a situation where we start with 92 seqs and 90 metadata rows,

$ augur filter  --sequences results/masked_aus.fasta    --metadata data/example_metadata_aus.tsv     --include defaults/include.txt     --output results/filtered_aus.fasta --output-strains results/filtered_aus.txt
-3 sequences were dropped during filtering
        1 sequences were excluded because they did not have metadata

        3 sequences were added back because they were in defaults/include.txt
94 strains have been passed all filters

and end up with a FASTA containing 91 sequences but a strains list containing 94! For consistency, to force-include strains I think they must be in the provided metadata and sequences (if supplied). Would also be nice to separate out "1 strain dropped during filtering" from "n samples added back in ..." (here n=3 but I think it should be 0).

huddlej · 2021-02-25T21:35:19Z

Thank you so much for the detailed inspection of these edge cases, @jameshadfield. This helped me write a more effective functional test that captures these nuances and then update the code to make the test pass in 32a1ff2. The output from your last example above now looks like this:

$ augur filter  --sequences results/masked_aus.fasta    --metadata data/example_metadata_aus.tsv     --include defaults/include.txt     --output results/filtered_aus.fasta --output-strains results/filtered_aus.txt
WARNING: A sequence index was not provided, so we are generating one. Generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.

1 strains were dropped during filtering
	1 had no metadata

	0 strains were added back because they were requested by include files
	3 strains from include files were not added because they lacked sequence or metadata
91 strains passed all filters

$ grep "^>" results/filtered_aus.fasta | wc -l
91
$ wc -l results/filtered_aus.txt
91 results/filtered_aus.txt

For another example with more extensive reporting, you can change into the tests/functional/filter/ directory and run the following commands. In these test data, there are 12 strains represented including one strain missing metadata and one strain missing a sequence. The include file refers to a nonexistent strain and one Colombian strain.

$ echo -e "NotReal\nCOL/FLR_00008/2015"  > include.txt
$ augur filter \
  --sequence-index sequence_index.tsv \
  --metadata metadata.tsv \
  --query "country != 'Colombia'" \
  --include include.txt \
  --output-strains test.txt

4 strains were dropped during filtering
	1 had no sequence data
	1 had no metadata
	3 of these were filtered out by the query:
		"country != 'Colombia'"

	1 strains were added back because they were requested by include files
	1 strains from include files were not added because they lacked sequence or metadata
8 strains passed all filters

Based on today's meeting discussion, I've left the sequences as optional. I'll test this more completely with the ncov workflow and also open it up for general review.

Allows users to filter strains using only metadata information and output the strains that pass filters as either a metadata table or a list of names. Additionally, allows users to select only those sequences whose strain names are listed in one or more include files. To enable this functionality, this commit makes the following major changes to the filter interface: - sequence input and output are optional - non-sequence-based filters work without sequence input or index - the include argument accepts multiple inputs whose contents are unioned - a new `--exclude-all` flag excludes all strains by default, allowing only those in the include flags to be added back - new metadata and strain output arguments are supported - the initial complete list of strain names to consider for filtering originates from the metadata instead of the sequences Collectively, these changes allow multiple filter commands to be run in parallel without accessing the original sequence data and then the outputs of these separate commands can be joined by a single final filter command to output the sequences corresponding to all desired strain names. For workflows with large sequence inputs, like ncov, these changes should speed up analyses considerably.

Adds a `read_strains` utility function that can read strain names from one or more files and return the distinct set of all names. Updates the filter logic to use this function to read in names of strains to include.

Updates the `--exclude` argument to allow one or more values and the corresponding logic to read in strains from these files with the new `read_strains` utility function.

When sequences input (or a sequence index) is provided, intersect those strains with the strains in the metadata while also tracking the number of strains filtered by this operation. Note that this approach is not too different from the original filter implementation that required each sequence to have metadata or be filtered out. This approach has the additional effect of filtering out strains that have metadata and no sequences which is a filter that used to happen more implicitly before. This commit adds the number of strains filtered by this operation to the report at the end, so users have a better sense of which strains are missing data. As a side effect of getting the functional tests in this commit to work, this commit also reorders the report output such that the user always sees how many strains were filtered and why before the error message gets printed when all strains are filtered. This commit also adds a functional test for filtering VCFs. Fixes #681

Report the number of strains that had sequences but did not have any metadata.

Previously, augur filter treated comments differently for `include` and `exclude` files. This commit consistently allows the more permissive commenting associated with the `exclude` files such that users can make inline comments on the same line as a strain name. This commit also adds tests to confirm that we get the expected behavior from the current code. This change should be backwards compatible with existing `include` inputs, since the new comment standard is a superset of the older `include` format.

@jameshadfield

Fixes inconsistencies in numbers from filter reports associated with edge cases where strains are missing metadata or sequence data. Specific changes include: - calculate the intersection of strains with metadata and sequences (whenever sequences are provided) as the "source of truth" - track _strains_ as opposed to _sequences_ through sets instead of lists (enabling clear logic and testing of group membership) - report about strains dropped from either lack of metadata or sequence data - report how many strains could or could not be included from the include file inputs to clarify when requested strains do not have metadata or sequence data This commit also updates functional tests for augur filter to include a more complex example of how these filter behaviors should manifest, based on an excellent example from @jameshadfield.

Outputs how many "available strains" were initially dropped by the `--exclude-all` flag.

huddlej · 2021-02-27T00:14:04Z

I just hit another edge case where I wanted to filter a subsampled FASTA by date using a sequence index I prebuilt from the whole original set of sequences and the complete metadata. Given these inputs:

subsampled FASTA: 6827 strains
sequence index: 78133 strains
metadata: 78133 strains

I ran the following command:

augur filter \
  --sequences results/subsampled.fasta \
  --sequence-index results/sequence_index.tsv \
  --metadata results/metadata.tsv \
  --min-date 2017-01-01 \
  --max-date 2020-01-01 \
  --output-sequences results/filtered.fasta

Instead of writing out the 5437 of 6827 sequences from the input FASTA that passed the filter, the report output was:

46133 strains were dropped during filtering
	46133 of these were dropped because of their date (or lack of date)
32000 strains passed all filters

There are a couple of possible workarounds for this case:

Omit the sequence index and let augur rebuild it.
Export the subsampled metadata in the same command that output the subsampled sequences such that the metadata source of truth is limited.

But, the reporting behavior above is still a bug. Ideally, augur would use the number of strains that are actually written to disk as its final number of strains that passed the filter. One problem is that we don't know how many strains are in a sequence file without looping through the whole file again.

rneher · 2021-02-28T16:39:25Z

thanks so much @huddlej! This will be very useful (though we should definitely break up the filter function!!)

Regarding the edge case above, would a good solution be the following pseudo code?

if index & sequences:
    sequence_strains = index | sequences
elif index:
    sequence_strains = index
elif sequences:
    sequence_strains = sequencs
else:
    sequence_strains=[]

rneher

other than the edge case you flagged and the need to refactor, I think this is great. But refactor is probably best left for a later PR.

It is possible for the sequence index to be a superset of (or generally out of sync with) the given sequences output. To handle this case, we track the observed sequence strains when the user requests sequence output and update the statistics for which strains were filtered and why. We also move the handling of sequence outputs before the metadata and strain list outputs, so these outputs reflect any changes in the available sequences that occur when writing sequence output.

It's is a normal use case for the sequence index to contain a superset of strains present in the sequence inputs. Only warn users about a mismatch between the index and sequences when the superset relationship is not true.

huddlej · 2021-03-01T20:04:57Z

Thanks for checking this out, @rneher! I just pushed a fix for the edge case in e1b1153. The output from my example above now looks like this (the total number of strains has changed slightly due to randomness in my subsampling):

72682 strains were dropped during filtering
	71292 had no sequence data
	46133 of these were dropped because of their date (or lack of date)
5451 strains passed all filters

I ended up checking the available sequence strains as part of the single pass through the sequences input and updating the sequence_strains set based on those observed strains. If there is a difference between the observed strains and those in the sequence index, filter updates the set of strains to keep, etc. based on what's actually available. If the observed strains are not a subset of the expected strains from the index, filter warns the user with:

WARNING: The sequence index is out of sync with the provided sequences. Augur will only output strains with available sequences.

This update to sequence_strains and seq_keep occurs before the metadata and strains output, so we avoid the case where there are more strains in those outputs than in the sequence outputs.

The downside to this approach is that the number of strains dropped for missing sequence data can now overlap with the number of strains filtered for other reasons (like the date filter in the example above). We can continue to refine this reporting in the future, but I think this approach is worth the benefit of not reading through the sequences twice (at the beginning and end).

I agree this module is ripe for a refactor!

Provides a more logical presentation of the many filter arguments by grouping them into inputs, metadata filters, sequence filters, subsampling, and outputs. Reorders some of the arguments such that related arguments are close to each other and required arguments occur before optional arguments (e.g., metadata before sequences). Expands some of the help strings to clarify arguments, too.

huddlej · 2021-03-04T18:21:01Z

@jameshadfield 1665e1b adds clearer argument grouping for inputs, metadata filters, sequence filters, subsampling options, and outputs. If this looks reasonable, I'll plan to merge this PR so we can release this feature and start integrating it into the ncov workflow. If you have any recommendations for changes, feel free to modify and push new commits.

jameshadfield · 2021-03-08T02:08:44Z

This is really nice @huddlej. Let's get this merged 😄

huddlej requested review from jameshadfield, emmahodcroft, rneher and trvrb February 13, 2021 00:56

huddlej self-assigned this Feb 16, 2021

huddlej commented Feb 18, 2021

View reviewed changes

jameshadfield requested changes Feb 19, 2021

View reviewed changes

jameshadfield reviewed Feb 22, 2021

View reviewed changes

augur/utils.py Outdated Show resolved Hide resolved

jameshadfield requested changes Feb 22, 2021

View reviewed changes

huddlej added 8 commits February 25, 2021 13:36

Define utility function to read strains from files

8b71d09

Adds a `read_strains` utility function that can read strain names from one or more files and return the distinct set of all names. Updates the filter logic to use this function to read in names of strains to include.

Allow users to pass more than one exclude file to filter

008ea1f

Updates the `--exclude` argument to allow one or more values and the corresponding logic to read in strains from these files with the new `read_strains` utility function.

Track the number of strains with sequences and no metadata

e40243c

Report the number of strains that had sequences but did not have any metadata.

Report how many strains were excluded by the --exclude-all flag

26ec5b6

Outputs how many "available strains" were initially dropped by the `--exclude-all` flag.

huddlej force-pushed the metadata-only-filter branch from 2304dd7 to 26ec5b6 Compare February 25, 2021 21:41

huddlej marked this pull request as ready for review February 25, 2021 21:54

rneher approved these changes Feb 28, 2021

View reviewed changes

huddlej added 2 commits March 1, 2021 11:08

Don't warn user when sequences are a subset of the index

34d66f5

It's is a normal use case for the sequence index to contain a superset of strains present in the sequence inputs. Only warn users about a mismatch between the index and sequences when the superset relationship is not true.

huddlej mentioned this pull request Mar 3, 2021

WIP: Use metadata-only filtering in subsample jobs nextstrain/ncov#571

Closed

jameshadfield approved these changes Mar 8, 2021

View reviewed changes

huddlej added this to the Next release 11.X.X milestone Mar 8, 2021

huddlej merged commit 6ab87dd into master Mar 8, 2021

huddlej deleted the metadata-only-filter branch March 8, 2021 18:36

huddlej mentioned this pull request Apr 2, 2021

filter: Report number and names of sequences without metadata #675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow filtering by metadata only #679

Allow filtering by metadata only #679

huddlej commented Feb 13, 2021

codecov bot commented Feb 13, 2021 •

edited

Loading

huddlej Feb 18, 2021

huddlej commented Feb 18, 2021 •

edited

Loading

huddlej commented Feb 18, 2021

jameshadfield left a comment

jameshadfield Feb 19, 2021

huddlej Feb 21, 2021

huddlej commented Feb 21, 2021

jameshadfield left a comment •

edited

Loading

huddlej commented Feb 25, 2021

huddlej commented Feb 27, 2021

rneher commented Feb 28, 2021

rneher left a comment

huddlej commented Mar 1, 2021

huddlej commented Mar 4, 2021

jameshadfield commented Mar 8, 2021

Allow filtering by metadata only #679

Allow filtering by metadata only #679

Conversation

huddlej commented Feb 13, 2021

Description of proposed changes

Testing

codecov bot commented Feb 13, 2021 • edited Loading

Codecov Report

huddlej Feb 18, 2021

Choose a reason for hiding this comment

huddlej commented Feb 18, 2021 • edited Loading

huddlej commented Feb 18, 2021

jameshadfield left a comment

Choose a reason for hiding this comment

jameshadfield Feb 19, 2021

Choose a reason for hiding this comment

huddlej Feb 21, 2021

Choose a reason for hiding this comment

huddlej commented Feb 21, 2021

jameshadfield left a comment • edited Loading

Choose a reason for hiding this comment

huddlej commented Feb 25, 2021

huddlej commented Feb 27, 2021

rneher commented Feb 28, 2021

rneher left a comment

Choose a reason for hiding this comment

huddlej commented Mar 1, 2021

huddlej commented Mar 4, 2021

jameshadfield commented Mar 8, 2021

codecov bot commented Feb 13, 2021 •

edited

Loading

huddlej commented Feb 18, 2021 •

edited

Loading

jameshadfield left a comment •

edited

Loading