Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow filtering by metadata only #679

Merged
merged 11 commits into from
Mar 8, 2021
Merged

Allow filtering by metadata only #679

merged 11 commits into from
Mar 8, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Feb 13, 2021

Description of proposed changes

Allows users to filter strains using only metadata information and output the strains that pass filters as either a metadata table or a list of names. Additionally, allows users to select only those sequences whose strain names are listed in one or more include files. To enable this functionality, this commit makes the following major changes to the filter interface:

  • sequence input and output are optional
  • non-sequence-based filters work without sequence input or index
  • the include argument accepts multiple inputs whose contents are unioned
  • a new --exclude-all flag excludes all strains by default, allowing only those in the include flags to be added back (this explicit supports functionality that I've often hacked with an all-encompassing exclusion filter like --exclude-where 'country!=fakecountry' plus the --include argument)
  • new metadata and strain output arguments are supported
  • the initial complete list of strain names to consider for filtering originates from the metadata instead of the sequences

Collectively, these changes allow multiple filter commands to be run in parallel without accessing the original sequence data and then the outputs of these separate commands can be joined by a single final filter command to output the sequences corresponding to all desired strain names. For workflows with large sequence inputs, like ncov, these changes should speed up analyses considerably.

A useful pattern (also shown in the functional tests) that is enabled by this new interface looks like this:

# Get Zika strains from Brazil.
augur filter \
  --sequence-index tests/functional/filter/sequence_index.tsv \
  --metadata tests/functional/filter/metadata.tsv \
  --query "country == 'Brazil'" \
  --output-strains brazil.txt

# Get Zika strains from Colombia.
augur filter \
  --sequence-index tests/functional/filter/sequence_index.tsv \
  --metadata tests/functional/filter/metadata.tsv \
  --query "country == 'Colombia'" \
  --output-strains colombia.txt

# Get sequences and metadata for all selected strains.
augur filter \
  --sequences tests/functional/filter/sequences.fasta \
  --sequence-index tests/functional/filter/sequence_index.tsv \
  --metadata tests/functional/filter/metadata.tsv \
  --exclude-all \
  --include brazil.txt colombia.txt \
  --output-sequences filtered.fasta \
  --output-metadata filtered_metadata.tsv

Testing

This PR adds functional tests for the proposed new filter interface in tests/functional/filter.t.

@codecov
Copy link

codecov bot commented Feb 13, 2021

Codecov Report

Merging #679 (1665e1b) into master (698df3d) will increase coverage by 0.35%.
The diff coverage is 52.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #679      +/-   ##
==========================================
+ Coverage   30.27%   30.62%   +0.35%     
==========================================
  Files          40       40              
  Lines        5563     5812     +249     
  Branches     1354     1463     +109     
==========================================
+ Hits         1684     1780      +96     
- Misses       3819     3946     +127     
- Partials       60       86      +26     
Impacted Files Coverage Δ
augur/filter.py 45.84% <49.28%> (-1.20%) ⬇️
augur/utils.py 38.36% <100.00%> (+1.72%) ⬆️
augur/index.py 84.00% <0.00%> (+1.85%) ⬆️
augur/titer_model.py 21.81% <0.00%> (+2.90%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 698df3d...1665e1b. Read the comment docs.

@@ -146,7 +188,8 @@ def run(args):
#If VCF, open and get sequence names
if is_vcf:
seq_keep, all_seq = read_vcf(args.sequences)
else:
# TODO: How to best handle "source of truth" for VCF files vs. metadata?
elif args.sequences or args.sequence_index:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I just ran into while mocking up a ncov workflow with this interface was that I had to include a sequence index in the final filter step that pulls in all of the subsampled strains and extracts their FASTA sequences. Even though the index isn't necessary at that stage because there are no sequence-based filters being applied, if I didn't include the sequence index, augur would try to build it for me.

So, this line here should probably also check for whether any sequence-based filters have been requested (as in the argument validation code above) and then skip the sequence index completely if it isn't needed.

@huddlej
Copy link
Contributor Author

huddlej commented Feb 18, 2021

An alternative interface for merging the subsampled strains and extracting their sequences could be a combination of UNIX utilities and samtools. For example:

# Collect all unique strains from previous filter steps into a single file.
sort -u brazil.txt colombia.txt > strains.txt

# Extract sequences for the given strains.
# This builds a samtools FASTA index (a `sequences.fasta.fai` file) the first time it is run
# and then uses that index for subsequent commands.
samtools faidx -r strains.txt tests/functional/filter/sequences.fasta > filtered.fasta

This approach is much faster (1 second to extract sequences compared to 1-2 minutes with augur filter) because samtools is written in C and optimized for random reads in big sequence files by using an index. The downside here is that users can't easily get the metadata for the filtered strains, too. While having a filtered metadata file has been critical for my own flu analyses, that may not be a common use case and I've gotten by with my custom scripts to perform the same function.

If we used this approach, we could get rid of the --exclude-all and multiple input support for --include described in this PR.

huddlej added a commit to nextstrain/ncov that referenced this pull request Feb 18, 2021
Builds on a prototype augur interface for metadata-only filtering [1] to speed
up the ncov subsampling steps. Specifically, subsampling writes out lists of
selected strain names to text files instead of writing out sequences. Then the
"combine samples" step extracts sequences from the unique list of strain names
with samtools faidx. The resulting workflow requires substantially less I/O and
speeds up subsampling from an average of 163 seconds per job (for 12 jobs in a
Nextstrain global build) to 36 seconds per job.

One downside to the current metadata-only approach in Augur that reveals itself
here is that its possible to have mismatches between metadata and sequences such
that a strain has metadata but does not have a sequence in the filtered
FASTA. One way around this issue is to export the filtered metadata earlier in
the pipeline. For now, we use samtools faidx with the `-c` flag to ignore
missing sequences.

[1] nextstrain/augur#679
@huddlej
Copy link
Contributor Author

huddlej commented Feb 18, 2021

@jameshadfield I've mocked up a ncov workflow using with the proposed metadata-only interface from this PR. After upfront costs of indexing the sequences (6 minutes for augur index and 1-2 minutes for samtools faidx), each subsampling job only takes ~30 seconds and the final steps to collect strain names and extract their sequences with samtools takes ~30 seconds.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good @huddlej and the speed improvements for ncov will be immense -- we currently find ourselves in a situation where aligning is faster than filtering!

I tried to cover the different paths available for running augur filter - there are many, and I know I will have missed some, especially around VCF handling. The main confusion I had was the inconsistencies between different sources of truth. For instance, using the ncov pipeline's example_metadata.tsv

echo -e ">something\nATCG" > dummy.fasta
augur filter --metadata example_metadata.tsv --sequences dummy.fasta  \
    --max-date 2020-01-30 --output-strains result.txt 
 # results in 14 strains. I argue it should be 0.
augur filter --metadata example_metadata.tsv --sequences dummy.fasta  \
    --max-date 2020-01-30 --output-strains result.txt --output result.fasta
 # results in 0 strains in the FASTA -- which I think is correct
 # and result.txt is not modified (so it may still exist from a previous call)

I've made in-line comments which would lead us towards taking the intersection of inputs, but there may be other directions to pursue.

augur/filter.py Outdated Show resolved Hide resolved
augur/filter.py Outdated
@@ -91,15 +91,16 @@ def filter_by_query(sequences, metadata_file, query):
return [seq for seq in sequences if seq in filtered_meta_dict]

def register_arguments(parser):
parser.add_argument('--sequences', '-s', required=True, help="sequences in fasta or VCF format")
parser.add_argument('--sequences', '-s', help="sequences in FASTA or VCF format")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The augur filter interface is becoming rather complicated, but I'm not sure how much we can explain within augur --help. We could consider using argument groups to separate into "data input args", "parameters", and "outputs", each with a small preamble. This would help with errors such as ERROR: You need to select at least one output.

P.S. the arguement validation & error messages you've added in this PR are really helpful once a user starts to actually run things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea a lot. I'm not sure what those groups would be exactly though, so I'm hesitant to commit any specifics here. Maybe we can discuss this at the next meeting, too?

augur/filter.py Outdated Show resolved Hide resolved
augur/filter.py Outdated Show resolved Hide resolved
augur/filter.py Show resolved Hide resolved
augur/filter.py Outdated Show resolved Hide resolved
augur/filter.py Show resolved Hide resolved
@huddlej
Copy link
Contributor Author

huddlej commented Feb 21, 2021

Thank you for this detailed review, @jameshadfield! Your example commands were helpful for mocking up end-to-end tests that we were missing.

I've tried to address most of the comments. The only one I couldn't determine a solution to was the grouping of arguments. The current state of the PR still treats the metadata as "the source of truth". How we choose to handle this could be discussed more. My main thought here is that metadata are the only required input, so they basically have to be prompted to "source of truth". If we don't want this behavior, then we should require args.sequences or args.sequence_index instead of allowing them to be optional.

We still get the speed benefit of the "metadata-only" approach, if we require a sequence index, so maybe that is a better direction. One benefit of the current approach is it allows us to skip creating a sequence index completely if we don't request any sequence filters.

augur/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and works well in most situations, however I think there are still a couple of edge cases with strange behavior.

Firstly, given (e.g) a set of 4 sequences (FASTA), and 4 rows of metadata, where there is only 2 strains in common (i.e. there are 2 sequences without metadata, and two rows of metadata without sequences) then the output is:

2 sequences were dropped during filtering
        2 were excluded because they had no sequence data
        2 sequences were excluded because they did not have metadata

Ideally this would be:

4 samples were dropped during filtering
        2 were excluded because they had no sequence data
        2 were excluded because they did not have metadata

Secondly, there is inconsistent output stemming (I think) from the --includeed samples. E.g. using the multiple-inputs tutorial from nCoV we have a situation where we start with 92 seqs and 90 metadata rows,

$ augur filter  --sequences results/masked_aus.fasta    --metadata data/example_metadata_aus.tsv     --include defaults/include.txt     --output results/filtered_aus.fasta --output-strains results/filtered_aus.txt
-3 sequences were dropped during filtering
        1 sequences were excluded because they did not have metadata

        3 sequences were added back because they were in defaults/include.txt
94 strains have been passed all filters

and end up with a FASTA containing 91 sequences but a strains list containing 94! For consistency, to force-include strains I think they must be in the provided metadata and sequences (if supplied). Would also be nice to separate out "1 strain dropped during filtering" from "n samples added back in ..." (here n=3 but I think it should be 0).

@huddlej
Copy link
Contributor Author

huddlej commented Feb 25, 2021

Thank you so much for the detailed inspection of these edge cases, @jameshadfield. This helped me write a more effective functional test that captures these nuances and then update the code to make the test pass in 32a1ff2. The output from your last example above now looks like this:

$ augur filter  --sequences results/masked_aus.fasta    --metadata data/example_metadata_aus.tsv     --include defaults/include.txt     --output results/filtered_aus.fasta --output-strains results/filtered_aus.txt
WARNING: A sequence index was not provided, so we are generating one. Generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.

1 strains were dropped during filtering
	1 had no metadata

	0 strains were added back because they were requested by include files
	3 strains from include files were not added because they lacked sequence or metadata
91 strains passed all filters

$ grep "^>" results/filtered_aus.fasta | wc -l
91
$ wc -l results/filtered_aus.txt
91 results/filtered_aus.txt

For another example with more extensive reporting, you can change into the tests/functional/filter/ directory and run the following commands. In these test data, there are 12 strains represented including one strain missing metadata and one strain missing a sequence. The include file refers to a nonexistent strain and one Colombian strain.

$ echo -e "NotReal\nCOL/FLR_00008/2015"  > include.txt
$ augur filter \
  --sequence-index sequence_index.tsv \
  --metadata metadata.tsv \
  --query "country != 'Colombia'" \
  --include include.txt \
  --output-strains test.txt

4 strains were dropped during filtering
	1 had no sequence data
	1 had no metadata
	3 of these were filtered out by the query:
		"country != 'Colombia'"

	1 strains were added back because they were requested by include files
	1 strains from include files were not added because they lacked sequence or metadata
8 strains passed all filters

Based on today's meeting discussion, I've left the sequences as optional. I'll test this more completely with the ncov workflow and also open it up for general review.

Allows users to filter strains using only metadata information and
output the strains that pass filters as either a metadata table or a
list of names. Additionally, allows users to select only those sequences
whose strain names are listed in one or more include files. To enable
this functionality, this commit makes the following major changes to the
filter interface:

 - sequence input and output are optional
 - non-sequence-based filters work without sequence input or index
 - the include argument accepts multiple inputs whose contents are unioned
 - a new `--exclude-all` flag excludes all strains by default, allowing
   only those in the include flags to be added back
 - new metadata and strain output arguments are supported
 - the initial complete list of strain names to consider for filtering
   originates from the metadata instead of the sequences

Collectively, these changes allow multiple filter commands to be run in
parallel without accessing the original sequence data and then the
outputs of these separate commands can be joined by a single final
filter command to output the sequences corresponding to all desired
strain names. For workflows with large sequence inputs, like ncov, these
changes should speed up analyses considerably.
Adds a `read_strains` utility function that can read strain names from
one or more files and return the distinct set of all names. Updates the
filter logic to use this function to read in names of strains to
include.
Updates the `--exclude` argument to allow one or more values and the
corresponding logic to read in strains from these files with the new
`read_strains` utility function.
When sequences input (or a sequence index) is provided, intersect those
strains with the strains in the metadata while also tracking the number
of strains filtered by this operation.

Note that this approach is not too different from the original filter
implementation that required each sequence to have metadata or be
filtered out. This approach has the additional effect of filtering out
strains that have metadata and no sequences which is a filter that used
to happen more implicitly before. This commit adds the number of strains
filtered by this operation to the report at the end, so users have a
better sense of which strains are missing data.

As a side effect of getting the functional tests in this commit to work,
this commit also reorders the report output such that the user always
sees how many strains were filtered and why before the error message
gets printed when all strains are filtered.

This commit also adds a functional test for filtering VCFs.

Fixes #681
Report the number of strains that had sequences but did not have any
metadata.
Previously, augur filter treated comments differently for `include` and
`exclude` files. This commit consistently allows the more permissive
commenting associated with the `exclude` files such that users can make
inline comments on the same line as a strain name. This commit also adds
tests to confirm that we get the expected behavior from the current
code. This change should be backwards compatible with existing `include`
inputs, since the new comment standard is a superset of the older
`include` format.
Fixes inconsistencies in numbers from filter reports associated with
edge cases where strains are missing metadata or sequence data. Specific
changes include:

 - calculate the intersection of strains with metadata and sequences
   (whenever sequences are provided) as the "source of truth"
 - track _strains_ as opposed to _sequences_ through sets instead
   of lists (enabling clear logic and testing of group membership)
 - report about strains dropped from either lack of metadata or sequence
   data
 - report how many strains could or could not be included from the
   include file inputs to clarify when requested strains do not have
   metadata or sequence data

This commit also updates functional tests for augur filter to include a
more complex example of how these filter behaviors should manifest,
based on an excellent example from @jameshadfield.
Outputs how many "available strains" were initially dropped by the
`--exclude-all` flag.
@huddlej huddlej marked this pull request as ready for review February 25, 2021 21:54
@huddlej
Copy link
Contributor Author

huddlej commented Feb 27, 2021

I just hit another edge case where I wanted to filter a subsampled FASTA by date using a sequence index I prebuilt from the whole original set of sequences and the complete metadata. Given these inputs:

subsampled FASTA: 6827 strains
sequence index: 78133 strains
metadata: 78133 strains

I ran the following command:

augur filter \
  --sequences results/subsampled.fasta \
  --sequence-index results/sequence_index.tsv \
  --metadata results/metadata.tsv \
  --min-date 2017-01-01 \
  --max-date 2020-01-01 \
  --output-sequences results/filtered.fasta

Instead of writing out the 5437 of 6827 sequences from the input FASTA that passed the filter, the report output was:

46133 strains were dropped during filtering
	46133 of these were dropped because of their date (or lack of date)
32000 strains passed all filters

There are a couple of possible workarounds for this case:

  1. Omit the sequence index and let augur rebuild it.
  2. Export the subsampled metadata in the same command that output the subsampled sequences such that the metadata source of truth is limited.

But, the reporting behavior above is still a bug. Ideally, augur would use the number of strains that are actually written to disk as its final number of strains that passed the filter. One problem is that we don't know how many strains are in a sequence file without looping through the whole file again.

@rneher
Copy link
Member

rneher commented Feb 28, 2021

thanks so much @huddlej! This will be very useful (though we should definitely break up the filter function!!)

Regarding the edge case above, would a good solution be the following pseudo code?

if index & sequences:
    sequence_strains = index | sequences
elif index:
    sequence_strains = index
elif sequences:
    sequence_strains = sequencs
else:
    sequence_strains=[]

Copy link
Member

@rneher rneher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than the edge case you flagged and the need to refactor, I think this is great. But refactor is probably best left for a later PR.

It is possible for the sequence index to be a superset of (or generally
out of sync with) the given sequences output. To handle this case, we
track the observed sequence strains when the user requests sequence
output and update the statistics for which strains were filtered and
why. We also move the handling of sequence outputs before the metadata
and strain list outputs, so these outputs reflect any changes in the
available sequences that occur when writing sequence output.
It's is a normal use case for the sequence index to contain a superset
of strains present in the sequence inputs. Only warn users about a
mismatch between the index and sequences when the superset relationship
is not true.
@huddlej
Copy link
Contributor Author

huddlej commented Mar 1, 2021

Thanks for checking this out, @rneher! I just pushed a fix for the edge case in e1b1153. The output from my example above now looks like this (the total number of strains has changed slightly due to randomness in my subsampling):

72682 strains were dropped during filtering
	71292 had no sequence data
	46133 of these were dropped because of their date (or lack of date)
5451 strains passed all filters

I ended up checking the available sequence strains as part of the single pass through the sequences input and updating the sequence_strains set based on those observed strains. If there is a difference between the observed strains and those in the sequence index, filter updates the set of strains to keep, etc. based on what's actually available. If the observed strains are not a subset of the expected strains from the index, filter warns the user with:

WARNING: The sequence index is out of sync with the provided sequences. Augur will only output strains with available sequences.

This update to sequence_strains and seq_keep occurs before the metadata and strains output, so we avoid the case where there are more strains in those outputs than in the sequence outputs.

The downside to this approach is that the number of strains dropped for missing sequence data can now overlap with the number of strains filtered for other reasons (like the date filter in the example above). We can continue to refine this reporting in the future, but I think this approach is worth the benefit of not reading through the sequences twice (at the beginning and end).

I agree this module is ripe for a refactor!

Provides a more logical presentation of the many filter arguments by
grouping them into inputs, metadata filters, sequence filters,
subsampling, and outputs. Reorders some of the arguments such that
related arguments are close to each other and required arguments occur
before optional arguments (e.g., metadata before sequences). Expands
some of the help strings to clarify arguments, too.
@huddlej
Copy link
Contributor Author

huddlej commented Mar 4, 2021

@jameshadfield 1665e1b adds clearer argument grouping for inputs, metadata filters, sequence filters, subsampling options, and outputs. If this looks reasonable, I'll plan to merge this PR so we can release this feature and start integrating it into the ncov workflow. If you have any recommendations for changes, feel free to modify and push new commits.

@jameshadfield
Copy link
Member

This is really nice @huddlej. Let's get this merged 😄

@huddlej huddlej added this to the Next release 11.X.X milestone Mar 8, 2021
@huddlej huddlej merged commit 6ab87dd into master Mar 8, 2021
@huddlej huddlej deleted the metadata-only-filter branch March 8, 2021 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants