Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Read all metadata columns when query parsing fails #1418

Merged
merged 3 commits into from Feb 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGES.md
Expand Up @@ -2,6 +2,11 @@

## __NEXT__

### Bug Fixes

* filter: In versions 24.2.0 and 24.2.1, `--query` stopped working in cases where internal optimizations added in version 24.2.0 failed to parse the columns from the query. It now falls back to non-optimized behavior that allows queries to work. [#1418][] (@victorlin)

[#1418]: https://github.com/nextstrain/augur/pull/1418

## 24.2.1 (14 February 2024)

Expand Down
12 changes: 11 additions & 1 deletion augur/filter/io.py
Expand Up @@ -3,6 +3,7 @@
from argparse import Namespace
import os
import re
from textwrap import dedent
from typing import Sequence, Set
import numpy as np
from collections import defaultdict
Expand Down Expand Up @@ -65,7 +66,16 @@ def get_useful_metadata_columns(args: Namespace, id_column: str, all_columns: Se
# Attempt to automatically extract columns from the query.
variables = extract_variables(args.query)
if variables is None and not args.query_columns:
raise AugurError("Could not infer columns from the pandas query. If the query is valid, please specify columns using --query-columns.")
print_err(dedent(f"""\
WARNING: Could not infer columns from the pandas query. Reading all metadata columns,
which may impact execution time. If the query is valid, please open a new issue:

<https://github.com/nextstrain/augur/issues/new/choose>

and add the query to the description:

{args.query}"""))
columns.update(all_columns)
else:
columns.update(variables)

Expand Down
32 changes: 32 additions & 0 deletions tests/functional/filter/cram/filter-query-backtick-quoting.t
@@ -0,0 +1,32 @@
Setup

$ source "$TESTDIR"/_setup.sh

Create metadata file for testing.

$ cat >metadata.tsv <<~~
> strain region name
> SEQ_1 A
> SEQ_2 A
> SEQ_3 B
> SEQ_4
> ~~

The 'region name' column is query-able by backtick quoting.

$ ${AUGUR} filter \
> --metadata metadata.tsv \
> --query '(`region name` == "A")' \
> --output-strains filtered_strains.txt > /dev/null
WARNING: Could not infer columns from the pandas query. Reading all metadata columns,
which may impact execution time. If the query is valid, please open a new issue:

<https://github.com/nextstrain/augur/issues/new/choose>

and add the query to the description:

(`region name` == "A")

$ sort filtered_strains.txt
SEQ_1
SEQ_2
12 changes: 11 additions & 1 deletion tests/functional/filter/cram/filter-query-errors.t
Expand Up @@ -41,5 +41,15 @@ However, other Pandas errors are not so helpful, so a link is provided for users
> --metadata "$TESTDIR/../data/metadata.tsv" \
> --query "some bad syntax" \
> --output-strains filtered_strains.txt > /dev/null
ERROR: Could not infer columns from the pandas query. If the query is valid, please specify columns using --query-columns.
WARNING: Could not infer columns from the pandas query. Reading all metadata columns,
which may impact execution time. If the query is valid, please open a new issue:

<https://github.com/nextstrain/augur/issues/new/choose>

and add the query to the description:

some bad syntax
ERROR: Internal Pandas error when applying query:
invalid syntax (<unknown>, line 1)
Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.
[2]