Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: allow wildcard filtering based on assembly name #75

Open
jdwinkler-lanzatech opened this issue Aug 21, 2022 · 4 comments

Comments

@jdwinkler-lanzatech
Copy link

Hi,

I was wondering if it would be possible to provide a filtering option based on assembly (species/assigned) name? I often want to pull a group of microbes with a general metabolic capabilities (say methanogenesis) but I have to manually pick out the TaxIDs currently to do so. Not a major problem, but the feature might be useful for other people too!

@pirovc
Copy link
Owner

pirovc commented Aug 22, 2022

Hi, thanks for the suggestion. genome_updater selects and filters data based on the assembly_summary.txt file provided by NCBI (more info https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt). Besides the filter parameters, the -F option allow custom filtering for data selection. However, I'm not sure the information you refer to is contained in that file.

@jdwinkler-lanzatech
Copy link
Author

Column 8 would be the target, I think. I believe right now the -F option is an exact match though, so I am thinking of another flag that basically uses grep behind the scenes to implement the matching. I'd basically want to grab all the assemblies with an organism name matching "methano*", if that makes sense. Obviously would not be perfect, but could be handy if you have a specific enough search string.

@pirovc
Copy link
Owner

pirovc commented Aug 24, 2022

Partial matching should be doable, will mark it as enhancement. For now one can download the full assembly_summary.txt from genbank or refseq and apply the filter/grep manually and use the resulting file as an external assembly_summary.txt (param. -e).

@jdwinkler-lanzatech
Copy link
Author

Great, thanks! I figure it is a logical addition to the custom filtering offered by -F already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants