Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pangenome: missing genes with repeat motifs #1955

Closed
FlorianTrigodet opened this issue Jul 7, 2022 · 2 comments
Closed

[BUG] Pangenome: missing genes with repeat motifs #1955

FlorianTrigodet opened this issue Jul 7, 2022 · 2 comments
Assignees

Comments

@FlorianTrigodet
Copy link
Contributor

Short description of the problem

In the anvi'o google group, Emily St John noticed a few genes were missing after anvi-pan-genome. I observed the same in a different pangenome analysis.

The genes had a low complexity composition with repeats and were skipped by Diamond default masking parameter.

Maybe we should consider changing the default Diamond blastp parameters and remove the default masking with --masking 0?

This issue echoes with #1922 and the masking option could also be available to the advanced user that want to use it.

anvi'o version

Development branch.

Detailed description of the issue

Here is a gene that was skipped:

>hash0b6b2fe9_1019
MKPYAFSEEEEEWDEEEFEEEEEEEFEDFGEEEEFLEEEEEEEEWEEEEEEEF

To reproduce the error, save the above sequence as test.fasta.
With the default Diamond command in anvi-pan-genome (no output):

diamond blastp -q test.fasta \
               -d test.fasta \
               -o diamond-search-results.txt \
               --outfmt 6 \
               --max-target-seqs 100000 \
               --evalue 1e-05

With the masking off:

diamond blastp -q test.fasta \
               -d test.fasta \
               -o diamond-search-results.txt \
               --outfmt 6 \
               --max-target-seqs 100000 \
               --evalue 1e-05 \
               --masking 0
@meren
Copy link
Member

meren commented Jul 8, 2022

Great point. I will add --masking 0 immediately. We need a different design for additional DIAMOND parameters / user access to that. Perhaps we can follow the --additional-parameters logic Alon implemented for snakemake workflows.

@meren
Copy link
Member

meren commented Jul 19, 2022

This is now solved :) --masking 0 is our default, and can be edited in anyway the user wishes via the --additional-params-for-seq-search flag, which is defined as the following in the help menu:

anvi-pan-genome -h
(...)
  --additional-params-for-seq-search CMD LINE PARAMS
                        OK. This is very important. While anvi'o has some
                        defaults for whichever approach you choose to use for
                        your sequence search, you can assume full control over
                        what is passed to the search program. Put anything you
                        wish anvi'o to send your search program in double
                        quotes, and they will be passed to the program. If you
                        don't use this parameter, in addition to the
                        additional parameters anvi'o will use to call your
                        search algorithm of preference, anvi'o will pass to
                        DIAMOND the following parameters: "--masking 0", and
                        to the NCBI blast nothing, as the default additional
                        parameters for the NCBI BLAST is empty.. If you use
                        this parameter, it will completely overwrite what you
                        see above. This means, if you are about to use DIAMOND
                        and would like to enable sensitive mode for DIAMOND
                        along with the current anvi'o default additional
                        parameter for it, then you should set this parameter
                        like this manually: --additional-params-for-seq-search
                        "--masking 0 --sensitive". DO NOT EVER FORGET THE
                        DOUBLE QUOTES, unless you hate your computer and want
                        to see it melting beforey our eyes. (default: None)

The parameters used are shown in the output message during runtime:

anvi-pan-genome -g TEST-GENOMES.db -n TEST

(...)
DIAMOND BLASTP
===============================================
Additional params for blastp .................: --masking 0
(...)

And additional parameters used stored in the pan-db under additional_params_for_seq_search:

anvi-db-info TEST/TEST-PAN.db

DB Info (no touch)
===============================================
Database Path ................................: TEST/TEST-PAN.db
Description ..................................: _No description is provided_
db_type ......................................: pan (variant: None)
version ......................................: 16


DB Info (no touch also)
===============================================
internal_genome_names ........................:
external_genome_names ........................: E_faecalis_6240,E_faecalis_6255,E_faecalis_6512,E_faecalis_6557,E_faecalis_6563
num_genomes ..................................: 5
min_percent_identity .........................: 0.0
gene_cluster_min_occurrence ..................: 1
mcl_inflation ................................: 2.0
default_view .................................: gene_cluster_presence_absence
use_ncbi_blast ...............................: 0
additional_params_for_seq_search .............: --masking 0
minbit .......................................: 0.5
exclude_partial_gene_calls ...................: 0
gene_alignments_computed .....................: 1
genomes_storage_hash .........................: hash8e5b917a
project_name .................................: TEST
creation_date ................................: 1658212778.57778
num_gene_clusters ............................: 369
num_genes_in_gene_clusters ...................: 1255
default_item_order ...........................: presence-absence:euclidean:ward
available_item_orders ........................: frequency:euclidean:ward,presence-absence:euclidean:ward,Forced synteny <> E_faecalis_6240:NA:NA,Forced synteny <> E_faecalis_6255:NA:NA,Forced synteny <> E_faecalis_6512:NA:NA,Forced synteny <>
                                                E_faecalis_6557:NA:NA,Forced synteny <> E_faecalis_6563:NA:NA
items_ordered ................................: 1

* Please remember that it is never a good idea to change these values. But in some
cases it may be absolutely necessary to update something here, and a programmer
may ask you to run this program and do it. But even then, you should be
extremely careful.

With this change there is no more --sensitive flag in the CLI.

@meren meren closed this as completed Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants