New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anvi-get-sequences-for-gene-clusters improvements #668

Closed
rbartelme opened this Issue Dec 13, 2017 · 13 comments

Comments

Projects
None yet
4 participants
@rbartelme

rbartelme commented Dec 13, 2017

Could you add these flags, found in the hmm export function, to export pc alignments as well?

  --max-num-genes-missing-from-bin INTEGER
                        This filter removes bins (or genomes) from your
                        analysis. If you have a list of gene names, you can
                        use this parameter to omit any bin (or external
                        genome) that is missing more than a number of genes
                        you desire. For instance, if you have 100 genome bins,
                        and you are interested in working with 5 ribosomal
                        proteins, you can use '--max-num-genes-missing-from-
                        bin 4' to remove remove the bins that are missing more
                        than 4 of those 5 genes. This is especially useful for
                        phylogenomic analyses. Parameter 0 will remove any bin
                        that is missing any of the genes.
  --min-num-bins-gene-occurs INTEGER
                        This filter removes genes from your analysis. Let's
                        assume you have 100 bins to get sequences for HMM
                        hits. If you want to work only with genes among all
                        the hits that occur in at least X number of bins, and
                        discard the rest of them, you can use this flag. If
                        you say '--min-num-bins-gene-occurs 90', each gene in
                        the analysis will be required at least to appear in 90
                        genomes. If a gene occurs in less than that number of
                        genomes, it simply will not be reported. This is
                        especially useful for phylogenomic analyses, where you
                        may want to only focus on genes that are prevalent
                        across the set of genomes you wish to analyze.
@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 13, 2017

Member

I see. I think this is a brilliant idea. I feel like we should allow users to be able to select a subset of gene clusters in the interface by playing with these parameters, too. So they can create their own collections of "core-like" genes, etc.

Member

meren commented Dec 13, 2017

I see. I think this is a brilliant idea. I feel like we should allow users to be able to select a subset of gene clusters in the interface by playing with these parameters, too. So they can create their own collections of "core-like" genes, etc.

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 13, 2017

Member

I will take a stab at the algorithm, and then will talk to @ozcan to see what he thinks.

Member

meren commented Dec 13, 2017

I will take a stab at the algorithm, and then will talk to @ozcan to see what he thinks.

@rbartelme

This comment has been minimized.

Show comment
Hide comment
@rbartelme

rbartelme Dec 14, 2017

I also briefly mentioned to @ozcan, since he's here today, that having the partition data from the concatenations would be very useful. It let's one tell a phylogenetic program that the concatenated alignment consists of multiple genes, and where those genes start and end.

So if I had a concatenation of geneA, geneB, and geneC, and all were 300 characters long the partition would look something like:
geneA: 1-300
geneB: 300-600
geneC: 600-900

This would significantly improve the interface with standard external phylogenetics programs like RAxML, IQ-TREE, MrBayes, BEAST, etc. Even if it was just output as a separate text file.

rbartelme commented Dec 14, 2017

I also briefly mentioned to @ozcan, since he's here today, that having the partition data from the concatenations would be very useful. It let's one tell a phylogenetic program that the concatenated alignment consists of multiple genes, and where those genes start and end.

So if I had a concatenation of geneA, geneB, and geneC, and all were 300 characters long the partition would look something like:
geneA: 1-300
geneB: 300-600
geneC: 600-900

This would significantly improve the interface with standard external phylogenetics programs like RAxML, IQ-TREE, MrBayes, BEAST, etc. Even if it was just output as a separate text file.

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 14, 2017

Member

Would it be useful if we reported in the FASTA defline the beginning and end position of each concatenated gene? Or do you need a fixed number?

Member

meren commented Dec 14, 2017

Would it be useful if we reported in the FASTA defline the beginning and end position of each concatenated gene? Or do you need a fixed number?

@rbartelme

This comment has been minimized.

Show comment
Hide comment
@rbartelme

rbartelme Dec 14, 2017

I think it might make the deflines really messy, unless they were added in between the gene name pipes (if that's still a thing the export does?). If it was like this...

Bacteria_Rthebest GenA:1-300|GenB:300-600|GenC:600-900
MNFDKFJSOFIJSIDFHSIDFKSDFSKFSKHAFJASHFASH----JKNKNS...etc

That could work. Just thinking about how many non-FASTA formats phylogenetics software require, but I think what you're suggestion might be workable!

rbartelme commented Dec 14, 2017

I think it might make the deflines really messy, unless they were added in between the gene name pipes (if that's still a thing the export does?). If it was like this...

Bacteria_Rthebest GenA:1-300|GenB:300-600|GenC:600-900
MNFDKFJSOFIJSIDFHSIDFKSDFSKFSKHAFJASHFASH----JKNKNS...etc

That could work. Just thinking about how many non-FASTA formats phylogenetics software require, but I think what you're suggestion might be workable!

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 14, 2017

Member

It would be more like this:

GENOME GenA,1:257;GenB,257:555;GenC,555:724

Or it would be a fixed distance (i.e., the longest gene with gaps, such as a length of 644, and the defline would look like this:

GENOME genes:GenA,GenB,GenC|gene_block:644

The second one is wasteful. The first one is complex. I don't know what is the solution to this.

Member

meren commented Dec 14, 2017

It would be more like this:

GENOME GenA,1:257;GenB,257:555;GenC,555:724

Or it would be a fixed distance (i.e., the longest gene with gaps, such as a length of 644, and the defline would look like this:

GENOME genes:GenA,GenB,GenC|gene_block:644

The second one is wasteful. The first one is complex. I don't know what is the solution to this.

@rbartelme

This comment has been minimized.

Show comment
Hide comment
@rbartelme

rbartelme Dec 14, 2017

Thinking more like the first one.

rbartelme commented Dec 14, 2017

Thinking more like the first one.

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 14, 2017

Member

Yes, although it looks painful, it is extremely easy to parse with a Python script. I APPROVE YOUR ATTITUDE.

Member

meren commented Dec 14, 2017

Yes, although it looks painful, it is extremely easy to parse with a Python script. I APPROVE YOUR ATTITUDE.

@osvatic

This comment has been minimized.

Show comment
Hide comment
@osvatic

osvatic Dec 14, 2017

Adding to this issue, it would be really nice if there was a way to select all unique ortholog groups (belonging to a selection genome or group of genomes) and output them in some way.

osvatic commented Dec 14, 2017

Adding to this issue, it would be really nice if there was a way to select all unique ortholog groups (belonging to a selection genome or group of genomes) and output them in some way.

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 14, 2017

Member

Thanks, Jay. I think we can use the collections infrastructure to store this kind of information.

Member

meren commented Dec 14, 2017

Thanks, Jay. I think we can use the collections infrastructure to store this kind of information.

meren added a commit that referenced this issue Dec 27, 2017

meren added a commit that referenced this issue Dec 27, 2017

-export-gene-cluster-alignments -> get-sequences-for-gene-clusters
a much more talented program to deal with gene clusters. re: #668
@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 27, 2017

Member

Dear @rbartelme and @osvatic,

During my flight from France to the US I had some time to improve anvi-export-pc-alignments. It is now caled anvi-get-sequences-for-gene-clusters (similar to anvi-get-sequences-for-hmm-hits.

The program now contains the following filters:

 $ anvi-get-sequences-for-gene-clusters -h

    Do cool stuff with gene clusters in anvi'o pan genomes

(...)

ADVANCED FILTERS:
  If you are here you must be looking for ways to specify exactly what you
  want from that database of gene clusters. These filters will be applied to
  what your previous selections reported.

  --min-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        least X number of genomes. If you say '--min-num-
                        genomes-gene-cluster-occurs 90', each gene cluster in
                        the analysis will be required at least to appear in 90
                        genomes. If a gene occurs in less than that number of
                        genomes, it simply will not be reported. This is
                        especially useful for phylogenomic analyses, where you
                        may want to only focus on gene clusters that are
                        prevalent across the set of genomes you wish to
                        analyze.
  --max-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        most X number of genomes. If you say '--min-num-
                        genomes-gene-cluster-occurs 1', you will get gene
                        clusters that are singletons. Combining this paramter
                        with --min-num-genomes-gene-cluster-occurs can give
                        you a very precise way to filter your gene clusters.
  --min-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--min-num-genes-from-each-genome
                        2', this filter will remove every gene cluster, to
                        which every genome in your analysis contributed less
                        than 2 genes. This can be useful to find out gene
                        clusters with many genes from many genomes (such as
                        conserved multi-copy genes within a clade).
  --max-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--max-num-genes-from-each-genome
                        1', every gene cluster that has more than one gene
                        from any genome that contributes to it will be removed
                        from your analysis. This could be useful to remove
                        gene clusters with paralogs from your report for
                        appropriate phylogenomic analyses. For instance, using
                        '--max-num-genes-from-each-genome 1' and 'min-num-
                        genomes-gene-cluster-occurs X' where X is the total
                        number of your genomes, would give you the single-copy
                        gene cluters in your pan genome.
  --max-num-gene-clusters-missing-from-genome INTEGER
                        This filter will remove genomes from your report. If
                        you have a list of gene cluster names, you can use
                        this parameter to omit any genome from your report if
                        it is missing more than a number of genes you desire.
                        For instance, if you have 100 genomes in your pan
                        genome, and you are interested in working only with
                        genomes that have all 5 specific gene clusters of your
                        choice, you can use '--max-num-gene-clusters-missing-
                        from-genome 4' to remove remove the bins that are
                        missing more than 4 of those 5 genes. This is
                        especially useful for phylogenomic analyses. Parameter
                        0 will remove any genome that is missing any of the
                        genes.
  --add-into-items-additional-data-table NAME
                        If you use any of the filters, and would like to add
                        the resulting item names into the items additional
                        data table of your database, you can use this
                        parameter. You will need to give a name for these
                        results to be saved. If the given name is already in
                        the items additoinal data table, its contents will be
                        replaced with the new one. Then you can run anvi-
                        interactive or anvi-display-pan to 'see' the results
                        of your filters.

(...)

As you can see, the program can also store resulting gene clusters after applying all filters into item additional data table in the pan database with the parameter --add-into-items-additional-data-table (which means they can be visualized as additional layers --see this post for more information on these new tables and learn how those tables can be manipulated).

The code is a bit ugly and will likely change as we move forward, but those changes will not affect the user experience. So you can help us by testing it using the master repo, and suggest more ways to improve your experience.

I will talk to @ozcan about improving the interface using this new functionality. Then it will be much more fun!

Thanks,

Member

meren commented Dec 27, 2017

Dear @rbartelme and @osvatic,

During my flight from France to the US I had some time to improve anvi-export-pc-alignments. It is now caled anvi-get-sequences-for-gene-clusters (similar to anvi-get-sequences-for-hmm-hits.

The program now contains the following filters:

 $ anvi-get-sequences-for-gene-clusters -h

    Do cool stuff with gene clusters in anvi'o pan genomes

(...)

ADVANCED FILTERS:
  If you are here you must be looking for ways to specify exactly what you
  want from that database of gene clusters. These filters will be applied to
  what your previous selections reported.

  --min-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        least X number of genomes. If you say '--min-num-
                        genomes-gene-cluster-occurs 90', each gene cluster in
                        the analysis will be required at least to appear in 90
                        genomes. If a gene occurs in less than that number of
                        genomes, it simply will not be reported. This is
                        especially useful for phylogenomic analyses, where you
                        may want to only focus on gene clusters that are
                        prevalent across the set of genomes you wish to
                        analyze.
  --max-num-genomes-gene-cluster-occurs INTEGER
                        This filter will remove gene clusters from your
                        report. Let's assume you have 100 genomes in your pan
                        genome analysis. You can use this parameter if you
                        want to work only with gene clusters that occur in at
                        most X number of genomes. If you say '--min-num-
                        genomes-gene-cluster-occurs 1', you will get gene
                        clusters that are singletons. Combining this paramter
                        with --min-num-genomes-gene-cluster-occurs can give
                        you a very precise way to filter your gene clusters.
  --min-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--min-num-genes-from-each-genome
                        2', this filter will remove every gene cluster, to
                        which every genome in your analysis contributed less
                        than 2 genes. This can be useful to find out gene
                        clusters with many genes from many genomes (such as
                        conserved multi-copy genes within a clade).
  --max-num-genes-from-each-genome INTEGER
                        This filter will remove gene clusters from your
                        report. If you say '--max-num-genes-from-each-genome
                        1', every gene cluster that has more than one gene
                        from any genome that contributes to it will be removed
                        from your analysis. This could be useful to remove
                        gene clusters with paralogs from your report for
                        appropriate phylogenomic analyses. For instance, using
                        '--max-num-genes-from-each-genome 1' and 'min-num-
                        genomes-gene-cluster-occurs X' where X is the total
                        number of your genomes, would give you the single-copy
                        gene cluters in your pan genome.
  --max-num-gene-clusters-missing-from-genome INTEGER
                        This filter will remove genomes from your report. If
                        you have a list of gene cluster names, you can use
                        this parameter to omit any genome from your report if
                        it is missing more than a number of genes you desire.
                        For instance, if you have 100 genomes in your pan
                        genome, and you are interested in working only with
                        genomes that have all 5 specific gene clusters of your
                        choice, you can use '--max-num-gene-clusters-missing-
                        from-genome 4' to remove remove the bins that are
                        missing more than 4 of those 5 genes. This is
                        especially useful for phylogenomic analyses. Parameter
                        0 will remove any genome that is missing any of the
                        genes.
  --add-into-items-additional-data-table NAME
                        If you use any of the filters, and would like to add
                        the resulting item names into the items additional
                        data table of your database, you can use this
                        parameter. You will need to give a name for these
                        results to be saved. If the given name is already in
                        the items additoinal data table, its contents will be
                        replaced with the new one. Then you can run anvi-
                        interactive or anvi-display-pan to 'see' the results
                        of your filters.

(...)

As you can see, the program can also store resulting gene clusters after applying all filters into item additional data table in the pan database with the parameter --add-into-items-additional-data-table (which means they can be visualized as additional layers --see this post for more information on these new tables and learn how those tables can be manipulated).

The code is a bit ugly and will likely change as we move forward, but those changes will not affect the user experience. So you can help us by testing it using the master repo, and suggest more ways to improve your experience.

I will talk to @ozcan about improving the interface using this new functionality. Then it will be much more fun!

Thanks,

@rbartelme

This comment has been minimized.

Show comment
Hide comment
@rbartelme

rbartelme Dec 28, 2017

So does " --add-into-items-additional-data-table NAME
If you use any of the filters, and would like to add
the resulting item names into the items additional
data table of your database, you can use this
parameter. You will need to give a name for these
results to be saved. If the given name is already in
the items additoinal data table, its contents will be
replaced with the new one. Then you can run anvi-
interactive or anvi-display-pan to 'see' the results
of your filters."

List the gene names and potentially the AA positions in a concatenated aligned output??

rbartelme commented Dec 28, 2017

So does " --add-into-items-additional-data-table NAME
If you use any of the filters, and would like to add
the resulting item names into the items additional
data table of your database, you can use this
parameter. You will need to give a name for these
results to be saved. If the given name is already in
the items additoinal data table, its contents will be
replaced with the new one. Then you can run anvi-
interactive or anvi-display-pan to 'see' the results
of your filters."

List the gene names and potentially the AA positions in a concatenated aligned output??

@meren meren changed the title from anvi-export-pc-alignments improvement to anvi-get-sequences-for-gene-clusters improvements Dec 31, 2017

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Dec 31, 2017

Member

@rbartelme, are you referring to this:

#668 (comment)

The answer is no, it is not yet in :)

Member

meren commented Dec 31, 2017

@rbartelme, are you referring to this:

#668 (comment)

The answer is no, it is not yet in :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment