Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact gene locatons #413

Closed
jmtsuji opened this issue Jul 26, 2021 · 18 comments
Closed

Exact gene locatons #413

jmtsuji opened this issue Jul 26, 2021 · 18 comments
Labels

Comments

@jmtsuji
Copy link
Contributor

jmtsuji commented Jul 26, 2021

Hello @SilasK ,

I'm now testing out ATLAS 2.6a2, using a server running Ubuntu 20.04.2.

I've noticed that setting the genecatalog to source: genomes in the config file seems to be broken now. (The default source:contigs works.) I now getting a MissingInputException that genomes/annotations/orf2genome.tsv is missing.

I am guessing this is coming from localrules: concat_genes in genecatalog.smk (see the line below that I marked with an asterisk):

if config['genecatalog']['source']=='contigs':

    localrules: concat_genes
    rule concat_genes:
        input:
            faa= expand("{sample}/annotation/predicted_genes/{sample}.faa", sample=SAMPLES),
            fna= expand("{sample}/annotation/predicted_genes/{sample}.fna", sample=SAMPLES)
        output:
            faa=  temp("Genecatalog/all_genes_unfiltered.faa"),
            fna = temp("Genecatalog/all_genes_unfiltered.fna"),
        run:
            from utils.io import cat_files
            cat_files(input.faa,output.faa)
            cat_files(input.fna,output.fna)


else:

    localrules: concat_genes
    rule concat_genes:
        input:
*-->            "genomes/annotations/orf2genome.tsv",
            faa= lambda wc: get_all_genes(wc,".faa"),
            fna= lambda wc: get_all_genes(wc,".fna")
        output:
            faa=  temp("Genecatalog/all_genes_unfiltered.faa"),
            fna = temp("Genecatalog/all_genes_unfiltered.fna"),
        run:
            from utils.io import cat_files
            cat_files(input.faa,output.faa)
            cat_files(input.fna,output.fna)

Is this orf2genome.tsv file meant to be produced by this new version of ATLAS? I do not find mention of it in the other atlas rules. It also appears to be not required in the run command at the bottom of the rule. I wonder if this rule would still work without genomes/annotations/orf2genome.tsv in the input. (I will test this if I get the time.)

I am okay to use source: contigs for now, but I thought I would let you know about this issue.

Thanks,
Jackson

@SilasK
Copy link
Member

SilasK commented Jul 27, 2021

Dear Jackson

Thank you for highlighting this error.
What is your reason to use the source: genomes ?

I probably will remove the option source: genomes. The rationale is by annotating all contigs once one wouldn't need to rerun this part when e.g. the binning method is changed.

In the future, I will probably prefer annotating the genomes with DRAM #360

@jmtsuji
Copy link
Contributor Author

jmtsuji commented Jul 28, 2021

@SilasK My primary reason for using source: genomes is that it is very easy to relate the annotation data to the MAG data. You can easily relate the gene cluster ID to the ORF ID of the corresponding MAG via orf2gene.tsv.

When using source: contigs, I did not see a method to simply relate the contig-derived ORF IDs to the ORF IDs in the MAGs. However, I think it should be possible to relate these two IDs, if the ORF IDs of the MAGs prior to renaming are saved somewhere. This is the main barrier to me adopting source: contigs at the moment.

That is a nice idea to try DRAM. I have never used it, but it is neat that it appears to have many levels of annotation. Note that it is pretty heavy on disk usage and RAM, however (see https://github.com/shafferm/DRAM#system-requirements):

DRAM has a large memory burden and is designed to be run on high performance computers. DRAM annotates against a large variety of databases which must be processed and stored. Setting up DRAM with KEGG Genes and UniRef90 will take up ~500 GB of storage after processing and require ~512 GB of RAM while using KOfam and skipping UniRef90 will mean all processed databases will take up ~30 GB on disk and will only use ~128 GB of RAM while processing. DRAM annotation memory usage depends on the databases used. When annotating with UniRef90 around 220 GB of RAM is required. If the KEGG gene database has been provided and UniRef90 is not used then memory usage is around 100 GB of RAM. If KOfam is used to annotate KEGG and UniRef90 is not used then less than 50 GB of RAM is required. DRAM can be run with any number of processors on a single node.

It sounds like annotation using UniRef90 will take about 220 GB of RAM, which is roughly equivalent to the current GTDB classifier. This might not be a huge problem, but it does mean that DRAM will add one more RAM bottleneck to the pipeline.

@SilasK
Copy link
Member

SilasK commented Jul 28, 2021

atlas run all with source: contigs gives you this file https://github.com/metagenome-atlas/atlas/blob/master/atlas/Snakefile#L244

Allowing you to relate the gene annotations to the genome. But you are interested in the exact gene - MAG orf link?

You will be able to run DRAM without the Uniref annotation. and focus on Kegg.

@SilasK SilasK closed this as completed Jul 28, 2021
@jmtsuji
Copy link
Contributor Author

jmtsuji commented Jul 28, 2021

@SilasK Thanks for the advice to look for genomes/annotations/gene2genome.tsv.gz! I'll take a look for it when my ATLAS run finishes. It should be okay to remove the source: genomes option in that case.

Correct -- if relying on KOfam, then DRAM will use <50 GB RAM. That sounds like a reasonable option is RAM is an issue.

@jmtsuji
Copy link
Contributor Author

jmtsuji commented Oct 11, 2021

@SilasK I took a look at genomes/annotations/gene2genome.tsv.gz. Unfortunately, like you mentioned, this file does not relate the exact Gene ID (from the Genecatalog) to the ORF ID in the MAG. Instead, it shows the copy # of the Gene in the MAG, but not the exact location, e.g.,

Gene	MAG	Ncopies
Gene0000001	MAG36	1
Gene0000002	MAG36	1
Gene0000003	MAG36	1
Gene0000004	MAG36	1
Gene0000005	MAG36	1
Gene0000006	MAG36	1
Gene0000007	MAG36	1
Gene0000008	MAG36	1
Gene0000009	MAG36	1

Ideally, I would like to use the Genecatalog data to annotate the MAGs. Because of this, it would be very helpful to map the Gene ID to the specific ORF ID of the MAG.

Mapping the Gene ID to the ORF ID of the MAG seems kind of tricky, though, because the Gene IDs are derived from the unbinned contigs, but the ORF IDs are from the MAGs, which are annotated after genome binning.

I wonder if the easiest solution might be to cluster the MAG ORFs with the Genecatalog/all_genes/predicted_genes.faa file at 100% amino acid identity. This would link the MAG ORFs to the unclustered ORFs from the contigs. Then, Genecatalog/clustering/orf2gene.tsv.gz could be used to link the MAG ORFs to the Gene IDs. What do you think?

@jmtsuji jmtsuji reopened this Oct 11, 2021
@LeeBergstrand
Copy link

@jmtsuji My understanding is that the gene catalogue already contains genes clustered at 95% identity. So each gene in the gene catalogue is representative of a cluster of multiple genes from either the contigs or genomes respectively.

For contigs: contigs --> predict genes --> cluster genes at 95% identity --> representive genes --> gene catalogue

For genomes: genomes --> predict genes --> cluster genes at 95% identity --> representive genes --> gene catalogue

I know you can get counts of numbers of genes in each cluster but I'm not sure if there is a mapping file between each gene in the contig/genome and which cluster it belongs to.

@LeeBergstrand
Copy link

@SilasK It might be beneficial to assign each gene in the entire pipeline a universally unique identifier (https://en.wikipedia.org/wiki/Universally_unique_identifier) if you are not already doing so, so you could make a mapping file /database of which gene from which contigs/genomes belongs in which cluster and representative sequence.

@LeeBergstrand
Copy link

@jmtsuji This could also be used to map ORF IDs as well.

@LeeBergstrand
Copy link

@SilasK Within the gene2genome.tsv.gz file is the gene column referring to the gene catalogue ID? or some gene id that is unique to that file?

gene2genome.tsv.gz

Gene MAG Ncopies
Gene0000097 MAG26 1
Gene0000098 MAG26 1
Gene0000099 MAG51 1
Gene0000100 MAG51 1
Gene0000101 MAG51 1
Gene0000102 MAG51 1
Gene0000106 MAG51 1
Gene0000107 MAG51 1
Gene0000108 MAG51 1

orf2gene.tsv.gz

ORF Gene
S29_24558_1 Gene0000001
S29_24574_3 Gene0000002
S29_24591_1 Gene0000003
S29_24608_1 Gene0000004
S30_2454_2 Gene0000004
S29_24627_1 Gene0000005
S29_24644_1 Gene0000006
S29_24661_1 Gene0000007
S29_24675_1 Gene0000008

@LeeBergstrand
Copy link

LeeBergstrand commented Oct 12, 2021

I think we need some file like the following:

gene_catalogue_rep_gene_id gene_catalogue_gene_cluster contig_gene_id contig genome_gene_id genome_id

The file headers above are completely database unnormalized, so would be very big file. But the underlying could be pretty easily represented in a normalized SQLite3 database file.

@SilasK
Copy link
Member

SilasK commented Oct 13, 2021

Ok guys. What I have in atlas is:

All predicted genes have a unique name by the fact that they are based on the contig names which is itself unique. E.g. samplename_contignr_genern I refer to this genes as orfs. All orfs are clustered into a gene catalog (by default at 95% but you can set 100% if you wish).I call each cluster a gene and give it a name Gene00001 which are then annotated by eggNOG mapper or your custom tool.

The file Genecatalog/clustering/orf2gene.tsv.gz gives you the link between orf and the gene (cluster representative). From the orf name, you can easily get the sample and contig name.

Now In the genome workflow, contigs are binned and dereplicated. The dereplicated MAGs are consistently named MAG001. By default, the contigs are also renamed from sample_contignr to MAG0001_1.

The file genomes/annotations/gene2genome.tsv.gz. relates the MAG to the Gene. It tells you if the specific gene is in a genome.

@jmtsuji You would like to know which orf of the MAG corresponds to a gene? Can you explain why you want this information?
I don't do it in atlas, but if you run prodigal again on the genomes, then in fact there is some duplication as the orf MAG0001_1_1 might be the same as sample5_1_1 which are identical.

You could turn off the renaming of MAG contigs by setting: rename_mags_contigs: false in the config file.

@SilasK
Copy link
Member

SilasK commented Oct 13, 2021

@LeeBergstrand So the mapping tables are quite large. And most of the information is actually encoded in the names of the orfs.

The SQL database would consist mainly of the two files
Genecatalog/clustering/orf2gene.tsv.gz and genomes/clustering/all_contigs2bins.tsv.gz

I don't have much experience with mySQL and I have the impression that it closes the files for the average user.

Also, have a look at how I do the mapping.

rule gene2genome:
input:
contigs2bins="genomes/clustering/all_contigs2bins.tsv.gz",
contigs2mags="genomes/clustering/contig2genome.tsv",
old2newID="genomes/clustering/old2newID.tsv",
orf2gene="Genecatalog/clustering/orf2gene.tsv.gz",
params:
remaned_contigs=config["rename_mags_contigs"] & (
config["genecatalog"]["source"] == "contigs"
),
output:
"genomes/annotations/gene2genome.tsv.gz",
run:
import pandas as pd
if params.remaned_contigs:
contigs2bins = pd.read_csv(
input.contigs2bins, index_col=0, squeeze=False, sep="\t", header=None
)
contigs2bins.columns = ["Bin"]
old2newID = pd.read_csv(
input.old2newID, index_col=0, squeeze=True, sep="\t"
)
contigs2genome = (
contigs2bins.join(old2newID, on="Bin").dropna().drop("Bin", axis=1)
)
else:
contigs2genome = pd.read_csv(
input.contigs2mags, index_col=0, squeeze=False, sep="\t", header=None
)
contigs2genome.columns = ["MAG"]
orf2gene = pd.read_csv(
input.orf2gene, index_col=0, squeeze=False, sep="\t", header=0
)
orf2gene["Contig"] = orf2gene.index.map(lambda s: "_".join(s.split("_")[:-1]))
orf2gene = orf2gene.join(contigs2genome, on="Contig")
orf2gene = orf2gene.dropna(axis=0)
gene2genome = orf2gene.groupby(["Gene", "MAG"]).size()
gene2genome.name = "Ncopies"
gene2genome.to_csv(output[0], sep="\t", header=True, compression="gzip")

@SilasK
Copy link
Member

SilasK commented Oct 13, 2021

In the future, I plan to make the gene catalog a little more independent from the genomes workflow. E.g. using PLASS to assemble genes directly from the raw reads for data where the assembly is not working optimally.

Therefore it will become necessary to map the genes predicted from the MAGs to the Genecatalog. I think doing this with mmseq but without reclustering, there should be an option of simply mapping at 100% or so.

@jmtsuji
Copy link
Contributor Author

jmtsuji commented Oct 19, 2021

@SilasK Thanks for the helpful information! That is good to know that you are hoping to make the Genecatalog more independent of the genomes workflow long-term. (Trying out PLASS would also be interesting.)

In my case, it's very helpful to relate the Genecatalog entries to the predicted ORFs of the genomes. Here are a couple workflows I commonly use:

  • Annotate predicted ORFs within the genomes using the EggNOG annotations of the Genecatalog.
  • Explore the Genecatalog for functional genes of interest and see if those genes are inside a MAG. If the genes are in a MAG, it enables deeper analysis of the genes. You can also look at the neighbouring genes on the contigs in the MAGs. Knowing the exact gene is a little more helpful than just knowing that the gene is somewhere in the MAG (which is what genomes/annotations/gene2genome.tsv.gz shows).

I agree that the easiest solution to relate the Genecatalog entries to the predicted ORFs in the genomes might be to map the ORFs predicted in the MAGs to the Genecatalog (e.g., using mmseqs cluster), without actually reclustering the Genecatalog. This way, the user can see which Genecatalog entries match their MAGs, but the two workflows can remain independent. Clustering at 100% should work if you map the MAG ORFs to the Genecatalog/all_genes/predicted_genes.faa file, because that file should contain the exact same ORFs as predicted for the MAGs. If you are mapping to the Genecatalog/gene_catalog.faa file, you might need to use a clustering threshold of something like 95%.

I considered using rename_mags_contigs: false, but I really like the naming scheme in the FastA headers for the genomes once renamed. I prefer to keep the MAG ORF naming scheme, if possible. :-)

I recently tested out the "Genecatalog - genomes" mapping method I described above on my ATLAS run (via MMSeqs2), and it gives me data I am satisfied with. mmseqs missed ~0.7% of the ORFs in the mapping, but this might be due to the specificity setting being too low. Here's the head of the final output file I generated:

$ head mag_orf2gene.tsv

ORF	Gene
MAG01_1_1	Gene0423526
MAG01_1_2	Gene0423527
MAG01_1_3	Gene0253845
MAG01_1_4	Gene0253846
MAG01_1_5	Gene0253847
MAG01_1_6	Gene0253848
MAG01_1_7	Gene0253849
MAG01_1_8	Gene0253850
MAG01_1_9	Gene0253851

@LeeBergstrand Good to know that you are also interested in this topic! Like Silas mentioned, I think a lot of the info you mentioned in your table is already encoded in the FastA headers and/or in Genecatalog/clustering/orf2gene.tsv.gz. The only thing missing is the link between the genomes and the Genecatalog. It sounds like Silas is trying to make these workflows kind of independent now, which makes the link a little more tricky.

@SilasK
Copy link
Member

SilasK commented Oct 20, 2021

 @jmtsuji I see that if you want to look at the neighborhoods of genes that you need to know the exact gene. However, keep in mind that the MAGs are only the best and the representative genomes per species.

I would suggest that you look for neighboring genes in all bins or even on all contigs. In that case, you can look at the mapping files outlined above.

It would be cool to create a db for all orfs on all contigs. In this case, maybe a SQL database would be meaningful.

@SilasK SilasK changed the title MissingInputException: orf2genome.tsv Exact gene locatons Oct 20, 2021
@SilasK
Copy link
Member

SilasK commented Oct 21, 2021

@jmtsuji The discussion with you motivated me to do some coding.

Here is my example of how I would look at the neighborhood of a gene in all the contigs and bins.

https://gist.github.com/SilasK/5932a6887ee4d2520b5a59cec06d09b7

What do you think?

@LeeBergstrand
Copy link

LeeBergstrand commented Oct 22, 2021

I don't have much experience with mySQL.

@SilasK I would use SQLite3 (datafile with an SQL library interface) rather than mySQL (a full-on database server) for this application. I have pretty extensive experience with SQL (I used it a lot in undergrad, my company, and my masters), so If you need my help with designing schemas etc. I can do that. Using an ORM like SQLAlchemy also helps and SQL Alchemy can hook into pandas.

and I have the impression that it closes the files for the average user.

Yes, this is a concern. You would either have to have a command-line tool that creates mapping files as needed for end users or you would have to have a Python code for parsing the data found in the SQLite3 file.

If we plan to go the SQL root. I'm willing to chip in some development time. I would also be interested in building this as a complimentary feature to the mapping files.

Here is an example of some SQLAlchemy code: https://github.com/Micromeda/pygenprop/blob/develop/pygenprop/assignment_database.py

You basically create a series of objects and then the python code syncs their data to the database file.

@github-actions
Copy link

There was no activity since some time. I hope your issue is solved in the mean time.
This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 19, 2023
@metagenome-atlas metagenome-atlas locked and limited conversation to collaborators Mar 20, 2023
@SilasK SilasK converted this issue into discussion #620 Mar 20, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

3 participants