Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contigs are disorganised in the EMBL file #70

Closed
ireneortega opened this issue Feb 11, 2022 · 10 comments
Closed

Contigs are disorganised in the EMBL file #70

ireneortega opened this issue Feb 11, 2022 · 10 comments
Labels

Comments

@ireneortega
Copy link

My genome has 30 contigs named consecutively as contig1, contig2, contig3..., contig29 and contig30. When I create the EMBL file, contigs are ordered this way: contig1, contig10, contig11...contig19, contig2, contig20,... However, I don't like that order because my contigs are ordered against a reference genomes and then genome is disordered in the EMBL file to submit. I want the EMBL file to keep the contigs in this order: contig1, contig2, contig3..., contig29 and contig30. What can I do to keep the order of my contigs?

@Juke34
Copy link
Collaborator

Juke34 commented Feb 11, 2022

We might change this order but I guess it will be only esthetic. I mean the change will affect the order in EMBL flat file but I don't think it will change anything about the ordering in ENA archive.
Well we should contact the helpdesk to confirm how it is handle by their submission pipeline. So the question is, do they keep the order on sequences met in the flat file?
If their is a specific order to follow usually you provide aan AGP file that explain to ENA how to scaffold those contigs. (AGP file is sort of recipe that say how to concatenate the different contigs (order and direction) to create a long scaffold).

Anyway we could try to see if changing the EMBLmyGFF3 behavior is an easy task. If so we will change that.

@ireneortega
Copy link
Author

ireneortega commented Feb 11, 2022

Yes, it's just a question of esthetic. I have no idea what is the order they keep in the ENA archive, but I received the .gz file after submitting the annotated sequences and the order is the same as in the flat file. Honestly, I don't know if that .gz file will be the one that will be released to public as this is my first time submitting to ENA. I only uploaded the assembly (only interested in the contig level, not scaffolds) and the sequence annotations. In fact, I didn't know about the AGP file. Should I upload this file too? Is this format correct for that file? I haven't seen this file before...

contig1	1	102565	1	W	OV1234	1	102565	+
contig2	1	88529	1	W	OV1235	1	88529	+
contig3	1	18341	1	W	OV1236	1	18341	+

Thanks for your help!

@Juke34
Copy link
Collaborator

Juke34 commented Feb 11, 2022

If order matter yes you should upload an AGP file. See the dedicated section in the ENA help:
https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#agp-file
and NCBI https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/
and examples at the bottom as here: https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/scaffold_from_contig_WGS.agp.v2.0/

Loot at biostars.org, there were several questions related to AGP files.

You can try the validator then to be sure if it is well done: https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi

@Juke34
Copy link
Collaborator

Juke34 commented Jun 14, 2022

I got an answer: Order matter:

The accession numbers will be assigned to the sequences in the exact order it
was included in the flat file submission. 

So we need to update EMBLmyGFF3 to fix this problem

@Juke34 Juke34 added the bug label Jun 14, 2022
@Juke34
Copy link
Collaborator

Juke34 commented Sep 1, 2022

@ireneortega I would need some feedbacks from you to close this issue. I have tried on my side and EMBLmyGFF3 v2.1 sounds to work as expected. Did you use an older version? Could you tell my your python version, EMBLmyGFF3 version and biopython version? Otherwise could you try with EMBLmyGFF3 v2.1 to see if you see the same problem?

@ireneortega
Copy link
Author

@Juke34 My contig ordering is not kept in the EBML file in the same way as in the fasta and gff files, so contigs are ordered in the way I told you at the beginning. After assembling, contigs are named as contig1, contig2, etc. and then contigs are reordered, imagine this way: contig14, contig1, contig3, etc, so I want the EMBL shows the contigs in this same way. What I got is: contig10, contig11...contig19, contig1, contig20...

I am using EMBLmyGFF3 v2.1 with Python 2.7.18 and biopython 1.76.

@Juke34
Copy link
Collaborator

Juke34 commented Sep 2, 2022

Ok then it wiould be fixed if you install python >=3.6

Since Python 3.6, the default dict class maintains key order, meaning this dictionary will reflect the order of records given to it. As of Biopython 1.72, on older versions of Python we explicitly use an OrderedDict so that you can always assume the record order is preserved.

I made a try with this order

ERS324955|SC|contig000001
ERS324955|SC|contig000012
ERS324955|SC|contig000011
ERS324955|SC|contig000003

The order from the fasta is kept by seq_dict = SeqIO.to_dict( SeqIO.parse(infasta, "fasta") ) (still 1,12,11,3) but then when parsing the GFF with for record in GFF.parse(infile, base_dict=seq_dict): the order become

ERS324955|SC|contig000001
ERS324955|SC|contig000003
ERS324955|SC|contig000011
ERS324955|SC|contig000012

I made a try moving around order of GFF feature, the final order is still respected. So updating python would fix the problem.
I the next release I will force new version of python and biopython and add a test to check that order behavior is respected.

@ireneortega
Copy link
Author

I've just installed EMBLmyGFF3 through conda with python 3.6 and the same problem appeared (contig1, contig10, contig11...).

@Juke34
Copy link
Collaborator

Juke34 commented Sep 2, 2022

Try with branch 2.2.

  1. Git clone the repo
  2. move into the 2.2 branch (git checkout 2.2)
  3. There is a conda file you can use to prepare the env conda env create -f conda_environment_AGAT.yml , then conda activate emblmygff3
  4. Install EMBLmyGFF3 with python setup.py install

Then it should work properly.

@Juke34
Copy link
Collaborator

Juke34 commented Nov 25, 2022

Please feel free to reopen the issue if you still encountered problem in v2.2 of EMBLmyGFF3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants