Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contig_id in any of the files cannot be mapped back to original FASTA #99

Open
schorlton opened this issue Feb 2, 2022 · 6 comments
Open
Labels
enhancement New feature or request

Comments

@schorlton
Copy link

Thanks for the great tool and your help both on and offline.

I'm curious about the decision to fix_fasta_header before processing the FASTA. This makes it basically impossible (as far as I can tell) to combine MOB-Suite outputs with those of other tools dowstream, as it has irreversibly modified the actual contig IDs and then puts those out as new contig IDs in all of its outputs. Was this a conscious decision? Is it possible to leave the contig IDs as the FASTA standard which is the first word before a whitespace in the FASTA header?

It seems like another approach would be to parse the input FASTA file with biopython, maintaining separation between the contig ID and description. As far as I can tell, it seems that this is the only place that actually needs the sequence description:

if unicycler_contigs:
if 'circular=true' in id or '_circ' in id:
contig_info[id]['circularity_status'] = 'circular'
elif id not in circular_contigs:
contig_info[id]['circularity_status'] = 'incomplete'

I'd be happy to open a PR if that makes sense to you? Thanks again!

@jrober84 jrober84 added the enhancement New feature or request label Feb 3, 2022
@jrober84
Copy link
Collaborator

jrober84 commented Feb 3, 2022

So the reason for the fixing of fasta headers is that spaces and other characters can cause issues with tools like blast but there are a variety of tools that put information in the fasta headers delimited by spaces. Al of the blast calls requires mapping back to the contig identifiers and spaces cause issues. We were originally planning on stripping the headers entirely in favor of a sequential integer id but compromised on this since MOB-suite does use the circularity information and it would be harder to map things back. I can put this in as a feature request to have the original headers maintained in the outputs.

@schorlton
Copy link
Author

Sorry for my digging further, but how does it cause issues with BLAST? As far as I'm aware, BLAST outputs contig ID as per FASTA standard in many/most of its output formats.

Thanks again, would really appreciate this addition!

@jrober84
Copy link
Collaborator

The issue #87 shows the problem with super long headers. I have made a change to the code that all sequences will be renamed in the format {int}_{md5}_circular={status} for all blast and mash searches. Then the program will relabel all of the sequences to their original contig id's. This should solve your issue, as I agree there is a problem with mapping contigs back for integrating results from other tools. version 3.1.0 will implement this

@schorlton
Copy link
Author

Thanks @jrober84 for the effort! One suggestion: I'd suggest putting the sequence ID in the contig_id field, and not the whole FASTA header. Eg. with this fix it puts contig_1 circular=true in the contig_id, but I'd suggest just contig_1. There are a couple reasons for this:

  • The field is named ID, which is only the first field of a FASTA header by convention
  • The field may get excessively long in the TSV
  • More importantly, if someone puts a tab in their FASTA header, it will break mob-recon's output as there will varying numbers of columns

@kbessonov1984
Copy link
Collaborator

Thanks for sharing reflections on contig_id format. I think that both providing a full length fasta header as contig ID or just using identifier before the first space has its value because this way user might encode additional information that might be of use. On the other hand the contig_id might get long to read, but the output should not be affected by length of presence of tabs and correctly be written to the report file by the writeReport(data_list, header, outfile) function. All subsequent tools can just remove this information if it causes compatibility issues. It is easier to remove data than to add in my opinion. In addition, what if user decides to remove all spaces and substitute them by _ symbols. Then in this case we would not able to trim the contig_id. I think the best compromise is to create new output parameter --contig_id_standardize that would output standard contig_ids truncated to the first space (if present).

@schorlton
Copy link
Author

Thanks @kbessonov1984. While the output may be "correctly" written to TSV by python, it won't be parseable as it will have variable numbers of columns depending on if a FASTA sequence had tabs in its sequence header. Eg. I don't think you'll be able to read the mob-recon output back into pandas or us python's csv.DictReader, which is pretty problematic.

@jrober84 jrober84 reopened this Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants