contig_id in any of the files cannot be mapped back to original FASTA #99

schorlton · 2022-02-02T23:30:28Z

Thanks for the great tool and your help both on and offline.

I'm curious about the decision to fix_fasta_header before processing the FASTA. This makes it basically impossible (as far as I can tell) to combine MOB-Suite outputs with those of other tools dowstream, as it has irreversibly modified the actual contig IDs and then puts those out as new contig IDs in all of its outputs. Was this a conscious decision? Is it possible to leave the contig IDs as the FASTA standard which is the first word before a whitespace in the FASTA header?

It seems like another approach would be to parse the input FASTA file with biopython, maintaining separation between the contig ID and description. As far as I can tell, it seems that this is the only place that actually needs the sequence description:

mob-suite/mob_suite/mob_recon.py

Lines 1138 to 1142 in 1d735b3

    
           if unicycler_contigs: 
        
               if 'circular=true' in id or '_circ' in id: 
        
                   contig_info[id]['circularity_status'] = 'circular' 
        
               elif id not in circular_contigs: 
        
                   contig_info[id]['circularity_status'] = 'incomplete'

I'd be happy to open a PR if that makes sense to you? Thanks again!

The text was updated successfully, but these errors were encountered:

jrober84 · 2022-02-03T20:18:12Z

So the reason for the fixing of fasta headers is that spaces and other characters can cause issues with tools like blast but there are a variety of tools that put information in the fasta headers delimited by spaces. Al of the blast calls requires mapping back to the contig identifiers and spaces cause issues. We were originally planning on stripping the headers entirely in favor of a sequential integer id but compromised on this since MOB-suite does use the circularity information and it would be harder to map things back. I can put this in as a feature request to have the original headers maintained in the outputs.

schorlton · 2022-02-03T21:11:33Z

Sorry for my digging further, but how does it cause issues with BLAST? As far as I'm aware, BLAST outputs contig ID as per FASTA standard in many/most of its output formats.

Thanks again, would really appreciate this addition!

jrober84 · 2022-05-26T13:57:56Z

The issue #87 shows the problem with super long headers. I have made a change to the code that all sequences will be renamed in the format {int}_{md5}_circular={status} for all blast and mash searches. Then the program will relabel all of the sequences to their original contig id's. This should solve your issue, as I agree there is a problem with mapping contigs back for integrating results from other tools. version 3.1.0 will implement this

schorlton · 2022-06-03T04:09:02Z

Thanks @jrober84 for the effort! One suggestion: I'd suggest putting the sequence ID in the contig_id field, and not the whole FASTA header. Eg. with this fix it puts contig_1 circular=true in the contig_id, but I'd suggest just contig_1. There are a couple reasons for this:

The field is named ID, which is only the first field of a FASTA header by convention
The field may get excessively long in the TSV
More importantly, if someone puts a tab in their FASTA header, it will break mob-recon's output as there will varying numbers of columns

kbessonov1984 · 2022-06-03T19:58:39Z

Thanks for sharing reflections on contig_id format. I think that both providing a full length fasta header as contig ID or just using identifier before the first space has its value because this way user might encode additional information that might be of use. On the other hand the contig_id might get long to read, but the output should not be affected by length of presence of tabs and correctly be written to the report file by the writeReport(data_list, header, outfile) function. All subsequent tools can just remove this information if it causes compatibility issues. It is easier to remove data than to add in my opinion. In addition, what if user decides to remove all spaces and substitute them by _ symbols. Then in this case we would not able to trim the contig_id. I think the best compromise is to create new output parameter --contig_id_standardize that would output standard contig_ids truncated to the first space (if present).

schorlton · 2022-06-03T23:57:17Z

Thanks @kbessonov1984. While the output may be "correctly" written to TSV by python, it won't be parseable as it will have variable numbers of columns depending on if a FASTA sequence had tabs in its sequence header. Eg. I don't think you'll be able to read the mob-recon output back into pandas or us python's csv.DictReader, which is pretty problematic.

jrober84 added the enhancement New feature or request label Feb 3, 2022

jrober84 closed this as completed May 26, 2022

jrober84 reopened this Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contig_id in any of the files cannot be mapped back to original FASTA #99

contig_id in any of the files cannot be mapped back to original FASTA #99

schorlton commented Feb 2, 2022

jrober84 commented Feb 3, 2022

schorlton commented Feb 3, 2022

jrober84 commented May 26, 2022

schorlton commented Jun 3, 2022

kbessonov1984 commented Jun 3, 2022

schorlton commented Jun 3, 2022

contig_id in any of the files cannot be mapped back to original FASTA #99

contig_id in any of the files cannot be mapped back to original FASTA #99

Comments

schorlton commented Feb 2, 2022

jrober84 commented Feb 3, 2022

schorlton commented Feb 3, 2022

jrober84 commented May 26, 2022

schorlton commented Jun 3, 2022

kbessonov1984 commented Jun 3, 2022

schorlton commented Jun 3, 2022