-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using FCS-GX to split/mask internal adaptor sequences #67
Comments
Hello, Can you post the exact commands and input you used for FCS-adaptor screening and subsequent cleaning? By "output from FCS-adaptor as an input to FCS-GX" I am assuming you mean here the adaptor report used on the uncleaned FASTA.
I should test this sometime. Don't think it should be an issue, though. There is quite a bit of adaptor contamination here. What adaptor sequence is it hitting? Are these at contig boundaries? You might want to look at the reads mapped to the sequence in a viewer to see whether these sequences are making false joins in your assembly...you would want to split instead of hardmask. Eric |
Hi Eric, The cleaning and report was done when I submitted the genome to NCBI - which gave me a report (the modified version is the one I attached above) and a cleaned genome (removed contamination at contig boundaries). As you can see in the report.txt file, all adaptor contamination is found at internal sequences within contigs - so my main objective is to split the contigs at these contamination sites, and re-run FCS-adaptor to remove these sequences at contig boundaries. These adaptor sequences were mostly from cDNA synthesis kit - you can see the original report attached. Here is the command I am using with the genome provided by NCBI cleaning (decontam_genome.fsa) and the modified adaptor contamination .txt file. |
Can you forward the email from the NCBI submissions team to eric.tvedte@nih.gov? |
This remains unresolved, I have attempted to re-run FCS_adaptor on my original genome and attempted to use the output for using TRIM on internal adaptors, but the output from FCS_adaptor is not the same as what is used by fcs.py. The wiki indicates that you can use the fcs_adaptor_report.txt for fcs.py but I keep getting "Fatal error (St13runtime_error): util.cpp:212 in ConsumeMetalineHeader(...): Expected the first line of the input file to begin with header: |
Hi Ben, Sorry this isn't working as expected. Catching up with the previous communications... If you are running FCS-adaptor on your own, the resulting fcs_adaptor_report.txt should have one row per sequence (not like the single action per row in the GX -style output). So scaffold_1 should look similar to this:
When you use fcs.py clean genome with FCS-adaptor style reports, make sure you are doing the following:
When the clean run is completed, you should be able to see coordinates in the FASTA header corresponding to the split locations. So the first few splits produce the following...
If you are still having this error after doing all of the above, let me know. |
Hi Eric,
Thanks for the reply. I have attempted to do this twice now with fresh downloads of fcs.py and run_fcsadaptor.sh and am still getting the error: "Fatal error (St13runtime_error): util.cpp:212 in ConsumeMetalineHeader(...): Expected the first line of the input file to begin with header:
##[["FCS genome report",2,
found:
#accession length action range name”
Let me know what you think is the best next step!
Ben
… On Apr 23, 2024, at 1:43 PM, Eric Tvedte ***@***.***> wrote:
Hi Ben,
Sorry this isn't working as expected. Catching up with the previous communications...
If you are running FCS-adaptor on your own, the resulting fcs_adaptor_report.txt should have one row per sequence (not like the single action per row in the GX -style output). So scaffold_1 should look similar to this:
scaffold_1 3322941 ACTION_TRIM 139816..139841,140342..140366,338804..338829,339330..339353,433763..433788,546394..546418,889324..889348,953469..953496,953597..953623,1125514..1125536,1125637..1125661,1234579..1234604,1235105..1235129,1274509..1274533,1408151..1408175,1408676..1408701,1606116..1606140,1776689..1776714,1777215..1777241,1822912..1822936,1823437..1823460,1839569..1839595,2050714..2050739,2050840..2050864,2081692..2081717,2457137..2457161,2721378..2721403,2721504..2721529,2745521..2745546,2746147..2746172,2886862..2886886,2905448..2905474,3158039..3158057,3158558..3158583 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
When you use fcs.py clean genome with FCS-adaptor style reports, make sure you are doing the following:
Make sure you are using the latest version of fcs.py
Use the original FASTA that you ran with run_fcsadaptor.sh, not the sequences in cleaned_sequences. See here <https://github.com/ncbi/fcs/wiki/FCS-adaptor-quickstart#clean-the-genome> for more details.
Don't change anything in the FCS-adaptor report if you want to split on internal contaminants. You do need to change ACTION_TRIM to FIX if you want to mask.
Run
cat input.fa | python3 ./fcs.py clean genome --action-report ./outputdir/fcs_adaptor_report.txt --output clean.fasta --contam-fasta-out contam.fasta
When the clean run is completed, you should be able to see coordinates in the FASTA header corresponding to the split locations. So the first few splits produce the following...
>scaffold_1~433789..546393
>scaffold_1~339354..433762
>scaffold_1~338830..339329
>scaffold_1~140367..338803
>scaffold_1~139842..140341
>scaffold_1~1..139815
If you are still having this error after doing all of the above, let me know.
—
Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AREROTQ76CRSGYSWDS7MWVLY62TUPAVCNFSM6AAAAABC5DVFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZTGE3TIMBUGE>.
You are receiving this because you authored the thread.
|
The output from run_fcsadaptor.sh looks like:
#accession length action range name
scaffold_10000 49619 ACTION_TRIM 49595..49619 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
scaffold_10001 49611 ACTION_TRIM 49587..49611 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
scaffold_10002 49608 ACTION_TRIM 1..25 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
scaffold_10003 50100 ACTION_TRIM 17314..17339,17840..17864,50075..50100 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence
Which matches the expected output as described here https://github.com/ncbi/fcs/raw/main/examples/FCS_combo_test.fcs_adaptor_report.txt
Yet, fcs.py seems to want an input more similar to the FCS-GX output like https://github.com/ncbi/fcs/raw/main/examples/FCS_combo_test.fcs_gx_report.txt
Best, Ben
… On Apr 23, 2024, at 6:24 PM, Ben Daniels ***@***.***> wrote:
Hi Eric,
Thanks for the reply. I have attempted to do this twice now with fresh downloads of fcs.py and run_fcsadaptor.sh and am still getting the error: "Fatal error (St13runtime_error): util.cpp:212 in ConsumeMetalineHeader(...): Expected the first line of the input file to begin with header:
##[["FCS genome report",2,
found:
#accession length action range name”
Let me know what you think is the best next step!
Ben
> On Apr 23, 2024, at 1:43 PM, Eric Tvedte ***@***.***> wrote:
>
>
> Hi Ben,
>
> Sorry this isn't working as expected. Catching up with the previous communications...
>
> If you are running FCS-adaptor on your own, the resulting fcs_adaptor_report.txt should have one row per sequence (not like the single action per row in the GX -style output). So scaffold_1 should look similar to this:
>
> scaffold_1 3322941 ACTION_TRIM 139816..139841,140342..140366,338804..338829,339330..339353,433763..433788,546394..546418,889324..889348,953469..953496,953597..953623,1125514..1125536,1125637..1125661,1234579..1234604,1235105..1235129,1274509..1274533,1408151..1408175,1408676..1408701,1606116..1606140,1776689..1776714,1777215..1777241,1822912..1822936,1823437..1823460,1839569..1839595,2050714..2050739,2050840..2050864,2081692..2081717,2457137..2457161,2721378..2721403,2721504..2721529,2745521..2745546,2746147..2746172,2886862..2886886,2905448..2905474,3158039..3158057,3158558..3158583 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
> When you use fcs.py clean genome with FCS-adaptor style reports, make sure you are doing the following:
>
> Make sure you are using the latest version of fcs.py
> Use the original FASTA that you ran with run_fcsadaptor.sh, not the sequences in cleaned_sequences. See here <https://github.com/ncbi/fcs/wiki/FCS-adaptor-quickstart#clean-the-genome> for more details.
> Don't change anything in the FCS-adaptor report if you want to split on internal contaminants. You do need to change ACTION_TRIM to FIX if you want to mask.
> Run
> cat input.fa | python3 ./fcs.py clean genome --action-report ./outputdir/fcs_adaptor_report.txt --output clean.fasta --contam-fasta-out contam.fasta
> When the clean run is completed, you should be able to see coordinates in the FASTA header corresponding to the split locations. So the first few splits produce the following...
>
> >scaffold_1~433789..546393
> >scaffold_1~339354..433762
> >scaffold_1~338830..339329
> >scaffold_1~140367..338803
> >scaffold_1~139842..140341
> >scaffold_1~1..139815
> If you are still having this error after doing all of the above, let me know.
>
> —
> Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AREROTQ76CRSGYSWDS7MWVLY62TUPAVCNFSM6AAAAABC5DVFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZTGE3TIMBUGE>.
> You are receiving this because you authored the thread.
>
|
Are you using Docker or Singularity?
In the meantime, can you try running If this works, it is suggesting there is something different/conflicting with your adaptor report. If this doesn't work, this is some kind of software/image issue. |
Hi Eric,
I had previous docker images from when I first used FCS that I needed to remove. I got fcs.py to run, but it seems to have converted all internal adapter contamination into N’s rather than split (config number is the same). I checked the adaptor report and all actions have ACTION_TRIM and not FIX… Let me know if I am missing something. Thanks for all your help on this.
Best, Ben
… On Apr 24, 2024, at 5:31 AM, Eric Tvedte ***@***.***> wrote:
Are you using Docker or Singularity?
fcs.py should be able to handle both formats. I don't recognize this error message in the current code. Will continue to look.
In the meantime, can you try running fcs.py clean genome on that example from the wiki? You linked the adaptor report above, and the FASTA is retrievable from zenodo. I tested this on Docker just last week and got it to work.
If this works, it is suggesting there is something different/conflicting with your adaptor report. If this doesn't work, this is some kind of software/image issue.
—
Reply to this email directly, view it on GitHub <#67 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AREROTR25AIPGCPRCRWRPTLY66Q3XAVCNFSM6AAAAABC5DVFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUHAZTQOJRGE>.
You are receiving this because you authored the thread.
|
OK, you're using Docker. Can you verify that the version in the docker image is up-to-date when running Also, please try using the example from the wiki. That has cases with internal ACTION_TRIMs called by FCS-adaptor and should default to splitting with |
I believe I am running into the same issue on my end. Did this ever get resolved? |
@Hannah1746 we are not aware of any issues with splitting vs. masking in the current release. Please verify that you are using the v0.5.0 release. It would be helpful if you could provide additional details:
|
There is a new FCS v0.5.4 release that can be tested. Make sure you are using the latest release when screening/cleaning genomes. There weren't any changes relevant to the content of this GitHub issue, but we haven't received any additional information that would help us to troubleshoot. If you're still having this problem with v0.5.4, feel free to re-open the issue. |
Hello,
I have been working on submitting a genome to NCBI, and along with many others have internal contamination by adaptors (as identified by FCS-adaptor). As recommended, I am trying to use the output from FCS-adaptor as an input to FCS-GX to split or mask the internal adaptors. I made a new report.txt file with action "FIX" or "SPLIT" and get "Applied 0 actions; 0 bps dropped; 0 bps hardmasked" back with no modification to the genome. I noticed someone is having a similar issue in issue #66.
I have attached the modified report.txt file - is there any issue with filling columns with "NA"? is there information I am missing? Thanks for the help. Looking forward to seeing a SPLIT or internal trim function in FCS-adaptor soon!
adaptor_contamination_fcsgx.txt
The text was updated successfully, but these errors were encountered: