Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using FCS-GX to split/mask internal adaptor sequences #67

Closed
bndaniel opened this issue Feb 7, 2024 · 13 comments
Closed

Using FCS-GX to split/mask internal adaptor sequences #67

bndaniel opened this issue Feb 7, 2024 · 13 comments

Comments

@bndaniel
Copy link

bndaniel commented Feb 7, 2024

Hello,

I have been working on submitting a genome to NCBI, and along with many others have internal contamination by adaptors (as identified by FCS-adaptor). As recommended, I am trying to use the output from FCS-adaptor as an input to FCS-GX to split or mask the internal adaptors. I made a new report.txt file with action "FIX" or "SPLIT" and get "Applied 0 actions; 0 bps dropped; 0 bps hardmasked" back with no modification to the genome. I noticed someone is having a similar issue in issue #66.

I have attached the modified report.txt file - is there any issue with filling columns with "NA"? is there information I am missing? Thanks for the help. Looking forward to seeing a SPLIT or internal trim function in FCS-adaptor soon!

adaptor_contamination_fcsgx.txt

@etvedte
Copy link
Contributor

etvedte commented Feb 7, 2024

Hello,

Can you post the exact commands and input you used for FCS-adaptor screening and subsequent cleaning? By "output from FCS-adaptor as an input to FCS-GX" I am assuming you mean here the adaptor report used on the uncleaned FASTA.

is there any issue with filling columns with "NA"

I should test this sometime. Don't think it should be an issue, though.

There is quite a bit of adaptor contamination here. What adaptor sequence is it hitting? Are these at contig boundaries? You might want to look at the reads mapped to the sequence in a viewer to see whether these sequences are making false joins in your assembly...you would want to split instead of hardmask.

Eric

@bndaniel
Copy link
Author

bndaniel commented Feb 7, 2024

Hi Eric,

The cleaning and report was done when I submitted the genome to NCBI - which gave me a report (the modified version is the one I attached above) and a cleaned genome (removed contamination at contig boundaries). As you can see in the report.txt file, all adaptor contamination is found at internal sequences within contigs - so my main objective is to split the contigs at these contamination sites, and re-run FCS-adaptor to remove these sequences at contig boundaries.

These adaptor sequences were mostly from cDNA synthesis kit - you can see the original report attached.

Here is the command I am using with the genome provided by NCBI cleaning (decontam_genome.fsa) and the modified adaptor contamination .txt file.
python3 ./fcs.py clean genome -i ./decontam_genome.fsa --action-report ./adaptor_contamination_fcsgx.txt --output ./clean_genome.fasta --contam-fasta-out ./contam.fasta

RemainingContamination.txt

@etvedte
Copy link
Contributor

etvedte commented Feb 7, 2024

Can you forward the email from the NCBI submissions team to eric.tvedte@nih.gov?

@bndaniel
Copy link
Author

This remains unresolved, I have attempted to re-run FCS_adaptor on my original genome and attempted to use the output for using TRIM on internal adaptors, but the output from FCS_adaptor is not the same as what is used by fcs.py. The wiki indicates that you can use the fcs_adaptor_report.txt for fcs.py but I keep getting "Fatal error (St13runtime_error): util.cpp:212 in ConsumeMetalineHeader(...): Expected the first line of the input file to begin with header:
##[["FCS genome report",2,
found:
#accession length action range name"

@etvedte
Copy link
Contributor

etvedte commented Apr 23, 2024

Hi Ben,

Sorry this isn't working as expected. Catching up with the previous communications...

If you are running FCS-adaptor on your own, the resulting fcs_adaptor_report.txt should have one row per sequence (not like the single action per row in the GX -style output). So scaffold_1 should look similar to this:

scaffold_1      3322941 ACTION_TRIM     139816..139841,140342..140366,338804..338829,339330..339353,433763..433788,546394..546418,889324..889348,953469..953496,953597..953623,1125514..1125536,1125637..1125661,1234579..1234604,1235105..1235129,1274509..1274533,1408151..1408175,1408676..1408701,1606116..1606140,1776689..1776714,1777215..1777241,1822912..1822936,1823437..1823460,1839569..1839595,2050714..2050739,2050840..2050864,2081692..2081717,2457137..2457161,2721378..2721403,2721504..2721529,2745521..2745546,2746147..2746172,2886862..2886886,2905448..2905474,3158039..3158057,3158558..3158583      CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked

When you use fcs.py clean genome with FCS-adaptor style reports, make sure you are doing the following:

  1. Make sure you are using the latest version of fcs.py
  2. Use the original FASTA that you ran with run_fcsadaptor.sh, not the sequences in cleaned_sequences. See here for more details.
  3. Don't change anything in the FCS-adaptor report if you want to split on internal contaminants. You do need to change ACTION_TRIM to FIX if you want to mask.
  4. Run
cat input.fa | python3 ./fcs.py clean genome --action-report ./outputdir/fcs_adaptor_report.txt --output clean.fasta --contam-fasta-out contam.fasta

When the clean run is completed, you should be able to see coordinates in the FASTA header corresponding to the split locations. So the first few splits produce the following...

>scaffold_1~433789..546393
>scaffold_1~339354..433762
>scaffold_1~338830..339329
>scaffold_1~140367..338803
>scaffold_1~139842..140341
>scaffold_1~1..139815  

If you are still having this error after doing all of the above, let me know.

@bndaniel
Copy link
Author

bndaniel commented Apr 23, 2024 via email

@bndaniel
Copy link
Author

bndaniel commented Apr 23, 2024 via email

@etvedte
Copy link
Contributor

etvedte commented Apr 24, 2024

Are you using Docker or Singularity?

fcs.py should be able to handle both formats. I don't recognize this error message in the current code. Will continue to look.

In the meantime, can you try running fcs.py clean genome on that example from the wiki? You linked the adaptor report above, and the FASTA is retrievable from zenodo. I tested this on Docker just last week and got it to work.

If this works, it is suggesting there is something different/conflicting with your adaptor report. If this doesn't work, this is some kind of software/image issue.

@bndaniel
Copy link
Author

bndaniel commented May 1, 2024 via email

@etvedte
Copy link
Contributor

etvedte commented May 1, 2024

OK, you're using Docker. Can you verify that the version in the docker image is up-to-date when running fcs.py commands? It should be v0.5.0.

Also, please try using the example from the wiki. That has cases with internal ACTION_TRIMs called by FCS-adaptor and should default to splitting with fcs.py clean genome. See what happens.

@Hannah1746
Copy link

I believe I am running into the same issue on my end. Did this ever get resolved?

@etvedte
Copy link
Contributor

etvedte commented Jun 5, 2024

@Hannah1746 we are not aware of any issues with splitting vs. masking in the current release. Please verify that you are using the v0.5.0 release. It would be helpful if you could provide additional details:

  • Are you using Docker or Singularity
  • Post your full fcs.py clean genome command
  • Post the header of the action report and the first few lines detailing contaminants that are to be cleaned
  • Post the error message

@etvedte
Copy link
Contributor

etvedte commented Jun 26, 2024

There is a new FCS v0.5.4 release that can be tested. Make sure you are using the latest release when screening/cleaning genomes. There weren't any changes relevant to the content of this GitHub issue, but we haven't received any additional information that would help us to troubleshoot. If you're still having this problem with v0.5.4, feel free to re-open the issue.

@etvedte etvedte closed this as completed Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants