Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRISPR prediction assertion error #299

Closed
ohickl opened this issue Jul 1, 2024 · 1 comment
Closed

CRISPR prediction assertion error #299

ohickl opened this issue Jul 1, 2024 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@ohickl
Copy link
Contributor

ohickl commented Jul 1, 2024

Hi Oliver,
I performed quite a few runs now on metagenomes of various sizes with the modified Bakta version from #289 and had barely any problems except for one sample and specifically with one contig.
There, a PILER-CR predicted spacer seq doesnt match the seq in the fasta at the position parsed by Bakta.

Here the stdout from a run with regular Bakta and just the one contig:

Bakta v1.9.3
Options and arguments:
        input: .../bakta_devel/test/data/piler_test_contig.fna
        db: .../databases/imp3/bakta/db, version 5.1, full
        output: .../bakta_devel/test/sanity_check_piler_base
        force: True
        tmp directory: .../bakta_devel/test/tmp/tmp5fici64f
        prefix: annotation
        threads: 128
        debug: True
        meta mode: True
        translation table: 11
        keep contig headers: True

Bakta runs in DEBUG mode! Temporary data will not be destroyed at: .../bakta_devel/test/tmp/tmp5fici64f

parse genome sequences...
        imported: 1
        filtered & revised: 1
        contigs: 1

start annotation...
predict tRNAs...
        found: 0
predict tmRNAs...
        found: 0
predict rRNAs...
        found: 0
predict ncRNAs...
        found: 0
predict ncRNA regions...
        found: 0
predict CRISPR arrays...
array_id=1, contig_id=contig_0001, position=19605, repeat_length=47, repeat_seq=...........................T..................., spacer_seq=TTCCGGTG, spacer_length=8, gap_count=0, spacer_genome_seq=TGCGGACT
Traceback (most recent call last):
  File ".../bin/bakta", line 10, in <module>
      sys.exit(main())
  File ".../lib/python3.10/site-packages/bakta/main.py", line 210, in main
    genome['features'][bc.FEATURE_CRISPR] = crispr.predict_crispr(genome, contigs_path)
  File ".../lib/python3.10/site-packages/bakta/features/crispr.py", line 108, in predict_crispr
    assert spacer_seq == spacer_genome_seq  # assure PILER-CR provided sequence equals sequence extracted from genome
AssertionError

I do find the seq in the fasta at index -6 from crispr_spacer['start'], but am not sure how/if its relevant:

Start, Stop: 19652 19659
Spacer:  TGCGGAC
Spacer - 10:  TCCCTTCCGG
Spacer + 10:  TTCCGTTAAT
Spacer rev comp:  GTCCGCA

Im not sure, if the problem is PILER itself or the result parsing, but I get this with all Bakta versions.

I attached the fasta of the contig and a manual PILER run with the same params used in Bakta.

Let me know, if I should get in touch with the PILER people directly for this.

Best

Oskar
piler_test_out.txt
piler_test_contig.txt

@ohickl ohickl added the bug Something isn't working label Jul 1, 2024
oschwengers added a commit that referenced this issue Jul 16, 2024
The assumed minimal length of CRISPR spacers was reduced from at least 10 down to at least 1 in the regex.
@oschwengers
Copy link
Owner

Hi @ohickl ,
thank you very much for reporting! I just took a deeper look into this and found a wrong assumption that I made for the CRISPR parser of PILER-CR.

I wrongly assumed that all CRISPR spacers are of length N>=10. But, in your example, the CRISPR spacer is of length 8. Because of this, the regex didn't match to the result file. I pushed a PR with a fix. I'll run some tests and release a patch release, soon - hopefully by the end of this or next week.

Thanks again!

@oschwengers oschwengers self-assigned this Jul 16, 2024
@oschwengers oschwengers added this to the v1.9.4 milestone Jul 16, 2024
oschwengers added a commit that referenced this issue Jul 16, 2024
* fix CRISPR parser #299

The assumed minimal length of CRISPR spacers was reduced from at least 10 down to at least 1 in the regex.
* polish code
* relaxe CRISPR spacer length regex to 1 or 2 digits
@ohickl ohickl closed this as completed Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants