Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyfastx.read.raw mangling reads #75

Closed
BioWilko opened this issue Oct 30, 2023 · 8 comments
Closed

Pyfastx.read.raw mangling reads #75

BioWilko opened this issue Oct 30, 2023 · 8 comments

Comments

@BioWilko
Copy link

When iterating through a gzipped fastq file and dumping reads to various output filehandles using read.raw from records stored in a dict reads dumped to files are sometimes mangled. It seems that the way the raw string is generated is flawed in some way leading to weirdness like the below, this occurs predictably and the mangling can be changed by removing a read from the start of the file.

I managed to get around this by not indexing the fastq files in question and reconstructing the record from the name, seq, qual tuple.

An example of a mangled read record:

@SRR5382385.27877 27877 length=135
ATCCAACCGAGCAAGCCGTTCCCGGCTGGTACTCAAATGTAAAAATTTTTATCTGCCGTCTTGAAAATTTTTCCGGCGAACATAAATTTTTATCACGCGACGCCGAATGCGGATCGCGCACTTAATCACTTGGAT
+SRR5382385.27877 27877 length=135
AAF<FFFAFFFJFJFJJJ<7FFF<<<JJF<JJJJFJJFJJFJJJJJJJJJJ<FJFJ--7AF<F<F-FJ-JJJFF<A<-77JFJJF-FJJJJJFA<JFFAAA-<7A-FFAFFJ--7A7FFFJJJFJ77FA--FAFJ
@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTT@SRR5382385.27887 27887 length=135
ACTTTGCGCTCACCGAAAAACGCCGCCTGCAATGCCGGAATACCCAATTGCCGATAGGGGCTGATATTACGAAATTCGCGGGCAAAACTCGGCTCGCCAAACGGGCGGTGCATGGTGCGCGTCAGGTAGTTGCTC
+SRR5382385.27887 27887 length=135
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJ-7FFJJJJJJJJJJJJJJJJJJJ

A script which can be used to reproduce this issue is available here happy to provide any more information if needed to trace the problem.

@lmdu
Copy link
Owner

lmdu commented Nov 5, 2023

Thank you for reporting this issue. I will check and fix it.

@lmdu
Copy link
Owner

lmdu commented Nov 25, 2023

I have download the SRR5382385 to test. But I could not reproduce the issue. Could you provide a simple script for me to reproduce this issue. Thanks.

@BioWilko
Copy link
Author

Sure no problem, I successfully replicated the issue with this script:

import pyfastx
import sys

input_fastq = pyfastx.Fastq(sys.argv[1])

for read in input_fastq:
    sys.stdout.write(read.raw)

Using version 2.0.2 installed from pip on an M2 arm64 macbook with a 697M gzipped fastq, whether or not the fastq is compressed seems to have no impact leading to identical mangling. I can provide the specific fastq if desired but I get the same issue with any large fastq.

@lmdu
Copy link
Owner

lmdu commented Nov 30, 2023

Thanks! I have no macbook with M2 arm64 to test. But i will try to do it. That may take a long time. You can try to use python setup.py install to install pyfastx from source code.

@BioWilko
Copy link
Author

BioWilko commented Nov 30, 2023

I just fully reproduced the problem in a fresh environment on an x86_64 cluster like:

mamba  env create -n pyfastx_test pip seqtk

wget {fastq_file}

seqtk size {fastq_file}

Returns 761975 379155685

pip install pyfastx

python pyfastx_minimal_example.py {fastq_file} > test.fastq

seqtk size test.fastq

Returns 646 378608

With the 647th read looking like (quality line for first read merged with read id line from next read):

@18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d runid=ca30319f25b536efe90b79a721af63532e657a6c read=4884 ch=56 start_time=2023-09-22T09:48:04.386162+00:00 flow_cell_id=FAW72829
ATGTCCTGTTCTTGGTTCAGTTATGCGACCAAGCCATATTCACGGGAAAGTGGCCTACCGTGACTCGATTCCGTTTGTAGTCGTCTGTCGTTTTTCGTGCGCCGCTTCAACTATGTCCCCCGCTCACGCGTCACCGACTGCCAGCGACGGCCGGGTATCGGCCCGACGCTCCAGCGCCATCCATTTTCAGGGCTAGTTGATCCGGCAGGTGAGTTGTTACACACTCCTTAGCGGATTCCGACTTCCACGGCCACCGTCCTGCTGTCTATATCAACCAACACCTTTTCTGGGGTCTGATGAGCGTCGGCATCGGGCGCCTAACCCGGCGTTCGGTTCATCCCGCAGCGCCAGTTCTCTACCAAAAGTGGCCCGCTTCGGCACTCGCATTCCACGCCCGGCTCCACGCCAGCGAGCCGGGCTTCTTACCCATTTAAAGTTTGAGAATAGGTTGAGATCCCCATGTTGAAGACGGCGCACGAAAACAACAACGACTACAACAGAAATGAGTCAC
+
))**'#$%%&%%$#$%(''''''&''&$####$%&%%&**&''$%%%&'()'(++..166?ABBB><<=@GIEGFGQIFKKIQHHKJC1011/,5@18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d runid=ca30319f25b536efe90b79a721af63532e657a6c read=4884 ch=56 start_time=2023-09-22T09:48:04.386162+00:00 flow_cell_id=FAW72829 protocol_group_id=Flying_Sharks_Metagenomics_20230922 sample_id=Flying_Sharks_Metagenomics_20230922 barcode=barcode02 barcode_alias=barcode02 parent_read_id=18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_@45a218bd-8cb7-4368-b5d9-a4574d98ab7e runid=ca30319f25b536efe90b79a721af63532e657a6c read=867 ch=237 start_time=2023-09-22T09:14:55.386162+00:00 flow_cell_id=FAW72829 protocol_group_id=Flying_Sharks_Metagenomics_20230922 sample_id=Flying_Sharks_Metagenomics_20230922 barcode=barcode02 barcode_alias=barcode02 parent_read_id=45a218bd-8cb7-4368-b5d9-a4574d98ab7e basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0 barcode=barcode02
ATGTTTTGCATCTTACTTCGTTCAGTTGTGAACGAACCGATATTAGCCGCCTGTCCTACCGTGACTCGATTCCGTTTGTAGTCGTCTGTCGTTTTTCGTGCGCCGCTTCAACATGGGCGCCGTCGGCGGCCGGGCCGAGGTGGGATCCCGAGGCCTCTCCAGTCCGCCGAGGGCGCACCACCGGCCCGTCTCGCCCGCCGCGCCGGGGAGGTGGAGCACGAGCGCACGTGTTAGGACCCGAAAGATGGTGAACTATGCCTGGGCAGGGCGAAGCCAGAGGAAACTCTGGTGGAGGTCCGTAGCGGTCCTGACGTGCAAATCGGTCGTCCGACCTGGGTATAGGGGCGAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGCGCTCTCGCACGACCACCCCCGCCGTTTATCCGGTAAAGCGAATGATTAGAGGTCTTGGGGCCGAAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTCGCGTGGAGCCGGGCGTGGAGGTGGATGCCTAGTGGGCCACTTTTGGTAAGCAGAACTGGCGCTGCGGGATGAACCGAACGCCAGGGTACGGCGCCGATGCCGACGCTCATAGGACCCCCAGAAAAGGTGTTGTTGATATAGACAGCAGGACGGTGGCCATGGAAGTCGGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCTGAAAATGGATGGCGCTGGAGCGTCGGGCCCATACCCGGCCGTCGCTGGCAGTCGGTGACGCGTGAGAGGGGGACGGGTATGGGCCCGACGCTCCAGCGCCATCCCCTTTCAGGGCTGATTGATACGGCAGGTGAGTCTACACACTCCTTAGCGGATTCCGACTTCCATGGCCACCGTCCTGCTGTCTATATCAACCAACACCTTTTCTGGTCGTGAGCGTCAGCCATGCGGGCGCCTTAACCCGGCGTTGGTCCGATCCCGCAGCGCCGGTCTGCTTACCGAAAGTGGCCCGCTAGGCACTCGCATTCCCACACCCTCCGCGCCGGCCGAGAGCCGCTCTTACCCATTTAAAGTTTGAGAATAGGTTGAGATCGTTTCGGCCCCAAGACCTCTAATAGTTCGCTTTACCGGATAAAACTGCGGGGGGTGGTCGTGCGAGAGCGCAGCTAGCCTGAGGGAGACCAGAGGGGACCGTACTAGATAGTTCGATTAGTCTTTCGCCCCTACCCAGGTCGGACGACCGAGTTTTAGCATCAGTACGGACCTCCACCAGAGCGTCCTCGCGGCTTCGCCCTGCCCGGGCGTAGCCCAGCCGTCTTTCGGGTCCTAACACGTGCGCTCGTGCTGCCTCCCGGCGCGGCGCGAGACGGCGGGTCGTGCCTCGGCGGACTGGGAGCCGGGACCCCCGCCGCCGACAGCACCATGTTGAAGCGGCGCACGAAAAACGACAGACGACTACAAACGGAATGGAGTCG
(pyfastx) jovyan:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               GenuineIntel
  Model name:            Intel Xeon Processor (Skylake, IBRS)
    CPU family:          6
    Model:               85
    Thread(s) per core:  2

@rmcolq
Copy link

rmcolq commented Dec 5, 2023

I'm also interested in this fix

@lmdu
Copy link
Owner

lmdu commented Feb 28, 2024

Fixed in v2.1.0

@BioWilko
Copy link
Author

Amazing, thanks @lmdu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants