Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqtk sample not working as expected #193

Closed
antoine4ucsd opened this issue Jun 9, 2022 · 2 comments
Closed

seqtk sample not working as expected #193

antoine4ucsd opened this issue Jun 9, 2022 · 2 comments

Comments

@antoine4ucsd
Copy link

Hello
I am trying to subsample fastq.gz file but not sure if it really works as expected above a given limit.

my source file contains 150k reads

awk '{s++}END{print s/4}' ./BA922J_barcode16_run5_merged.fastq.gz
150626

but when trying to subset:

seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 > BA922J_10000.gz
seqtk sample -s100 BA922Jl_barcode16_run5_merged.fastq.gz 40000 > BA922J_40000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 60000 > BA922J_60000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 80000 > BA922J_80000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 100000 > BA922J_100000.gz

then the file size is plateauing...

-rwxrwxrwx  1  staff    82M Jun  9 08:36 BA922J_10000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_40000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_60000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_80000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:37 BA922J_100000.gz

also need to make sure this is not resampling the same reads. can you confirm (for example if I set the sample to 200k)

not sure what I am doing wrong...
thank you!

@shenwei356
Copy link

What's the size of the original file? And also, check the number of reads with seqkit stats.

seqkit stats -j 10 *.gz

PS: The command below does not output gzip format.

seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 > BA922J_10000.gz

# this does
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 | pigz -c > BA922J_10000.gz

@antoine4ucsd
Copy link
Author

thank you. good catch for the typo in the cmd line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants