Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requirement: splitting on the number of lines/reads #8

Closed
Toliman06 opened this issue Nov 22, 2017 · 6 comments
Closed

requirement: splitting on the number of lines/reads #8

Toliman06 opened this issue Nov 22, 2017 · 6 comments

Comments

@Toliman06
Copy link

Hi,

I'm wondering if it is possible to add a new split option: is it possible to split files by a certain number of reads and not in a certain number of sub-files?
It could be useful if you want to parallelize and standardize the downstream alignments (guess the execution time of each sub-sample) and you don't know the size of your input fastq.gz file...

@sfchen
Copy link
Member

sfchen commented Nov 22, 2017

This will be implemented:)

sfchen added a commit that referenced this issue Nov 26, 2017
@sfchen
Copy link
Member

sfchen commented Nov 26, 2017

Hi @Toliman06 , this feature is implemented.

You can use -S lines to limit the lines of each split file, instead of using -s to limit the total split file number.

For example, -S 10000 to set every output file will be near 10000 lines.

You can use either -s or -S for splitting output, but you cannot choose both.

Could you please try latest code or binary (not the release) and update this issue?

@Toliman06
Copy link
Author

Nice! I will try this ans keep you in touch

@Toliman06
Copy link
Author

Toliman06 commented Nov 28, 2017

The program have a strange behavior:
`root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# fastp -i 4d64173b-71dc-4aef-a877-fda22c960a66_1.fastq.gz -S 10000 -o test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0174.test.fastq

11996 0174.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0165.test.fastq

12000 0165.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0138.test.fastq

12000 0138.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0116.test.fastq

12000 0116.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0073.test.fastq

12000 0073.test.fastq
`
There is 12000 lines per file instead of 10000.

PS: perhaps the number of reads per file could be more interesting that the number of lines (as a read is supposed to be 4 lines...)

@sfchen
Copy link
Member

sfchen commented Nov 28, 2017

This is reasonable.

Since fastp reads data by blocks, a block contains 1000 reads (4000 lines). And fastp also writes by blocks, so the output may be a little more than 10000 lines.

Since the lines setting are usually much greater than 1000 lines, so this will not cause any problems.

@Toliman06
Copy link
Author

Nice! It could be helpfull to mention it in the readme file section.
I will close this subject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants