requirement: splitting on the number of lines/reads #8

Toliman06 · 2017-11-22T10:28:59Z

Hi,

I'm wondering if it is possible to add a new split option: is it possible to split files by a certain number of reads and not in a certain number of sub-files?
It could be useful if you want to parallelize and standardize the downstream alignments (guess the execution time of each sub-sample) and you don't know the size of your input fastq.gz file...

sfchen · 2017-11-22T12:09:21Z

This will be implemented:)

#8

sfchen · 2017-11-26T13:20:36Z

Hi @Toliman06 , this feature is implemented.

You can use -S lines to limit the lines of each split file, instead of using -s to limit the total split file number.

For example, -S 10000 to set every output file will be near 10000 lines.

You can use either -s or -S for splitting output, but you cannot choose both.

Could you please try latest code or binary (not the release) and update this issue?

Toliman06 · 2017-11-27T08:50:35Z

Nice! I will try this ans keep you in touch

Toliman06 · 2017-11-28T07:21:17Z

The program have a strange behavior:
`root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# fastp -i 4d64173b-71dc-4aef-a877-fda22c960a66_1.fastq.gz -S 10000 -o test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0174.test.fastq

11996 0174.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0165.test.fastq

12000 0165.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0138.test.fastq

12000 0138.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0116.test.fastq

12000 0116.test.fastq
root@seqonesandbox:/data/analysis/bbb9ba7-b4e062897804/input_files# wc -l 0073.test.fastq

12000 0073.test.fastq
`
There is 12000 lines per file instead of 10000.

PS: perhaps the number of reads per file could be more interesting that the number of lines (as a read is supposed to be 4 lines...)

sfchen · 2017-11-28T07:41:54Z

This is reasonable.

Since fastp reads data by blocks, a block contains 1000 reads (4000 lines). And fastp also writes by blocks, so the output may be a little more than 10000 lines.

Since the lines setting are usually much greater than 1000 lines, so this will not cause any problems.

Toliman06 · 2017-11-28T07:48:18Z

Nice! It could be helpfull to mention it in the readme file section.
I will close this subject.

sfchen added a commit that referenced this issue Nov 26, 2017

support splitting by lines

5f34f11

#8

Toliman06 closed this as completed Nov 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

requirement: splitting on the number of lines/reads #8

requirement: splitting on the number of lines/reads #8

Toliman06 commented Nov 22, 2017

sfchen commented Nov 22, 2017

sfchen commented Nov 26, 2017 •

edited

Loading

Toliman06 commented Nov 27, 2017

Toliman06 commented Nov 28, 2017 •

edited

Loading

sfchen commented Nov 28, 2017

Toliman06 commented Nov 28, 2017

requirement: splitting on the number of lines/reads #8

requirement: splitting on the number of lines/reads #8

Comments

Toliman06 commented Nov 22, 2017

sfchen commented Nov 22, 2017

sfchen commented Nov 26, 2017 • edited Loading

Toliman06 commented Nov 27, 2017

Toliman06 commented Nov 28, 2017 • edited Loading

sfchen commented Nov 28, 2017

Toliman06 commented Nov 28, 2017

sfchen commented Nov 26, 2017 •

edited

Loading

Toliman06 commented Nov 28, 2017 •

edited

Loading