Switch single&paired API to single&multiple API. #82

rhpvorderman · 2022-06-01T08:32:21Z

Currently I am working a lot with UMI data that is stored in a separate FASTQ file meaning I have 3 files now.

I needed to filter those files on average error rate so I adopted the fastq-filter program to work with multiple files.

To keep the pipeline simple. I opted to have a Multiple file reader. This yields 1-tuples for 1 file, 2-tuples for 2 files, 3-tuples for 3 files, etc.
This way I can write the filters to always handle a tuple of SequenceRecord objects and use the same filter in all cases.
Similarly I wrote a multiple writer.

I am wondering if we should do this in dnaio too. There are now two cases in dniao:

Single file. Yield one SequenceRecord object.
Paired file. Yield a 2-tuple of SequenceRecord objects.

I propose replacing the latter with a multipe file reader that can read n number of records and yields n-tuples of SequenceRecords. The PairedEndReader and PairedEndWriter interfaces can still be maintained, but these can simply inherit the MultipleReaders and provide a backwards compatible interface. (Shouldn't be too hard given it is just the 2-case of the MultipleReader).

This way I do not have to reinvent the wheel across multiple projects. I also feel this is needed for cutadapt. Which needs a sort of auxilary file option, where the auxilary file with the UMIs is kept in sync with the FASTQ files that are output from cutadapt. Currently I have to use biopet-fastqsync to sync the UMI FASTQ file afterwards. (This is not the correct place to raise this issue, but I simply state this here to show that I think this will be a good move for the future).

I already have implemented a multiple reader in my FASTQ filter project. At first it was written in a generic manner. (Everything is a list of multiple files.) But I discovered that severely harms the single-end and paired-end cases: LUMC/fastq-filter#16 . I wonder what the best way is to implement is in dnaio. Alternatively there could be separate 1-tuple 2-tuple n-tuple readers that all share the same interface trough abstract classes.

The text was updated successfully, but these errors were encountered:

marcelm · 2022-06-03T09:13:57Z

Generalizing the paired-end reader to multiple files sounds like a good idea. I think I’d implement this by accepting more than two input files in dnaio.open and then the function would work as before for n=1 and n=2 (so totally backwards compatible for the single end and paired-end cases). Then for n>2, it would return this new MultipleReader (not sure whether that is the best name, though). Is that what you meant?

This would indeed be a requirement for supporting records with more than two "ends" in Cutadapt.

rhpvorderman · 2022-06-03T10:16:43Z

Yes that is what I meant. Generalizing dnaio.open seems indeed the best path. MultipleReader is not intended to be the final name. I am struggling to think of a better one though.
One issue is that the current naming "PairedEnd" is not very applicable with N FASTQ files. "NEndReader" is not going to win the hearts and minds of anyone I am afraid. Oh well, I am sure a better name will pop up in our minds at some point.

rhpvorderman changed the title ~~Switch single | paired API to single | multiple API.~~ Switch single&paired API to single&multiple API. Jun 1, 2022

marcelm mentioned this issue Aug 5, 2022

Multiple reader/writer api #87

Merged

rhpvorderman closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch single&paired API to single&multiple API. #82

Switch single&paired API to single&multiple API. #82

rhpvorderman commented Jun 1, 2022

marcelm commented Jun 3, 2022

rhpvorderman commented Jun 3, 2022

Switch single&paired API to single&multiple API. #82

Switch single&paired API to single&multiple API. #82

Comments

rhpvorderman commented Jun 1, 2022

marcelm commented Jun 3, 2022

rhpvorderman commented Jun 3, 2022