Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch single&paired API to single&multiple API. #82

Closed
rhpvorderman opened this issue Jun 1, 2022 · 2 comments
Closed

Switch single&paired API to single&multiple API. #82

rhpvorderman opened this issue Jun 1, 2022 · 2 comments

Comments

@rhpvorderman
Copy link
Collaborator

Currently I am working a lot with UMI data that is stored in a separate FASTQ file meaning I have 3 files now.

I needed to filter those files on average error rate so I adopted the fastq-filter program to work with multiple files.

To keep the pipeline simple. I opted to have a Multiple file reader. This yields 1-tuples for 1 file, 2-tuples for 2 files, 3-tuples for 3 files, etc.
This way I can write the filters to always handle a tuple of SequenceRecord objects and use the same filter in all cases.
Similarly I wrote a multiple writer.

I am wondering if we should do this in dnaio too. There are now two cases in dniao:

  1. Single file. Yield one SequenceRecord object.
  2. Paired file. Yield a 2-tuple of SequenceRecord objects.

I propose replacing the latter with a multipe file reader that can read n number of records and yields n-tuples of SequenceRecords. The PairedEndReader and PairedEndWriter interfaces can still be maintained, but these can simply inherit the MultipleReaders and provide a backwards compatible interface. (Shouldn't be too hard given it is just the 2-case of the MultipleReader).

This way I do not have to reinvent the wheel across multiple projects. I also feel this is needed for cutadapt. Which needs a sort of auxilary file option, where the auxilary file with the UMIs is kept in sync with the FASTQ files that are output from cutadapt. Currently I have to use biopet-fastqsync to sync the UMI FASTQ file afterwards. (This is not the correct place to raise this issue, but I simply state this here to show that I think this will be a good move for the future).

I already have implemented a multiple reader in my FASTQ filter project. At first it was written in a generic manner. (Everything is a list of multiple files.) But I discovered that severely harms the single-end and paired-end cases: LUMC/fastq-filter#16 . I wonder what the best way is to implement is in dnaio. Alternatively there could be separate 1-tuple 2-tuple n-tuple readers that all share the same interface trough abstract classes.

@rhpvorderman rhpvorderman changed the title Switch single | paired API to single | multiple API. Switch single&paired API to single&multiple API. Jun 1, 2022
@marcelm
Copy link
Owner

marcelm commented Jun 3, 2022

Generalizing the paired-end reader to multiple files sounds like a good idea. I think I’d implement this by accepting more than two input files in dnaio.open and then the function would work as before for n=1 and n=2 (so totally backwards compatible for the single end and paired-end cases). Then for n>2, it would return this new MultipleReader (not sure whether that is the best name, though). Is that what you meant?

This would indeed be a requirement for supporting records with more than two "ends" in Cutadapt.

@rhpvorderman
Copy link
Collaborator Author

Yes that is what I meant. Generalizing dnaio.open seems indeed the best path. MultipleReader is not intended to be the final name. I am struggling to think of a better one though.
One issue is that the current naming "PairedEnd" is not very applicable with N FASTQ files. "NEndReader" is not going to win the hearts and minds of anyone I am afraid. Oh well, I am sure a better name will pop up in our minds at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants