-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore headers #45
Comments
Possible, but impractical. What is the upstream processing doing? |
Labelling reads by barcode. |
The bar code information is preserved. It can be done after. |
We get the data from our sequencing core in a slightly different format. Barcodes are not in read headers; they're in a separate file. |
Can you provide samples? |
Re #7 "Why some sequencing centres fail to do this is beyond my comprehension" The reason why is that we use custom barcodes that are incompatible with the Illumina software thus we have to demultiplex ourselves. |
It looks easier to have PANDAseq read the barcodes and attach them to the sequences as they are read than to deal with manipulated headers. |
You lost me. Pandaseq can read barcodes files? Where does it attach them to the sequence? |
Currently, PANDAseq reads and parses the headers for the input FASTQ files. The files you provided have valid headers there is just no barcode information present (which is not a problem). You propose that you manipulate the headers and then PANDAseq ignores the headers. This is difficult since PANDAseq has many expectations about what it can do with the headers. I propose changing PANDAseq so it can take three files as input (forward, reverse, and index). It will then use the barcodes provided and include the barcode in the output header. You can also use the |
I wasn't aware that Pandaseq needed information in the header for anything except verifying that I'm not doing something stupid like aligning reads from two different sequencing runs. If that's the only reason then I'd suggest adding a flag to skip it as sequencing companies change file formats all the time. Can't I just attach the barcode sequence to the header myself? I can't find what a "good" header looks like in the docs. I'm testing with this, where I manually added the "GATC":
|
That's not the correct place for it.
I'm halfway finished me proposal anyway. |
So can I fake the |
Ah this is weird. The sequencing core was previously sending us files with headers like They just recently switched to I'll ask them what kind of upstream processing they may have performed. I hope they don't have to re-generate the FASTQ files because this usually takes them a week. |
That's the same format, just a difference sequencing platform. |
I've implemented something in dc65111. There is now a |
I will give this a try. Thanks. |
This seems to work. Thanks for the quick response. |
Hello, I want to include pandaseq in a paper comparing the accuracy of overlap-based read-merging programs, but my methodology requires custom read headers. Are you willing to add a header-parsing override flag for this purpose? |
This change is not practical to make. The parsed header structure is woven through the code. You can use the PANDAseq API to track reads directly if desired. You will need to populate a header structure, but it need not be valid. See |
OK, that's unfortunate, but thanks for your explanation. |
Hello again, Would it be possible to get a new release with that has the |
I have some bugs in queue, but I can do it after those issues are resolved. Should be under a month. |
Is it possible to add a flag to Pandaseq to completely ignore header format?
The reason is that I have some preprocessing upstream of Pandaseq where I need to add some info to headers (has to be first field too because bioinformatics).
The text was updated successfully, but these errors were encountered: