Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore headers #45

Closed
audy opened this issue Mar 13, 2015 · 23 comments
Closed

ignore headers #45

audy opened this issue Mar 13, 2015 · 23 comments

Comments

@audy
Copy link

audy commented Mar 13, 2015

Is it possible to add a flag to Pandaseq to completely ignore header format?

The reason is that I have some preprocessing upstream of Pandaseq where I need to add some info to headers (has to be first field too because bioinformatics).

@apmasell
Copy link
Member

Possible, but impractical. What is the upstream processing doing?

@audy
Copy link
Author

audy commented Mar 13, 2015

Labelling reads by barcode.

@apmasell
Copy link
Member

The bar code information is preserved. It can be done after.

@audy
Copy link
Author

audy commented Mar 13, 2015

We get the data from our sequencing core in a slightly different format. Barcodes are not in read headers; they're in a separate file.

@apmasell
Copy link
Member

Can you provide samples?

@audy
Copy link
Author

audy commented Mar 13, 2015

@audy
Copy link
Author

audy commented Mar 13, 2015

Re #7 "Why some sequencing centres fail to do this is beyond my comprehension"

The reason why is that we use custom barcodes that are incompatible with the Illumina software thus we have to demultiplex ourselves.

@apmasell
Copy link
Member

It looks easier to have PANDAseq read the barcodes and attach them to the sequences as they are read than to deal with manipulated headers.

@audy
Copy link
Author

audy commented Mar 13, 2015

You lost me. Pandaseq can read barcodes files? Where does it attach them to the sequence?

@apmasell
Copy link
Member

Currently, PANDAseq reads and parses the headers for the input FASTQ files. The files you provided have valid headers there is just no barcode information present (which is not a problem).

You propose that you manipulate the headers and then PANDAseq ignores the headers. This is difficult since PANDAseq has many expectations about what it can do with the headers.

I propose changing PANDAseq so it can take three files as input (forward, reverse, and index). It will then use the barcodes provided and include the barcode in the output header. You can also use the -C validtag:AAAAA to select a subset of sequences (i.e., use PANDAseq to also do the demultiplexing).

@audy
Copy link
Author

audy commented Mar 13, 2015

I wasn't aware that Pandaseq needed information in the header for anything except verifying that I'm not doing something stupid like aligning reads from two different sequencing runs. If that's the only reason then I'd suggest adding a flag to skip it as sequencing companies change file formats all the time.

Can't I just attach the barcode sequence to the header myself? I can't find what a "good" header looks like in the docs.

I'm testing with this, where I manually added the "GATC":

pandaseq-checkid "@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC"
@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC
                                           ^
    BAD
    instrument = "M02780"
    run = 41
    flowcell = "000000000-ADDJ7"
    lane = 1
    tile = 1101
    x = 20524
    y = 1181
    tag = ""
    generator = CASAVA 1.7+

@apmasell
Copy link
Member

That's not the correct place for it.
From the manual:

The name of the input read did not follow the known Illumina standard formats. Older versions of CASAVA produce sequences with IDs that look like HWUSI-EAS1661_9323_FC619KG:7:1:1190:15190#ATCACG/1, where the fields are instrument:lane:tile:x:y#tag/direction. Newer version of CASAVA produce IDs that look like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA, where the fields are instrument:run:flow‐cell:lane:tile:x:y direction:filtered:flags:tag. If your sequence headers do not look like either of these, either Illumina has created yet-another header format or, more likely, your sequence headers have been manipulated by some upstream processing, possibly at your sequencing centre. PANDAseq needs the original Illumina probabilities; not ones manipulated by other programs. We're very picky about that. Sometimes, for mysterious reasons, the sequences lack the barcoding tag. The -B option will cause the lack of barcode to be ignored. This will obviously invalidate the use of validation modules that depend on the barcode.

I'm halfway finished me proposal anyway.

@audy
Copy link
Author

audy commented Mar 13, 2015

So can I fake the direction:filtered:flags:tag?

@audy
Copy link
Author

audy commented Mar 13, 2015

Ah this is weird. The sequencing core was previously sending us files with headers like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104

They just recently switched to @M02780:41:000000000-ADDJ7:1:1101:20524:1181:200

I'll ask them what kind of upstream processing they may have performed.

I hope they don't have to re-generate the FASTQ files because this usually takes them a week.

@apmasell
Copy link
Member

That's the same format, just a difference sequencing platform.

@apmasell
Copy link
Member

I've implemented something in dc65111. There is now a -i flag where you can supply the index reads and PANDAseq will apply the barcodes. There's no need to preprocess the files.

@audy
Copy link
Author

audy commented Mar 13, 2015

I will give this a try. Thanks.

@audy
Copy link
Author

audy commented Mar 13, 2015

This seems to work. Thanks for the quick response.

@bbushnell
Copy link

Hello,

I want to include pandaseq in a paper comparing the accuracy of overlap-based read-merging programs, but my methodology requires custom read headers. Are you willing to add a header-parsing override flag for this purpose?

@apmasell
Copy link
Member

apmasell commented Apr 3, 2015

This change is not practical to make. The parsed header structure is woven through the code.

You can use the PANDAseq API to track reads directly if desired. You will need to populate a header structure, but it need not be valid. See panda_assembler_assemble function for details. If C is bothersome, there are Vala bindings, which is a Java/C#-like language and it can provide an object oriented interface to deal directly with PANDAseq. These are used in the regression test.

@bbushnell
Copy link

OK, that's unfortunate, but thanks for your explanation.

@audy
Copy link
Author

audy commented Jul 1, 2015

Hello again, Would it be possible to get a new release with that has the -i flag? I need to share my "Pandaseq analysis pipeline" with colleagues and could do without the extra "you must clone and compile this specific ref" step.

@apmasell
Copy link
Member

apmasell commented Jul 1, 2015

I have some bugs in queue, but I can do it after those issues are resolved. Should be under a month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants