ignore headers #45

audy · 2015-03-13T15:16:28Z

Is it possible to add a flag to Pandaseq to completely ignore header format?

The reason is that I have some preprocessing upstream of Pandaseq where I need to add some info to headers (has to be first field too because bioinformatics).

apmasell · 2015-03-13T15:29:26Z

Possible, but impractical. What is the upstream processing doing?

audy · 2015-03-13T15:39:59Z

Labelling reads by barcode.

apmasell · 2015-03-13T15:41:01Z

The bar code information is preserved. It can be done after.

audy · 2015-03-13T15:41:43Z

We get the data from our sequencing core in a slightly different format. Barcodes are not in read headers; they're in a separate file.

apmasell · 2015-03-13T15:42:08Z

Can you provide samples?

audy · 2015-03-13T15:43:47Z

Sure: https://www.dropbox.com/s/cozbtkxbn3qymez/fastq-sample.tar.gz?dl=0

audy · 2015-03-13T15:49:54Z

Re #7 "Why some sequencing centres fail to do this is beyond my comprehension"

The reason why is that we use custom barcodes that are incompatible with the Illumina software thus we have to demultiplex ourselves.

apmasell · 2015-03-13T15:51:43Z

It looks easier to have PANDAseq read the barcodes and attach them to the sequences as they are read than to deal with manipulated headers.

audy · 2015-03-13T15:52:43Z

You lost me. Pandaseq can read barcodes files? Where does it attach them to the sequence?

apmasell · 2015-03-13T15:56:15Z

Currently, PANDAseq reads and parses the headers for the input FASTQ files. The files you provided have valid headers there is just no barcode information present (which is not a problem).

You propose that you manipulate the headers and then PANDAseq ignores the headers. This is difficult since PANDAseq has many expectations about what it can do with the headers.

I propose changing PANDAseq so it can take three files as input (forward, reverse, and index). It will then use the barcodes provided and include the barcode in the output header. You can also use the -C validtag:AAAAA to select a subset of sequences (i.e., use PANDAseq to also do the demultiplexing).

audy · 2015-03-13T15:59:24Z

I wasn't aware that Pandaseq needed information in the header for anything except verifying that I'm not doing something stupid like aligning reads from two different sequencing runs. If that's the only reason then I'd suggest adding a flag to skip it as sequencing companies change file formats all the time.

Can't I just attach the barcode sequence to the header myself? I can't find what a "good" header looks like in the docs.

I'm testing with this, where I manually added the "GATC":

pandaseq-checkid "@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC"
@M02780:41:000000000-ADDJ7:1:1101:20524:1181:GATC
                                           ^
    BAD
    instrument = "M02780"
    run = 41
    flowcell = "000000000-ADDJ7"
    lane = 1
    tile = 1101
    x = 20524
    y = 1181
    tag = ""
    generator = CASAVA 1.7+

apmasell · 2015-03-13T16:02:29Z

That's not the correct place for it.
From the manual:

The name of the input read did not follow the known Illumina standard formats. Older versions of CASAVA produce sequences with IDs that look like HWUSI-EAS1661_9323_FC619KG:7:1:1190:15190#ATCACG/1, where the fields are instrument:lane:tile:x:y#tag/direction. Newer version of CASAVA produce IDs that look like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA, where the fields are instrument:run:flow‐cell:lane:tile:x:y direction:filtered:flags:tag. If your sequence headers do not look like either of these, either Illumina has created yet-another header format or, more likely, your sequence headers have been manipulated by some upstream processing, possibly at your sequencing centre. PANDAseq needs the original Illumina probabilities; not ones manipulated by other programs. We're very picky about that. Sometimes, for mysterious reasons, the sequences lack the barcoding tag. The -B option will cause the lack of barcode to be ignored. This will obviously invalidate the use of validation modules that depend on the barcode.

I'm halfway finished me proposal anyway.

audy · 2015-03-13T16:03:42Z

So can I fake the direction:filtered:flags:tag?

audy · 2015-03-13T16:07:33Z

Ah this is weird. The sequencing core was previously sending us files with headers like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104

They just recently switched to @M02780:41:000000000-ADDJ7:1:1101:20524:1181:200

I'll ask them what kind of upstream processing they may have performed.

I hope they don't have to re-generate the FASTQ files because this usually takes them a week.

apmasell · 2015-03-13T16:08:14Z

That's the same format, just a difference sequencing platform.

apmasell · 2015-03-13T16:30:06Z

I've implemented something in dc65111. There is now a -i flag where you can supply the index reads and PANDAseq will apply the barcodes. There's no need to preprocess the files.

audy · 2015-03-13T16:49:06Z

I will give this a try. Thanks.

audy · 2015-03-13T20:09:25Z

This seems to work. Thanks for the quick response.

bbushnell · 2015-04-03T02:01:34Z

Hello,

I want to include pandaseq in a paper comparing the accuracy of overlap-based read-merging programs, but my methodology requires custom read headers. Are you willing to add a header-parsing override flag for this purpose?

apmasell · 2015-04-03T02:19:49Z

This change is not practical to make. The parsed header structure is woven through the code.

You can use the PANDAseq API to track reads directly if desired. You will need to populate a header structure, but it need not be valid. See panda_assembler_assemble function for details. If C is bothersome, there are Vala bindings, which is a Java/C#-like language and it can provide an object oriented interface to deal directly with PANDAseq. These are used in the regression test.

bbushnell · 2015-04-03T02:24:35Z

OK, that's unfortunate, but thanks for your explanation.

audy · 2015-07-01T15:43:07Z

Hello again, Would it be possible to get a new release with that has the -i flag? I need to share my "Pandaseq analysis pipeline" with colleagues and could do without the extra "you must clone and compile this specific ref" step.

apmasell · 2015-07-01T15:47:13Z

I have some bugs in queue, but I can do it after those issues are resolved. Should be under a month.

apmasell closed this as completed Mar 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore headers #45

ignore headers #45

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

audy commented Mar 13, 2015

bbushnell commented Apr 3, 2015

apmasell commented Apr 3, 2015

bbushnell commented Apr 3, 2015

audy commented Jul 1, 2015

apmasell commented Jul 1, 2015

ignore headers #45

ignore headers #45

Comments

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

audy commented Mar 13, 2015

apmasell commented Mar 13, 2015

apmasell commented Mar 13, 2015

audy commented Mar 13, 2015

audy commented Mar 13, 2015

bbushnell commented Apr 3, 2015

apmasell commented Apr 3, 2015

bbushnell commented Apr 3, 2015

audy commented Jul 1, 2015

apmasell commented Jul 1, 2015