# Demultiplex a Raw FASTQ

In [None]:
source demux2015_config.sh
mkdir -p $DEMUX

# Raw Data

## Description
This data is from an experiment testing the transcriptional response of *Escherichia coli* to growth in high pH media.  Samples were sequenced with 101bp paired-end reads.

The samples were sequenced in 3 pools:
    * dryrun
    * groups_EG
    * groups_KNP

We will just work with the *dryrun* data.

In [None]:
ls $CURDATA

### Using fastq-multx to demultiplex
`fastq-multx` comes from ea-utils.  It is (somewhat) straightforward to use.  We can get some information on how to use it by running the command `fastq-multx` with no arguments

In [None]:
fastq-multx

Since we have barcodes in a separate file, we are contrained in how we run it. Here are the command line arguments that we will be using:

* -B BARCODE_FILE : a list of known barcodes, and the associated sample names
* -o OUTPUT_FILE(s) : fastq-multx will produce a separate file for each barcode (two files when paired-end reads are input).  This option provides a template for naming the output file - the program will fill in the "%" with the barcode.
* -m : number of mismatches to allow in barcode 
* -d : minimum edit distance between the best and next best match
* -x : don't trim barcodes
* I1_FASTQ : the index read FASTQ, which will be used to demultiplex other reads
* R1_FASTQ : the R1 raw data to demultiplex
* R2_FASTQ : (optional) if data is paired-end, the R2 raw data to demultiplex

You already know what is in the FASTQ file, but the barcode file is new. Let's take a look . . .

In [None]:
cat ${CURDATA}/dryrun_barcodes.tab

OK, now we are ready to run the demuxing . . .

In [None]:
fastq-multx -m1 -d1 -x -B ${CURDATA}/dryrun_barcodes.tab \
    ${CURDATA}/dryrun_combined_I1.fastq.gz \
    ${CURDATA}/dryrun_combined_R1.fastq.gz \
    ${CURDATA}/dryrun_combined_R2.fastq.gz \
    -o ${DEMUX}/i1.%.fq.gz ${DEMUX}/r1.%.fq.gz ${DEMUX}/r2.%.fq.gz > ${DEMUX}/dryrun_demux.stdout

### Results
We redirected STDOUT to `${DEMUX}/dryrun_demux.stdout`, so we can look at that file to get some run statistics.  

Note that, while normally an error message such as `gzip: stdout: Broken pipe` would be reason for concern, we can safely ignore it in this case - it has something to do with running things within Jupyter.

Let's take a look at the output . . .

In [None]:
cat ${DEMUX}/dryrun_demux.stdout

For each barcode in the barcode file, it tells us:
1. Id: the barcode name (as given in the barcode file)
2. Count: the number of reads corresponding to that barcode
3. Files(s): The demultiplexed files generated for this barcode

Unfortunately this is hard to read because it is giving the full path for the files.  This is really our fault, because when we ran it we used full paths.  There are a couple of ways we can more easily view this file.

We can use the `cut` program to show us the first N characters of each line

In [None]:
cut -c1-50 ${DEMUX}/dryrun_demux.stdout

This lets us see the Id and count information, but the files are truncated.  Alternatively we can use the `sed` program to strip out the full path from the filenames

In [None]:
line_old="${DEMUX}/"
line_new=''
sed "s%$line_old%$line_new%g" ${DEMUX}/dryrun_demux.stdout

Here we can see that for each barcode, three files are generated, the "i1" file, which contains all of the index reads corresponding to the barcode, "r1" file, which contains all of the first reads corresponding to the barcode, and the "r2" file, which contains all of the second reads corresponding to the barcode. We can confirm this by looking at the files in our "DEMUX" directory. . .

In [None]:
ls ${DEMUX}

Note that neither the `cut` nor the `sed` command that we used alter the original file, no changes are saved, the results of the commands are just displayed.  See the original is still the same . . . 

In [None]:
cat ${DEMUX}/dryrun_demux.stdout