#Getting started
To be sure that we don't fill up our VMs with the full datasets, we will work on our scratch drives.

In [None]:
mkdir -p ~/scratch/analysis

In [None]:
cd ~/scratch/analysis

# Your Data
All of the data is in `/mnt/nfs/ngsworkshop/colab-sbx-16/raw_data/`.  The sequencing is being done in pools of 18 (pool 1) or 12 (pool 2) samples.  In order to get approzimately similar coverage of all samples, pool 1 is being sequenced 3 times, and pool 2 is being sequenced twice.

In [None]:
ls /mnt/nfs/ngsworkshop/colab-sbx-16/analysis

Let's make a link to the raw data to save ourselves some typing:

In [None]:
%%bash
ln -s  /mnt/nfs/ngsworkshop/colab-sbx-16/raw_data/ ~/

##Concatenating Run Files
If you are interested in run-specific batch effect, you would want to treat samples from different sequencing runs as separate technical replicates and compare them.  Otherwise it is going to be simpler to just combine comparable files from different runs (e.g. R1 files together, R2 files together, and I1 files together).  A simple way to do this is with `zcat`, `gzip`, a pipe, and a redirect:

`zcat pool1_run1_R1.fastq.gz pool1_run2_R1.fastq.gz | gzip > pool1_R1.fastq.gz`

Just be absolutely certain that you do not switch the order between reads (i.e. run1, then run2 for R1; run2, then run1 for R2): remember that reads need to be kept in the same order across files.  

Details on how to do this are below.  First, let's make a directory for the combined files.

In [None]:
mkdir ~/raw_data/combined

Now let's figure out a what files we want to combine.  We will work with Pool 1 first.

In [None]:
%%bash
ls ~/raw_data/groups_KNP_run?/????1_S1_L001_??_001.fastq.gz

Remember that we will want to combine the R1 files into one, the R2 files into one, and the I1 files into one, so we will use a for loop to do each group of files.

In [None]:
%%bash
for READ in I1 R1 R2 
    do
        zcat ~/raw_data/groups_KNP_run?/????1_S1_L001_${READ}_001.fastq.gz | \
            gzip > ~/raw_data/combined/groups_KNP_combined_${READ}.fastq.gz
    done

Next we will work with Pool 2.

In [None]:
ls ~/raw_data/groups_EG_run?/POOL2_S1_L001_I1_001.fastq.gz

In [None]:
%%bash
for READ in I1 R1 R2 
    do
        zcat ~/raw_data/groups_EG_run?/POOL2_S1_L001_${READ}_001.fastq.gz | \
            gzip > ~/raw_data/combined/groups_EG_combined_${READ}.fastq.gz
    done

##Demultiplexing
Unless we are interested in analyzing the sequencing runs separately, it is less work to demultiplex the runs after concatenating.

### Using fastq-multx to demultiplex
`fastq-multx` comes from ea-utils, the same package that provides `fastq-mcf`.  It is pretty straightforward to use.  Since we have barcodes in a separate file, we are contrained in how we run it:

* -B BARCODE_FILE : a list of known barcodes, and the associated sample names
* -o OUTPUT_FILE(s) : fastq-multx will produce a separate file for each barcode (two files when paired-end reads are input).  This option provides a template for naming the output file - the program will fill in the "%" with the barcode.
* -m : number of mismatches to allow in barcode 
* -d : minimum edit distance between the best and next best match
* -x : don't trim barcodes
* I1_FASTQ : the index read FASTQ, which will be used to demultiplex other reads
* R1_FASTQ : the R1 raw data to demultiplex
* R2_FASTQ : (optional) if data is paired-end, the R2 raw data to demultiplex

*Note:* You can ignore the error message "gzip: stdout: Broken pipe".

In [None]:
%%bash
GROUP=KNP
OUTDIR=${GROUP}_demux
mkdir -p $OUTDIR
fastq-multx -m1 -d1 -x -B ~/bioinf_nb_ngscourse2015/pool1_barcodes.tab \
    ~/raw_data/combined/groups_${GROUP}_combined_I1.fastq.gz \
    ~/raw_data/combined/groups_${GROUP}_combined_R1.fastq.gz \
    ~/raw_data/combined/groups_${GROUP}_combined_R2.fastq.gz \
    -o ${OUTDIR}/i1.%.fq.gz ${OUTDIR}/r1.%.fq.gz ${OUTDIR}/r2.%.fq.gz \
    > ${OUTDIR}/demux_summary.txt

Now we can run pool2 - remember to use `pool2_barcodes.tab`.

In [None]:
%%bash
GROUP=EG
OUTDIR=${GROUP}_demux
mkdir -p $OUTDIR
fastq-multx -m1 -d1 -x -B ~/bioinf_nb_ngscourse2015/pool2_barcodes.tab \
    ~/raw_data/combined/groups_${GROUP}_combined_I1.fastq.gz \
    ~/raw_data/combined/groups_${GROUP}_combined_R1.fastq.gz \
    ~/raw_data/combined/groups_${GROUP}_combined_R2.fastq.gz \
    -o ${OUTDIR}/i1.%.fq.gz ${OUTDIR}/r1.%.fq.gz ${OUTDIR}/r2.%.fq.gz \
    > ${OUTDIR}/demux_summary.txt

Note that when you supply `fastq-multx` with an index read file, it automatically determines what the barcodes are . . . and sometimes it finds some that were not used in any of the libraries.  We will just discard these bogus barcodes.

##Next Steps
Now we are ready to run our pipeline on the data.  I recommend doing it the same way we did the analysis before - run just one sample, check to be sure it looks OK, then run the rest of the samples.