In [None]:
mkdir -p ~/scratch/analysis

In [None]:
cd ~/scratch/analysis/

In [None]:
ls /mnt/nfs/ngsworkshop/colab-sbx-16/raw_data/groups_KNP_run1/

In [None]:
%%bash
fastq-multx -h

In [None]:
%%bash
OUTDIR=KNP_run1_demux_bm1
mkdir -p $OUTDIR
fastq-multx -m1 -x -B ~/bioinf_nb_ngscourse2015/pool1_barcodes.tab \
    ~/raw_data/groups_KNP_run1/pool1_S1_L001_I1_001.fastq.gz \
    ~/raw_data/groups_KNP_run1/pool1_S1_L001_R1_001.fastq.gz \
    ~/raw_data/groups_KNP_run1/pool1_S1_L001_R2_001.fastq.gz \
    -o ${OUTDIR}/i1.%.fq.gz ${OUTDIR}/r1.%.fq.gz ${OUTDIR}/r2.%.fq.gz \
    > ${OUTDIR}/demux_summary.txt

The unmatched files is pretty big.  Let's see what barcodes are showing up there.

In [None]:
%%bash
zcat KNP_run1_demux_bm1/i1.unmatched.fq.gz \
    | sed -n '2~4p' | sort | uniq -c | sort -nr | head

We see a large overrepresentation of "CCGATC"!  Why?  Did you notice the error message from `fastq-multx`?  It said "Skipped because of distance < 2 : 136452".  

What does this mean?  

Let's see if we can figure out.  We will compare "CCGATC" to each of the barcodes to see how many mutations it takes to get from this sequence to each of the barcodes.  In computer science, this value is know as the "edit distance" or "Levenshtein distance".

In [None]:
import editdistance
import os
barcode_file = "~/bioinf_nb_ngscourse2015/pool1_barcodes.tab"

distance_list = []
for line in open(os.path.expanduser(barcode_file)):
    label,barcode = line.split()
    distance_list.append((editdistance.eval('CCGATC', barcode), 
                          barcode, label))
distance_list.sort()
for vals in distance_list:
    print "{0[0]} {0[1]} {0[2]}".format(vals)

Aha! So "CAGATC" has an edit distance of 1 from "CCGATC" - that is probably the source of most of the "CAGATC".  But we also see that "CCGTCC" has an edit distance of 2.  This is what the message "Skipped because of distance < 2 : 136452" meant: it observed barcodes in the data that do not match any of the sequences in the "known barcode" file, but have  edit distance of less (or equal to) 2 from multiple sequences in the "known barcode" file.  We can tell fastq-multx to be less stringent about edit distance using the "-d" option, which specfies "Require a minimum distance of N between the best and next best" - the default is 2, but we can set it to 1.

This is a judgement call - probably some of those "CAGATC" are really "CCGTCC" that have two sequencing errors.  With "-d 2", everything with a "CAGATC" barcode is getting thrown away, with "-d 1", they will all be put in the "CAGATC" file.  There is no right answer!

For the moment we will rerun fastq-multx with "-d1"

In [None]:
%%bash
OUTDIR=KNP_run1_demux_bd1m1
mkdir -p $OUTDIR
fastq-multx -m1 -d1 -x -B ~/bioinf_nb_ngscourse2015/pool1_barcodes.tab \
    ~/raw_data/groups_KNP_run1/pool1_S1_L001_I1_001.fastq.gz \
    ~/raw_data/groups_KNP_run1/pool1_S1_L001_R1_001.fastq.gz \
    ~/raw_data/groups_KNP_run1/pool1_S1_L001_R2_001.fastq.gz \
    -o ${OUTDIR}/i1.%.fq.gz ${OUTDIR}/r1.%.fq.gz ${OUTDIR}/r2.%.fq.gz \
    > ${OUTDIR}/demux_summary.txt