### First step in the pipeline: Extracting barcodes  
We use the "umi-tools" package to parse out our barcodes from the fastq files.  
https://umi-tools.readthedocs.io/en/latest/QUICK_START.html  

Umi-tools was orignally designed to parse out barcodes from large sets of scRNA-seq data,  
so we will end up using their naming conventions (e.g. "cell_barcode") for Light-Seq.

In [1]:
# If you run "umi_tools" in your terminal, you should get a set of options.
# We will be using the "extract option"
!umi_tools

For full UMI-tools documentation, see: https://umi-tools.readthedocs.io/en/latest/


umi_tools.py - Tools for UMI analyses

:Author: Tom Smith & Ian Sudbury, CGAT
:Release: $Id$
:Date: |today|
:Tags: Genomics UMI

There are 6 tools:

  - whitelist
  - extract
  - group
  - dedup
  - count
  - count_tab

To get help on a specific tool, type:

    umi_tools <tool> --help

To use a specific tool, type::

    umi_tools <tool> [tool options] [tool arguments]



### Regex patterns
The barcode extraction method that we use with umi_tools is the "regex" method which uses the built-in "regex" python module.  
Regex is almost its own language and does have a slight learning curve.  

Below I will give the regex pattern that we use for light-seq and introduce the regex101.com website for a more visual representation of what its doing.  
Note: regex101 doesnt seem to handle the regex fuzzy seach parameter very well so we will just ignore that for now.  
https://regex101.com/r/R3AlWj/1

In [2]:
import os
import regex

In [3]:
barcodePattern = '^(?P<umi_1>.{12})' \
               '(?P<discard_1>.*)' \
               '(?P<cell_1>GTTAGG|AGGGTA)' \
               '(?P<discard_2>TGAGTTAT){s<=2}' \
               '(?P<discard_3>.{8})' \
               '.{15,40}+' \
               '(?P<discard_4>AAAAAAAAAAA.*|CTGTCTCTTAT.*|.*$)'

In [4]:
barcodePattern

'^(?P<umi_1>.{12})(?P<discard_1>.*)(?P<cell_1>GTTAGG|AGGGTA)(?P<discard_2>TGAGTTAT){s<=2}(?P<discard_3>.{8}).{15,40}+(?P<discard_4>AAAAAAAAAAA.*|CTGTCTCTTAT.*|.*$)'

In [5]:
# Lets run a regex search on an example sequence
seq = 'GAAGGGAGTGTAAGGGTANGAGTTATAGTCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAACCTCGTATGCCGTCTTCTGCTTGAAAAA'

In [6]:
match = regex.search(barcodePattern, seq)
match

<regex.Match object; span=(0, 100), match='GAAGGGAGTGTAAGGGTANGAGTTATAGTCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAACCTCGTATGCCGTCTTCTGCTTGAAAAA', fuzzy_counts=(1, 0, 0)>

In [7]:
# Umi-tools parses the fastq file based on the entries for each of the regex search groups. e.g."umi_1" etc...
# Therefore its better to just use the umi-tools naming convention to avoid headaches.
match.groupdict()

{'umi_1': 'GAAGGGAGTGTA',
 'discard_1': '',
 'cell_1': 'AGGGTA',
 'discard_2': 'NGAGTTAT',
 'discard_3': 'AGTCTGTC',
 'discard_4': 'TCGTATGCCGTCTTCTGCTTGAAAAA'}

In [8]:
match.groupdict().get('umi_1')

'GAAGGGAGTGTA'

In [9]:
# Lets Parse a file then!!!
OUT_DIR = 'outFiles/'
inFile = 'inFiles/TLS23A_S1_L001_R1_001.fastq.gz'
filePrefix = '%s%s' % (OUT_DIR, inFile.split('/')[-1].split('_')[0])
trimmedR1File = '%s_R1_trimmed.fastq.gz' % filePrefix

In [10]:
def parseSequences(inFile, filePrefix, trimmedR1File):
    
    print('Analyzing %s...' % inFile.split('_R1')[0])
    barcodePattern = '^(?P<umi_1>.{12})' \
                   '(?P<discard_1>.*)' \
                   '(?P<cell_1>GTTAGG|AGGGTA)' \
                   '(?P<discard_2>TGAGTTAT){s<=2}' \
                   '(?P<discard_3>.{8})' \
                   '.{15,40}+' \
                   '(?P<discard_4>AAAAAAAAAAA.*|CTGTCTCTTAT.*|.*$)'

    # Extract UMIs and barcodes
    extractLogFile = '%s_extract.log' % filePrefix

    os.system(('umi_tools extract --stdin %s --extract-method=regex ' \
              '--bc-pattern="%s" -L %s --stdout %s') % \
              (inFile, barcodePattern, extractLogFile, trimmedR1File))

In [11]:
# Note, this will take a few minutes on a local machine so i wont run this and bore you.
parseSequences(inFile, filePrefix, trimmedR1File)

Analyzing inFiles/TLS23A_S1_L001...


In [11]:
# We can take a look at the first parsed sequences.
!zgrep '' outFiles/TLS23A_R1_trimmed.fastq.gz | head -n 4

@M01675:146:000000000-JNRMM:1:1101:15250:1474_AGGGTA_GAAGGGAGTGTA 1:N:0:TAAGGCGA+NTCTCTAT
TCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAACC
+
GGGGGGGGGGGEFFCAF7FGGGFGDGCEGGGGGGGGGG+@


In [12]:
# This is the original seq that was parsed out
print(seq)
match.groupdict()

GAAGGGAGTGTAAGGGTANGAGTTATAGTCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAACCTCGTATGCCGTCTTCTGCTTGAAAAA


{'umi_1': 'GAAGGGAGTGTA',
 'discard_1': '',
 'cell_1': 'AGGGTA',
 'discard_2': 'NGAGTTAT',
 'discard_3': 'AGTCTGTC',
 'discard_4': 'TCGTATGCCGTCTTCTGCTTGAAAAA'}

### Regex Fuzzy Search
Regex also allows for a "fuzzy search" parameter, which means if a sequence is one base off etc... it can still match.  
This would be very useful in the future if you want to incorporate some error corrected barcodes.  

In [13]:
# Lets take our original sequence and add a 1 base error
seq_1baseBCerr = 'GAAGGGAGTGTAAGGCTANGAGTTATAGTCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAACCTCGTATGCCGTCTTCTGCTTGAAAAA'

In [14]:
# Now the normal barcodepattern will not be able to match the barcode.
match = regex.search(barcodePattern, seq_1baseBCerr)
match

In [15]:
# Lets introduce a 1 base tolerace to the barcodepattern.
barcodePattern_1hamming = '^(?P<umi_1>.{12})' \
               '(?P<discard_1>.*)' \
               '(?P<cell_1>(GTTAGG){s<=1}|(AGGGTA){s<=1})' \
               '(?P<discard_2>TGAGTTAT){s<=2}' \
               '(?P<discard_3>.{8})' \
               '.{15,40}+' \
               '(?P<discard_4>AAAAAAAAAAA.*|CTGTCTCTTAT.*|.*$)'

In [16]:
match = regex.search(barcodePattern_1hamming, seq_1baseBCerr)
match

<regex.Match object; span=(0, 100), match='GAAGGGAGTGTAAGGCTANGAGTTATAGTCTGTCTCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAACCTCGTATGCCGTCTTCTGCTTGAAAAA', fuzzy_counts=(2, 0, 0)>

In [17]:
match.groupdict()

{'umi_1': 'GAAGGGAGTGTA',
 'discard_1': '',
 'cell_1': 'AGGCTA',
 'discard_2': 'NGAGTTAT',
 'discard_3': 'AGTCTGTC',
 'discard_4': 'TCGTATGCCGTCTTCTGCTTGAAAAA'}

### Now you are ready!
Now you can modify the regex pattern however you want!  
In the future you may want to use more barcodes, or add more sequences to the match etc...  
Everything is done with the regex string