# Toy example with `IlluminaBarcodeParser` containing an `upstream2` sequence
This example illustrates use of a [IlluminaBarcodeParser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html#dms_variants.illuminabarcodeparser.IlluminaBarcodeParser) on a toy example.

It is written primarily as a test for that class.

Import required modules:

In [1]:
import tempfile

from dms_variants.illuminabarcodeparser import IlluminaBarcodeParser

Initialize an `IlluminaBarcodeParser` for a barcode arrangement that looks like this:

    5'-[R2 binding site]-GCA-ACATGA-NNNN-[R1 binding site]-3'

In [2]:
parser = IlluminaBarcodeParser(bclen=4, upstream="ACATGA", upstream2="GCA")

Create temporary file holding the FASTQ reads.
We write some valid test reads and some invalid reads. 
The header for each read explains why it is valid / invalid. 
We use quality scores of ``?`` (30) or ``+`` (10) for high- and low-quality bases:

In [3]:
r1file = tempfile.NamedTemporaryFile(mode="w")

# valid TACG barcode, full flanking regions
_ = r1file.write("@valid_CGTA_barcode\n" "CGTATCATGTTGC\n" "+\n" "?????????????\n")

# valid CGTA barcode, partial flanking regions
_ = r1file.write(
    "@valid_CGTA_barcode_partial_flanking_region\n"
    "CGTATCATTGC\n"
    "+\n"
    "???????????\n"
)

# valid GCCG barcode, extended flanking regions
_ = r1file.write(
    "@valid_GCCG_barcode_extended_flanking_region\n"
    "GCCGTCATGTTGCCAA\n"
    "+\n"
    "????????????????\n"
)

# some sites low quality
_ = r1file.write("@low_quality_site\n" "CGTATCATGTTGC\n" "+\n" "???+?????????\n")

# N in barcode
_ = r1file.write("@N_in_barcode\n" "CGTNTCATGTTGC\n" "+\n" "?????????????\n")

# GGAG barcode, one mismatch in flanking region
_ = r1file.write(
    "@GGAG_barcode_one_mismatch_in_upstream\n" "GGAGTCATGATGC\n" "+\n" "?????????????\n"
)

# GGTG barcode, mismatch in both upstream regions
_ = r1file.write(
    "@GGTG_barcode_two_mismatch_in_upstream_and_upstream2\n"
    "GGTGTCATGATGG\n"
    "+\n"
    "?????????????\n"
)

r1file.flush()

Parse the barcodes using both R1 and R2 reads:

In [4]:
barcodes, fates = parser.parse(r1file.name)
print(barcodes)
print(fates)

  barcode  count
0    CGTA      1
1    GCCG      1
                     fate  count
0     unparseable barcode      3
1           valid barcode      2
2     low quality barcode      1
3          read too short      1
4  failed chastity filter      0
5         invalid barcode      0


Now create a parser that allows mismatch in `upstream`, and check that we recover barcode:

In [5]:
parser_mismatch = IlluminaBarcodeParser(
    bclen=4,
    upstream="ACATGA",
    upstream2="GCA",
    upstream_mismatch=1,
)
barcodes_mismatch, fates_mismatch = parser_mismatch.parse(r1file.name)
print(barcodes_mismatch)
print(fates_mismatch)

  barcode  count
0    CGTA      1
1    GCCG      1
2    GGAG      1
                     fate  count
0           valid barcode      3
1     unparseable barcode      2
2     low quality barcode      1
3          read too short      1
4  failed chastity filter      0
5         invalid barcode      0


Now classify outer flank failures differently from unparseable barcodes:

In [6]:
parser_mismatch = IlluminaBarcodeParser(
    bclen=4,
    upstream="ACATGA",
    upstream2="GCA",
    upstream_mismatch=1,
)
barcodes_mismatch, fates_mismatch = parser_mismatch.parse(r1file.name, outer_flank_fates=True)
print(barcodes_mismatch)
print(fates_mismatch)

  barcode  count
0    CGTA      1
1    GCCG      1
2    GGAG      1
                     fate  count
0           valid barcode      3
1     invalid outer flank      1
2     low quality barcode      1
3          read too short      1
4     unparseable barcode      1
5  failed chastity filter      0
6         invalid barcode      0


Now create a parser that allows mismatch in `upstream` and `upstream2`, and check that we recover barcode:

In [7]:
parser_mismatch = IlluminaBarcodeParser(
    bclen=4,
    upstream="ACATGA",
    upstream2="GCA",
    upstream_mismatch=1,
    upstream2_mismatch=1,
)
barcodes_mismatch, fates_mismatch = parser_mismatch.parse(r1file.name)
print(barcodes_mismatch)
print(fates_mismatch)

  barcode  count
0    CGTA      1
1    GCCG      1
2    GGAG      1
3    GGTG      1
                     fate  count
0           valid barcode      4
1     low quality barcode      1
2          read too short      1
3     unparseable barcode      1
4  failed chastity filter      0
5         invalid barcode      0


Close the temporary file:

In [8]:
r1file.close()