# Toy example with `IlluminaBarcodeParser`
This example illustrates use of a [IlluminaBarcodeParser](https://jbloomlab.github.io/dms_variants/dms_variants.illuminabarcodeparser.html#dms_variants.illuminabarcodeparser.IlluminaBarcodeParser) on a toy example.

It is written primarily as a test for that class.

Import required modules:

In [1]:
import tempfile

from dms_variants.illuminabarcodeparser import IlluminaBarcodeParser

Initialize an `IlluminaBarcodeParser` for a barcode arrangement that looks like this:

    5'-[R2 binding site]-ACATGA-NNNN-GACT-[R1 binding site]-3'

In [2]:
parser = IlluminaBarcodeParser(bclen=4, upstream="ACATGA", downstream="GACT")

Create temporary files holding the FASTQ reads.
We write some valid test reads and some invalid reads. 
The header for each read explains why it is valid / invalid. 
We use quality scores of ``?`` (30) or ``+`` (10) for high- and low-quality bases:

In [3]:
r1file = tempfile.NamedTemporaryFile(mode="w")
r2file = tempfile.NamedTemporaryFile(mode="w")

# valid TACG barcode, full flanking regions
_ = r1file.write(
    "@valid_CGTA_barcode_full_flanking_region\n"
    "AGTCCGTATCATGT\n"
    "+\n"
    "??????????????\n"
)
_ = r2file.write(
    "@valid_CGTA_barcode_full_flanking_region\n"
    "ACATGATACGGACT\n"
    "+\n"
    "??????????????\n"
)

# valid CGTA barcode, partial flanking regions
_ = r1file.write(
    "@valid_CGTA_barcode_partial_flanking_region\n"
    "AGTCCGTATCAT\n"
    "+\n"
    "????????????\n"
)
_ = r2file.write(
    "@valid_CGTA_barcode_partial_flanking_region\n" "ACATGATACG\n" "+\n" "??????????\n"
)

# valid GCCG barcode, extended flanking regions
_ = r1file.write(
    "@valid_GCCG_barcode_extended_flanking_region\n"
    "AGTCGCCGTCATGTTAC\n"
    "+\n"
    "?????????????????\n"
)
_ = r2file.write(
    "@valid_GCCG_barcode_extended_flanking_region\n"
    "ACATGACGGCGACTGAC\n"
    "+\n"
    "?????????????????\n"
)

# AAGT barcode in R1 but R2 differs
_ = r1file.write(
    "@AAGT_R1_barcode_but_R2_differs\n" "AGTCAAGTTCATGT\n" "+\n" "??????????????\n"
)
_ = r2file.write(
    "@AAGT_R1_barcode_but_R2_differs\n" "ACATGAACTAGACT\n" "+\n" "??????????????\n"
)

# same site low quality in R1 and R2
_ = r1file.write(
    "@low_quality_site_in_R1_and_R2\n" "AGTCCGTATCATGT\n" "+\n" "?????+????????\n"
)
_ = r2file.write(
    "@low_quality_site_in_R1_and_R2\n" "ACATGATACGGACT\n" "+\n" "????????+?????\n"
)

# different site low quality in R1 and R2
_ = r1file.write(
    "@AGTA_with_low_quality_site_in_R1\n" "AGTCAGTATCATGT\n" "+\n" "?????+????????\n"
)
_ = r2file.write(
    "@AGTA_with_low_quality_site_in_R1\n" "ACATGATACTGACT\n" "+\n" "?????????+????\n"
)

# N in barcode
_ = r1file.write("@N_in_barcode\n" "AGTCCGTNTCATGT\n" "+\n" "??????????????\n")
_ = r2file.write("@N_in_barcode\n" "ACATGATACGGACT\n" "+\n" "??????????????\n")

# GGAG barcode, one mismatch in each flanking region
_ = r1file.write(
    "@GGAG_barcode_one_mismatch_per_flank\n" "GGTCGGAGTCATGA\n" "+\n" "??????????????\n"
)
_ = r2file.write(
    "@GGAG_barcode_one_mismatch_per_flank\n" "TCATGACTCCGACG\n" "+\n" "??????????????\n"
)

# GGAG barcode, two mismatch in a flanking region
_ = r1file.write(
    "@GGAG_barcode_two_mismatch_in_a_flank\n"
    "GGTCGGAGTCATAA\n"
    "+\n"
    "??????????????\n"
)
_ = r2file.write(
    "@GGAG_barcode_two_mismatch_in_a_flank\n"
    "TCATGACTCCGACG\n"
    "+\n"
    "??????????????\n"
)

r1file.flush()
r2file.flush()

Parse the barcodes using both R1 and R2 reads:

In [4]:
barcodes, fates = parser.parse(r1file.name, r2files=r2file.name)
print(barcodes)
print(fates)

  barcode  count
0    CGTA      2
1    AGTA      1
2    GCCG      1
                     fate  count
0           valid barcode      4
1     unparseable barcode      3
2        R1 / R2 disagree      1
3     low quality barcode      1
4  failed chastity filter      0
5         invalid barcode      0


Now parse just using R1.
We gain the barcode where R1 and R2 disagree, but lose the one where R1 is low quality at a position where R2 is OK:

In [5]:
barcodes, fates = parser.parse(r1file.name)
print(barcodes)
print(fates)

  barcode  count
0    CGTA      2
1    AAGT      1
2    GCCG      1
                     fate  count
0           valid barcode      4
1     unparseable barcode      3
2     low quality barcode      2
3  failed chastity filter      0
4         invalid barcode      0


We can also add extra columns to the output data frame:

In [6]:
barcodes, fates = parser.parse(
    r1file.name, add_cols={"library": "lib-1", "sample": "s1"}
)
print(barcodes)
print(fates)

  barcode  count library sample
0    CGTA      2   lib-1     s1
1    AAGT      1   lib-1     s1
2    GCCG      1   lib-1     s1
                     fate  count library sample
0           valid barcode      4   lib-1     s1
1     unparseable barcode      3   lib-1     s1
2     low quality barcode      2   lib-1     s1
3  failed chastity filter      0   lib-1     s1
4         invalid barcode      0   lib-1     s1


Now create a parser that allows a mismatch in each flanking region, and check that we recover a "GGAG" barcode:

In [7]:
parser_mismatch = IlluminaBarcodeParser(
    bclen=4,
    upstream="ACATGA",
    downstream="GACT",
    upstream_mismatch=1,
    downstream_mismatch=1,
)
barcodes_mismatch, fates_mismatch = parser_mismatch.parse(
    r1file.name, r2files=r2file.name
)
print(barcodes_mismatch)
print(fates_mismatch)

  barcode  count
0    CGTA      2
1    AGTA      1
2    GCCG      1
3    GGAG      1
                     fate  count
0           valid barcode      5
1     unparseable barcode      2
2        R1 / R2 disagree      1
3     low quality barcode      1
4  failed chastity filter      0
5         invalid barcode      0


Now parse the barcodes using `valid_barcodes` to set a barcode whitelist:

In [8]:
parser_wl = IlluminaBarcodeParser(
    upstream="ACATGA", downstream="GACT", valid_barcodes={"CGTA", "AGTA", "TAAT"}
)
barcodes_wl, fates_wl = parser_wl.parse(r1file.name, r2files=r2file.name)
print(barcodes_wl)
print(fates_wl)

  barcode  count
0    CGTA      2
1    AGTA      1
2    TAAT      0
                     fate  count
0     unparseable barcode      3
1           valid barcode      3
2        R1 / R2 disagree      1
3         invalid barcode      1
4     low quality barcode      1
5  failed chastity filter      0


Close the temporary files:

In [9]:
r1file.close()
r2file.close()