# Extract viral barcodes from viral barcode sequencing data
This Python Jupyter notebook uses the raw viral barcode seqeuncing data to parse viral barcodes for each read.

Import Python modules:

In [1]:
import pandas as pd

from pymodules.tags_and_barcodes import parse_barcodes_supernatant

import pysam

Temporarily hardcode inputs for development:

In [2]:
expt = 'scProgenyProduction_trial1'
fastqs = [('supernatant','wildtype','fluHA','replicate_1','/shared/ngs/illumina/bloom_lab/201013_M03100_0625_000000000-JB2KP/Data/Intensities/BaseCalls/Trial1-WT-Sup-fluHA-A_S1_L001_R1_001.fastq.gz'),
          ('supernatant','wildtype','fluHA','replicate_2','/shared/ngs/illumina/bloom_lab/201013_M03100_0625_000000000-JB2KP/Data/Intensities/BaseCalls/Trial1-WT-Sup-fluHA-B_S2_L001_R1_001.fastq.gz')]
fastqs_df = pd.DataFrame.from_records(fastqs,
                                     columns=['source', 'tag', 'gene', 'replicate', 'fastq_path'])
viral_bc_locs = [('fluHA',0,16),
                 ('fluNA',0,16)]
bc_locs_df = pd.DataFrame.from_records(viral_bc_locs,
                                     columns=['gene', 'start', 'end'])
viral_bc_counts_csv = 'results/viral_fastq10x/scProgenyProduction_trial1_viral_bc_counts.csv.gz'

Get `snakemake` variables [as described here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#jupyter-notebook-integration):

In [3]:
# bam = snakemake.input.bam
# bai = snakemake.input.bai
# cell_barcodes = snakemake.input.cell_barcodes
# viral_bc_locs = snakemake.input.viral_bc_locs
# viral_bc_by_cell_umi_csv = snakemake.output.viral_bc_by_cell_umi_csv
# expt = snakemake.wildcards.expt

Read the viral barcode locations:

In [4]:
# print(f"Reading viral barcode locations from {viral_bc_locs}")
# #bc_locs_df = pd.read_csv(viral_bc_locs)
# bc_locs_df = pd.DataFrame()
display(bc_locs_df)

if len(bc_locs_df) != bc_locs_df['gene'].nunique():
    raise ValueError('code assumes at most one barcode per gene')

Unnamed: 0,gene,start,end
0,fluHA,0,16
1,fluNA,0,16


Now we get the viral barcodes.
Specifically, parse the FASTQ file, and for each read we extract the viral barcode based on position in the sequence.

In [21]:
viral_barcodes_by_read = pd.DataFrame({}, columns=['source', 'tag', 'gene', 'replicate', 'read_id',  'viral_barcode'])

for i, row in fastqs_df.iterrows():
    print(f"Parsing viral barcodes for \n",
          f"\t{row['tag']} {row['gene']} {row['source']} {row['replicate']}\n",
          f"\tfrom {row['fastq_path'].split('/')[-1]}:")

    with pysam.FastxFile(row['fastq_path']) as fastq_iterator:
        viral_barcodes = parse_barcodes_supernatant(readiterator=fastq_iterator,
                                                    start=bc_locs_df.query(f'gene == "{row["gene"]}"')['start'][0],
                                                    end=bc_locs_df.query(f'gene == "{row["gene"]}"')['end'][0]). \
                                                    assign(source = row['source'],
                                                           tag = row['tag'],
                                                           gene = row['gene'],
                                                           replicate = row['replicate'])
        viral_barcodes_by_read = viral_barcodes_by_read.append(viral_barcodes)
        print(f"Parsed viral barcodes {len(viral_barcodes)} reads.\n")

Parsing viral barcodes for 
 	wildtype fluHA supernatant replicate_1
 	from Trial1-WT-Sup-fluHA-A_S1_L001_R1_001.fastq.gz:
Parsed viral barcodes 760633 reads.

Parsing viral barcodes for 
 	wildtype fluHA supernatant replicate_2
 	from Trial1-WT-Sup-fluHA-B_S2_L001_R1_001.fastq.gz:
Parsed viral barcodes 691137 reads.



Next, I will count the number of reads I see each viral barcode in, within each sequencing sample. To do this, I group by source, tag, gene, and replicate (which defines a sequencing sample), and simply count the number of times I see each barcode.

In [22]:
viral_bc_counts = pd.DataFrame(viral_barcodes_by_read.groupby(['source','tag','gene','replicate'])['viral_barcode'] \
             .value_counts()) \
             .rename(columns={'viral_barcode':'count'}) \
             .reset_index()

Write the viral barcodes to the output CSV file:

In [6]:
print(f"Writing viral barcodes to {viral_bc_counts_csv}")

viral_bc.to_csv(viral_bc_counts_csv,
                  index=False,
                  compression='gzip')

NameError: name 'viral_bc_by_cell_umi_csv' is not defined

In [None]:
row