# Illumina 10X transcript coverage for selected transcripts
This Python Jupyter notebook plots the 10X Illumina coverage for a handful of selected transcripts.
These are all the transcripts in the viral GTF plus a handful of additional "house-keeping" host transcripts with names hardcoded below.

## Parameters for notebook
First, set the parameters for the notebook.
That should be done in the next cell, which is tagged as a `parameters` cell to enable [papermill parameterization](https://papermill.readthedocs.io/en/latest/usage-parameterize.html):

In [None]:
# parameters cell; in order for the notebook to run this cell must define:
#  - samples_10x: list of 10X samples
#  - input_fastq10x_bams: list of BAM file with alignments of 10X reads for each sample
#  - input_fastq10x_bais: BAM indices for each file in `input_fastq10x_bam`
#  - input_viral_gtf: GTF file with viral genes
#  - input_gtf: GTF with all cellular and viral genes

Check input lists to make sure they are same length:

In [None]:
assert len(samples_10x) == len(input_fastq10x_bams) == len(input_fastq10x_bais)

An additional parameter hardcoded for the analysis: get coverage for **transcript** entries with these names in addition to those in `input_viral_gtf`:

In [None]:
gene_names = ['ACTB-201',
              'RPL32-201',
              ]

## Import Python modules

In [None]:
import collections
import itertools
import re

import pandas as pd

from plotnine import *

import pysam

## Get entries of interest from the GTF
We go through `input_gtf` and get all **transcripts** that are in `gene_names` as well as any entries in `input_viral_gtf`:

In [None]:
with open(input_viral_gtf) as f:
    viral_gtf_lines = set(line for line in f.readlines() if line[0] != '#')

print(f"Read {len(viral_gtf_lines)} entries from {input_viral_gtf}")

gene_name_regex = re.compile(f"\"({'|'.join(gene_names)})\"")

print(f"Searching for transcripts of interest in {input_gtf}")
gtf_entries = []
with open(input_gtf) as f:
    for line in f:
        if line in viral_gtf_lines:
            gtf_entries.append(line)
        elif gene_name_regex.search(line):
            feature_type = line.split('\t')[2]
            if feature_type == 'transcript':
                gtf_entries.append(line)
print(f"Overall, found {len(gtf_entries)} GTF lines of interest")

For each GTF entry, get:
 - name of the transcript: first try `gene_name`, and if that is not unique then try `transcript_name`
 - contig (chromosome) on which the transcript is found
 - start of the transcript on contig in 0-based Python indexing
 - end of the transcript on contig in Python indexing

In [None]:
contigs = [gtf_entry.split('\t')[0] for gtf_entry in gtf_entries]
starts = [int(gtf_entry.split('\t')[3]) - 1 for gtf_entry in gtf_entries]
ends = [int(gtf_entry.split('\t')[4]) for gtf_entry in gtf_entries]

names = [re.search('gene_name \"([\w\-]+)\"', gtf_entry).group(1)
         for gtf_entry in gtf_entries]
name_counts = collections.Counter(names)
for i, (name, gtf_entry) in enumerate(zip(names, gtf_entries)):
    if name_counts[name] > 1:
        names[i] = re.search('transcript_name \"([\w\-]+)\"', gtf_entry).group(1)      
if len(names) != len(set(names)):
    raise ValueError('`gene_names` not unique')
    
entries_df = pd.DataFrame({'name': names,
                           'contig': contigs,
                           'start': starts,
                           'end': ends})
entries_df

## Get coverage from BAM file
Now we get the coverage over each transcript of interest for each sample into `coverage_df`, and also compute the relative coverage (relative to the site with the highest coverage):

In [None]:
cols = ['sample', 'gene', 'site', 'dist_from_3end', 'coverage', 'rel_coverage']
coverage_list = []

for sample, bam, bai in zip(samples_10x,
                            input_fastq10x_bams,
                            input_fastq10x_bais):
    with pysam.AlignmentFile(bam, mode='rb', index_filename=bai) as bamfile:
        for tup in entries_df.itertuples():
            coverage_list.append(
                pd.DataFrame(dict(zip('ACGT',
                                      bamfile.count_coverage(contig=tup.contig,
                                                             start=tup.start,
                                                             stop=tup.end)
                                      )
                                  )
                             )
                .assign(coverage=lambda x: x.sum(axis=1),
                        rel_coverage=lambda x: x['coverage'] / x['coverage'].max(),
                        site=lambda x: x.index + 1,
                        dist_from_3end=lambda x: x['site'].max() - x['site'],
                        gene=tup.name,
                        sample=sample)
                [cols],
                )
            
coverage_df = pd.concat(coverage_list, sort=False, ignore_index=True)
coverage_df

## Plot max coverage for each transcript
Coverage at site in transcript with most depth:

In [None]:
p = (ggplot(coverage_df.groupby(['sample', 'gene'])
                       .aggregate(max_coverage=pd.NamedAgg('coverage', 'max'))
                       .reset_index(),
            aes('gene', 'max_coverage')) +
     geom_bar(stat='identity') +
     facet_wrap('~ sample', nrow=1) +
     theme(figure_size=(0.25 * coverage_df['gene'].nunique() * coverage_df['sample'].nunique(), 2),
           axis_text_x=element_text(angle=90),
           )
     )
_ = p.draw()

## Plot relative coverage over whole transcript

In [None]:
p = (ggplot(coverage_df, aes('site', 'rel_coverage')) +
     geom_area() +
     facet_grid('sample ~ gene', scales='free_x') +
     theme(figure_size=(1.5 * coverage_df['gene'].nunique(),
                        2 * coverage_df['sample'].nunique()),
           axis_text_x=element_text(angle=90))
     )
_ = p.draw()

## Plot relative coverage over just the last 600 nucleotides of each transcript

In [None]:
p = (ggplot(coverage_df.query('dist_from_3end < 600'),
            aes('dist_from_3end', 'rel_coverage')) +
     geom_area() +
     scale_x_reverse() +
     facet_grid('sample ~ gene', scales='free') +
     theme(figure_size=(1.5 * coverage_df['gene'].nunique(),
                        2 * coverage_df['sample'].nunique()),
           axis_text_x=element_text(angle=90))
     )
_ = p.draw()

# 