# SAM Alignment Quality

This notebook calculates and plots features relevant to alignment quality - such as number of mapped/unmapped reads and edit distance (smallest edit distance, if secondary alignments are present) - for a set of SAM files. 

## Requirements
* A directory containing one or more .sam or .bam files. Records must contain the optional field "NM".
* Docker
* Python packages: matplotlib, ipyplot, pandas, scipy, numpy, itables

## Configuration and User Variables

Run the following three cells to set up your environment. In the cell labeled "Edit Me", change the directories to match your filesystem. Notice that Docker-specific filepaths are different from the absolute filepath on your machine.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import ipyplot
import os
import pandas
import scipy.stats as st
from scipy.stats import poisson
from scipy.optimize import curve_fit
from collections import Counter
import numpy as np
import math
import random
from itables import init_notebook_mode
tmp = init_notebook_mode(all_interactive=True)

In [None]:
###########
# EDIT ME #
###########

# Directories

# Main directory
working_dir = "/path/to/working/directory/"

# Note: for the following directories, please use the path relative to the working directory
# The reason for this is that the absolute path changes within the docker

# SAM subfolder
sam_dir="/sam/"
# Metrics subfolder - 
out_dir="/metrics/"
# Figures subfolder
fig_dir="/figs/"

# Docker-specific parameters

# Docker image and version
DOCKER_VERSION="maizegenetics/phg:latest"
# Docker path - location where Docker is installed
# Note: on CBSU machines this is docker1
DOCKER_PATH="docker"

# Docker-specific directories - Do not edit directly
PHG_SAM_DIR="/phg/" + sam_dir
PHG_OUT_DIR="/phg/" + out_dir


In [None]:
# Make metrics and figure directories, if they don't already exist

if not os.path.exists(working_dir + "/" + out_dir):
    os.mkdir(working_dir + "/" + out_dir)
    
if not os.path.exists(working_dir + "/" + fig_dir):
    os.mkdir(working_dir + "/" + fig_dir)

# Run Plugin

First, we run the PHG plugin SAMMetricsPlugin to extract information from the SAM files. This step may take a few minutes per file, depending on their size. 
Alternatively, you may run the plugin outside this notebook: just make sure that the `-outDir` parameter points to the same directory as `out_dir` in the "Edit Me" cell above.

In [None]:
# Run SAMMetricsPlugin

! {DOCKER_PATH} run --name sam_metrics --rm \
    -v {working_dir}/:/phg/ \
    -t {DOCKER_VERSION} \
    /tassel-5-standalone/run_pipeline.pl -Xmx50G -debug \
    -SAMMetricsPlugin \
        -samDir {PHG_SAM_DIR} \
        -outDir {PHG_OUT_DIR} \
    -endPlugin

# Summary Table and Plots

Run the cells below to view a summary of the number of mapped and unmapped reads in each SAM file.

In [None]:
# Load table from file and print

summary_table = pandas.read_csv(working_dir + "/" + out_dir + "/SAM_summary_statistics.txt", sep="\t")
samples = [str(x) for x in summary_table["fileName"]]

# Calculate proportion of mapped reads

summary_table["percentReadsMapped"] = summary_table["mappedReads"]/(summary_table["mappedReads"] + summary_table["unmappedReads"])
summary_table = summary_table[["fileName", "mappedReads", "unmappedReads", "percentReadsMapped", "meanMinEditDistance", "medianMinEditDistance"]]

summary_table

In [None]:
# Generate a bar plot of the percentage of reads mapped for each SAM file

fig, ax = plt.subplots()

ax.bar(samples, summary_table["percentReadsMapped"])

ax.set_ylim([0, 1])

ax.set_yticks([0, 0.2, 0.4, 0.6, 0.8, 1])
ax.set_yticklabels(["0%", "20%", "40%", "60%", "80%", "100%"])
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.set_axisbelow(True)
plt.grid()

plt.title("Percentage of Reads Mapped")

plt.show()


# Distribution of Alignment Edit Distances

For each fasta/fastq read that is successfully aligned to the reference sequence, minimap2 gives an edit distance - the number of single base insertions, deletions, or substitutions required to transform the query sequence into the reference sequence. An edit distance of 0 means that the query sequence perfectly matched the reference sequence. 
The following cells generate histograms showing the distribution of the edit distances for each mapped read in the SAM files (choosing the smallest edit distance if multiple alignments were generated for each read). The plots are saved to the `fig_dir` directory for future display/reference. In the "Edit Me" cell below, set parameters to change the look of the plots. 

In [None]:
###########
# EDIT ME #
###########

# figure prefix - this is appended to the start of the figure file name when it is saved
# use this if you want to generate multiple versions of the figures and keep them in the same folder
fig_prefix = ""

# whether to display the y-axis on a logarithmic scale
is_log_y = False

# histogram plot parameter: how to display the histogram. bar is the default
# Allowed types are: bar, barstacked, step, stepfilled
hist_type = "bar"

# show the best-fit Poisson distributions on the plots
show_fit_curve = True

# image file type
fig_ext = ".png"

# if you would like to generate graphs for only a subset of your sams, specify the subset here in a list
# format the names as they appear in the fileName column in summary_table
# If you want to generate graphs for all sams, use subsample = samples
subsample = samples


In [None]:
# Generate histogram for each SAM file
# Note: does not display figures - for that see the calls below

def fit_func(k, lamb):
    return poisson.pmf(k, lamb)    

# keeping track of the poisson distribution parameters
parameters_list = []

plt.ioff()

for sample in subsample:
    
    # read in data
    file_name = working_dir + "/" + out_dir + "/" + sample + "_editDistances.txt"
    with open(file_name, "r") as file:
        lines = [int(x.strip("\n")) for x in file.readlines()]
    
    
    # poisson fit  
    # count editDistances
    distance_counter = Counter(lines)

    # extract just the counts that are greater than exclude_counts_under
    bins = [x for x in distance_counter.keys()]
    counts = [distance_counter[x] for x in bins]
    sum_counts = sum(counts)

    # fit poisson curve to counts
    # need to scale things so counts sum to 1
    counts_scaled = [y/sum_counts for y in counts]
    parameters, cov_matrix = curve_fit(fit_func, bins, counts_scaled, p0=0)    

    # get the curve that we fit and scale it so it makes sense on the axis with the histogram
    linspace = np.linspace(0, max(lines), max(lines) + 1)
    expected_curve = poisson.pmf(linspace, parameters) * sum_counts
    parameters_list.extend(parameters)
    
    # plot
    fig = plt.figure()
    ax = plt.subplot(111)
    
    #ax.hist(lines, max(lines)+1, log=is_log_y, histtype=hist_type)
    ax.bar(list(distance_counter.keys()), list(distance_counter.values()), width=1.0)
    ax.set_xlabel("edit distance (NM)")
    ax.set_ylabel("number of mapped reads")
    
    if(show_fit_curve):
        ax.plot(linspace, expected_curve, color="red")
    
    if(is_log_y):
        ax.set_ylabel("number of mapped reads (log)")
    else:
        ax.set_ylabel("number of mapped reads")
        ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    
    plt.title(sample + " alignment edit distances")
    
    fig.savefig(working_dir + "/" + fig_dir + "/" + fig_prefix + sample + fig_ext)
    plt.close()

tmp = plt.ion()

In [None]:
# For Poisson curves fit to the edit distance distributions, show the distribution of the lambda parameter

fig, ax = plt.subplots(2, 1)

ax[0].boxplot(parameters_list, vert=False)
ax[1].scatter(parameters_list, [0.95 + random.random()/10.0 for x in parameters_list])
ax[1].set_ylim([0.9, 1.1])

ax[1].set_xlabel("lambda")

ax[0].yaxis.set_major_locator(ticker.NullLocator())
ax[1].yaxis.set_major_locator(ticker.NullLocator())

ax[0].set_title("Distribution of lamba parameters of best-fit Poisson distributions")
plt.show()


In [None]:
# Display figures side-by-side

fig_names = [(working_dir + "/" + fig_dir + "/" + fig_prefix + x + fig_ext) for x in subsample]

figs = [plt.imread(f) for f in fig_names]

ipyplot.plot_images(figs, img_width=400)


# Comparative Histograms

Sometimes, you may wish to overlay two histograms in order to compare them directly. For example, you might have aligned your fastqs to two different reference sequences. Use the cells below to display pairs of overlaid histograms.

Make sure you have run the "Run Plugin" cells on all of the sam files you wish to use. If you are comparing alignments from the same fastq file across multiple references/settings, you should use a separate directory for `out_dir` for each, to prevent files from being overwritten.

Next, create a config file with the pairs you would like to graph. This file should be a plain-text table with the following columns (separated by tabs, and including the header):
* **fig_title**: the title of the figure
* **fig_name**: the file name for the generated figure (not including extension)
* **file_1**: the path to the first metrics file. It should end with _editDistances.txt
* **label_1**: the label to use in the legend for the first file
* **file_2**: same as **file_1**, but for the second file
* **label_2**: same as **label_1**, but for the second file

| fig_title | fig_name | file_1 | label_1 | file_2 | label_2 |
| :-------- | :------- | :----- | :------ | :----- | :------ |
| Example figure title | example_fig_name | /path/to/file_1_editDistances.txt | method1 | /path/to/file_2_editDistances.txt | method2 |


Set the parameters in the "Edit Me" cell below and run the following cells to see your figures.

In [None]:
###########
# EDIT ME #
###########

# Path to the config file (see paragraph above for details)
config_file = working_dir + "/sam_config.txt"

# figure prefix - this is appended to the start of the figure file name when it is saved
# use this if you want to generate multiple versions of the same figure and keep them in the same folder
fig_prefix = ""

# whether or not to display the y-axis on a logarithmic scale
is_log_y = False

# Note: for overlaid plots, I recommend one of the following settings

# hist_type="step", hist_alpha=1 (Only the outlines of the histogram will show)
# hist_type="stepfilled", hist_alpha=0.4 (Histograms will be transparent)

# histogram plot parameter: how to display the histogram. bar is the default
# Allowed types are: bar, barstacked, step, stepfilled
hist_type = "step"

# fill opacity
hist_alpha = 1.0

# image file type
fig_ext = ".svg"


In [None]:
# Load config file
config = pandas.read_csv(config_file, sep="\t")
config

In [None]:
plt.ioff()

for index, row in config.iterrows():
    fig=plt.figure()
    ax=plt.subplot(111)
    
    with open(row["file_1"], "r") as file:
            lines_1 = [int(x.strip("\n")) for x in file.readlines()]
    with open(row["file_2"], "r") as file:
            lines_2 = [int(x.strip("\n")) for x in file.readlines()]
    
    ax.hist(lines_1, max(lines_1)+1, histtype=hist_type, alpha=hist_alpha, log=is_log_y, label=row["label_1"])
    ax.hist(lines_2, max(lines_2)+1, histtype=hist_type, alpha=hist_alpha, log=is_log_y, label=row["label_2"])
    
    ax.set_xlabel("edit distance (NM)")
    
    if(is_log_y):
        ax.set_ylabel("number of mapped reads (log)")
    else:
        ax.set_ylabel("number of mapped reads")
        ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    
    plt.title(row["fig_title"])
    plt.legend()
    
    fig.savefig(working_dir + "/" + fig_dir + "/" + fig_prefix + row["fig_name"] + fig_ext)
    plt.close()

tmp = plt.ion()

In [None]:
# Display figures side-by-side

fig_names = [(working_dir + "/" + fig_dir + "/" + fig_prefix + x + fig_ext) for x in config["fig_name"]]

figs = [plt.imread(f) for f in fig_names]

ipyplot.plot_images(figs, img_width=400)