# Variant Calling with ONT data

Here we will consider the use of clair3 for variant calling with Oxford Nanopore Data.

Clair3 is available from https://github.com/HKU-BAL/Clair3

The data we will use are derived from the 1000 genomes collection and have been sequenced in different ways.


First we need to set up our notebooks to make them look pretty!

In [None]:
#This hides some warnings that we might want to look at one day if our code doesn't work!
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import igv_notebook
igv_notebook.init()
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

In [None]:
#These are various graph plotting and data processing tools we may use.
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
import numpy as np
import pandas as pd


#This is a nice plotting library that will also do some pretty graphics for us.
import aplanat
from aplanat import points
from aplanat import graphics
from aplanat.hist import histogram
from aplanat.lines import steps
from bokeh.layouts import gridplot


#A library to manipuate sam files
import pysam



Here are three samples - you can choose which one to look at and work through the notebook.

In [None]:
sampleA = "../student_projects_2022/data/malaysia_23/HG01280_P1_hg38.bam"
sampleAindex = "../student_projects_2022/data/malaysia_23/HG01280_P1_hg38.bam.bai"
sampleB = "../student_projects_2022/data/malaysia_23/GM18871.subset.bam"
sampleBindex = "../student_projects_2022/data/malaysia_23/GM18871.subset.bam.bai"
sampleC = "../student_projects_2022/data/malaysia_23/HG01280_P2_hg38.bam"
sampleCindex = "../student_projects_2022/data/malaysia_23/HG01280_P2_hg38.bam.bai"

First lets download these files and have a look at them in IGV.


In [None]:
from IPython.display import FileLink
filename="merge_output.vcf.gz"
FileLink(sampleA)



In [None]:
FileLink(sampleAindex)

In [None]:
FileLink(sampleB)


In [None]:
FileLink(sampleBindex)

In [None]:
FileLink(sampleC)

In [None]:
FileLink(sampleCindex)

To look at these, download all 6 files and then we will upload them to webigv.

https://igv.org/app/

We are going to look at the specific coordinates in the bam file below:

In [None]:
!cat quick_demo.bed

Choose the sample you want to look at by setting sample1 equal to sampleA or sampleB or sampleC.

In [None]:
sample1 = sampleB

First lets look at the header of the BAM file to work out what it is telling us.

In [None]:
!samtools view -H {sample1}

Lets generate some information about our data.

In [None]:
# run the alignment summarizer program
!stats_from_bam {sample1} > {sample1}.bam.stats


df = pd.read_csv(f"{sample1}.bam.stats", sep="\t")

p1 = histogram(
    [df['read_length']], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count")
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
    [df['acc']], title="Read accuracy",
    x_axis_label="% accuracy", y_axis_label="count")
aplanat.show(gridplot((p1, p2), ncols=2))

In [None]:
summary = graphics.InfoGraphItems()
summary.append(label='Total reads', value=len(df.name.unique()), icon='angle-up', unit='')
summary.append('Total yield', df.read_length.sum(), 'signal', 'b')
summary.append('Mean read length', df.read_length.sum()/len(df.name.unique()), 'align-center', 'b')
summary.append('Mean read identity', df.iden.mean(), 'thumbs-up')
summary.append('Mean read accuracy', df.acc.mean(), 'thumbs-up')
plot = graphics.infographic(summary.values())
aplanat.show(plot, background='#f4f4f4')

As you can see we have a small subset of data and not a full human genome!

Now we need to set up some parameters and configure clair3 for analysis. Because clair3 is a neural network it has been trained to expect specific characteristics in the data. One characteristic is how the data were base called - so we need to tell it the "model".

For R9 data the model is r941_prom_sup_g5014 - for R10 data the model is r1041_e82_400bps_sup_v420

In [None]:
MODEL_NAME="r1041_e82_400bps_sup_v420"

In [None]:
PLATFORM="ont"

In [None]:
BAM=sample1

In [None]:
REF="../student_projects_2022/data/refs/hg38_chr2_7/chr7.hg38.fasta.gz"

In [None]:
CONTIGS="chr7"

In [None]:
START_POS=99500000

In [None]:
END_POS=100000000

In [None]:
!echo -e "$CONTIGS\t$START_POS\t$END_POS" > quick_demo.bed

In [None]:
THREADS=8

In [None]:
!run_clair3.sh \
  --bam_fn=$BAM \
  --ref_fn=$REF \
  --threads=$THREADS \
  --platform=$PLATFORM \
  --model_path=/opt/tljh/user/bin/models/$PLATFORM \
  --output=. \
  --bed_fn=quick_demo.bed

In [None]:
!zcat merge_output.vcf.gz


Now we wish to determine if any of these variants are clinically significant.


In [None]:
from IPython.display import FileLink
filename="merge_output.vcf.gz"
FileLink(filename)

We can attempt to annotate this file by using the ENSEMBL Variant Effect Predictor Tool - download the VCF and upload it to https://www.ensembl.org/Homo_sapiens/Tools/VEP

We will attempt to run some local annotation as well!

In [None]:
!snpEff

In [None]:
import os
dblocation = os.path.abspath("../student_projects_2022/data/snpdbs/")


In [None]:
!java -Xmx8g -jar /opt/tljh/user/share/snpeff-5.2-0/snpEff.jar -nodownload -dataDir $dblocation hg38kg merge_output.vcf.gz > merge_output.ann.vcf

In [None]:
!cat merge_output.ann.vcf

In [None]:
!!java -Xmx8g -jar /opt/tljh/user/share/snpsift-5.2-0/SnpSift.jar annotate $dblocation/clinvar.vcf.gz merge_output.ann.vcf > merge_output.ann.clinvar.vcf


In [None]:
!cat merge_output.ann.clinvar.vcf | grep CYP

To investigate this file we need to filter down to clinically significant variants - we can do that as follows:

In [None]:
!cat merge_output.ann.clinvar.vcf \
    | java -jar /opt/tljh/user/share/snpsift-5.2-0/SnpSift.jar filter \
    "(exists CLNSIG)" \
    > merge_output.ann.clinvar.filtered.vcf

In [None]:
!cat merge_output.ann.clinvar.filtered.vcf

This pipeline is essentially what is run in epi2me-labs - which we will also look at.