# Introduction

As before we need to set up our notebook with the relevant code for analysis of our data.

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
#This hides some warnings that we might want to look at one day if our code doesn't work!
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

#These are various graph plotting and data processing tools we may use.
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
import numpy as np
import pandas as pd


#This is a nice plotting library that will also do some pretty graphics for us.
import aplanat
from aplanat import points
from aplanat import graphics
from aplanat.hist import histogram
from aplanat.lines import steps
from bokeh.layouts import gridplot


#A library to manipulate sam files
import pysam
#This hides some warnings that we might want to look at one day if our code doesn't work!
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

## Required Files

We are going to need some reference files to map our data too. We will use two references. The first (called "reference") is the Hmed an Hvol genomes combined. The seconde (called "reference2") is just the Hmed genome sequence. These are set up in the cell below:

In [None]:
reference = "~/student_projects_2022/data/refs/merged_refs.fasta"
reference_index = "~/student_projects_2022/data/refs/merged_refs.fasta.fai"

reference2 = "~/student_projects_2022/data/refs/Hmed_Chr_CP001868.2.fasta"
reference2_index = "~/student_projects_2022/data/refs/Hmed_Chr_CP001868.2.fasta.fai"

reference3 = "~/student_projects_2022/data/refs/Hvol_Chr_NC_013967.1.fasta"
reference3_index = "~/student_projects_2022/data/refs/Hvol_Chr_NC_013967.1.fasta.fai"

## Data Summaries and Analysis

We need to know a little bit about the data set that you have obtained for your sample.

First you need to tell the computer which barcode you want to look at. We will store this in a variable called "barcode".

Fill in the value of your barcode in the cell below. This needs to be a two digit number - e.g. one of:

01
02
03
04
05
06
07
08
09
10
11
12


In [None]:
barcode="07"

We have pregrouped the data in the folder so we can now find your reads by creating a new variable called reads like this.

In [None]:
reads = f"~/student_projects_2022/data/ic_131/Haloferax_clean_RBK004ori/20221104_1529_X2_FAV40358_9bd50a0f/fastq_*/barcode{barcode}/*.fastq.gz"

We also need to create a few output file names that we can use in the rest of the code.
We are going to map our reads in a few different ways. Firstly we will map them to both reference genomes and then each genome individually.

In [None]:
sort_barcode_bam = f"sorted_{barcode}_out.bam"
sort_barcode_bam_hmed = f"sorted_{barcode}_out_hmed.bam"
sort_barcode_bam_hvol = f"sorted_{barcode}_out_hvol.bam"

### Using LAST

As we saw last time, LAST is more useful mapper for us than minimap2 for this experiment. We therefore need to make our last reference database for each of our genomes.

This step is quite slow. Therefore we have precomputed these and instead we will just load the references. The first three lines would be used to generate the references usually.

In [None]:
#!lastdb -uNEAR ~/student_projects_2022/data/refs/halodb $reference
#!lastdb -uNEAR ~/student_projects_2022/data/refs/hmeddb $reference2
#!lastdb -uNEAR ~/student_projects_2022/data/refs/hvoldb $reference3
halodb="~/student_projects_2022/data/refs/halodb"
hmeddb="~/student_projects_2022/data/refs/hmeddb"
hvoldb="~/student_projects_2022/data/refs/hvoldb"


### What are our databases?

halodb is both reference genomes.

hmeddb is just the H med reference.

hvoldb is just the H vol reference.

### Training Last
The last aligner can be run with lots of different parameters - choosing the correct ones is challenging. So the authors of the aligner have provided a way for us to be able to work out the parameters with training.

The cell below will train the last aligner using your reads for each of the two databases we need to look at. This bit is slow if you have a lot of data.

In [None]:
!last-train -P 8 -Q1 $halodb $reads > train.out
!last-train -P 8 -Q1 $hmeddb $reads > train_hmed.out
!last-train -P 8 -Q1 $hvoldb $reads > train_hvol.out

A problem with the data we are looking at here is that we expect some of our reads to map to both genomes - these are the reads we are really interested in! These reads are the "recombinants". To find these we are going to use a tool called last-split. last-split finds the optimal mapping for each section of a read.

First off we will map our reads to both genomes using last and have a look at the output.

This command does a lot of things!

lastal. It aligns the reads to the halodb using the training information we generated earlier. It passes the output of this to the next program through a 'pipe' - the 'pipe' is the "|" character.
last-split - splits the outputs it gets from the aligner into the best ones for each genome and then pipes it's output to:
maf-convert - this program converts the maf output file from last into a samtools file which is then piped into:
samtools view - this is tricky. This samtools command is adding essential information about the reference into the bam file which is then piped into:
samtools sort - we've used this before - we want to sort our alignments along the genome.



## Aligning to both genomes.

In [None]:
!lastal -P 8 --split -p train.out $halodb $reads | last-split | maf-convert sam - | samtools view -bt $reference_index | samtools sort -@16 -o $sort_barcode_bam

Finally we need to index our bam file so we can analyse it further.

In [None]:
!samtools index $sort_barcode_bam

## Aligning to just the med genome.

In [None]:
!lastal -P 8 -p train_hmed.out $hmeddb $reads | last-split | maf-convert sam - | samtools view -bt $reference2_index | samtools sort -@16 -o $sort_barcode_bam_hmed

Finally we need to index our bam file so we can analyse it further.

In [None]:
!samtools index $sort_barcode_bam_hmed

## Aligning to just the H. vol genome.

In [None]:
!lastal -P 8 -p train_hvol.out $hvoldb $reads | last-split | maf-convert sam - | samtools view -bt $reference3_index | samtools sort -@16 -o $sort_barcode_bam_hvol

Finally we need to index our bam file so we can analyse it further.

In [None]:
!samtools index $sort_barcode_bam_hvol

Having done all of this, we should be able to generate some stats about our data!

### Statistics for mapping to both genomes.

In [None]:
# run the alignment summarizer program
!stats_from_bam $sort_barcode_bam > sorted.reads_reference.bam.stats


df = pd.read_csv("sorted.reads_reference.bam.stats", sep="\t")

p1 = histogram(
    [df['read_length']], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count",bins=100)
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
    [df['acc']], title="Read accuracy",
    x_axis_label="% accuracy", y_axis_label="count",bins=100)
aplanat.show(gridplot((p1, p2), ncols=2))


summary = graphics.InfoGraphItems()
summary.append(label='Total reads', value=len(df.name.unique()), icon='angle-up', unit='')
summary.append('Total yield', df.drop_duplicates(subset=["name"], keep='first').read_length.sum(), 'signal', 'b')
summary.append('Mean read length', df.drop_duplicates(subset=["name"], keep='first').read_length.sum()/len(df.name.unique()), 'align-center', 'b')
summary.append('Mean read identity', df.iden.mean(), 'check')
summary.append('Mean read accuracy', df.acc.mean(), 'check')
plot = graphics.infographic(summary.values())
aplanat.show(plot, background='#f4f4f4')

### Statistics for mapping just to H Vol.

In [None]:
# run the alignment summarizer program
!stats_from_bam $sort_barcode_bam_hvol > sorted.reads_reference_hvol.bam.stats


df = pd.read_csv("sorted.reads_reference_hvol.bam.stats", sep="\t")

p1 = histogram(
    [df['read_length']], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count",bins=100)
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
    [df['acc']], title="Read accuracy",
    x_axis_label="% accuracy", y_axis_label="count",bins=100)
aplanat.show(gridplot((p1, p2), ncols=2))


summary = graphics.InfoGraphItems()
summary.append(label='Total reads', value=len(df.name.unique()), icon='angle-up', unit='')
summary.append('Total yield', df.drop_duplicates(subset=["name"], keep='first').read_length.sum(), 'signal', 'b')
summary.append('Mean read length', df.drop_duplicates(subset=["name"], keep='first').read_length.sum()/len(df.name.unique()), 'align-center', 'b')
summary.append('Mean read identity', df.iden.mean(), 'check')
summary.append('Mean read accuracy', df.acc.mean(), 'check')
plot = graphics.infographic(summary.values())
aplanat.show(plot, background='#f4f4f4')

Why has this changed?

### Mapping just to H Med Genome.

In [None]:
# run the alignment summarizer program
!stats_from_bam $sort_barcode_bam_hmed > sorted.reads_reference_hmed.bam.stats


df = pd.read_csv("sorted.reads_reference_hmed.bam.stats", sep="\t")

p1 = histogram(
    [df['read_length']], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count",bins=100)
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
    [df['acc']], title="Read accuracy",
    x_axis_label="% accuracy", y_axis_label="count",bins=100)
aplanat.show(gridplot((p1, p2), ncols=2))


summary = graphics.InfoGraphItems()
summary.append(label='Total reads', value=len(df.name.unique()), icon='angle-up', unit='')
summary.append('Total yield', df.drop_duplicates(subset=["name"], keep='first').read_length.sum(), 'signal', 'b')
summary.append('Mean read length', df.drop_duplicates(subset=["name"], keep='first').read_length.sum()/len(df.name.unique()), 'align-center', 'b')
summary.append('Mean read identity', df.iden.mean(), 'check')
summary.append('Mean read accuracy', df.acc.mean(), 'check')
plot = graphics.infographic(summary.values())
aplanat.show(plot, background='#f4f4f4')

## What now?

Again - why has this changed?

# Do we have recombinants?
Now we are going to look at coverage with the tool mosdepth to see if we can see our recombinants.

In [None]:
barcode_cov = f"last_barcode{barcode}_cov"

In [None]:
!mosdepth -n --fast-mode --by 100 $barcode_cov $sort_barcode_bam

In [None]:
cumulative_depth = pd.read_csv(
    f'{barcode_cov}.mosdepth.region.dist.txt', sep='\t',
    names=['ref', 'depth', 'proportion'])

binned_depth = pd.read_csv(
    f'{barcode_cov}.regions.bed.gz', sep='\t',
    names=['ref', 'start', 'end', 'depth'])

def make_coverage_plot(cumulative_depth, binned_depth):
    # Plot the proportion of the genome at coverage levels
    p1 = steps(
        [cumulative_depth[cumulative_depth['ref'].eq(binned_depth['ref'].unique()[0])]['depth']],
        [cumulative_depth[cumulative_depth['ref'].eq(binned_depth['ref'].unique()[0])]['proportion']],
        colors=['darkolivegreen'],
        x_axis_label='Depth of coverage',
        y_axis_label='Proportion of genome at coverage',
        title=binned_depth['ref'].unique()[0])
    
    # Plot the binned coverage levels across the genome
    
    p2 = steps(
        [binned_depth[binned_depth['ref'].eq(binned_depth['ref'].unique()[0])]['start']],
        [binned_depth[binned_depth['ref'].eq(binned_depth['ref'].unique()[0])]['depth']],
        colors=['darkolivegreen'],
        x_axis_label='Position along reference',
        y_axis_label='sequencing depth / bases',
        title=binned_depth['ref'].unique()[0])
    p2.xaxis.formatter.use_scientific = False
    
    p3 = steps(
        [cumulative_depth[cumulative_depth['ref'].eq(binned_depth['ref'].unique()[1])]['depth']],
        [cumulative_depth[cumulative_depth['ref'].eq(binned_depth['ref'].unique()[1])]['proportion']],
        colors=['darkblue'],
        x_axis_label='Depth of coverage',
        y_axis_label='Proportion of genome at coverage',
        title=binned_depth['ref'].unique()[1])

    
    # Plot the binned coverage levels across the genome
    
    p4 = steps(
        [binned_depth[binned_depth['ref'].eq(binned_depth['ref'].unique()[1])]['start']],
        [binned_depth[binned_depth['ref'].eq(binned_depth['ref'].unique()[1])]['depth']],
        colors=['darkblue'],
        x_axis_label='Position along reference',
        y_axis_label='sequencing depth / bases',
        title=binned_depth['ref'].unique()[1])
    p4.xaxis.formatter.use_scientific = False
    return gridplot((p1, p2,p3,p4), ncols=2)

aplanat.show(make_coverage_plot(cumulative_depth, binned_depth), background="#ffffff")

# Think

Why does our coverage look different here? What are the spikes in coverage? Why are the results not as good as our simulated data?

## Viewing the alignments

Now we will use IGV to look at the alignments in more detail.

Run the code below and use it to highlight regions of the genome and - specifically - the genes where the recombination events are occurring.

You can view two regions of a genome at once by entering the coordinates like this:

"CP001868.2:480,000-490,000 CP001868.2:850,000-860,000" 

or if you want to be really flash:

"CP001868.2:405,000-415,000 NC_013967.1:420,000-430,000 NC_013967.1:785000-795000 CP001868.2:800,000-805,000"


In [None]:
from igv_jupyterlab import IGV
import os
user = os.getenv('JUPYTERHUB_USER')

url=f"http://10.157.200.14/user/{user}/tree/UnderGradProjectTest/"
bams={'results':sort_barcode_bam}
track_list=[
            {"name": "HMerge",
                "url": url+"data/refs/merge.gff3",
                "format": "gff3",
                "type": "annotation",
                "displayMode": "expanded",
                "height":120,
                "indexed": False }
]
colors=['orange','green','gray']
i=0
for b in bams:
    d = {"name": b,
        "url":url+bams[b],
        "type": "alignment",
         #"displayMode":"SQUISHED",
         "height":800,
         "removable":True,
         #"color":colors[i],
        "indexed": True }
    track_list.append(d)
    i+=1
    
genome = IGV.create_genome(
    name="merged_refs",   
    fasta_url=url+'data/refs/merged_refs.fasta',
    index_url=url+ 'data/refs/merged_refs.fasta.fai',
    tracks=track_list
)

#create the widget
igv = IGV(genome=genome)


display(igv)

## List Your Identified Genes

You should find four genes - one at each end of the recombination event for each genome. We will call these CP_Left, CP_Right, NC_Left and NC_Right.

Complete the cell below with the relevant information. You must leave in the quotes and copy the gene names exactly up to the first full stop. So HFX_0896.mRNA.0 will become "HFX_0896"

In [None]:
CP_Left="HFX_0440"
CP_Right="HFX_0844"
NC_Left="HVO_0476"
NC_Right="HVO_0869"

## What are these genes?

To find what these genes are, we need to look for them in the annotation files for the genomes. We can find this information from the annotation file used to label the IGV plot above.


In [None]:
!grep $CP_Left ~/student_projects_2022/data/refs/merge.gff3

In [None]:
!grep $CP_Right ~/student_projects_2022/data/refs/merge.gff3

In [None]:
!grep $NC_Left ~/student_projects_2022/data/refs/merge.gff3

In [None]:
!grep $NC_Right ~/student_projects_2022/data/refs/merge.gff3

For some of these files we can see what they are by looking at the product name. Others are hypothetical proteins. To identify the hypothetical proteins, we need to get the sequence so we can analyse them.

To do this, we will use a tool called GFF3Toolkit.


In [None]:
!gff3_to_fasta -g ~/student_projects_2022/data/refs/merge.gff3 -f $reference -st cds -o test_genes -d complex

In [None]:
!grep -A1 $CP_Right ~/haloferax_2022/test_genes_cds.fa

In [None]:
!grep -A1 $NC_Right ~/haloferax_2022/test_genes_cds.fa

## Now lets look at the alignment with respect to just one genome.

Again we will use IGV but we will just look at the reads with respect to one genome.

### Hmed

In [None]:
"Hmed_Chr_CP001868.2:405,000-415,000 Hmed_Chr_CP001868.2:800,000-805,000"

In [None]:
from igv_jupyterlab import IGV
import os
user = os.getenv('JUPYTERHUB_USER')

url=f"http://10.157.200.14/user/{user}/tree/UnderGradProjectTest/"
bams={'results':sort_barcode_bam_hmed}
track_list=[
            {"name": "HMerge",
                "url": url+"data/refs/Hmed.gff3",
                "format": "gff3",
                "type": "annotation",
                "displayMode": "expanded",
                "height":120,
                "indexed": False }
]
colors=['orange','green','gray']
i=0
for b in bams:
    d = {"name": b,
        "url":url+bams[b],
        "type": "alignment",
         #"displayMode":"SQUISHED",
         "height":800,
         "removable":True,
         #"color":colors[i],
        "indexed": True }
    track_list.append(d)
    i+=1
    
genome = IGV.create_genome(
    name="Hmed",   
    fasta_url=url+'data/refs/Hmed_Chr_CP001868.2.fasta',
    index_url=url+ 'data/refs/Hmed_Chr_CP001868.2.fasta.fai',
    tracks=track_list
)

#create the widget
igv = IGV(genome=genome)


display(igv)

### Hvol

In [None]:
"Hvol_Chr_NC_013967.1:420,000-430,000 Hvol_Chr_NC_013967.1:785000-795000"

In [None]:
from igv_jupyterlab import IGV
import os
user = os.getenv('JUPYTERHUB_USER')

url=f"http://10.157.200.14/user/{user}/tree/UnderGradProjectTest/"
bams={'results':sort_barcode_bam_hvol}
track_list=[
            {"name": "HMerge",
                "url": url+"data/refs/Hvol.gff3",
                "format": "gff3",
                "type": "annotation",
                "displayMode": "expanded",
                "height":120,
                "indexed": False }
]
colors=['orange','green','gray']
i=0
for b in bams:
    d = {"name": b,
        "url":url+bams[b],
        "type": "alignment",
         #"displayMode":"SQUISHED",
         "height":800,
         "removable":True,
         #"color":colors[i],
        "indexed": True }
    track_list.append(d)
    i+=1
    
genome = IGV.create_genome(
    name="Hmed",   
    fasta_url=url+'data/refs/Hvol_Chr_NC_013967.1.fasta',
    index_url=url+ 'data/refs/Hvol_Chr_NC_013967.1.fasta.fai',
    tracks=track_list
)

#create the widget
igv = IGV(genome=genome)


display(igv)

### What have you found?

At this point, you should have identified two regions in each genome where recombination has occurred. There are lots of questions to ask here. For example - what is shared between these genes? What is the function of the genes? Are these genes still functional? 

You need to have a think about this going forwards.