# Introduction

As before we need to set up our notebook with the relevant code for analysis of our data.

This first cell is just to make it look pretty!

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import igv_notebook
igv_notebook.init()

This cell is setting up the various libraries we will need to analyse our data.

In [None]:
#This hides some warnings that we might want to look at one day if our code doesn't work!
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
import os
user = os.getenv('JUPYTERHUB_USER')

#These are various graph plotting and data processing tools we may use.
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
import numpy as np
import pandas as pd


#This is a nice plotting library that will also do some pretty graphics for us.
import aplanat
from aplanat import points
from aplanat import graphics
from aplanat.hist import histogram
from aplanat.lines import steps
from bokeh.layouts import gridplot


#A library to manipulate sam files
import pysam
#This hides some warnings that we might want to look at one day if our code doesn't work!
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

#compute cores to use - should not be greater than 8
cores=48

from IPython.display import Image

# Reminder

This slide is taken from Thorsten's introdcution - it shows what we expected to see.

In [None]:
Image(f"/home/jupyter-{user}/student_projects_2022/data/images/KeynoteSlide.jpg")

## Time Saving

Some steps in this workbook have been precomputed. All the code is here, but we have saved time on some of the steps!

## Required Files

We are going to need some reference files to map our data too. We will use various references.

The first (called "reference") is the Hmed an Hvol genomes combined.

The second (called "reference2") is just the Hmed genome sequence.

The third is the Hvol chromosome sequence. 

The final is the Hmed and Hvol sequences including all plasmids. 

These are set up in the cell below:

In [None]:
reference = "~/student_projects_2022/data/refs/merged_refs.fasta"
reference_index = "~/student_projects_2022/data/refs/merged_refs.fasta.fai"

reference2 = "~/student_projects_2022/data/refs/Hmed_Chr_CP001868.2.fasta"
reference2_index = "~/student_projects_2022/data/refs/Hmed_Chr_CP001868.2.fasta.fai"

reference3 = "~/student_projects_2022/data/refs/Hvol_Chr_NC_013967.1.fasta"
reference3_index = "~/student_projects_2022/data/refs/Hvol_Chr_NC_013967.1.fasta.fai"

reference4 = "~/student_projects_2022/data/refs/HBoth.fasta"
reference4_index = "~/student_projects_2022/data/refs/HBoth.fasta.fai"

## Data Summaries and Analysis

We need to know a little bit about the data set that you have obtained for your sample.

First you need to tell the computer which barcode you want to look at. We will store this in a variable called "barcode".

Fill in the value of your barcode in the cell below. This needs to be a two digit number - e.g. one of:

01
02
03
04
05
06
07
08
09
10

As a reminder, the samples were assigned as follows:

JK Barcode01

JR Barcode02

AM	Barcode03

LH	Barcode04

JO	Barcode05

EH	Barcode06

IT	Barcode07

GT	Barcode08

Parent01	Barcode09

Parent02	Barcode10



In [None]:
barcode="09"

We have pregrouped the read data in the folders below so we can find your reads by creating a new variable called reads like this.

In [None]:
reads = f"~/student_projects_2022/data/IC_199/*/*/fastq_pass/barcode{barcode}/*.fastq.gz"

We also need to create a few output file names that we can use in the rest of the code.

We are going to map our reads in a few different ways. Firstly we will map them to both reference genomes and then each genome individually.

For each mapping we need to create a bam file that describes the reads - to do this we name the file with the barcode number you are using for your analysis.

In [None]:
sort_barcode_bam = f"~/student_projects_2022/data/precomputed/bams/sorted_{barcode}_out.bam"
sort_barcode_bam_hmed = f"~/student_projects_2022/data/precomputed/bams/sorted_{barcode}_out_hmed.bam"
sort_barcode_bam_hvol = f"~/student_projects_2022/data/precomputed/bams/sorted_{barcode}_out_hvol.bam"

sort_barcode_bam_igv = f"student_projects_2022/data/precomputed/bams/sorted_{barcode}_out.bam"
sort_barcode_bam_hmed_igv = f"student_projects_2022/data/precomputed/bams/sorted_{barcode}_out_hmed.bam"
sort_barcode_bam_hvol_igv = f"student_projects_2022/data/precomputed/bams/sorted_{barcode}_out_hvol.bam"

### Using LAST

As we saw last time, LAST is more useful mapper for us than minimap2 for this experiment. We therefore need to make our last reference database for each of our genomes.

This step is quite slow. Therefore we have precomputed these and instead we will just load the references. The first four lines would have been used to generate the references usually.

In [None]:
#!lastdb -uNEAR ~/student_projects_2022/data/refs/halodb $reference
#!lastdb -uNEAR ~/student_projects_2022/data/refs/hmeddb $reference2
#!lastdb -uNEAR ~/student_projects_2022/data/refs/hvoldb $reference3
#!lastdb -uNEAR ~/student_projects_2022/data/refs/hmedalldb $reference4
halodb="~/student_projects_2022/data/refs/halodb"
hmeddb="~/student_projects_2022/data/refs/hmeddb"
hvoldb="~/student_projects_2022/data/refs/hvoldb"
halldb="~/student_projects_2022/data/refs/hmedalldb"


### What are our databases?

halodb is both reference genomes.

hmeddb is just the H med reference.

hvoldb is just the H vol reference.

halldb is the complete reference genomes combined including all plasmids.

### Training Last
The last aligner can be run with lots of different parameters - choosing the correct ones is challenging. So the authors of the aligner have provided a way for us to be able to work out the parameters with training.

The cell below will train the last aligner using your reads for each of the two databases we need to look at. This bit is slow if you have a lot of data. Therefore we have precomputed these again.

In [None]:
#!last-train -P $cores -Q1 $halodb $reads > train.out
#!last-train -P $cores -Q1 $hmeddb $reads > train_hmed.out
#!last-train -P $cores -Q1 $hvoldb $reads > train_hvol.out
#!last-train -P $cores -Q1 $halldb $reads > train_hmedall.out
train = "~/student_projects_2022/data/precomputed/train/train.out"
trainmed = "~/student_projects_2022/data/precomputed/train/train_hmed.out"
trainvol = "~/student_projects_2022/data/precomputed/train/train_hvol.out"
trainall = "~/student_projects_2022/data/precomputed/train/train_hmedall.out"


### Mapping

A problem with the data we are looking at here is that we expect some of our reads to map to both genomes - these are the reads we are really interested in! These reads are the "recombinants". To find these we are going to use a tool called last-split. last-split finds the optimal mapping for each section of a read.

First off we will map our reads to the genomes using last and have a look at the output.

This command does a lot of things!

lastal. It aligns the reads to the halodb using the training information we generated earlier. It passes the output of this to the next program through a 'pipe' - the 'pipe' is the "|" character.
last-split - splits the outputs it gets from the aligner into the best ones for each genome and then pipes it's output to:
maf-convert - this program converts the maf output file from last into a samtools file which is then piped into:
samtools view - this is tricky. This samtools command is adding essential information about the reference into the bam file which is then piped into:
samtools sort - we've used this before - we want to sort our alignments along the genome.



#### Aligning to both genomes.

Again, this step is computationally expensive, so it's been pre done for you.

In [None]:
#!lastal -P $cores --split -p $trainall $halldb $reads | last-split | maf-convert sam - | samtools view -bt $reference4_index | samtools sort -@16 -o $sort_barcode_bam

Finally we need to index our bam file so we can analyse it further.

In [None]:
#!samtools index $sort_barcode_bam

## Aligning to just the med genome.

In [None]:
#!lastal -P $cores -p train_hmed.out $hmeddb $reads | last-split | maf-convert sam - | samtools view -bt $reference2_index | samtools sort -@16 -o $sort_barcode_bam_hmed

Finally we need to index our bam file so we can analyse it further.

In [None]:
#!samtools index $sort_barcode_bam_hmed

## Aligning to just the H. vol genome.

In [None]:
#!lastal -P $cores -p train_hvol.out $hvoldb $reads | last-split | maf-convert sam - | samtools view -bt $reference3_index | samtools sort -@16 -o $sort_barcode_bam_hvol

Finally we need to index our bam file so we can analyse it further.

In [None]:
#!samtools index $sort_barcode_bam_hvol

Having done all of this, we should be able to generate some stats about our data!

### Statistics for mapping to both genomes.

These statistics are not expensive to calculate - so we will calculate them now!

In [None]:
# run the alignment summarizer program
!stats_from_bam $sort_barcode_bam > sorted.reads_reference.bam.stats


df = pd.read_csv("sorted.reads_reference.bam.stats", sep="\t")

p1 = histogram(
    [df['read_length']], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count",bins=100)
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
    [df['acc']], title="Read accuracy",
    x_axis_label="% accuracy", y_axis_label="count",bins=100)
aplanat.show(gridplot((p1, p2), ncols=2))


summary = graphics.InfoGraphItems()
summary.append(label='Total reads', value=len(df.name.unique()), icon='angle-up', unit='')
summary.append('Total yield', df.drop_duplicates(subset=["name"], keep='first').read_length.sum(), 'signal', 'b')
summary.append('Mean read length', df.drop_duplicates(subset=["name"], keep='first').read_length.sum()/len(df.name.unique()), 'align-center', 'b')
summary.append('Mean read identity', df.iden.mean(), 'check')
summary.append('Mean read accuracy', df.acc.mean(), 'check')
plot = graphics.infographic(summary.values())
aplanat.show(plot, background='#f4f4f4')

### Statistics for mapping just to H Vol.

In [None]:
# run the alignment summarizer program
!stats_from_bam $sort_barcode_bam_hvol > sorted.reads_reference_hvol.bam.stats


df = pd.read_csv("sorted.reads_reference_hvol.bam.stats", sep="\t")

p1 = histogram(
    [df['read_length']], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count",bins=100)
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
    [df['acc']], title="Read accuracy",
    x_axis_label="% accuracy", y_axis_label="count",bins=100)
aplanat.show(gridplot((p1, p2), ncols=2))


summary = graphics.InfoGraphItems()
summary.append(label='Total reads', value=len(df.name.unique()), icon='angle-up', unit='')
summary.append('Total yield', df.drop_duplicates(subset=["name"], keep='first').read_length.sum(), 'signal', 'b')
summary.append('Mean read length', df.drop_duplicates(subset=["name"], keep='first').read_length.sum()/len(df.name.unique()), 'align-center', 'b')
summary.append('Mean read identity', df.iden.mean(), 'check')
summary.append('Mean read accuracy', df.acc.mean(), 'check')
plot = graphics.infographic(summary.values())
aplanat.show(plot, background='#f4f4f4')

Why has this changed?

### Mapping just to H Med Genome.

In [None]:
# run the alignment summarizer program
!stats_from_bam $sort_barcode_bam_hmed > sorted.reads_reference_hmed.bam.stats


df = pd.read_csv("sorted.reads_reference_hmed.bam.stats", sep="\t")

p1 = histogram(
    [df['read_length']], title="Read lengths",
    x_axis_label="read length / bases", y_axis_label="count",bins=100)
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
    [df['acc']], title="Read accuracy",
    x_axis_label="% accuracy", y_axis_label="count",bins=100)
aplanat.show(gridplot((p1, p2), ncols=2))


summary = graphics.InfoGraphItems()
summary.append(label='Total reads', value=len(df.name.unique()), icon='angle-up', unit='')
summary.append('Total yield', df.drop_duplicates(subset=["name"], keep='first').read_length.sum(), 'signal', 'b')
summary.append('Mean read length', df.drop_duplicates(subset=["name"], keep='first').read_length.sum()/len(df.name.unique()), 'align-center', 'b')
summary.append('Mean read identity', df.iden.mean(), 'check')
summary.append('Mean read accuracy', df.acc.mean(), 'check')
plot = graphics.infographic(summary.values())
aplanat.show(plot, background='#f4f4f4')



## What now?

Again - why has this changed?

# Do we have recombinants?
Now we are going to look at coverage with the tool mosdepth to see if we can see our recombinants.

Firstly we need to define a file name for our data.

In [None]:
barcode_cov = f"last_barcode{barcode}_cov"

Now we use a program called mosdepth to calculate how many reads map to each position of each genome.

In [None]:
!mosdepth -n --fast-mode --by 10 $barcode_cov $sort_barcode_bam

Finally we use some code to plot the coverage for the large chromosomes.

In [None]:
cumulative_depth = pd.read_csv(
    f'{barcode_cov}.mosdepth.region.dist.txt', sep='\t',
    names=['ref', 'depth', 'proportion'])

binned_depth = pd.read_csv(
    f'{barcode_cov}.regions.bed.gz', sep='\t',
    names=['ref', 'start', 'end', 'depth'])

def make_coverage_plot(cumulative_depth, binned_depth):
    # Plot the proportion of the genome at coverage levels
    p1 = steps(
        [cumulative_depth[cumulative_depth['ref'].eq('NC_013967.1')]['depth']],
        [cumulative_depth[cumulative_depth['ref'].eq('NC_013967.1')]['proportion']],
        colors=['darkolivegreen'],
        x_axis_label='Depth of coverage',
        y_axis_label='Proportion of genome at coverage',
        title="NC_013967.1 Haloferax volcanii DS2, complete sequence")
    
    # Plot the binned coverage levels across the genome
    
    p2 = steps(
        [binned_depth[binned_depth['ref'].eq('NC_013967.1')]['start']],
        [binned_depth[binned_depth['ref'].eq('NC_013967.1')]['depth']],
        colors=['darkolivegreen'],
        x_axis_label='Position along reference',
        y_axis_label='sequencing depth / bases',
        title="NC_013967.1 Haloferax volcanii DS2, complete sequence")
    p2.xaxis.formatter.use_scientific = False
    
    p3 = steps(
        [cumulative_depth[cumulative_depth['ref'].eq("CP001868.2")]['depth']],
        [cumulative_depth[cumulative_depth['ref'].eq("CP001868.2")]['proportion']],
        colors=['darkblue'],
        x_axis_label='Depth of coverage',
        y_axis_label='Proportion of genome at coverage',
        title="CP001868.2 Haloferax mediterranei ATCC 33500, complete sequence")

    
    # Plot the binned coverage levels across the genome
    
    p4 = steps(
        [binned_depth[binned_depth['ref'].eq("CP001868.2")]['start']],
        [binned_depth[binned_depth['ref'].eq("CP001868.2")]['depth']],
        colors=['darkblue'],
        x_axis_label='Position along reference',
        y_axis_label='sequencing depth / bases',
        title="CP001868.2 Haloferax mediterranei ATCC 33500, complete sequence")
    p4.xaxis.formatter.use_scientific = False
    return gridplot((p1, p2,p3,p4), ncols=2)

aplanat.show(make_coverage_plot(cumulative_depth, binned_depth), background="#ffffff")

# Think

Why does our coverage look different here? What are the spikes in coverage? Why are the results different to our simulated data?

## Viewing the alignments

Now we will use IGV to look at the alignments in more detail.

Run the code below and use it to highlight regions of the genome and - specifically - the genes where the recombination events are occurring.

You can view two regions of a genome at once by entering the coordinates like this:

"CP001868.2:480,000-490,000 CP001868.2:850,000-860,000" 

or if you want to be really flash:

"CP001868.2:405,000-415,000 NC_013967.1:420,000-430,000 NC_013967.1:785000-795000 CP001868.2:800,000-805,000"



So you need to find the coordinates on each genome where it switches from one to another. You can enter them below:

In [None]:
CP1=278000
CP2=280000
CP3=770000
CP4=780000
NC1=278000
NC2=280000
NC3=760000
NC4=780000

In [None]:
print (f"CP001868.2:{CP1}-{CP2} NC_013967.1:{NC1}-{NC2} NC_013967.1:{NC3}-{NC4} CP001868.2:{CP3}-{CP4}")

In [None]:
import os
user = os.getenv('JUPYTERHUB_USER')

url=f"http://10.157.200.14/user/{user}/tree/"
bams={'results':sort_barcode_bam_igv}
track_list=[
                  {
                    "name": "HMerge",
                    "url": url+"student_projects_2022/data/refs/merge.gff3",
                    "format": "gff3",
                    "type": "annotation",
                    "displayMode": "expanded",
                    "height":120,
                    "indexed": False
                  },
                
            ]

colors=['orange','green','gray']
i=0
for b in bams:
    d = {"name": b,
        "url":url+bams[b],
        "indexURL":url+bams[b]+".bai",
        "type": "alignment",
         "displayMode":"SQUISHED",
         "height":800,
         "showInsertions":False,
         #"removable":True,
         #"color":colors[i],
        #"indexed": True 
        }
    track_list.append(d)
    i+=1

igv_browser= igv_notebook.Browser(
    {
        "reference": {
                "name": "merged_refs",   
                "fastaURL": url+'student_projects_2022/data/refs/merged_refs.fasta',
                "indexURL": url+ 'student_projects_2022/data/refs/merged_refs.fasta.fai'
        },
        "tracks": track_list,
        #"locus":f"CP001868.2:{CP1}-{CP2} NC_013967.1:{NC1}-{NC2} NC_013967.1:{NC3}-{NC4} CP001868.2:{CP3}-{CP4}",
    }
)

## List Your Identified Genes

You should find four genes - one at each end of the recombination event for each genome. We will call these CP_Left, CP_Right, NC_Left and NC_Right.

Complete the cell below with the relevant information. You must leave in the quotes and copy the gene names exactly up to the first full stop. So HFX_0896.mRNA.0 will become "HFX_0896"

In [None]:
CP_Left="HFX_0494"
CP_Right="HFX_0828"
NC_Left="HVO_0522"
NC_Right="HVO_0859"

## What are these genes?

To find what these genes are, we need to look for them in the annotation files for the genomes. We can find this information from the annotation file used to label the IGV plot above.


In [None]:
!grep $CP_Left ~/student_projects_2022/data/refs/merge.gff3

In [None]:
!grep $CP_Right ~/student_projects_2022/data/refs/merge.gff3

In [None]:
!grep $NC_Left ~/student_projects_2022/data/refs/merge.gff3

In [None]:
!grep $NC_Right ~/student_projects_2022/data/refs/merge.gff3

For some of these files we can see what they are by looking at the product name. Others are hypothetical proteins. To identify the hypothetical proteins, we need to get the sequence so we can analyse them.

To do this, we will use a tool called GFF3Toolkit.


In [None]:
!gff3_to_fasta -g ~/student_projects_2022/data/refs/merge.gff3 -f $reference -st cds -o test_genes -d complex

In [None]:
!grep pyrE2 ~/student_projects_2022/data/refs/merge.gff3

In [None]:
!grep -A1 $CP_Right ~/haloferax_2022/test_genes_cds.fa

In [None]:
!grep -A1 $NC_Right ~/haloferax_2022/test_genes_cds.fa

## Now lets look at the alignment with respect to just one genome.

Again we will use IGV but we will just look at the reads with respect to one genome.

### Hmed

In [None]:
print (f"Hmed_Chr_CP001868.2:{CP1}-{CP2} Hmed_Chr_CP001868.2:{CP3}-{CP4}")

In [None]:
import os
user = os.getenv('JUPYTERHUB_USER')

url=f"http://10.157.200.14/user/{user}/tree/"
bams={'results':sort_barcode_bam_hmed_igv}
track_list=[
                  {
                    "name": "HMerge",
                    "url": url+"student_projects_2022/data/refs/Hmed.gff3",
                    "format": "gff3",
                    "type": "annotation",
                    "displayMode": "expanded",
                    "height":120,
                    "indexed": False
                  },
                
            ]

colors=['orange','green','gray']
i=0
for b in bams:
    d = {"name": b,
        "url":url+bams[b],
        "indexURL":url+bams[b]+".bai",
        "type": "alignment",
         "displayMode":"SQUISHED",
         "height":800,
         "showInsertions":False,
         #"removable":True,
         #"color":colors[i],
        #"indexed": True 
        }
    track_list.append(d)
    i+=1

igv_browser= igv_notebook.Browser(
    {
        "reference": {
                "name": "Hmed",   
                "fastaURL": url+'student_projects_2022/data/refs/Hmed_Chr_CP001868.2.fasta',
                "indexURL": url+ 'student_projects_2022/data/refs/Hmed_Chr_CP001868.2.fasta.fai'
        },
        "tracks": track_list,
        #"locus":f"CP001868.2:{CP1}-{CP2} NC_013967.1:{NC1}-{NC2} NC_013967.1:{NC3}-{NC4} CP001868.2:{CP3}-{CP4}",
    }
)



### Hvol

In [None]:
print (f"Hvol_Chr_NC_013967.1:{NC1}-{NC2} Hvol_Chr_NC_013967.1:{NC3}-{NC4}")

In [None]:
import os
user = os.getenv('JUPYTERHUB_USER')

url=f"http://10.157.200.14/user/{user}/tree/"
bams={'results':sort_barcode_bam_hvol_igv}
track_list=[
                  {
                    "name": "HMerge",
                    "url": url+"student_projects_2022/data/refs/Hvol.gff3",
                    "format": "gff3",
                    "type": "annotation",
                    "displayMode": "expanded",
                    "height":120,
                    "indexed": False
                  },
                
            ]

colors=['orange','green','gray']
i=0
for b in bams:
    d = {"name": b,
        "url":url+bams[b],
        "indexURL":url+bams[b]+".bai",
        "type": "alignment",
         "displayMode":"SQUISHED",
         "height":800,
         "showInsertions":False,
         #"removable":True,
         #"color":colors[i],
        #"indexed": True 
        }
    track_list.append(d)
    i+=1

igv_browser= igv_notebook.Browser(
    {
        "reference": {
                "name": "Hmed",   
                "fastaURL": url+'student_projects_2022/data/refs/Hvol_Chr_NC_013967.1.fasta',
                "indexURL": url+ 'student_projects_2022/data/refs/Hvol_Chr_NC_013967.1.fasta.fai'
        },
        "tracks": track_list,
        #"locus":f"CP001868.2:{CP1}-{CP2} NC_013967.1:{NC1}-{NC2} NC_013967.1:{NC3}-{NC4} CP001868.2:{CP3}-{CP4}",
    }
)


### What have you found?

At this point, you should have identified two regions in each genome where recombination has occurred. There are lots of questions to ask here. For example - what is shared between these genes? What is the function of the genes? Are these genes still functional? 

You need to have a think about this going forwards.

### Other options

Lets try an assembly.

In [None]:
reads = f"~/student_projects_2022/data/IC_199/*/*/fastq_pass/barcode{barcode}/*.fastq.gz"

In [None]:
assembly_dir=f"~/student_projects_2022/data/precomputed/assemblies/{barcode}_assembly"
assembly_image_dir=f"student_projects_2022/data/precomputed/assemblies/{barcode}_assembly"

Assembly is - again - computationally expensive. Therefore we have precomputed the assembly for you.

In [None]:
#!flye --threads $cores --out-dir $assembly_dir --nano-hq $reads

In [None]:
maffile = f"{assembly_dir}/barcode{barcode}.maf"

!lastal --split -P $cores $halldb $assembly_dir/assembly.fasta > $maffile


In [None]:
print (maffile)

In [None]:
!grep -B 1 contig_3 $maffile |  tr -s ' ' | cut -f1-6 -d' '\


In [None]:
mafimage = f"/home/jupyter-{user}/{assembly_image_dir}/barcode{barcode}.png"

!last-dotplot     -v $maffile $mafimage

In [None]:
print (mafimage)

In [None]:
Image(f"{mafimage}")

From this, it should be possible to think and work out what may have happened in your experiment.

### What if my result isn't simple?

OK - so you've got this far and things look odd? Maybe it all looks fine... If it doesn't what can you do?

First - this is research. We had a prediction - it may or may not be correct.

Second - we might be able to do some additional analysis. But to do this we might need to get some sequence data from our assembly.

Nothing from here has been precomputed. So it will be a lot slower.

In [None]:
### What sequence do you want to get from your assembly?

### Be careful - you do not want to print long sequences to the screen!

mysequence = "contig_1"

In [None]:
import pyfastx

In [None]:
from pathlib import Path

In [None]:
assembly_path = str(Path(f'{assembly_dir}/assembly.fasta').expanduser())

In [None]:
fa=pyfastx.Fasta(assembly_path)

In [None]:
fa[mysequence]

In [None]:
print (f">{fa[mysequence].name}")
print (f"{fa[mysequence].seq[3400000:3401000]}")

Using the code above you can grab any bit of sequence from your assembly you are interested in and analyse it using a tool such as blast to work out what it might be.

## Getting sequence from the parental strains.

We can loook at the two parental strains and try to identify the sequences where recombination has occurred and see if anything interesting has happened in those regions.

The sequences can be obtained from the reference.

In [None]:
reference_path = str(Path(f'{reference4}').expanduser())

In [None]:
print (reference_path)

In [None]:
referencefa=pyfastx.Fasta(reference_path)

Lets get the sequences from the left junction first.

In [None]:
print (f"CP001868.2:{CP1}-{CP2} NC_013967.1:{NC1}-{NC2} NC_013967.1:{NC3}-{NC4} CP001868.2:{CP3}-{CP4}")

In [None]:
seq1=referencefa["CP001868.2"].seq[CP1:CP2]

In [None]:
seq2=referencefa["NC_013967.1"].seq[NC1:NC2]

Do these sequences have any similarity to one another? To answer this we can use blast2seqs to check - go to https://blast.ncbi.nlm.nih.gov/Blast.cgi?BLAST_SPEC=blast2seq&LINK_LOC=align2seq&PAGE_TYPE=BlastSearch and compare the two sequences.

In [None]:
print(">Sequence_1")
print(seq1)

In [None]:
print(">Sequence_2")
print(seq2)

Now we will get the sequence from the right junction:\

In [None]:
referencefa["CP001868.2"].seq[275994-1000:275994+1000]

In [None]:
seq3=referencefa["NC_013967.1"].seq[NC3:NC4]
seq4=referencefa["CP001868.2"].seq[CP3:CP4]

print(">Sequence_3")
print(seq3)

print(">Sequence_4")
print(seq4)

## Looking at the assembly.

We might want to understand the coverage depth on the various elements in the assembly. We can do this by mapping our reads back to our assembly.

In [None]:
!samtools faidx $assembly_dir/assembly.fasta

In [None]:
assembly_index = f"{assembly_dir}/assembly.fasta.fai"

In [None]:
#!lastdb -uNEAR $assembly_dir $assembly_dir/assembly.fasta
assembly_db = f"{assembly_dir}"

In [None]:
#!last-train -P $cores -Q1 $assembly_db $reads > train_assembly.out

In [None]:
#!lastal -P $cores --split -p train_assembly.out $assembly_db $reads | last-split | maf-convert sam - | samtools view -bt $assembly_index | samtools sort -@16 -o assembly_mapping.bam
#!samtools index assembly_mapping.bam

In [None]:
#!mosdepth -n --fast-mode --by 10 barcode_mapping assembly_mapping.bam

We can look at a summary file of this which will allow us to estimate the abundance of different contigs.

In [None]:
assembly_df=pd.read_csv("barcode_mapping.mosdepth.summary.txt", sep="\t")


In [None]:
assembly_df[~(assembly_df['chrom'].str.endswith('_region'))]

We can also look at coverage to see if there are any unusual patterns.

In [None]:
cumulative_depth = pd.read_csv(
    f'barcode_mapping.mosdepth.region.dist.txt', sep='\t',
    names=['ref', 'depth', 'proportion'])

binned_depth = pd.read_csv(
    f'barcode_mapping.regions.bed.gz', sep='\t',
    names=['ref', 'start', 'end', 'depth'])

def make_coverage_plot_contig(contig,cumulative_depth, binned_depth):
    
    
    # Plot the binned coverage levels across the genome
    
    p2 = steps(
        [binned_depth[binned_depth['ref'].eq(contig)]['start']],
        [binned_depth[binned_depth['ref'].eq(contig)]['depth']],
        colors=['darkolivegreen'],
        x_axis_label='Position along reference',
        y_axis_label='sequencing depth / bases',
        title=contig)
    p2.xaxis.formatter.use_scientific = False
    
    
    return p2

#return gridplot((p1, p2,p3,p4), ncols=2)

plotlist=[]
for val in binned_depth['ref'].unique():
    print(val)
    #plotlist.append(make_coverage_plot_contig(val,cumulative_depth,binned_depth))
    p2=make_coverage_plot_contig(val,cumulative_depth,binned_depth)
    aplanat.show(gridplot((p2,), ncols=1), background="#ffffff")




We can also look at the assemblies in IGV to see if we can learn anything about them!

In [None]:
import os
user = os.getenv('JUPYTERHUB_USER')

url=f"http://10.157.200.14/user/{user}/tree/haloferax_2022/"
bams={'results':'assembly_mapping.bam'}

track_list=[
                  {
                    "name": "HMerge",
                    "url": url+"student_projects_2022/data/refs/Hvol.gff3",
                    "format": "gff3",
                    "type": "annotation",
                    "displayMode": "expanded",
                    "height":120,
                    "indexed": False
                  },
                
            ]

colors=['orange','green','gray']
i=0
for b in bams:
    d = {"name": b,
        "url":url+bams[b],
        "indexURL":url+bams[b]+".bai",
        "type": "alignment",
         "displayMode":"SQUISHED",
         "height":800,
         "showInsertions":False,
         #"removable":True,
         #"color":colors[i],
        #"indexed": True 
        }
    track_list.append(d)
    i+=1

igv_browser= igv_notebook.Browser(
    {
        "reference": {
                "name": "Assembly",   
                "fastaURL": url+f'{assembly_dir}/assembly.fasta',
                "indexURL": url+f'{assembly_dir}/assembly.fasta.fai',
        },
        "tracks": track_list,
        #"locus":f"CP001868.2:{CP1}-{CP2} NC_013967.1:{NC1}-{NC2} NC_013967.1:{NC3}-{NC4} CP001868.2:{CP3}-{CP4}",
    }
)

