# Pangenomics
--------------------------------------------

# Variant Calling with vg

## Overview

Variants can be called within the pangenomic graph and by aligning reads to the graph. You will learn how to call variants both ways in this submodule.

## Learning Objectives
+ Understand different types of variants
+ Understand our ability to call variants with different types of reads and pangenomic graphs
+ Learn how to call and interpret variants with vg

## Get Started

First we will learn how to identify variants that are supported by the graph. Then we'll look at identifying novel variants that are not in the graphs.

### Call Variants

We will look for variants in the aligned reads that are supported by the graph as well as for variants that are novel (not in the graph but supported by the reads aligned to the graph).

We will call variants against the graph, though you could also call variants using the surjected BAM file and traditional variant calling methods.

### Calling Graph Supported Variants

Compute read support for variation already in the graph using `vg pack`.

The parameters:

-x  the graph 
-g  aligments in gam format  
-Q  ignore mapping and base qualities < N  
-s  ignore the first and last N nucleotides of each read  
-o  the output pack file  
-t  use N threads

In [None]:
!vg pack -x yprp.chrVIII.pggb.giraffe.gbz -g SK1xyprp.chrVIII.pggb.mapped.gam -Q 5 -s 5 -t 4 -o yprp.chrVIII.pggb.mapped.pack

Generate a VCF from the read support using `vg call`.

The parameters:

-k  The read support file to read in  
-t  The number of threads
The graph

In [None]:
!vg call -k yprp.chrVIII.pggb.mapped.pack -t 4 yprp.chrVIII.pggb.giraffe.gbz > SK1xyprp.chrVIII.pggb.graph_calls.vcf

<div class="alert alert-block alert-info"> <b>Try this:</b>  
    <ul>
        <li>Create a blank code cell below.</li>
        <li>Call variants in the full genome graph by computing read support and then generating a vcf file.</li>
    </ul>

<details>
<summary>Click for help</summary>
<br>
!vg pack -x yprp.fullgenome.pggb.giraffe.gbz -g SK1xyprp.fullgenome.pggb.mapped.gam -Q 5 -s 5 -o yprp.fullgenome.pggb.mapped.pack -t 4   


!vg call -k yprp.fullgenome.pggb.mapped.pack -t 4 yprp.fullgenome.pggb.giraffe.gbz > yprp.fullgenome.pggb.graph_calls.vcf
</details>

### Including Novel Variant Calls

To call novel variants, those variants supported by the aligned reads, we need to embed the variation from the reads we aligned back into the graph. To do this we need to convert the graph into a form that we can change. We will use `vg convert` to convert the .gbz file to a .vg file.

In [None]:
!vg convert yprp.chrVIII.pggb.giraffe.gbz > yprp.chrVIII.pggb.giraffe.vg

Now, we can augment the graph with the mapped reads using `vg augment`. This will embed the variation from the alignments back into the graph.

The Parameters:

-A  new, augmented graph with aligned reads  
-t  the number of threads to use  
The graph  
The input alignment (gam) file


In [None]:
!vg augment yprp.chrVIII.pggb.giraffe.vg SK1xyprp.chrVIII.pggb.mapped.gam -A SK1xyprp.chrVIII.pggb.mapped.aug.gam -t 4 > SK1xyprp.chrVIII.pggb.aug.vg 

Index the augmented graph using `vg index`. We will make a .xg index.

The prameters:

-x  output file
-t  the number of threads  
The input graph

In [None]:
!vg index -t 4 -x SK1xyprp.chrVIII.pggb.aug.xg SK1xyprp.chrVIII.pggb.aug.vg

Now that the variation from the reads is embedded into the graph along with the original variants, we can procede to call variants like we did above. 

Compute read support.

In [None]:
!vg pack -x SK1xyprp.chrVIII.pggb.aug.xg -g SK1xyprp.chrVIII.pggb.mapped.aug.gam -Q 5 -s 5 -o SK1xyprp.chrVIII.pggb.mapped.aug.pack -t 4

Generate a VCF from the support.

In [None]:
!vg call SK1xyprp.chrVIII.pggb.aug.xg -k SK1xyprp.chrVIII.pggb.mapped.aug.pack -t 4 > SK1xyprp.chrVIII.pggb.aug_calls.vcf

Generate stats on this VCF file. We will use `grep` to pull out the rows that start with SN.

In [None]:
!bcftools stats SK1xyprp.chrVIII.pggb.aug_calls.vcf | grep "^SN"

SNPs = single nucleotide polymorphisms (a single nucleotide change; reference and alternate alleles are all of length 1)  
MNPs = multi-nucleotide polymorphisms (reference and alternate alleles are all of the same length and that length is >1)  
indels = insertion/deletion (reference and alternate alleles are of different lengths)
others = more complex variants
multiallelic sites = more than one alternate allele
multiallelic SNP sites = more than one alternate allele at a SNP site

<div class="alert alert-block alert-info"> <b>Try this:</b>  
    <ul>
        <li>Create a blank code cell below.</li>
        <li>Call novel variants for the yprp.fullgenome.pggb.giraffe.gbz graph.</li>
        <li>+ Convert the graph to vg format.</li>
        <li>+ Augment the graph to embed the read alignments into it.</li>
        <li>+ Create an index (xg).</li>
        <li>+ Compute read support.</li>
        <li>+ Generate a VCF.</li>
        <li>+ Generate statistics.</li>
    </ul>

<details>
<summary>Click for help</summary>
<br>
    
!vg convert yprp.fullgenome.pggb.giraffe.gbz > yprp.fullgenome.pggb.giraffe.vg 

!vg augment yprp.fullgenome.pggb.giraffe.vg SK1xyprp.fullgenome.pggb.mapped.gam -A SK1xyprp.fullgenome.pggb.mapped.aug.gam -t 4 > SK1xyprp.fullgenome.pggb.aug.vg 

!vg index -t 4 -x SK1xyprp.fullgenome.pggb.aug.xg SK1xyprp.fullgenome.pggb.aug.vg

!vg pack -x SK1xyprp.fullgenome.pggb.aug.xg -g SK1xyprp.fullgenome.pggb.mapped.aug.gam -Q 5 -s 5 -o SK1xyprp.fullgenome.pggb.mapped.aug.pack -t 4

!vg call SK1xyprp.fullgenome.pggb.aug.xg -k SK1xyprp.fullgenome.pggb.mapped.aug.pack -t 4 > SK1xyprp.fullgenome.pggb.aug_calls.vcf

!bcftools stats SK1xyprp.fullgenome.pggb.aug_calls.vcf | grep "^SN"

</details>

In [None]:
## Variants used to construct the graph

We can also pull out variants that are in the structure of the graph, which does not require aligning reads from a sample.

First, we will create a .xg formated graph with `vg index`.

In [None]:
!vg index -t 4 -x yprp.chrVIII.pggb.xg yprp.chrVIII.pggb.vg

Now, grab the variants from the graph using `vg deconstruct`.

The parameters:

Graph (xg)
-P path prefix (variants will be called for all paths that start with this prefix)  
-t number of threads

NOTE: VG takes liberties with variants when constructing the graph so this VCF might not be identical to those from other methods.

In [None]:
!vg deconstruct yprp.chrVIII.pggb.xg -P S288C -t 20 > yprp.chrVIII.pggb.S288Cpaths.deconstruct.vcf

<div class="alert alert-block alert-info"> <b>Try this:</b>  
    <ul>
        <li>Create a blank code cell below.</li>
        <li>Deconstruct variants from the full genome graph (yprp.fullgenome.pggb.giraffe.vg)</li>
        <li>Hint: first you will need to create and .xg version using `vg index`</li>
    </ul>

<details>
<summary>Click for help</summary>
<br>
!vg index -t 4 -x yprp.fullgenome.pggb.xg yprp.fullgenome.pggb.vg   


!vg deconstruct yprp.fullgenome.pggb.xg -P S288C -t 20 > yprp.fullgenome.pggb.S288Cpaths.deconstruct.vcf
</details>

### Visualize the original and augmented graphs

Compare CUP1 region of the original graph (yprp.chrVIII.pggb.gfa) and augmented graph (SK1xyprp.chrVIII.pggb.aug.vg, which we will convert to SK1xyprp.chrVIII.pggb.aug.gfa) in bandage. First, convert the augmented graph to gfa format.

The parameters:

-f  output in GFA format

In [None]:
vg convert -f SK1xyprp.chrVIII.pggb.aug.vg > SK1xyprp.chrVIII.pggb.aug.gfa

Now visualize the CUP1 region of each .gfa file in bandage.

NOTE: The augmented graph is much bigger and it will have difficulty loading the entire chrVIII graph. So, before drawing the graph, blast the genes. Change the "Scope" to "Around query hits". Change "Distance" to 200 for the augmented graph. Then click "Draw Graph".  


<div class="alert alert-block alert-info"> <b>Questions:</b>  
    <ul>
        <li>What differences do you see between the CUP1 region in both graphs?</li>
        <li>How do the number of nodes, number of edges, and the total length differ between the original and the augmented graph?</li>
    </ul>

<details>
<summary>Click for possible answers</summary>
<br>
CUP1 region from the original graph.  

![CUP1 region](./Figures/CUP1region.png)

CUP1 region from the augmented graph.

![CUP1 region](./Figures/auggraph.png)

These graphs are actually fairly similar except the augment graph is much more broken up, making the graph look like it has more "dotted" lines. There are small differences in the loops, especially the double loop near the 2 straight ends.


|        | Original | Augmented |
|--------|----------|-----------|
| Nodes  | 18,856   | 590,574   |
| Edges  | 25,371   | 886,845   |
| Length | 623,013  | 938,699   |


The augmented graph is broken into many more pieces. This is expected given that adding in variation breaks nodes. It is also much longer, capturing more genetic variation.
</details>

## Conclusion

In this submodule, you learned different ways to call and characterize variants from the graph, including variants supported within the graph and variants supported by reads mapped to the graph.

## Clean up
No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!