# Pangenomics
--------------------------------------------

# Variant Calling with VG

## Overview

Variants can be called both within the pangenomic graph and by aligning reads to the graph. You will learn how to call variants both ways in this submodule.

## Learning Objectives
+ Specify different types of variants
+ Discuss our ability to call variants with different types of reads and pangenomic graphs
+ Call and interpret variants with VG

## Getting Started

When calling variants, we use aligned reads to find support for variants contained in the graph. For the original pangenome graph, it will find variants from the assemblies used to make the graph. You can also augment the pangenome graph with novel variants in the reads, creating an augemented pangenome graph that can be used to call variants.

First we will learn how to identify variants that are supported by the graph. Then we'll look at identifying novel variants that are not in the graphs.

#### Variant Calling
- Variants supported by the graph
- Novel variants

----------------------

## Call Variants

We will look for two variant types:
- Variants that are supported by the graph.
- Variants that are novel (i.e. not in the graph but supported by the reads aligned to the graph).

We will call variants against the graph, though you could also call variants using the surjected BAM file and traditional variant calling methods.

----------------------

## Calling Graph Supported Variants

1. Create a directory for the variants

In [None]:
!mkdir variants


2. Compute read support for variation already in the graph using `vg pack`.

The parameters:

`-x`  the graph  
`-g`  aligments in GAM format  
`-Q`  ignore mapping and base qualities < N  
`-s`  ignore the first and last N nucleotides of each read  
`-o`  the output PACK file  
`-t`  use N threads

In [None]:
!vg pack -x graphs/yprp.chrVIII.pggb.giraffe.gbz -g alignments/SK1xyprp.chrVIII.pggb.mapped.gam -Q 5 -s 5 -t 4 -o alignments/yprp.chrVIII.pggb.mapped.pack

3. Generate a VCF from the read support using `vg call`.

The parameters:

`-k`  The read support file to read in  
`-t`  The number of threads  
`-z`  Restrict the search to GBZ haplotypes (can improve speed and accuracy); we won't use this here

Also, feed in the graph as a positional argument: *graphs/yprp.chrVIII.pggb.giraffe.gbz*

In [None]:
!vg call -k alignments/yprp.chrVIII.pggb.mapped.pack -t 4 graphs/yprp.chrVIII.pggb.giraffe.gbz > variants/SK1xyprp.chrVIII.pggb.graph_calls.vcf

<div class="alert alert-block alert-success"> <b>Try this in the cell below:</b>  
    <ul>
        <li>Call variants in the full genome graph by computing read support and then generating a vcf file (variants/yprp.fullgenome.pggb.graph_calls.vcf).</li>
    </ul>

In [None]:
# Call variants in the full genome graph

<details>
<summary>Click for help</summary>
<br>
!vg pack -x graphs/yprp.fullgenome.pggb.giraffe.gbz -g alignments/SK1xyprp.fullgenome.pggb.mapped.gam -Q 5 -s 5 -o alignments/yprp.fullgenome.pggb.mapped.pack -t 4   


!vg call -k alignments/yprp.fullgenome.pggb.mapped.pack -t 4 graphs/yprp.fullgenome.pggb.giraffe.gbz > variants/yprp.fullgenome.pggb.graph_calls.vcf
</details>

----------------------

## Variant Statistics

1. We will use `bcftools stats` to get statistics on the graph-supported variants. We will use `grep` to get the summary numbers.

In [None]:
!bcftools stats variants/SK1xyprp.chrVIII.pggb.graph_calls.vcf | grep "^SN"

Run the code below to see the flashcards.

In [None]:
from IPython.display import IFrame
IFrame('../html/flashcard_variants.html', width=800, height=400)

<div class="alert alert-block alert-success"> <b>Try this in the cell below:</b>  
    <ul>
        <li>Get statistics for variants supported by the full genome graph (variants/yprp.fullgenome.pggb.graph_calls.vcf).</li>
    </ul>

In [None]:
# Statistics for variants supported by the full genome graph

<details>
<summary>Click for help</summary>
<br>
!bcftools stats variants/yprp.fullgenome.pggb.graph_calls.vcf | grep "^SN"
</details>

----------------------

## Including Novel Variant Calls

1. To call novel variants (i.e. those variants supported by the aligned reads),  we need to embed the variation from the reads we aligned back into the graph. To do this we need to convert the graph into a form that we can change. We will use `vg convert` to convert the .gbz file to a .vg file.

In [None]:
!vg convert graphs/yprp.chrVIII.pggb.giraffe.gbz > graphs/yprp.chrVIII.pggb.giraffe.vg

2. Now, we can augment the graph with the mapped reads using `vg augment`. This will embed the variation from the alignments back into the graph.

The parameters:

`-A`  new, augmented graph with aligned reads  
`-t`  the number of threads to use  

Also, feed in the the graph and the input alignment (GMA) file as positional arguments:  
*graphs/yprp.chrVIII.pggb.giraffe.vg*  
*alignments/SK1xyprp.chrVIII.pggb.mapped.gam*

In [None]:
!vg augment graphs/yprp.chrVIII.pggb.giraffe.vg alignments/SK1xyprp.chrVIII.pggb.mapped.gam -A alignments/SK1xyprp.chrVIII.pggb.mapped.aug.gam -t 4 > graphs/SK1xyprp.chrVIII.pggb.aug.vg 

3. Next, index the augmented graph using `vg index`. We will make a .xg index.

The parameters:

`-x`  output file  
`-t`  the number of threads  

Also, feed in the graph as a positional argument:  
*graphs/SK1xyprp.chrVIII.pggb.aug.vg*

In [None]:
!vg index -t 4 -x graphs/SK1xyprp.chrVIII.pggb.aug.xg graphs/SK1xyprp.chrVIII.pggb.aug.vg

4. Now that the variation from the reads is embedded into the graph along with the original variants, we can procede to call variants like we did above by computing read support as we show below.

In [None]:
!vg pack -x graphs/SK1xyprp.chrVIII.pggb.aug.xg -g alignments/SK1xyprp.chrVIII.pggb.mapped.aug.gam -Q 5 -s 5 -o alignments/SK1xyprp.chrVIII.pggb.mapped.aug.pack -t 4

5. Then, we generate a VCF from the support.

In [None]:
!vg call graphs/SK1xyprp.chrVIII.pggb.aug.xg -k alignments/SK1xyprp.chrVIII.pggb.mapped.aug.pack -t 4 > variants/SK1xyprp.chrVIII.pggb.aug_calls.vcf

6. Finally, we generate stats on this VCF file as we did above.

In [None]:
!bcftools stats variants/SK1xyprp.chrVIII.pggb.aug_calls.vcf | grep "^SN"

<div class="alert alert-block alert-success"> <b>Try this in the cells below:</b><br/>
Call novel variants for the full genome graph (yprp.fullgenome.pggb.giraffe.gbz) by performing the following steps:
    <ul>
        <li>Convert the graph to .vg format.</li>
        <li>Augment the graph to embed the read alignments.</li>
        <li>Create an index (.xg).</li>
        <li>Compute read support.</li>
        <li>Generate a VCF.</li>
        <li>Generate statistics.</li>
    </ul>

In [None]:
# Convert the graph to vg format

In [None]:
# Augment the graph to embed the read alignments

In [None]:
# Create an index (xg)

In [None]:
# Compute read support

In [None]:
# Generate a VCF

In [None]:
# Generate statistics

<details>
<summary>Click for help</summary>
<br>
    
!vg convert graphs/yprp.fullgenome.pggb.giraffe.gbz > graphs/yprp.fullgenome.pggb.giraffe.vg 
<br><br>
!vg augment graphs/yprp.fullgenome.pggb.giraffe.vg alignments/SK1xyprp.fullgenome.pggb.mapped.gam -A alignments/SK1xyprp.fullgenome.pggb.mapped.aug.gam -t 4 > graphs/SK1xyprp.fullgenome.pggb.aug.vg 
<br><br>
!vg index -t 4 -x graphs/SK1xyprp.fullgenome.pggb.aug.xg graphs/SK1xyprp.fullgenome.pggb.aug.vg
<br><br>
!vg pack -x graphs/SK1xyprp.fullgenome.pggb.aug.xg -g alignments/SK1xyprp.fullgenome.pggb.mapped.aug.gam -Q 5 -s 5 -o alignments/SK1xyprp.fullgenome.pggb.mapped.aug.pack -t 4
<br><br>
!vg call graphs/SK1xyprp.fullgenome.pggb.aug.xg -k alignments/SK1xyprp.fullgenome.pggb.mapped.aug.pack -t 4 > variants/SK1xyprp.fullgenome.pggb.aug_calls.vcf
<br><br>
!bcftools stats variants/SK1xyprp.fullgenome.pggb.aug_calls.vcf | grep "^SN"

</details>

----------------------

## Variants used to construct the graph

We can also pull out variants that are in the structure of the graph, which does not require aligning reads from a sample.

1. First, we will create a .xg formated graph with `vg index`.

In [None]:
!vg index -t 4 -x graphs/yprp.chrVIII.pggb.xg graphs/yprp.chrVIII.pggb.vg

2. Now, grab the variants from the graph using `vg deconstruct`.

The parameters:
  
`-P` path prefix (variants will be called for all paths that start with this prefix)  
`-t` number of threads

Also, feed in the graph as a positional argument:  
*Graph (xg)*

<div class="alert alert-block alert-info"> <b>NOTE:</b> VG takes liberties with variants when constructing the graph so this VCF might not be identical to those from other methods.

We will use the original chrVIII graph and export variants onto the S288C paths.

In [None]:
!vg deconstruct graphs/yprp.chrVIII.pggb.xg -P S288C -t 20 > variants/yprp.chrVIII.pggb.S288Cpaths.deconstruct.vcf

3. Get variant statistics.

In [None]:
!bcftools stats variants/yprp.chrVIII.pggb.S288Cpaths.deconstruct.vcf | grep "^SN"

<div class="alert alert-block alert-success"> <b>Try this for the full genome graph in the cell below:</b>  
    <ul>
        <li>Create an .xg version of graphs/yprp.fullgenome.pggb.vg using `vg index`</li>
        <li>Deconstruct variants</li>
        <li>Get the variant statistics.</li>
    </ul>

In [None]:
# Index to create an .xg version

In [None]:
# Deconstruct variants from the full genome graph

In [None]:
# Get variant statistics

<details>
<summary>Click for help</summary>
<br>
!vg index -t 4 -x graphs/yprp.fullgenome.pggb.xg graphs/yprp.fullgenome.pggb.vg   
<br><br>

!vg deconstruct graphs/yprp.fullgenome.pggb.xg -P S288C -t 20 > variants/yprp.fullgenome.pggb.S288Cpaths.deconstruct.vcf
<br><br>

!bcftools stats variants/yprp.fullgenome.pggb.S288Cpaths.deconstruct.vcf | grep "^SN"

</details>

----------------------

## Visualize the original and augmented graphs

1. Compare CUP1 region of the original graph (yprp.chrVIII.pggb.gfa) and augmented graph (SK1xyprp.chrVIII.pggb.aug.vg, which we will convert to SK1xyprp.chrVIII.pggb.aug.gfa) in Bandage. First, convert the augmented graph to GFA format.

The parameters:

`-f`  output in GFA format

Also, feed in the graph as a positional argument:  
*graphs/SK1xyprp.chrVIII.pggb.aug.vg*

In [None]:
!vg convert -f graphs/SK1xyprp.chrVIII.pggb.aug.vg > graphs/SK1xyprp.chrVIII.pggb.aug.gfa

2. Then visualize the CUP1 region in Bandage for the original graph (yprp.chrVIII.pggb.gfa) and the augmented graph (SK1xyprp.chrVIII.pggb.aug.gfa). In Bandage, under graph drawing, 100 is a good distance to use for the original graph and 200 for the augmented graphs. These distances will ensure that you extend far enough out from the BLAST hits so that the region is fully connected.

<div class="alert alert-block alert-info"> <b>NOTE:</b> The augmented graph is much bigger and it will have difficulty loading the entire chrVIII graph. So, before drawing the graph, BLAST the genes, then change the "Scope" to "Around query hits," and change "Distance" to 200 for the augmented graph. Finally, click "Draw Graph" to apply these changes.</div>


Run the code below to see the flashcard questions.

In [None]:
from IPython.display import IFrame
IFrame('../html/flashcard_viz.html', width=800, height=550)

<details>
<summary>Click for a copy of visualizations of the two graphs.</summary>
<br>
CUP1 region from the original graph.  

<figure>
  <img
    src="./Figures/cup1only.png"
    alt="CUP1" />
  <figcaption></figcaption>
</figure>

<br>
CUP1 region from the augmented graph.  

<figure>
  <img
    src="./Figures/auggraph.png"
    alt="CUP1aug" />
  <figcaption></figcaption>
</figure>

</details>

----------------------

## Conclusion

In this submodule, you learned different ways to call and characterize variants from the graph, including variants supported within the graph and novel variants supported by reads that were used to augment the graph.

----------------------

## Module Review

Congratulations, you have completed the pangenomics module!

In this module, you learned about pangenomics and used a yeast dataset to build pangenomics graphs using PGGB. You learned how to search these graphs for regions that match DNA sequence queries using BLAST and how to interactively visualize these graphs using Bandage. In addition, you learned how to use VG to index the graphs, map reads to the graphs, and call variants. Well done!

----------------------

## Clean up

<div class="alert alert-warning">No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!  

----------------------
    
If you do not plan to come back to review this module or to try to run your own data in this environment, delete your Workbench so you do not continue to incur charges.

</div>