# Pangenomics
--------------------------------------------

# Indexing Graphs with vg

## Overview

The Variation Graph Toolkit (VG) allows us to perform different operations on pangenomic graphs. You will learn about VG's capabilities and use VG to index PGGB graphs and get some stats.

## Learning Objectives
+ Understand VG and the different things it can be used for
+ Learn how to index graphs 
+ Get stats on the graphs

## Get Started

We will use the Variation Graph Toolkit (VG) to index our PGGB graphs, map sequences to them, and call variants.

However, VG can also create graphs and do many other steps in pangenomic analysis.
If you would like to learn about how to construct and manipulate graphs using VG and other pangenomic tools, please see [our virtual workshops](https://inbre.ncgr.org/ncgr-workshops/upcoming-ncgr-workshops.html).


### Variation Graph Toolkit (VG)

While we will not use VG to create pangenomics graphs in this module, it is important to understand the kinds of graphs that VG understands.

VG creates graphs that are cyclic, meaning that paths through the graph can be revisited.
This is important for capturing, for example, duplicated genomic regions.

VG graphs are otherwise general.
They are considered reference graphs, iterative, and reference-free.

VG has tools that can do the following pangenomic steps.

+ Constructs graphs
+ Manipulates graphs
+ Indexes graphs
+ Maps sequences to graphs
+ Calls variants on mapped sequences
+ Visualizes graphs

VG can also do:

+ [Transcriptomic analysis](https://github.com/vgteam/vg#transcriptomic-analysis)
+ Assembly-based pipelines
+ So much more

Citation:

![Garrison, E., Sirén, J., Novak, A. et al.](./Figures/VGref.png)

![vg Graph Genomics Pipeline: https://github.com/vgteam/vg](./Figures/VGpipe.png)


A reference genome "decorated" with variants:

![GRAF™ Pan Genome Reference: https://www.sevenbridges.com/graf/](./Figures/GRAF.png)
 

### VG Index Formats

VG has several different index formats.

XG (lightweight graph / path representation)

+ Binary file containing graph structure (nodes, edges, paths) but no sequences
+ Complex data structure that answers graph queries efficiently

GCSA (Generalized Compressed Suffix Array)

+ Equivalent to .sa file created by bwa index
+ Binary file containing a suffix array that efficiently looks up where sequences occur in the graph

For more information, visit https://github.com/vgteam/vg/wiki/File-Formats

### Converting our graphs from GFA to VG format

Previously, you created these graphs using PGGB:  
yprp.chrVIII.pggb.gfa (a graph for yeast chromosome VIII)  
output_allchrs/*/*gfa (16 subgraphs representing the 16 yeast chromosomes)  
yprp.yprp.fullgenome.pggb.gfa.pggb.gfa (a graph for the entire yeast genome)

Before indexing, we need to convert the graphs from GFA to VG format.

NOTE: You can index a GFA file rather than a VG file but this may have implications for mapping reads.
There’s also an [autoindex](https://github.com/vgteam/vg/wiki/Automatic-indexing-for-read-mapping-and-downstream-inference) command.

In [None]:
!vg convert -f yprp.chrVIII.pggb.gfa > yprp.chrVIII.pggb.vg

<div class="alert alert-block alert-info"> <b>Try this for the full graph (yprp.fullgenome.pggb.gfa):
</b>  
    <ul>
        <li>Create a blank code cell below.</li>
        <li>Convert yprp.fullgenome.pggb.gfa from GFA format to VG, calling the result "yprp.fullgenome.pggb.vg"</li></a>. </div>
    </ul>


<details>
<summary>Click for help</summary>
<br>
!vg convert -f yprp.fullgenome.pggb.gfa > yprp.fullgenome.pggb.vg
</details>

Now practice on the 16 chromosome subgraphs. Convert each one to vg format. You could do each one individually but a for loop will make it easier. The gfa files are labelled 0-15.

In [None]:
!for i in {0..15}; \
do \
    vg convert -f output_allchrs/*/*community.${i}.fa*gfa > yprp.allchrs.${i}.vg; \
done

At this point, you can continue to keep them separate but let's merge them into a single graph using 'vg combine'.

In [None]:
!vg combine yprp.allchrs.*.vg > yprp.allchrs.pggb.vg

### Indexing with VG

Generate .xg and .gcsa files on the S288C.vg file that you generated previously using PGGB.

The parameters:

-x Name of the .xg index file  
-g Name of the .gcsa index file (we will use this later)  
-p Show progress

First, we'll make the .xg index

In [None]:
!vg index -p -x yprp.chrVIII.pggb.xg yprp.chrVIII.pggb.vg

We need to modify our graph to reduce the kmer offset because vg index for .gcsa has a maximum of 1024.

In [None]:
!vg mod -X 256 yprp.chrVIII.pggb.vg > yprp.chrVIII.pggb.mod.vg

Now we need to prune the graph because it is too complex to make a .gcsa index.

In [None]:
!vg prune yprp.chrVIII.pggb.mod.vg > yprp.chrVIII.pggb.pruned.vg

Now make the .gcsa index of the pruned graph. Use the -g paramater for .gcsa.

In [None]:
!vg index -p -g yprp.chrVIII.pggb.gcsa yprp.chrVIII.pggb.pruned.vg

And, finally, remove the pruned graph.

In [None]:
!rm -f yprp.chrVIII.pggb.pruned.vg

<div class="alert alert-block alert-info"> <b>Try this for the full genome graph (yprp.fullgenome.pggb.vg):</b>  
    <ul>
        <li>Create a blank code cell below.</li>
        <li>Create a .xg index.</li>
        <li>Modify the graph to reduce the kmer offset.</li>
        <li>Prune the modified graph.</li>
        <li>Create a .gcsa index on the pruned graph using the prefix "yprp.fullgenome.pggb" for the result.</li>
        <li>Remove the pruned graph.</li>        
    </ul>

<details>
<summary>Click for help</summary>
<br>

!vg index -p -x yprp.fullgenome.pggb.xg yprp.fullgenome.pggb.vg

!vg mod -X 256 yprp.fullgenome.pggb.vg > yprp.fullgenome.pggb.mod.vg

!vg prune yprp.fullgenome.pggb.mod.vg > yprp.fullgenome.pggb.pruned.vg

!vg index -p -g yprp.fullgenome.pggb.gcsa yprp.fullgenome.pggb.pruned.vg

!rm -f yprp.fullgenome.pggb.pruned.vg

</details>

<div class="alert alert-block alert-info"> <b>Now, try this for the combined graph, originally run as individual chromosomes (yprp.allchrs.pggb.vg):</b>  
    <ul>
        <li>Create a blank code cell below.</li>
        <li>Create a .xg index.</li>
        <li>Modify the graph to reduce the kmer offset.</li>
        <li>Prune the modified graph.</li>
        <li>Create a .gcsa index on the pruned graph using the prefix "yprp.allchrs.pggb" for the result.</li>
        <li>Remove the pruned graph.</li>        
    </ul>

<details>
<summary>Click for help</summary>
<br>

!vg index -p -x yprp.allchrs.pggb.xg yprp.allchrs.pggb.vg

!vg mod -X 256 yprp.allchrs.pggb.vg > yprp.allchrs.pggb.mod.vg

!vg prune yprp.allchrs.pggb.mod.vg > yprp.allchrs.pggb.pruned.vg

!vg index -p -g yprp.allchrs.pggb.gcsa yprp.allchrs.pggb.pruned.vg

!rm -f yprp.allchrs.pggb.pruned.vg

</details>

### Graph statistics

Let's get some graph statistics for all 3 graphs, starting with the chrVIII graph.

**vg stats**

+ -z, --size             size of graph
+ -N, --node-count       number of nodes in graph
+ -E, --edge-count       number of edges in graph
+ -l, --length           length of sequences in graph

NOTE: the number of nodes and edges is given by default but calling them explicitly means they will be labeled.

In [None]:
!vg stats -z -N -E -l yprp.chrVIII.pggb.vg

The length of chromosome VIII is ~5.5-5.8k. So a size of just over 600k makes sense--it covers the full chromosome length plus a little more to account for regions that have diverged between our 3 accessions and, therefore, have more than one version in the graph.

<div class="alert alert-block alert-info"> <b>Try this:</b>  
    <ul>
        <li>Create a blank code cell below.</li>
        <li>Get stats for the graph created directly from the full genome (yprp.fullgenome.pggb.xg).</li>
        <li>Get stats for the graph created by combining the 16 chromosomal subgraphs (yprp.allchrs.pggb.vg).</li>
        <li>What differences do you notice between the stats for these two graphs?</li> 
    </ul>

<details>
<summary>Click for help</summary>
<br>

!vg stats -z -N -E -l yprp.fullgenome.pggb.vg

!vg stats -z -N -E -l yprp.allchrs.pggb.vg

The graph made directly from the full genome has more nodes and edges. This likely reflects the extra connections between chromosomes that you would expect in this graph, made with all the chromosomes at once, that are not in the other graph, where chromosomes were isolated during the building of the graph. These extra connections may have created additional nodes when they connected not quite identical sequence between chromosomes, splitting what would have been a single node of identical sequence into more than one node to take account of the sequence differences.

The graph made from individual chromosome subgraphs has a slightly longer sequence length, possibly reflecting some duplication between chromosomes that wasn't fully merged.

</details>

## Conclusion
In this submodule you learned about the VG toolkit and what it can be used for. You also learned how to index graphs with VG and indexed the PGGB graphs that you had previously made. You also learned how to obtain graph statistics. In the next submodule, you will learn how to map reads to the indexed graph.

## Clean up
No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!