# Pangenomics
--------------------------------------------

# Indexing Graphs with VG

## Overview

The [Variation Graph Toolkit (VG)](https://pubmed.ncbi.nlm.nih.gov/30125266/) allows us to perform different operations on pangenomic graphs. You will learn about VG's capabilities and use VG to index PGGB graphs and get some stats.

## Learning Objectives
+ Understand VG and its uses
+ Index graphs 
+ Calculate stats on the graphs

## Getting Started

In this submodule you will be introduced to the Variation Graph Toolkit (VG) and learn how to use it to index graphs and get graph statistics.


#### VG
- Overview
- Index PGGB Graphs
- Graph Statistics



----------------------

## Variation Graph Toolkit (VG)

We will use the [Variation Graph Toolkit (VG)](https://www.nature.com/articles/nbt.4227) to index our PGGB graphs, map sequences to them, and call variants.

However, VG can also create graphs and do many other steps in pangenomic analysis.
If you would like to learn about how to construct and manipulate graphs using VG and other pangenomic tools, please see [our virtual workshops](https://inbre.ncgr.org/ncgr-workshops/upcoming-ncgr-workshops.html).


While we will not use VG to create pangenomics graphs in this module, it is important to understand the kinds of graphs that VG understands.

VG creates graphs that are cyclic, meaning that paths through the graph can be revisited.
This is important for capturing, for example, duplicated genomic regions.

VG graphs are otherwise general.
They are considered reference graphs, iterative, and reference-free.

VG has tools that can perform the following pangenomic steps.

+ Constructs graphs
+ Manipulates graphs
+ Indexes graphs
+ Maps sequences to graphs
+ Calls variants on mapped sequences
+ Visualizes graphs

In addition to creating pangenomic graphs from genomic data, VG can also construct graphs and do analyses from [transcriptomic data](https://github.com/vgteam/vg#transcriptomic-analysis).

And VG can do so much more!

The figure below has more information about [the capabilities of VG](https://github.com/vgteam/vg), including the commands and the relationships between the commands/capabilities.

<figure>
  <img
    src="./Figures/VGpipe.png"
    alt="Variation Graph pipeline" />
  <figcaption><a href="https://github.com/vgteam/vg">https://github.com/vgteam/vg</a></figcaption>
</figure>


----------------------

## Converting our graphs from GFA to VG format

Previously, you created the following graphs using PGGB:  
A. *yprp.chrVIII.pggb.gfa* (a graph for yeast chromosome VIII)  
B. *output_allchrs/\*gfa* (16 subgraphs representing the 16 yeast chromosomes)  
C. *yprp.fullgenome.pggb.gfa* (a graph for the entire yeast genome)

1. Before indexing, we need to convert the graphs from GFA to VG format.

<div class="alert alert-block alert-info"> <b>NOTE:</b> You can index a GFA file rather than a VG file but this may have implications for mapping reads. There’s also an <a href="https://github.com/vgteam/vg/wiki/Automatic-indexing-for-read-mapping-and-downstream-inference">autoindex</a> command.

In [None]:
!vg convert -f graphs/yprp.chrVIII.pggb.gfa > graphs/yprp.chrVIII.pggb.vg

<div class="alert alert-block alert-success"> <b>Try this in the cell below:
</b>  
    <ul>
        <li>Use the code cell below.</li>
        <li>Convert the full graph (yprp.fullgenome.pggb.gfa) from GFA format to VG, calling the result yprp.fullgenome.pggb.vg</li></a></div>
    </ul>

In [None]:
# Convert the full graph from GFA to VG

<details>
<summary>Click for help</summary>
<br>
!vg convert -f graphs/yprp.fullgenome.pggb.gfa > graphs/yprp.fullgenome.pggb.vg
</details>

2. Now let's practice on the 16 chromosome subgraphs. Convert each one to vg format. You could do each one individually but a `for` loop will make it easier. The gfa files are numbered 0-15.

In [None]:
!for i in {0..15}; \
do \
    vg convert -f graphs/output_allchrs/*/*community.${i}.fa*gfa > graphs/yprp.allchrs.${i}.vg; \
done

3. At this point, you can continue to keep them separate but let's merge them into a single graph using `vg combine`.

In [None]:
!vg combine graphs/yprp.allchrs.{0..15}.vg > graphs/yprp.allchrs.pggb.vg

----------------------

## Indexing with VG

We need to index the graph before we do read mapping. In this case, we will create the necessary indexes for `vg giraffe`, which we will use for read mapping. We will index with `vg autoindex`.

The parameters:

`--workflow`  The name of the downstream workflow you are preparing for    
`-g`  Name of the graph  
`-p`  Output prefix

1. Index the chrVII graph.

In [None]:
!vg autoindex --workflow giraffe -g graphs/yprp.chrVIII.pggb.vg -p graphs/yprp.chrVIII.pggb

This will create all the necessary files for `vg giraffe`:

A. *yprp.chrVIII.pggb.giraffe.gbz* - a [GBZ](https://academic.oup.com/bioinformatics/article/38/22/5012/6731924?login=false) format graph that includes a [GBWT](https://github.com/jltsiren/gbwt) index and the corresponding GBWTGraph.

B. *yprp.chrVIII.pggb.min* - minimizer index annotated with positions in the distance index.

C. *yprp.chrVIII.pggb.dist* - minimum distance index.


2. Make sure the files are all there by listing the files. The `-l` parameter will list files in long form allowing us to see the size of the file.

In [None]:
#This will list some of our original graphs as well.

!ls -l graphs/yprp.chrVIII.pggb.*

<div class="alert alert-block alert-success"> <b>Try this in the cells below:
</b>  
    <ul>
        <li>Index the graph we made from the full genome (yprp.fullgenome.pggb.vg).</li>         <li>Index the combined graph, originally run as individual chromosomes (yprp.allchrs.pggb.vg).</li>   
    </ul>

In [None]:
# Index the full genome graph

In [None]:
# Index the combined graph

<details>
<summary>Click for help</summary>
<br>

!vg autoindex --workflow giraffe -g graphs/yprp.fullgenome.pggb.vg -p graphs/yprp.fullgenome.pggb  

!vg autoindex --workflow giraffe -g graphs/yprp.allchrs.pggb.vg -p graphs/yprp.allchrs.pggb


</details>

----------------------

## Graph statistics

1. Let's get some graph statistics using `vg stats` for all 3 graphs, starting with the chrVIII graph.

The parameters:

`-z` size of graph  
`-N` number of nodes in graph  
`-E` number of edges in graph  
`-l` length of sequences in graph

<div class="alert alert-block alert-info"> <b>NOTE:</b> The number of nodes and edges is given by default but calling them explicitly means they will be labeled. There will also be two unlabeled rows that match the nodes and edges.

In [None]:
!vg stats -z -N -E -l graphs/yprp.chrVIII.pggb.vg

The length of chromosome VIII is ~5.5-5.8k. So a size of just over 600k makes sense because it covers the full chromosome length plus a little more to account for regions that have diverged between our 3 accessions and, therefore, have more than one version in the graph.

<div class="alert alert-block alert-success"> <b>Try this in the cells below:</b>  
    <ul>
        <li>Get stats for the graph created directly from the full genome (yprp.fullgenome.pggb.vg).</li>
        <li>Get stats for the graph created by combining the 16 chromosomal subgraphs (yprp.allchrs.pggb.vg).</li>
    </ul>

In [None]:
# Get stats for the full genome graph

In [None]:
# Get stats for the combined graph

<details>
<summary>Click for help</summary>
<br>

!vg stats -z -N -E -l graphs/yprp.fullgenome.pggb.vg

!vg stats -z -N -E -l graphs/yprp.allchrs.pggb.vg

</details>

Run the code below to see the flashcards.

In [None]:
from IPython.display import IFrame
IFrame('../html/flashcard_graphstats.html', width=800, height=400)

----------------------

## Conclusion
In this submodule, you learned about the VG toolkit and what it can be used for. You also learned how to index graphs with VG and indexed the PGGB graphs that you had previously made. Finally, you learned how to obtain graph statistics. In the next submodule, you will learn how to map reads to the indexed graph.

----------------------

## Clean up

<div class="alert alert-warning">No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!</div>