# Pangenomics
--------------------------------------------

# Building Graphs with PGGB


## Overview
The PanGenome Graph Builder (PGGB) creates reference-free pangenomic graphs (https://github.com/pangenome/pggb). You will learn about the algorithm and its graphical output, its strengths and weaknesses, and you will build a yeast pangenomic graph.

## Learning Objectives
+ Understand what types of graphs PGGB builds and their pros/cons
+ Learn how to build graphs with PGGB

## Get Started
In this submodule you will learn how to build pangenomic graphs with PGGB.

PGGB lecture:
- Reference-Free Graphs with PGGB

PGGB hands-on tutorials:
- Yeast Dataset
- PGGB graph generation
- Graph inspection


## Reference-Free Graphs with PGGB

### PanGenome Graph Builder (PGGB)

The PGGB algorithm creates *reference-free graphs* from: 
+ All-pairwise whole genome alignments 
+ Induces a graph from the alignments

PGGB is built on the idea that a pangenome graph represents an alignment of the genomes in the graph, but infers the graph from all pairwise alignments instead of a multiple alignment.

PGGB computes all pairwise alignments efficiently by focusing on long, colinear homologies, instead of using the more traditional k-mer matching alignment approach.

Critically, pggb performs graph *normalization* to ensure that paths through the graph (e.g. chromosomes) have a linear structure while allowing for cyclic graph structures that capture structural variation.

![Input Genomes](./Figures/pggbFlowDiagram.png)

### Reference-Free Graphs

https://academic.oup.com/bioinformatics/article/30/24/3476/2422268

![Input Genomes](./Figures/InputGenomes.png)

###  PGGB Algorithm

1. Perform all-pairwise genome alignments using [wfmash](https://github.com/waveygang/wfmash)
2. Convert alignments into a graph using [seqwish](https://github.com/ekg/seqwish)
3. Progressively normalize graph with [smoothxg](https://github.com/pangenome/smoothxg) and [gfaffix](https://github.com/marschall-lab/GFAffix)



## Yeast Genome Assemblies and Reads

The [Yeast Population Reference Panel (YPRP)](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/) is a panel that includes 12 yeast genome assemblies.
More information is available in the [YPRP manuscript](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2659681/)

  + 7 Saccharomyces cerevisiae (brewer’s yeast), including the S288C reference
  + 5 *Saccharomyces paradoxus* (wild yeast)

![Yeast Genomes: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/](./Figures/Yeast.png)

Yeast genomes are ~12 Mb and have 16 chromosomes.

These yeast genomes were assembled with [LRSDAY](https://github.com/yjx1217/LRSDAY) Long-read Sequencing Data Analysis for Yeasts)

+ [YPRP: 12 Yeast PacBio Assemblies (Chromosome level)](https://yjx1217.github.io/Yeast_PacBio_2016/data/)
  + ~100-200x PacBio sequencing reads
  + HGAP + Quiver polishing
  + ~200-500x Illumina (Pilon correction)
  + Manual curation
  + Annotation



### SK1 Illumina Reads

SK1 is the most distant from S288C

We will use SK1 reads later on to call variants

![Yeast Genomes: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/](./Figures/YeastB.png)



### CUP1 Gene

![](./Figures/StructuralRearrangements.png)
[Structural Rearrangements](https://www.nature.com/articles/ng.3847)
+ [CUP1](https://www.yeastgenome.org/locus/S000001095) - A gene involved in heavy metal (copper) tolerance with copy-number variation (CNV) in population.
+ [YHR054C](https://www.yeastgenome.org/locus/S000001096) - Putative protein of unknown function.



### Preparing the Yeast Input Assemblies

1. Get the three yeast genome assembly files (FASTA).
 + curl transfers a URL
 + --location tells curl to follow any redirects
 + --output gives it an output file


In [None]:
!curl --location --output S288C.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/S288C.genome.fa.gz
!curl --location --output Y12.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/Y12.genome.fa.gz
!curl --location --output SK1.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/SK1.genome.fa.gz

2. Change the fasta headers to include the yeast accession name
[Pangenome Sequence Naming Specification](https://github.com/pangenome/PanSN-spec)

 + The for loop will work through each of the genome fasta files.
 + It will strip off the file suffix to get the yeast accession name.
 + It will then use sed to substitute the accession name in after the ">" of the header line.
 + Finally, we will rename the file.


In [None]:
!for file in *.genome.fa.gz; \
do \
    accession=$(basename "$file" .genome.fa.gz); \
	zcat ${file} | sed "s/>/>${accession}_/" | gzip > prepend_${file}; \
	mv prepend_${file} ${file}; \
done

3. Create a FASTA file containing all three yprp assemblies. Call it `yprp.all.fa`.
+ zcat uncompresses the files (we will compress the files later using a different compression algorithm).

In [None]:
!zcat *genome.fa.gz > yprp.all.fa

To confirm that your file looks correct:  
+ Count the number of sequences  
+ Look at the sequence headers

Hint: add a new code cell block below to run your code in. You can do this by clicking the "insert a cell below (B) icon in the upper right of this block.

<details>
<summary>Click for help</summary>

!# Count the number of sequences

!grep -c '>' yprp.all.fa

!# Look at the sequence headers

!grep '>' yprp.all.fa
</details>

4. Create a FASTA file containing chromosome VIII from every assembly. Call it `yprp.chrVIII.fa.gz`.
+ The awk command changes the record separator (RS) to ">"; in other words, it makes each sequence a record.
+ For each record (sequence) it checks to see if it matches chrVIII; if so, it prints it.

In [None]:
!awk 'BEGIN{RS=">"}$1~/chrVIII/{print ">" $0}' yprp.all.fa > yprp.chrVIII.fa

To confirm that your file looks correct:
+ Count the number of sequences
+ Look at the sequence headers

<details>
<summary>Click for help</summary>

!# Count the number of sequences

!grep -c '>' yprp.chrVIII.fa

!# Look at the sequence headers

!grep '>' yprp.chrVIII.fa
</details>

5. Compress the FASTA files

We will compress the files with bgzip. It is similar to gzip but allows for much faster random access though it creates bigger files than gzip.
[bgzip](https://www.htslib.org/doc/bgzip.html) the FASTA files.  
+ The -c parameter outputs the bgzipped file to standard output  
+ The ">" redirects the standard output into a file


In [None]:
!bgzip -c yprp.all.fa > yprp.all.fa.gz
!bgzip -c yprp.chrVIII.fa > yprp.chrVIII.fa.gz


7. Index the bgzip files with [samtools](http://www.htslib.org/doc/samtools.html) [faidx](http://www.htslib.org/doc/samtools-faidx.html):


In [None]:
!samtools faidx yprp.all.fa.gz
!samtools faidx yprp.chrVIII.fa.gz

## Running pggb on Chromosome VIII

Build a graph containing all the yprp assemblies using the following parameters:

+ **-i yprp.chrVIII.fa**
    + an input FASTA containing all sequences
+ **-o output_chrVIIII**
    + the directory where all output files should be placed
+ **-n 12**
    + the number of haplotypes (assemblies) in the input file
+ **-t 20**
    + the number of threads to use
+ **-p 95**
    + minimum sequence identity of alignment segments
+ **-s 5000**
    + nucleotide segment length when scaffolding the graph
    
NOTE: These arguments were taken from the [pggb paper](https://github.com/pangenome/pggb-paper/blob/main/workflows/AllSpecies.md).
Refer to the paper for parameter suggestions for other species.



In [None]:
!pggb build -i yprp.chrVIII.fa.gz -o output_chrVIII -n 12 -t 20 -p 95

Create a copy of the output graph with a simpler name.


In [None]:
!cp output_chrVIII/yprp.chrVIII.fa.gz.*.smooth.final.gfa yprp.chrVIII.pggb.gfa

You now have a graph file called yprp.chrVIII.pggb.gfa that is in GFA format. You will learn more about GFA format in the next submodule.

## Running pggb on all Chromosomes

While you can run all the chromosomes the same way you ran chromosome VII, partitioning the sequences before building the graph allows us to parallelize the graph building.
The partition-before-pggb command partitions the input FASTA into smaller FASTA "communities" containing sequences that should be in the same subgraph. This command uses the same parameters as pggb build.

+ Will likely correspond to chromosomes if you have complete assemblies
+ May improve run-time of normalization step and make downstream analysis easier
+ Consider skipping partitioning if your assemblies/organism has complex structure you want represented in the graph, e.g. polyploidy, translocations, etc.

The partition-before-pggb command will print a `pggb` command for every partition to the command line and to a log file: `output_all/yprp.all.fa.gz.*.log`



In [None]:
!partition-before-pggb -i yprp.all.fa.gz -o output_all -n 12 -t 20 -p 95 -s 5000

Now get all of the partition commands from the log file into a bash script called run-pggb-partitions.sh.

In [None]:
!sed -n '/pggb -i output_all/,$p' output_all/*.log > ./run-pggb-partitions.sh

And now run the bash script, which will run all the partition commands. They will run sequentially, each using 20 threads.

In [None]:
!./run-pggb-partitions.sh

You now have 16 subgraphs, each in GFA format. Later, we will combine them into a single file.

### Quiz

In [None]:
#Install jupyterquiz library
%pip install jupyterquiz

In [None]:
#Load jupyterquiz library
from jupyterquiz import display_quiz

In [None]:
#Display quiz as html
#Instructions for creating quiz .json files and converting to html provided in the links below
from IPython.display import IFrame
IFrame('module_notebooks/html/quiz_building_graphs.html', width=800, height=400)

## Conclusion
This module explained PGGB's graph building algorithm and output and its strengths and weaknesses.
You obtained the yeast genomes, prepared the input data, and created a yeast pangenomic graph of chromosome VIII and one of the entire genome.
In the next module you will learn how to visualize and explore these graphs.


## Cleanup
No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!