# Pangenomics
--------------------------------------------

# Building Graphs with PGGB


## Overview
The PanGenome Graph Builder (PGGB) creates reference-free pangenomic graphs (https://github.com/pangenome/pggb). You will learn about the algorithm and its graphical output, its strengths and weaknesses, and you will build a yeast pangenomic graph.

## Learning Objectives
+ Understand what types of graphs PGGB builds and their pros/cons
+ Learn how to build graphs with PGGB

## Get Started
In this submodule you will learn how to build pangenomic graphs with PGGB.

PGGB lecture:
- Reference-Free Graphs with PGGB

PGGB hands-on tutorials:
- Yeast Dataset
- PGGB graph generation
- Graph inspection


## Reference-Free Graphs with PGGB

### PanGenome Graph Builder (PGGB)

The PGGB algorithm creates *reference-free graphs* from: 
+ All-pairwise whole genome alignments 
+ Induces a graph from the alignments

PGGB is built on the idea that a pangenome graph represents an alignment of the genomes in the graph, but infers the graph from all pairwise alignments instead of a multiple alignment.

PGGB computes all pairwise alignments efficiently by focusing on long, colinear homologies, instead of using the more traditional k-mer matching alignment approach.

Critically, pggb performs graph *normalization* to ensure that paths through the graph (e.g. chromosomes) have a linear structure while allowing for cyclic graph structures that capture structural variation.

![Input Genomes](./Figures/pggbFlowDiagram.png)

### Reference-Free Graphs

https://academic.oup.com/bioinformatics/article/30/24/3476/2422268

![Input Genomes](./Figures/InputGenomes.png)

###  PGGB Algorithm

1. Perform all-pairwise genome alignments using [wfmash](https://github.com/waveygang/wfmash)
2. Convert alignments into a graph using [seqwish](https://github.com/ekg/seqwish)
3. Progressively normalize graph with [smoothxg](https://github.com/pangenome/smoothxg) and [gfaffix](https://github.com/marschall-lab/GFAffix)



## Yeast Genome Assemblies and Reads

The [Yeast Population Reference Panel (YPRP)](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/) is a panel that includes 12 yeast genome assemblies.
More information is available in the [YPRP manuscript](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2659681/)

  + 7 Saccharomyces cerevisiae (brewer’s yeast), including the S288C reference
  + 5 *Saccharomyces paradoxus* (wild yeast)

![Yeast Genomes: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/](./Figures/Yeast.png)

Yeast genomes are ~12 Mb and have 16 chromosomes.

These yeast genomes were assembled with [LRSDAY](https://github.com/yjx1217/LRSDAY) Long-read Sequencing Data Analysis for Yeasts)

+ [YPRP: 12 Yeast PacBio Assemblies (Chromosome level)](https://yjx1217.github.io/Yeast_PacBio_2016/data/)
  + ~100-200x PacBio sequencing reads
  + HGAP + Quiver polishing
  + ~200-500x Illumina (Pilon correction)
  + Manual curation
  + Annotation



### SK1 Illumina Reads

SK1 is the most distant from S288C

We will use SK1 reads later on to call variants

![Yeast Genomes: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/](./Figures/YeastB.png)



### CUP1 Gene

![](./Figures/StructuralRearrangements.png)
[Structural Rearrangements](https://www.nature.com/articles/ng.3847)
+ [CUP1](https://www.yeastgenome.org/locus/S000001095) - A gene involved in heavy metal (copper) tolerance with copy-number variation (CNV) in population.
+ [YHR054C](https://www.yeastgenome.org/locus/S000001096) - Putative protein of unknown function.



### Preparing the Yeast Input Assemblies

1. Get the three yeast genome assembly files (FASTA).
 + curl transfers a URL
 + --location tells curl to follow any redirects
 + --output gives it an output file


In [None]:
!curl --location --output S288C.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/S288C.genome.fa.gz
!curl --location --output Y12.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/Y12.genome.fa.gz
!curl --location --output SK1.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/SK1.genome.fa.gz


2. Change the fasta headers to include the yeast accession name
[Pangenome Sequence Naming Specification](https://github.com/pangenome/PanSN-spec)

 + The for loop will work through each of the genome fasta files.
 + It will strip off the file suffix to get the yeast accession name.
 + It will then use sed to substitute the accession name in after the ">" of the header line.
 + Finally, we will rename the file.


In [None]:
!for i in *genome.fa.gz; do
!	zcat $i | sed "s/>/>${accession}_/" | gzip > prepend_$i!
!	mv prepend_$i $i
!done


3. Create a FASTA file containing all three yprp assemblies. Call it `yprp.all.fa`.


In [None]:
!cat *genome.fa.gz > yprp.all.fa


4. Create a FASTA file containing chromosome VIII from every assembly. Call it `yprp.chrVIII.fa`.


In [None]:
!awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fa"; print ">" $0 > fnme; close(fnme);}' yprp/assemblies/*.genome.fa && cat *.chrVIII.fa > yprp.chrVIII.fa && rm `ls | grep -v yprp`


5. How can you see how many sequences are in this file to confirm the file looks correct?



<details>
<summary>Click for help</summary>
<br>
grep -c '>' yprp.all.fa
</details>

6. Compress the FASTA files

We will compress the files with bgzip. It is similar to gzip but allows for much faster random access though it creates bigger files than gzip.
[bgzip](https://www.htslib.org/doc/bgzip.html) the FASTA files.
The -c parameter outputs the bgzipped file to standard output
The ">" redirects the standard output into a file


In [None]:
!bgzip -c yprp.all.fa > yprp.all.fa.gz
!bgzip -c yprp.chrVIII.fa > yprp.chrVIII.fa.gz



7. Index the bgzip files with [samtools](http://www.htslib.org/doc/samtools.html) [faidx](http://www.htslib.org/doc/samtools-faidx.html):


In [None]:
!samtools faidx yprp.all.fa.gz
!samtools faidx yprp.chrVIII.fa.gz


## Running pggb on Chromosome VIII

Build a graph containing all the yprp assemblies using the following parameters:

+ **-i yprp.chrVIII.fa**
    + an input FASTA containing all sequences
+ **-o output_chrVIIII**
    + the directory where all output files should be placed
+ **-n 12**
    + the number of haplotypes (assemblies) in the input file
+ **-t 20**
    + the number of threads to use
+ **-p 95**
    + minimum sequence identity of alignment segments
+ **-s 5000**
    + nucleotide segment length when scaffolding the graph
    
NOTE: These arguments were taken from the [pggb paper](https://github.com/pangenome/pggb-paper/blob/main/workflows/AllSpecies.md).
Refer to the paper for parameter suggestions for other species.



In [None]:
!pggb build -i yprp.chrVIII.fa.gz -o output_chrVIII -n 12 -t 20 -p 95


Create a copy of the output graph with a simpler name.


In [None]:
!cp output_chrVIII/yprp.chrVIII.fa.gz.*.smooth.final.gfa yprp.chrVIII.pggb.gfa


## Running pggb on all Chromosomes

While you can run all the chromosomes the same way you ran chromosome VII, partitioning the sequences before building the graph allows us to parallelize the graph building.
The partition-before-pggb command partitions the input FASTA into smaller FASTA "communities" containing sequences that should be in the same subgraph. This command uses the same parameters as pggb build.

+ Will likely correspond to chromosomes if you have complete assemblies
+ May improve run-time of normalization step and make downstream analysis easier
+ Consider skipping if your assemblies/organism has complex structure you want represented in the graph, e.g. polyploidy, translocations, etc.

The partition-before-pggb command will print a `pggb` command for every partition to the command line and to a log file: `output_all/yprp.all.fa.gz.*.log`



In [None]:
!partition-before-pggb -i yprp.all.fa.gz -o output_all -n 12 -t 20 -p 95 -s 5000


Now use the commands on the screen or in the log file to run the 12 subgraphs.


<details>
<summary>Click for help</summary>
<br>
Add commands from log here.
</details>

### Insert quizzes

In [None]:
#Install jupyterquiz library
%pip install jupyterquiz

In [None]:
#Load jupyterquiz library
from jupyterquiz import display_quiz

In [None]:
#Display quiz as html
#Instructions for creating quiz .json files and converting to html provided in the links below
from IPython.display import IFrame
IFrame('module_notebooks/html/quiz_building_graphs.html', width=800, height=400)

## Conclusion
This module explained PGGB's graph building algorithm and output and its strengths and weaknesses.
You obtained the yeast genomes, prepared the input data, and created a yeast pangenomic graph of chromosome VIII and one of the entire genome.
In the next module you will learn how to visualize and explore these graphs.


## Cleanup
Don't forget to shutdown the VM and delete any relevant resources. <br><br>

<br>

## Additional Notebook Options & Functionalities

---------------------------------

### Use alert cells to communicate important messages or information

<div class="alert alert-block alert-danger"> <b>Warning:</b> Here is a warning. Please take appropriate action</a>. </div>
<div class="alert alert-block alert-warning"> <b>Attention:</b> Please take note</a>. </div>
<div class="alert alert-block alert-success"> <b>Success:</b> Your action was successful</a>. </div>
<div class="alert alert-block alert-info"> <b>Tip:</b> Try this</a>. </div>

### Stylize markdown cells and text

<p style="background:blue;color:white;font-family:times new roman"> Change cell background, text color, and/or font. </p>
<code style="background:black;color:white">>Make text look similar to command line. </code>

**This is bold text.** <br>
Another way to <b>bold</b> text. <br><br>
*This is italicized text.* <br>
Another way to <i>italicize</i> text. <br><br>
<mark>Emphasize</mark> a section of text. <br><br>
Insert LateX equations: $\sqrt{n}$

### Code syntax highlighting

```python
def my_python_function():
  print("Hello from a function")
```

### Create tables

##### Using markdown syntax
|Name|Address|Salary| 
|-----|-------|------| 
|Hanna|Brisbane|5000| 
|Adam|Sydney|4000|

##### Using HTML syntax
<table>
<thead>
<tr><th>Name</th><th>Address</th><th>Salary</th></tr>
</thead>
<tbody>
<tr><td>Hanna</td><td>Brisbane</td><td>5000</td></tr>
<tr><td>Adam</td><td>Sydney</td><td>4000</td></tr>
</tbody>
</table>

### Provide additional details through dropdowns or hover text

<details>
<summary>Click for help</summary>
<br>
Put your detailed instructions, command, or helpful hint(s) here.
</details>

<span title="Here is where you should put more detailed instructions.">Hover mouse over this text for further instructions</span>

### Display links

[Link To Nextflow Intro Video](https://www.youtube.com/watch?v=wbtMbJTo1xo)

<br>

## Notebook Embeddings
---------------------------------

### Embed images

![myimage](images/OIP.jpeg)

### Embed videos

In [None]:
from IPython.display import YouTubeVideo

# Youtube
YouTubeVideo(id='T9fbAkgINf0', height=200, width=400)

In [None]:
from IPython.display import Video

#Sample from vimeo
Video("videos/sample-mp4-file.mp4",width=400, height=200)

In [None]:
from IPython.display import VimeoVideo
VimeoVideo(id='281123163', width=400, height=200)

### Embed html files

In [None]:
from IPython.display import IFrame
IFrame(src='html/gut_1_fastqc.html', width=900, height=600)

### Embed interactive IGV browser

In [None]:
%pip install --user igv-notebook

In [None]:
import igv_notebook

igv_notebook.init()

In [None]:
b1 = igv_notebook.Browser(
    {
        "genome": "hg19",
        "locus": "chr22:24,376,166-24,376,456",
    }
)

Visit the [igvteam](https://github.com/igvteam/igv-notebook) Github page for additional information. <br><br>
Or refer to our past workshop on [using igv-notebook in jupyter](reference_notebooks/igv_template.ipynb).

<br>

## Quizzes & Flashcards
---------------------------------

### Insert quizzes

In [None]:
#Install jupyterquiz library
%pip install jupyterquiz

In [None]:
#Load jupyterquiz library
from jupyterquiz import display_quiz

In [None]:
#Display quiz as html
#Instructions for creating quiz .json files and converting to html provided in the links below
from IPython.display import IFrame
IFrame('html/quiz_example.html', width=800, height=400)

### Insert flashcards

In [None]:
#Install jupytercards library
%pip install jupytercards

In [None]:
#Display flashcard as html
#Instructions for creating flashcard .json files and converting to html provided in the links below
from IPython.display import IFrame
IFrame('html/flashcard_example.html', width=600, height=600)

For more details on constructing and embedding quizzes and flashcards using Python refer to this [notebook](reference_notebooks/python_quiz_template.ipynb). <br><br>
For details on constructing and embedding quizzes and flashcards within a notebook using R refer to this [notebook](reference_notebooks/r_quiz_template.ipynb).

<br>

# Data Figures & Graphics

---------------------------------------------------

### Data visualization with python

In [None]:
#Install libraries
%pip install matplotlib
%pip install pandas
%pip install seaborn
%pip install numpy
%pip install pyvis
%pip install ipycytoscape

In [None]:
#Load libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

#### Make volcano plots

In [None]:
toptable = pd.read_csv('viz_data/Toptable_VolcanoPlot.txt', sep='\t')

# Declare significance thresholds
sig = 0.05
FC = 0.6

toptable['Significance'] = np.where((toptable['logFC'] > FC) & 
                                    (toptable['P.Value'] < sig), 'Up', 
                           np.where((toptable['logFC'] < -FC) &
                                    (toptable['P.Value'] < sig), 'Down','Not_Sig'))
# Count of Significance level
toptable['Significance'].value_counts()

# Add color to the plot based on the values above with the hue parameter
sns.scatterplot(x='logFC', y=-np.log10(toptable['adj.P.Val']), hue='Significance', data=toptable)
plt.show()

#### Make heatmaps

In [None]:
# generate a matrix, red pill or blue pill?
np.random.seed(100)

nr1 = 4
nr2 = 8
nr3 = 6
nr = nr1 + nr2 + nr3
nc1 = 6
nc2 = 8
nc3 = 10
nc = nc1 + nc2 + nc3

mat1 = np.random.normal(1, 0.5, (nr1, nc1))
mat2 = np.random.normal(0, 0.5, (nr2, nc1))
mat3 = np.random.normal(0, 0.5, (nr3, nc1))
mat4 = np.random.normal(0, 0.5, (nr1, nc2))
mat5 = np.random.normal(1, 0.5, (nr2, nc2))
mat6 = np.random.normal(0, 0.5, (nr3, nc2))
mat7 = np.random.normal(0.5, 0.5, (nr1, nc3))
mat8 = np.random.normal(0.5, 0.5, (nr2, nc3))
mat9 = np.random.normal(1, 0.5, (nr3, nc3))

mat = np.concatenate((np.concatenate((mat1, mat2, mat3), axis=0),
                      np.concatenate((mat4, mat5, mat6), axis=0),
                      np.concatenate((mat7, mat8, mat9), axis=0)), axis=1)

mat = np.array(mat)
np.random.shuffle(mat)
np.random.shuffle(mat.T)

row_names = ["row"+str(i) for i in range(1,nr+1)]
col_names = ["column"+str(i) for i in range(1,nc+1)]

mat = pd.DataFrame(mat, index=row_names, columns=col_names)

# hate the default colors, so let's match it with the colors from the R notebook
from matplotlib.colors import LinearSegmentedColormap
red_white_blue = LinearSegmentedColormap.from_list("rbw",["red","white","blue"])

sns.heatmap(mat,annot=False,cmap=red_white_blue, linewidth=.5)
plt.show()

#### Make gene networks

In [None]:
#Load additional network visualization libraries
from pyvis.network import Network
#from ipycytoscape import CytoscapeWidget #only needed if want to produce widget below
import networkx as nx
import requests

In [None]:
# Create a Network object
net = Network(notebook=True, cdn_resources='remote')

# Add nodes to the network
net.add_node("A", label = "Gene A")
net.add_node("B", label = "Gene B")
net.add_node("C", label = "Gene C")

# Add edges to the network
net.add_edge("A", "B")
net.add_edge("B", "C")
net.add_edge("C", "A")

# Show the network
net.show("network_example.html")

In [None]:
#create a list of proteins or genes of interest
protein_list = ['TPH1','COMT','SLC18A2','HTR1B','HTR2C','HTR2A','MAOA',
            'TPH2','HTR1A','HTR7','SLC6A4','GABBR2','POMC','GNAI3',
            'NPY','ADCY1','PDYN','GRM2','GRM3','GABBR1']
proteins = '%0d'.join(protein_list)

url = 'https://string-db.org/api/tsv/network?identifiers=' + proteins + '&species=9606'
r = requests.get(url)

lines = r.text.split('\n') # pull the text from the response object and split based on new lines
data = [l.split('\t') for l in lines] # split each line into its components based on tabs
# convert to dataframe using the first row as the column names; drop empty, final row
df = pd.DataFrame(data[1:-1], columns = data[0]) 


In [None]:
df.head()

In [None]:
# dataframe with the preferred names of the two proteins and the score of the interaction
interactions = df[['preferredName_A', 'preferredName_B', 'score']] 

G=nx.Graph(name='Protein Interaction Graph')
interactions = np.array(interactions)
for i in range(len(interactions)):
    interaction = interactions[i]
    a = interaction[0] # protein a node
    b = interaction[1] # protein b node
    w = float(interaction[2]) # score as weighted edge where high scores = low weight
    G.add_weighted_edges_from([(a,b,w)]) # add weighted edge to graph

pos = nx.spring_layout(G) # position the nodes using the spring layout
plt.figure(figsize=(11,11),facecolor=[0.7,0.7,0.7,0.4])
nx.draw_networkx(G)

#cyto = CytoscapeWidget()
#cyto.graph.add_graph_from_networkx(G)

#display(cyto)
plt.axis('off')
plt.show()

### Data visualization with R
[Open R Notebook](./reference_notebooks/r_viz_template.ipynb)