# Pangenomics
--------------------------------------------

# Intro to Graphical Pangenomics


## Overview

Pangenome graphs are representations of related genomes that enable exploration of the relationships of the genomes to one another, their commonalities and novelties, and their collective genetic variation. You will learn about different types of pangenomic graphs and their strengths and weaknesses.

A brief overview of pangenomes and this module is available in this video XXX.

## Learning Objectives
+ Describe pangenomic data
+ Describe pangenomic graphs
+ Specify the different types of pangenomic graphs and identify their pros and cons
+ Detail the pipeline that will be used in this module

## Get Started

In this submodule you will learn about pangenomic data and different types of graphs, including their pros and cons. In addition, you will learn about the pangenomics pipeline that will be used in this module.

#### Pangenomics pipeline
- Build graphs
- Map reads
- Call variants
- Visualize graphs and mapped reads3

----------------

## What is a "pangenome"?

The term “pangenome” was first coined by [Sigaux et al. (2000)](https://pubmed.ncbi.nlm.nih.gov/11261250/) and was used to describe a public database containing an assessment of genome and transcriptome alterations in major types of tumors, tissues, and experimental models. The figure below shows the use of the term "pan genome" in the abstract of the [Sigaux et al. (2000)](https://pubmed.ncbi.nlm.nih.gov/11261250/) paper.

<figure>
  <img
    src="./Figures/Abstract1.png"
    alt="Sigaux et al. abstract" />
  <figcaption><a href="https://pubmed.ncbi.nlm.nih.gov/11261250/">https://pubmed.ncbi.nlm.nih.gov/11261250/</a></figcaption>
</figure>

The term was later revitalized by [Tettelin et al. (2005)](https://pubmed.ncbi.nlm.nih.gov/16172379/) to describe a microbial genome by which genes were in the core (present in all strains) and which genes were dispensable (missing from one or more of the strains). The figure below shows the use of the term "pan-genome" in the abstract of the [Tettelin et al. (2005)](https://pubmed.ncbi.nlm.nih.gov/16172379/) paper. This paper also introduces the concept of a "core genome", consisting of genomic regions or genes that are present in all the strains that were analyzed and a "dispensible genome" consisting of regions or genes present in only one or some of the strains. "Core genome" sequences are presumed to be critical to the species because they have been retained by all of the strains.

<figure>
  <img
    src="./Figures/Abstract2.png"
    alt="Tettelin et al. abstract" />
  <figcaption><a href="https://pubmed.ncbi.nlm.nih.gov/16172379/">https://pubmed.ncbi.nlm.nih.gov/16172379/</a></figcaption>
</figure>


While the terms "core" and "dispensible" or "accessory" genomes are still used, some people partition the pangenome into additional sections. For instance, the pangenome figure below ([EVCB Mx, 2021a](https://en.wikipedia.org/wiki/Pan-genome)) shows a pangenome partitioned into core (in all genomes), shell (in more than one but not all genomes), and cloud (in only a single accession) designations.

<figure>
  <img
    src="./Figures/Pangenome.png"
    alt="Pangenome as Venn diagram" />
  <figcaption><a href="https://en.wikipedia.org/wiki/Pan-genome">https://en.wikipedia.org/wiki/Pan-genome</a></figcaption>
</figure>

### Open vs. Closed Genomes

Pangenomes can be designated as open or closed. As shown in [EVCB Mx (2021a)](https://en.wikipedia.org/wiki/Pan-genome), closed pangenomes have a relatively large ratio of core to accessory regions. In these pangenomes, a small number of genomes is enough to capture all or nearly all of the genes or genomic regions present across the species. On the other hand, open pangenomes have a small ratio of core to accessory regions. As more genomes are sequenced and added to the pangenome, additional genes or genomic regions are found that had not been previously seen.

<figure>
  <img
    src="./Figures/ClosedvOpen.png"
    alt="Open and Closed Pangenomes" />
  <figcaption><a href="https://en.wikipedia.org/wiki/Pan-genome">https://en.wikipedia.org/wiki/Pan-genome</a></figcaption>
</figure>

### Then vs. Now

Pangenomes are becoming increasingly feasible because sequencing costs have dropped precipitously while throughput has increased rapidly. The figure below ([Wetterstrand, 2023](https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data)) shows the cost of sequencing a human genome tracking Moore's law until the Next Generation Sequencing Revolution, in which several new sequencing technologies were introduced, manifested around 2007/2008. Since then, the cost of sequencing a human genome has dropped approximately 10,000 fold.

<figure>
  <img
    src="./Figures/CostGenome.png"
    alt="Cost per Genome" />
  <figcaption><a href="https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data">https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data</a></figcaption>
</figure>


These cost reductions, along with improvements in the generation of high quality long sequencing reads such as [PacBio Hifi](https://www.pacb.com/technology/hifi-sequencing/) reads, have enabled the sequencing of multiple, high quality genomes within many different species, faciliting the construction of pangenomes and resulting in a steady increase in the number of publications mentioning pangenomes [(Bayer et al., 2020)](https://www.nature.com/articles/s41477-020-0733-0).

<figure>
  <img
    src="./Figures/GenomePubs.png"
    alt="Pangenome Publications" />
  <figcaption><a href="https://www.nature.com/articles/s41477-020-0733-0">https://www.nature.com/articles/s41477-020-0733-0</a></figcaption>
</figure>

### "Pangenome" Today

Today, the definition of a pangenome has evolved:

“Any collection of genomic sequences to be analyzed jointly or to be used as a reference. These sequences can be linked in a graph-like structure, or simply constitute sets of (aligned or unaligned) sequences.” – [Computational Pangenomics Consortium](https://academic.oup.com/bib/article/19/1/118/2566735)

### The Benefit of Pangenomes

+ Removes reference bias that:
  + May only represent one organism
  + Could be a “mosaic”of individuals, i.e. doesn’t represent a coherent haplotype
  + May contain allele bias
  + Doesn’t include common variation
+ Allows multiple assemblies to be analyzed simultaneously, i.e. efficiently

###  What are pangenomes good for?

+ Core vs dispensable genes:
  + How big is the core?
  + How big is the dispensable?
  + How big is the pangenome?
  + What traits are associated with the core/dispensable?
+ Unbiased read mapping and variant calling
+ More robust variation-trait association
+ Visual exploration of genomic structure of population

----------------

##  Computational Pangenomics

“Questions about efficient data structures, algorithms and statistical methods to perform bioinformatic analyses of pan-genomes give rise to the discipline of ‘computational pan-genomics’.” – [Computational Pangenomics Consortium](https://academic.oup.com/bib/article/19/1/118/2566735)

<figure>
  <img
    src="./Figures/Computational.png"
    alt="Computational Pangenomics" />
  <figcaption><a href="https://academic.oup.com/bib/article/19/1/118/2566735">https://academic.oup.com/bib/article/19/1/118/2566735</a></figcaption>
</figure>

### Pangenome Representations

Pangenomes can be represented in many different formats:

+ Gene sets
+ Multiple sequence alignments
+ K-mer sets
+ Graphs
  + De Bruijn graphs
  + Haptotype graphs
  + Variation graphs


### Variation Graphs

We will focus on variation graphs, which most pangenomic tools use. These graphs consist of nodes that contain sequence data, edges that connect the nodes, and paths that thread genomes, chromosomes, haplotypes, genes, or other sequence information through the graph, thereby connecting original sequence inputs directly with walks through the graph. Variation graphs retain all variation present in the original input sequences and balance construction efficiency with ease of visualization and interpretation.

+ Variation forms bubbles
+ Nodes represent sequences
+ Chains of nodes represent contiguous sequence in one or more assemblies
+ The sequences of nodes connected by an edge may overlap
+ Graphs can be acyclic or cyclic
+ Haplotypes are “threaded” through graph as paths

<figure>
  <img
    src="./Figures/VariationGraph.jpeg"
    alt="Pangenome Representations" />
  <figcaption><a href="https://academic.oup.com/bib/article/19/1/118/2566735">https://academic.oup.com/bib/article/19/1/118/2566735</a></figcaption>
</figure>

### Types of Variation Graphs

1. Reference Graph (vg)
      + A reference with variants
      + e.g., [Human reference now includes VCF with common variation](https://www.ncbi.nlm.nih.gov/genome/guide/human/)
2. Reference Backbone; “iterative” (minigraph)
      + Graph starts as reference and other sequences are layered on, i.e. variants can be relative to sequences other than the reference
3. Reference-Free (Cactus and pggb)
      + Graph is built using non-reference techniques, such as multiple sequence alignment

These are all methods used by the [Human Pangenome Reference Consortium](https://humanpangenome.org). In this module, you will learn how to use pggb, which aims to capture all the variation present in the input sequences and does not designate a genome as the reference, thereby avoiding reference bias.

### Mapping Reads to Variation Graphs

You can map sequencing reads from an individual to variation graphs, splitting the reads across matchin nodes as shown below [(Hickey et al, 2020)](https://link.springer.com/article/10.1186/s13059-020-1941-7). This allows you to identify which of the nodes (and, therefore, which variants) are supported by the reads. This read support can then be converted into genotypes for individual you sequenced and aligned to the variation graph.

<figure>
  <img
    src="./Figures/ReadsToVariation.png"
    alt="Genotyping Variation" />
  <figcaption><a href="https://link.springer.com/article/10.1186/s13059-020-1941-7">https://link.springer.com/article/10.1186/s13059-020-1941-7</a></figcaption>
</figure>

----------------

## Pangenome Data Sets

Here are some links to some pangenome data sets you can explore.

+ [Human Reference + Variation VCF](https://www.ncbi.nlm.nih.gov/genome/guide/human/)
+ [Human Pangenome Reference Consortium](https://humanpangenome.org)
+ [Zoonomia (200 mammals) Project](https://zoonomiaproject.org/the-data/)
+ [Maize NAM founder genomes](https://www.science.org/doi/10.1126/science.abg5289)
+ [Yeast Population Reference Panel (YPRP)](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/)


----------------------

### Quiz

Run the code below to take the quiz.

In [None]:
from IPython.display import IFrame
IFrame('../html/quiz_pangenomes.html', width=800, height=400)

----------------------

## Conclusion
This module gave an overview of pangenomes and the different types of pangenome representations, focusing on variation graphs. In the next submodule, you will build some pangenome graphs.

----------------

## Clean up

<div class="alert alert-warning">No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!.</div>