# Pangenomics
--------------------------------------------

# Building Graphs with PGGB


## Overview
The PanGenome Graph Builder (PGGB) creates reference-free pangenomic graphs (https://github.com/pangenome/pggb). In this submodule, you will learn about the algorithm and its graphical output, its strengths and weaknesses, and you will build a yeast pangenomic graph.

## Learning Objectives
+ Describe the types of graphs PGGB builds
+ List their pros/cons
+ Build graphs with PGGB

## Getting Started
In this submodule you will learn how to build pangenomic graphs with PGGB.

#### PGGB lecture:
- Reference-Free Graphs with PGGB

#### PGGB hands-on tutorials:
- Yeast Dataset
- PGGB graph generation
- Graph inspection


----------------------
## Reference-Free Graphs with PGGB

### PanGenome Graph Builder (PGGB)
+ PGGB is built on the idea that a pangenome graph represents an alignment of the genomes in the graph, thus, PGGB builds graphs from pairwise alignments between all genomes in the pangenome.
+ PGGB computes all pairwise alignments efficiently by focusing on long, colinear homologies, instead of using the more traditional k-mer matching alignment approach.
+ Critically, pggb performs graph *normalization* to ensure that paths through the graph (e.g. chromosomes) have a linear structure while allowing for cyclic graph structures that capture structural variation.
+ The main advantage of PGGB is that it is truly not reference biased
+ The two main drawbacks of PGGB is that the graphs are computationally expensive to compute and the graphs can be very complex compared to other types of pangenome graphs, making them more difficult to analyse, both computationally and visually.

#### The PGGB algorithm creates *[reference-free graphs](https://academic.oup.com/bioinformatics/article/30/24/3476/2422268)* from: 
+ All-pairwise whole genome alignments 
+ Inducing a graph from the alignments

####  PGGB Algorithm
1. Perform all-pairwise genome alignments using [wfmash](https://github.com/waveygang/wfmash)
2. Convert alignments into a graph using [seqwish](https://github.com/ekg/seqwish)
3. Progressively normalize the graph with [smoothxg](https://github.com/pangenome/smoothxg) and [gfaffix](https://github.com/marschall-lab/GFAffix)

The figure below shows a [flow diagram for PGGB](https://github.com/pangenome/pggb).

<figure>
  <img
    src="./Figures/pggbFlowDiagram.png"
    alt="PGGB pipeline" />
  <figcaption><a href="https://github.com/pangenome/pggb">https://github.com/pangenome/pggb</a></figcaption>
</figure>


The figure below shows a small example of how a graph may be built from aligned blocks ([Marcus, et al. 2014](https://academic.oup.com/bioinformatics/article/30/24/3476/2422268)). There are 4 small genomes split into shared and unique seqence blocks. These are used to create a graph that uses the sequence blocks as nodes and has edges connecting the nodes.

<figure>
  <img
    src="./Figures/InputGenomes.png"
    alt="Input genomes as abstract graph" />
  <figcaption><a href="https://academic.oup.com/bioinformatics/article/30/24/3476/2422268">https://academic.oup.com/bioinformatics/article/30/24/3476/2422268</a></figcaption>
</figure>



----------------------

## Yeast Data Description

### Yeast Population Reference Panel (YPRP)

We will use some yeast genome assemblies from the [Yeast Population Reference Panel (YPRP)](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/). YPRP is a panel that includes 12 yeast genome assemblies from two different species of yeast. 

  + 7 *Saccharomyces cerevisiae* (brewer’s yeast), including the S288C reference
  + 5 *Saccharomyces paradoxus* (wild yeast)

The figure below shows a [phylogenetic tree](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/) of these genomes (highlighted in red and blue) as well as some more distant relatives. We will focus on genomes from 3 different yeast accessions (S288C, SK1, and Y12) to learn how to run the pangenomics pipeline but encourage you to download more yeast genomes for additional practice after you work through the module. The 3 yeast accessions we chose have interesting structural differences surrounding the CUP1 gene for copper resistance, which we will use as an example (see below). Also of note, the yeast reference genome is from S288C.


<figure>
  <img
    src="./Figures/Yeast.png"
    alt="Yeast genomes" />
  <figcaption><a href="https://yjx1217.github.io/Yeast_PacBio_2016/welcome/">https://yjx1217.github.io/Yeast_PacBio_2016/welcome/</a></figcaption>
</figure>

### Yeast Genome Sequencing and Assembly Strategy

Yeast genomes are ~12 Mb and have 16 chromosomes. The yeast genome assemblies we will use are chromosome level, high quality assemblies. Sequence data, assemblies, and additional information about this population can be accessed [here](https://yjx1217.github.io/Yeast_PacBio_2016/data/). We briefly describe the data below.

The following sequence data were used:
  + ~100-200x PacBio sequencing reads
  + ~200-500x Illumina (for correction)

The PacBio reads were assembled with [LRSDAY](https://github.com/yjx1217/LRSDAY) (Long-Read Sequencing Data Analysis for Yeasts). Briefly, these are the steps taken for sequencing, assembly, and gene annotation:
  + *de novo* assembly of PacBio reads using [HGAP](https://www.nature.com/articles/nmeth.2474)
  + Polishing of the assembly using [Quiver](https://www.nature.com/articles/nmeth.2474)
  + Additional polishing using Illumina reads in [Pilon](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112963)
  + Manual curation
  + Gene annotation lift overs using [RATT](https://academic.oup.com/nar/article/39/9/e57/1236534?login=false) to pull across high confidence genes from the *S. cerevisiae* reference genome
  + Evidence-based and *de novo* gene annotation using the [Yeast Genome Annotation Pipeline (YGAP)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-237), [Maker](https://pmc.ncbi.nlm.nih.gov/articles/PMC4286374/), [tRNAscan-SE (v1.3.1)](https://pmc.ncbi.nlm.nih.gov/articles/PMC146525/), and [EVidenceModeler (EVM)](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-1-r7)

More information about these yeast accessions and YPRP's research is available in the [YPRP manuscript](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2659681/).

### Illumina Reads

We will use Illumina reads from SK1 to align to the pangenome graph and call variants. We are using SK1 because it is fairly distant from the reference (S288C), as shown in the Phylogeny below.

<figure>
  <img
    src="./Figures/YeastB.png"
    alt="Yeast genomes highlighted" />
  <figcaption><a href="https://yjx1217.github.io/Yeast_PacBio_2016/welcome/">https://yjx1217.github.io/Yeast_PacBio_2016/welcome/</a></figcaption>
</figure>



### CUP1 Gene

We will focus on a region that shows [structural variation](https://www.nature.com/articles/ng.3847) among some *S. cerevisiae* yeast genomes. The region contains two genes with copy number variation.

+ [CUP1](https://www.yeastgenome.org/locus/S000001095) - A gene involved in heavy metal (copper) tolerance with copy-number variation (CNV). In general, the more copies of CUP1, the better the copper tolerance.
+ [YHR054C](https://www.yeastgenome.org/locus/S000001096) - Putative protein of unknown function.

The figure below shows [a schematic of genes in the CUP1 region](https://www.yeastgenome.org/locus/S000001095). All three of the genomes we will use are different in this region.

<figure>
  <img
    src="./Figures/StructuralRearrangements.png"
    alt="Yeast CUP1 structure" />
  <figcaption><a href="https://www.nature.com/articles/ng.3847">https://yjx1217.github.io/Yeast_PacBio_2016/welcome/</a></figcaption>
</figure>




----------------------

## Downloading and Preparing Yeast Data

### Creating Directories

First, create some directories to keep things oranized.

<div class="alert alert-info"><b>Important:</b> When you run the code blocks, pay attention to the square brackets to the left of the code block. If there is an asterix in these brackets, the code is still running and you should wait before moving on.</div>

Now make some directories to keep things organized.

In [1]:
!mkdir assemblies
!mkdir graphs
!mkdir genes
!mkdir reads
!mkdir alignments
!mkdir variants

### Preparing the Yeast Input Assemblies

1. Get the three yeast genome assembly files (FASTA).
     + `curl` transfers a URL
     + `--location` tells curl to follow any redirects
     + `--output` gives it an output file


In [2]:
!curl --location --output assemblies/S288C.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/S288C.genome.fa.gz
!curl --location --output assemblies/Y12.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/Y12.genome.fa.gz
!curl --location --output assemblies/SK1.genome.fa.gz http://yjx1217.github.io/Yeast_PacBio_2016/data/Nuclear_Genome/SK1.genome.fa.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   162  100   162    0     0   9529      0 --:--:-- --:--:-- --:--:--  9529
100 3687k  100 3687k    0     0  16.5M      0 --:--:-- --:--:-- --:--:-- 16.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   162  100   162    0     0  18000      0 --:--:-- --:--:-- --:--:-- 18000
100 3357k  100 3357k    0     0  23.2M      0 --:--:-- --:--:-- --:--:-- 23.2M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   162  100   162    0     0  11571      0 --:--:-- --:--:-- --:--:-- 11571
100 3406k  100 3406k    0     0  24.1M      0 --:--:-- --:--:-- --:--:-- 24.1M


2. Change the FASTA headers to include the yeast accession name (see
[Pangenome Sequence Naming Specification](https://github.com/pangenome/PanSN-spec) for more about naming). The code below works as follows:

     + The `for` loop will work through each of the genome fasta files.
     + It will strip off the file suffix to get the yeast accession name.
     + It will then use `sed` to substitute the accession name in after the `>` of the header line.
     + Finally, we will rename the file.

In [3]:
%cd assemblies
    
!for file in *.genome.fa.gz; \
do \
    accession=$(basename "$file" .genome.fa.gz); \
	zcat ${file} | sed "s/>/>${accession}_/" | gzip > prepend_${file}; \
	mv prepend_${file} ${file}; \
done

%cd ..

/home/jupyter/NIGMS-Sandbox-Pangenomics-Module/module_notebooks/assemblies
/home/jupyter/NIGMS-Sandbox-Pangenomics-Module/module_notebooks


3. Create a FASTA file containing all three YPRP assemblies. Call it *yprp.all.fa*.
    + `zcat` uncompresses the files (we will compress the files later using a different compression algorithm).

In [4]:
!zcat assemblies/*genome.fa.gz > assemblies/yprp.all.fa

4. Exercise: Confirm that your file looks correct by adding code to the two code cells below that:  
    + Counts the number of sequences  
    + Looks at the sequence headers

In [None]:
# Count the number of sequences

In [None]:
# Look at the sequence headers

<details>
<summary>Click for help</summary>

**Count the number of sequences**

!grep -c '>' assemblies/yprp.all.fa

**Look at the sequence headers**

!grep '>' assemblies/yprp.all.fa
</details>

You should have 48 sequences, representing the 16 chromosomes for each of the 3 yeast accessions.

The sequence headers should have both the accession and the chromosome (example: S288C_chrI).

5. Create a FASTA file containing chromosome VIII from every assembly. Call it *yprp.chrVIII.fa.gz*.
    + The `awk` command changes the record separator (RS) to `>`; in other words, it makes each sequence a record.
    + For each record (sequence) it checks to see if it matches chrVIII; if so, it prints it.

In [6]:
!awk 'BEGIN{RS=">"}$1~/chrVIII/{print ">" $0}' assemblies/yprp.all.fa > assemblies/yprp.chrVIII.fa

6. Confirm that your file looks correct by adding code to the two code cells below that:  
    + Counts the number of sequences  
    + Looks at the sequence headers

In [None]:
# Count the number of sequences

In [None]:
# Look at the sequence headers

<details>
<summary>Click for help</summary>

**Count the number of sequences**

!grep -c '>' assemblies/yprp.chrVIII.fa

**Look at the sequence headers**

!grep '>' assemblies/yprp.chrVIII.fa
</details>

7. Compress the FASTA files with [bgzip](https://www.htslib.org/doc/bgzip.html)
    + We will compress the files with `bgzip`. It is similar to `gzip` but allows for much faster random access, though it creates bigger files than gzip. 
    + The `-c` parameter outputs the bgzipped file to standard output  
    + The `>` redirects the standard output into a file

In [7]:
!bgzip -c assemblies/yprp.all.fa > assemblies/yprp.all.fa.gz
!bgzip -c assemblies/yprp.chrVIII.fa > assemblies/yprp.chrVIII.fa.gz


8. Index the bgzip files with [samtools](http://www.htslib.org/doc/samtools.html) [faidx](http://www.htslib.org/doc/samtools-faidx.html). It will create a text (.fai) and compressed (.gzi) index.


In [8]:
!samtools faidx assemblies/yprp.all.fa.gz
!samtools faidx assemblies/yprp.chrVIII.fa.gz

----------------------

## Running pggb on Chromosome VIII

1. Build a graph containing all the YPRP assemblies using `pggb`.

The parameters:

-i  input FASTA containing all sequences  
-o  the directory where all output files should be placed  
-n  the number of haplotypes (assemblies) in the input file (we have 3 assemblies)  
-t  the number of threads to use  
-p  minimum sequence identity of alignment segments  
-s 5000  nucleotide segment length when scaffolding the graph 

<div class="alert alert-block alert-info"> <b>NOTE:</b> The %%capture command in the code block below suppresses the large amount of output. Make sure you wait until the asterisk in the square bracket to the left of the code block is replaced with a number before moving on. At that point the command has finished.


<div class="alert alert-block alert-info"> <b>NOTE:</b> These arguments were taken from the pggb paper (https://github.com/pangenome/pggb-paper/blob/main/workflows/AllSpecies.md).

Refer to the paper for parameter suggestions for other species.

In [9]:
%%capture

!pggb build -i assemblies/yprp.chrVIII.fa.gz -o graphs/output_chrVIII -n 3 -t 4 -p 95

<div class="alert alert-block alert-info"> <b>NOTE:</b> The warning that some of the sequence names do not match the Pangenome Sequence Naming (PanSN) can be ignored. We have chosen to name our sequences slightly simpler way than what is suggested in the <a href="https://github.com/pangenome/PanSN-spec">PanSN-spec: Pangenome Sequence Naming</a>.

2. Create a copy of the output graph with a simpler name.


In [10]:
!cp graphs/output_chrVIII/yprp.chrVIII.fa.gz.*.smooth.final.gfa graphs/yprp.chrVIII.pggb.gfa

----------------------

## Graphical Fragment Assembly (GFA) format

You now have a graph file called yprp.chrVIII.pggb.gfa that is in GFA format.

The visualization software we use in this module - Bandage - uses the Graphical Fragment Asembly (GFA) format, which was originally developed for representing genomes during assembly and is now used for pangenomics applications. More information on GFA formats is available [here](https://github.com/GFA-spec/GFA-spec). More information about the particular flavor (GFA1.0) that PGGB uses can be found [here](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md).

The PGGB GFA contains four different types of lines, each marked by its corresponding initial. Run the code below to see flashcards explaining each line type.

In [11]:
from IPython.display import IFrame
IFrame('../html/flashcard_gfa_line_types.html', width=800, height=400)

Let's explore the GFA file and the line types.

1. Let's find out how many of each type of line there are in the GFA file. We will grab the first field or column using `cut`. Then we will `sort` it in preparation for finding and counting the unique instances using `uniq -c`.

In [12]:
!cut -f 1 graphs/yprp.chrVIII.pggb.gfa | sort | uniq -c

      1 H
  25908 L
      3 P
  19252 S


Run the code below to see the flashcards.

In [13]:
from IPython.display import IFrame
IFrame('../html/flashcard_gfalines.html', width=800, height=400)

2. Take a look at the header line. The "^" tells `grep` to limit its search to the beginning of each line.

In [14]:
!grep "^H" graphs/yprp.chrVIII.pggb.gfa

H	VN:Z:1.0


The header line has a tag. Tags are formatted in GFA as TAG:TYPE:VALUE.

`TAG`    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; VN (version)  
`TYPE`   &nbsp;&nbsp;&nbsp;&nbsp; Z (a string that can include a space)  
`VALUE`  &nbsp;&nbsp; 1.0 (this is our GFA version)

3. Take a look at the segment lines. There are a lot of them so we'll use `head` to limit it to the first ten segment lines.

In [15]:
!grep "^S" graphs/yprp.chrVIII.pggb.gfa | head

S	1	CA
S	2	CCACAC
S	3	A
S	4	CCACACCCA
S	5	CAC
S	6	CACACCACAC
S	7	C
S	8	A
S	9	C
S	10	A
grep: write error: Broken pipe


The first field or column indicates it is a segment line. The second indicates the segment or node number. The third indicates the sequence content of that segment. All of these segments are pretty short.

4. Let's find the length of the longest segment or node by using `awk` to find the lengths of the third column, `sort -n` to order those lengths numerically, and `tail -n 1` to give us the last line (i.e. the longest length).

In [16]:
!grep "^S" graphs/yprp.chrVIII.pggb.gfa | awk '{print length($3)}' | sort -n | tail -n 1

11291


Run the code below to take the quiz.

In [19]:
from IPython.display import IFrame
IFrame('../html/quiz_length.html', width=800, height=400)

5. Take a look at the link lines. Again, we will use `head` to limit the output to the first ten lines.

In [20]:
!grep "^L" graphs/yprp.chrVIII.pggb.gfa | head

L	1	+	2	+	0M
L	2	+	3	+	0M
L	2	+	4	+	0M
L	3	+	4	+	0M
L	4	+	6	+	0M
L	4	+	5	+	0M
L	5	+	6	+	0M
L	6	+	7	+	0M
L	6	+	8	+	0M
L	7	+	9	+	0M
grep: write error: Broken pipe


The first field or column indicates that it is a link line. The second and fourth columns are the segments that the link connects. The third and fifth columns are the orientations of each of the segments. And the fifth column indicates the overlap between the segments (in this case, "0M" means zero matches or no overlap).

6. And, finally, let's take a look at the path lines. There are only 3 lines but each line is really long. To avoid the long output, we will limit the output to the first ten segments or nodes using the `cut` command.

In [21]:
!grep "^P" graphs/yprp.chrVIII.pggb.gfa | cut -f1-10 -d,

P	S288C_chrVIII	2+,3+,4+,6+,7+,9+,10+,11+,12+,13+
P	SK1_chrVIII	1+,2+,4+,5+,6+,8+,9+,11+,13+,15+
P	Y12_chrVIII	305+,613+,614+,616+,617+,619+,620+,622+,623+,625+


We are seeing the order and orientation of the first 10 segments or nodes in the S288C, SK1, and Y12 chrVIII sequences.

Run the code below to see the flashcards.

In [22]:
from IPython.display import IFrame
IFrame('../html/flashcard_alike.html', width=800, height=400)

----------------------

## Running pggb on all Chromosomes

While you can run the entire genome the same way you ran chromosome VII, partitioning the sequences before building the graph allows us to parallelize the graph building.
The partition-before-pggb command partitions the input FASTA into smaller FASTA "communities" containing sequences that should be in the same subgraph. This command uses the same parameters as pggb build.

+ Will likely correspond to chromosomes if you have complete assemblies
+ May improve run-time of normalization step and make downstream analysis easier
+ Will avoid often repetitive connections between chromosomes that complicate the graph and increase run time.
+ Consider skipping partitioning if your assemblies/organism has complex structure you want represented in the graph, e.g. polyploidy, translocations, etc.

The partition-before-pggb command will print a `pggb` command for every partition to the command line and to a log file: *graphs/output_allchrs/yprp.all.fa.gz.*.log*

1. Partition the graph

In [24]:
!partition-before-pggb -i assemblies/yprp.all.fa.gz -o graphs/output_allchrs -n 3 -t 4 -p 95 -s 5000

[mashmap] Skipping self mappings for single file all-vs-all mapping.
[mashmap] MashMap v3.1.1
[mashmap] Reference = [assemblies/yprp.all.fa.gz]
[mashmap] Query = [assemblies/yprp.all.fa.gz]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 199
[mashmap] Segment length = 5000 (read split allowed)
[mashmap] Block length min = 25000
[mashmap] Chaining gap max = 20000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 95%
[mashmap] Skip self mappings
[mashmap] Skipping sequences containing the same prefix based on the delimiter "#"
[mashmap] Hypergeometric filter w/ delta = 0.3 and confidence 0.999
[mashmap] Mapping output file = /dev/stdout
[mashmap] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads  = 4
[mashmap::skch::Sketch::build] minmer windows picked from reference = 2825837
[mashmap::skch::Sketch::index] unique minmers = 590131
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 82849) .

2. Now get all of the partition commands from the log file into a bash script called run-pggb-partitions.sh. Also, make the file executable.

In [26]:
!sed -n '/pggb -i graphs/output_allchrs/,$p' graphs/output_allchrs/*.log > graphs/run-pggb-partitions.sh
!chmod +x graphs/run-pggb-partitions.sh

sed: -e expression #1, char 17: unknown command: `o'


3. And now run the bash script, which will run all the partition commands. They will run sequentially, each using 20 threads.

<div class="alert alert-block alert-info"> <b>NOTE:</b> It will take about 30 minutes to run all 16 subgraphs. Make sure you wait until the command finishes to move on (the asterisk to the left of the code block below changes to a number.)

In [None]:
%%capture

!./graphs/run-pggb-partitions.sh

You now have 16 subgraphs in the graphs/output_allchrs/ directory, each in GFA format. You can look at them individually or you can combine them into a single graph (which you will learn how to do in the indexing submodule).

If you have reason to believe that there are important translocations between chromosomes, or if you want to see connections between haplotypes in a polyploid assembly, consider creating a graph directly from the entire genome assembly. Try it below.

<div class="alert alert-block alert-info"> <b>NOTE:</b> Combining the 16 subgraphs will give you a slightly different graph than if you had created a graph from the entire genome directly because there will be no connections between chromosomes.

<div class="alert alert-block alert-success"> <b>Try this in the cells below:</b>  
    <ul>
        <li>Create a graph from the entire genome assembly (*yprp.all.fa.gz*) in an output directory called *output_full_genome*</li>
        <li>Copy the graph into a file called yprp.fullgenome.pggb.gfa</li>
        <li>Count the number of each type of line</li></a></div>
    </ul>

In [None]:
%%capture

# Create a graph from the entire genome assembly (with suppressed output)


In [None]:
# Copy the graph into a file called *yprp.fullgenome.pggb.gfa*

In [None]:
# Count the number of each type of line

<div class="alert alert-block alert-info"> <b>NOTE:</b> It will actually take a little less time to build a graph for the full genome then to build the 16 subgraphs that correspond to the chromosomes. But, that might not be the case for other datasets. The relative timing of building a graph for the full genome or subgraphs for the chromosomes (or chromosome fragments) will depend on many factors, including the size and number of chromosomes (or chromosome fragments), the number of assemblies, the number of haplotypes per assembly, and the number of repeats and how they are distributed across chromosomes.

<details>
<summary>Click for help</summary>
<br>

%%capture

!pggb build -i assemblies/yprp.all.fa.gz -o graphs/output_full_genome -n 3 -t 4 -p 95

!cp graphs/output_full_genome/yprp.all.fa.gz.*.smooth.final.gfa graphs/yprp.fullgenome.pggb.gfa

!cut -f 1 graphs/yprp.chrVIII.pggb.gfa | sort | uniq -c

</details>

----------------------

### Quiz

Run the code below to take the quiz.

In [27]:
from IPython.display import IFrame
IFrame('../html/quiz_building_graphs.html', width=800, height=400)

----------------------

## Conclusion

This submodule explained the strengths and weaknesses of PGGB's graph building algorithm, and described its output.
As an example, we took you through obtaining yeast genomes, preparing input data, and creating a yeast pangenomic graph both for chromosome VIII and for the entire genome.
In the next module you will learn how to visualize and explore these graphs.

----------------------

## Cleanup

<div class="alert alert-warning">No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!.</div>