# Pangenomics
--------------------------------------------

# Reading Mapping with vg

## Overview

VG will allow you to map reads to pangenomic graphs. You will map reads from SK1 to the yeast pangenomic graph that you made with PGGB.

## Learning Objectives
+ Understanding the difference between read mapping with a reference versus a pangenome
+ Learn how to map reads to a pangenomic graph using vg

## Get Started

We will align reads to the indexed pangenomic graph that you created.

Traditionally, reads are aligned to a reference genome. The reference genome represents a single individual of the species and it might be missing some of the genetic variation in the species. This means that some reads, containing this novel variation, might not align to the reference. In addition, there might be some bias when aligning reads and calling variants, with reads from individuals that are close to reference aligning better than those from more divergent individuals. This read alignment bias trickles down to the variant calling phase, possibly resulting in some missed variant calls for more evolutionarily divergent individuals.

Pangenomics graphs capture more of the genetic variation that is in species. Therefore, using them as a reference reduces issues of missing variation and reference bias.

### Graph Alignment/Map (GAM) Format

[GAM](https://github.com/vgteam/vg/wiki/File-Formats) is an alignment format analogous to [BAM](https://samtools.github.io/hts-specs/SAMv1.pdf), but for graphs.  
+ Binary file describing where reads mapped to in the graph structure  
+ Uncompressed has one read per line  
+ Can be converted to JSON for manual parsing (very inefficient!)

### Get Reads for Mapping

We will use paired-end Illumina reads from SK1, which was also included in our graph. You could also align reads from an accession that is not in our graph. Download them using sratoolkit's `prefetch` and `fasterq-dump` commands from the Short Read Archive (SRA). They are in run accession SRR4074258.

First, `prefetch` the accession, which makes getting the fastq data faster.

The parameters:

--output-directory specify output directory

In [None]:
!prefetch SRR4074258 --output-directory .

Then get the fastq files. `fasterq-dump` allows us to multi-thread the download and will automatically put the read1 and read2 sequence in different files. Point it to the prefetched data (SRR4074258/SRR4074258.sra).

The parameters:

--outdir  output directory ("." indicates the current directory)  
--outfile  output file name (fasterq-dump will add _1 and _2 before the fastq as it separates out read1 and read2)  
--threads  number of threads  
--progress  show progress

In [None]:
!fasterq-dump --outdir . --outfile SK1.illumina.fastq --threads 40 --progress ./SRR4074258/SRR4074258.sra

Now zip the files. Rather than using `gzip`, we will use `pigz`, which stands for "parallel implementation of gzip", so that we can use multiple threads to get it done more quickly. By default it will use all available threads, though that can be adjusted with the --processes parameter.

**pigz**

+ -v verbose

NOTE: The verbose setting is not very verbose so make sure you wait until the asterisk in the square brackets to the left of the code block is replaced by a number to know that it is done.

In [None]:
!pigz SK1.illumina_1.fastq
!pigz SK1.illumina_2.fastq

Now, remove the prefetch directory.

In [None]:
!rm -r SRR4074258

### Read Mapping

We will use `vg giraffe` to map paired-end Illumina reads from the SK1 yeast accession to the chrVIII graph (yprp.chrVIII.pggb.vg).

The parameters:  

-d Graph prefix (yprp.chrVIII.pggb)  
-f Reads in fastq format  
-t Number of threads

We will redirect the output into a gam file.

In [None]:
!vg map -d yprp.chrVIII.pggb -f SK1.illumina_1.fastq -t 20 > yprp.chrVIII.x.SK1.illumina.gam

### Mapping statistics

Now we can compute some mapping statistics using vg stats.

The parameters:

-a alignment (GAM) file

In [None]:
!vg stats -a yprp.chrVIII.x.SK1.illumina.gam

XXX We need something here about how ot interpret the stats.

### Bringing Alignments Back to Individual Genomes

Our reads are mapped back to the pangenomic graph. If we need to bring the alignments back into coordinates for individual genomes, we can "surject" them into a genome of our choice using vg surject.

The parameters:

-x The VG graph or xg to use  
-b The graph alignment file (XXX the old docs said this was the bam output but it has a gam suffix and there is a redirect into a bam file)  
-t The number of threads to use

We will redirect the output into a bam file.

In [None]:
!vg surject -x S288C.xg -b S288C.SK1.illumina.gam -t 20 > S288C.SK1.illumina.BAM

The Integrative Genomics Viewer (IGV) is a powerful, user-friendly, open source genome viewer maintained by teams at UC San Diego and the Broad Institute of MIT and Harvard (https://igv.org/).

Let's prepare the bam file for viewing in IGV by converting the chromosome names to UCSC style names tha the viewer can recognize.

###  Preparing the BAM for IGV (or other genome viewer)

Convert the BAM (compressed) alignment file to SAM (uncompressed).

XXX We need more introduction into SAM and BAM formats.

The parameters:

-h print the header for the SAM output  
-o output file name (default: stdout)  
* stdout prints thing to the screen

In [None]:
!samtools view -h -o S288C.SK1.illumina.sam S288C.SK1.illumina.BAM

Next, we'll remove the assembly name from the sequence headers using a substitution in sed:

sed 's/thing_you_want_to_replace/thing_to replace it with/' file

In our case, we will replace S288C with nothing and redirect it into a new sam file.

In [None]:
!sed 's/S288C.//' S288C.SK1.illumina.sam > S288C.SK1.illumina.renamed.sam

Convert the renamed SAM file back into BAM format and redirect it into a new BAM file.

The parameters:

-b output BAM  
-S input is SAM  

In [None]:
!samtools view -bS S288C.SK1.illumina.renamed.sam > S288C.SK1.illumina.renamed.bam

Sort the bam file.

The parameters:

-o output file name

In [None]:
!samtools sort -o S288C.SK1.illumina.renamed.sorted.bam S288C.SK1.illumina.renamed.bam

Index the sorted BAM file.

In [None]:
!samtools index S288C.SK1.illumina.renamed.sorted.bam

XXX Do we need this?

## Pack (pileup support) Format

https://github.com/vgteam/vg/wiki/File-Formats  
+ Binary file  
+ Computes pileup support for each read in a mapping  
The format isn’t actually documented...

Now, copy the BAM file to your personal computer and load it into IGV.

How does CUP1 look?

What other interesting observations do you see?

## Conclusion

In this submodule, you learned how to align reads directly to a pangenomic graph, how to surject those alignments into coordinates in individual genomes, and how to view the surjected alignments in IGV. You also learned about the SAM, BAM, and GAM alignment file formats.

## Clean up
No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!