# Create Pangenome Fasta File

This notebook walks through the steps required to create a fasta file from a set of haplotypes in a Practical Haplotype Graph. This file is required to impute haplotypes from sequencing data.

This notebook assumes that you have a Practical Haplotype Graph database and have loaded haplotypes from your desired taxa (See steps 1 and 2 for constructing and populating the database).

## How It Works

WriteFastaFromGraphPlugin takes a haplotype graph as input and writes the sequences for each haplotype in the graph to a fasta file. The name of each sequence is an integer equal to its haplotype_id within the database. This "pangenome" fasta file will likely be several times larger than your reference genome fasta.

In order to provide the input for WriteFastaFromGraphPlugin, another plugin called HaplotypeGraphBuilderPlugin must be run first. HaplotypeGraphBuilderPlugin constructs a graph from user-specified methods, taxa, and chromosomes. Methods include haplotypes generated directly from assemblies or WGS data as well as consensus haplotypes generated by RunHapConsensusPipelinePlugin (See notebook for details). 

## Requirements

- A Practical Haplotype Graph database, with taxa loaded and (optionally) consensus haplotypes generated.
- A config file with database connection information and, optionally, Tassel plugin parameters set.

### Required Parameters

#### HaplotypeGraphBuilderPlugin

 - configFile: the config file containing database connection information
 - methods: Pairs of methods (haplotype method name and range group method name). Method pair separated by a comma, and pairs separated by colon. The range group is optional. Usage: <haplotype method name1>,<range group name1>:<haplotype method name2>,<range group name2>:<haplotype method name3>

#### WriteFastaFromGraphPlugin

- outputFile: path to the output pangenome fasta file

### Optional Parameters (default in parentheses)

#### HaplotypeGraphBuilderPlugin

- includeSequences (true): whether to include sequences in haplotype nodes. For creating the pangenome, this should be left as the default true
- includeVariantContexts (false): whether to include variant contexts in haplotype nodes. For creating the pangenome, this should be left as the default false
- haplotypeIDs: A list of haplotype ids to include in the graph. If not specified, all IDs are included
- chromosomes: A list of chromosomes to include in the graph. If not specified, all chromosomes are included
- taxa: A list of taxa to include in the graph. This can be a comma separated list of taxa (no spaces unless surrounded by quotes), file (.txt) with list of taxa names to include, or a taxa list file (.json or .json.gz). By default, all taxa will be included.
- localGVCFFolder: folder where the reference and assembly gvcfs are stored. Only required if includeVariantContexts is true.

In [1]:
###########
# EDIT ME #
###########

# working directory
working_dir = "/workdir/ahb232/phg_sorghum_apr2023/"

# location of the config file, relative to working_dir
config_file = "config.txt"

# path to the output pangenome fasta file, relative to working_dir
pangenome_file = "/outputDir/pangenome/pangenome.fa"

# location of the log file. To write the log to this notebook, use log_file = ""
log_file = "/logs/create_pangenome_log.txt"

# list of methods to use to create the pangenome
METHODS = "HudsonAlpha_assembly:public_assembly"

# docker command. Usually "docker", but on bioHPC should be "docker1"
DOCKER = "docker1"

# docker version
DOCKER_VERSION = "maizegenetics/phg:1.4"

In [None]:
## RUN THIS CODE BLOCK BUT DO NOT EDIT ##

CONFIG = "/phg/" + config_file

FASTA_FILE = "/phg/" + pangenome_file

TO_LOG = ""

if (log_file != ""):
    TO_LOG = " > " + working_dir + "/" + log_file

In [5]:
! {DOCKER} run --name create_pangenome --rm \
    -v {working_dir}/:/phg/ \
    -t {DOCKER_VERSION} \
    /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters {CONFIG} \
    -HaplotypeGraphBuilderPlugin \
    -configFile {CONFIG} \
    -methods {METHODS} \
    -endPlugin \
    -WriteFastaFromGraphPlugin \
    -outputFile {FASTA_FILE} \
    -endPlugin {TO_LOG}

## Indexing the Pangenome

The next plugin in the imputation pipeline, FastqToMappingPlugin expects a minimap2 index (.mmi) file. Run the code blocks below to generate it. A .mmi file may not be necessary if you want to align your reads outside of the PHG, so this step is optional.

In [1]:
# EDIT ME #

# path to minimap index file, relative to working_dir
index_path = "/outputDir/pangenome/pangenome.mmi"

# Minimap2 index parameter k, the kmer length of the minimizers, 
# which is used to index the pangenome.
KMER_LENGTH = 21

# Minimap2 index parameter I, the maximum number of bases loaded into memory, 
# which is used to index the pangenome.
# This must be large enough to hold the entire pangenome in memory.
NUM_BASES = "90G"

# Minimap2 index parameter w, the minimizer window size, which is used to index the pangenome.
WINDOW_SIZE = 11

INDEX_PATH = "/phg/" + index_path

In [None]:
! {DOCKER} run --name index_pangenome --rm \
    -v {working_dir}/:/phg/ \
    -t {DOCKER_VERSION} \
    minimap2 -d {INDEX_NAME} -k {KMER_LENGTH} -I {NUM_BASES} -w {WINDOW_SIZE} {FASTA_FILE}

## Next Steps

Want to continue with the Imputation Pipeline? The next step is to align your low-coverage sequence data to the fasta or .mmi file created with this notebook and map reads to haplotypes. The PHG has a plugin to do so, called FastqToMappingPlugin (see FastaToMapping), which uses minimap2 to align reads. Alignment can also be done outside the PHG, producing sam files which are converted to read mappings by SAMToReadMappingPlugin. This may be useful in situations where you would like to perform the resource-intensive alignment step on a machine that cannot easily connect to your PHG database.