# FASTQ To Read Mapping

This notebook describes the procedure required to run FastqToMappingPlugin with a Practical Haplotype Graph. It uses minimap2 to align reads to the pangenome, then assigns reads to haplotypes in the reference range with the most hits. 

This notebook assumes you are running the plugin within Docker using the image maizegenetics/phg version 1.4 or later.

## Requirements

- A Practical Haplotype Graph database with the desired haplotypes loaded (see steps 1 and 2 for setting up the database)
- Haplotype method(s) to map reads to. This can be either a set of haplotypes loaded directly from assemblies or WGS data, or a set of consensus haplotypes (see CreateConsensiNotebook for details)
- A config file containing database connection information and (optionally) plugin parameters
- A keyfile. See below for details.

## Keyfile

The keyfile for this plugin describes the sets of fastq reads to be mapped. It is a tab-separated table with a required header with the following columns:
- cultivar: (required) Name of the taxon to be processed
- flowcell_lane: (required) Name of the flow cell this sample came from
- filename: (required) Name of the fastq file to be processed. Do not include paths
- filename2: (required if using paired end reads) Second fastq file to be processed
- PlateID: (optional) The ID of the plate

Cultivar + flowcell_lane + PlateID should be unique for each sample in the keyfile. 

## Parameters

### HaplotypeGraphBuilderPlugin Parameters

### Required

- configFile: Database configuration file
- methods: Pairs of methods (haplotype method name and range group method name) Method pair separated by a comma, and pairs separate by colon. The range group is optional. Usage: <haplotype method name 1>,<range group name 1>:<haplotype method name 2>,<range group name 2>:<haplotype method name 3>

### Optional (default in parentheses)

- includeSequences: (true) Whether to include sequences in haplotype nodes
- includeVariantContexts: (false) Whether to include variant contexts in haplotype nodes. Variant contexts are not required for FastqToMappingPlugin.
- haplotypIds: List of haplotype IDs to include in the graph. If not specified, all IDs are included.
- chromosomes: List of chromosomes to include in graph. Defailt is to include all chromosomes.
- taxa: List of taxa to include. Default is to include all taxa.
- localGVCFFolder: Folder where reference/assembly GVCFs are stored. Only required if includeVariantContexts is true.


### FastqToMappingPlugin Parameters


### Required

- minimap2IndexFile: minimap2 index file foe the pangenome
- keyFile: keyFile with the list of files to process. See above for formatting.
- fastqDir: directory containing the fastq files
- methodName: a unique name for this read mapping method, to be stored in the database

### Optional (default in parentheses)

- maxRefRangeErr: (.25) Maximum allowed error when choosing the best reference range to count. Error is defined as 1-(mostHitRefRangeCount / totalHitCount)
- lowMemMode: (true) Run in low memory mode.
- maxSecondary: (20) Maximum number of secondary alignments to be returned by minimap2. This will be the value of the -N parameter in minimap2 command line.
- fParameter: (f1000,5000) The f parameters used by minimap2. If the sr preset (-x sr) is used then thi parameters takes the form f<int1,int2>. From the minimap2 map page: "If integer, ignore minimizers occuring more than INT1 times. INT2 is only effective in the --sr or -xsr mode, which sets the threshold for a second round of seeding."
- minimapLocation: (minimap2) location of minimap2 executable
- methodDescription: Method description to be stored in the database
- debugDir: Directory to write out the read mapping files. This is optional for debug purposes.
- outputSecondaryStats: (false) Output secondary mapping statistics such as total AS for each haplotype ID
- isTestMethod: (false) Indication if the data is to be loaded against a test method Data loaded with test methods are not cached with the PHG ktor server.
- updateDB: (true) If se to true, the read mappings will be written to the db. Otherwise nothing will be written.
- runWithoutGraph: (false) If true, will require the input of a JSON file created by CreateHapIdMapsPlugin and will not require an input Graph object from HaplotypeGraphBuilderPlugin. Useful when running alignments on a machine that cannot connect to the database.
- hapIDMapFile: Location of the HapIdMapFile where the graph information can be found. Required in runWithoutGraph is true
- inputFastqFile: Input of the first fastq file to be run. If the keyfile has a pair it will be detected. If this is not set, everything in the input directory will be run. 

In [None]:
###########
# EDIT ME #
###########

# Path to the working directory
working_dir = "/workdir/ahb232/phg_sorghum_apr2023/"

# config file path (relative to working_dir)
config = "config.txt"

# path to key file (relative to working_dir)
key_file = "IS36143_keyfile.txt"

# methods to pass to HaplotypeGraphBuilderPlugin (see above for details)
haplotype_methods = "HudsonAlpha_assembly:public_assembly"

# location of the pangenome minimap index file (relative to working_dir)
minimap_index_file = "/outputDir/pangenome/pangenome.mmi"

# folder (relative to working_dir) containing the fastq reads to map to the pangenome 
fastq_dir = "/WGS/sorghumbase/downsampled/"

# name and (optional) description for the mapping method
mapping_method = "IS36143_downsampled_WGS_readmapping"
mapping_method_description = ""

# path to the log file (relative to working_dir)
# Note: You may set log_file to "", in which case the log will print to the notebook cell. 
# This may make the notebook slow to open if the log is very long.
log_file = "/logs/readmapping_IS36143_log.txt"

In [None]:
# RUN BUT DO NOT EDIT #

CONFIG = "/phg/" + config
KEYFILE = "/phg/" + key_file
MMI_FILE = "/phg/" + minimap_index_file
FASTQ_DIR = "/phg/" + fastq_dir

TO_LOG = ""

if (log_file != ""):
    TO_LOG = " > " + LOG_FILE

In [3]:
! {DOCKER} run --name create_pangenome --rm \
    -v {working_dir}/:/phg/ \
    -t {DOCKER_VERSION} \
    /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters {CONFIG} \
    -HaplotypeGraphBuilderPlugin \
    -configFile {CONFIG} \
    -methods {haplotype_methods} \
    -endPlugin \
    -FastqToMappingPlugin \
    -minimap2IndexFile {MMI_FILE} \
    -keyFile {KEYFILE} \
    -fastqDir {FASTQ_DIR} \
    -methodName {mapping_method} \
    -methodDescription {mapping_method_description} \
    -endPlugin {TO_LOG}