# Input of Mapping Pipeline

## Input Files

In order to run the pipeline for a single cell library, you need to have 2 things:

1. FASTQ files generated by bcl2fastq
    - If the samplesheet used in bcl2fastq is made by yap, you don't need to make fastq dataframe about the fastq files, just provide the path is fine.
    - If the samplesheet is not made by yap, you need to make a FASTQ dataframe, see next step.
2. mapping_config.ini for mapping parameters


## FASTQ File Name Requirements
- FASTQ file should be generated by SampleSheet made from previous step. Because the pipeline heavily relies on pre-defined FASTQ file name patterns to automatically parse uid, lane, read_type etc.
- If the SampleSheet is not generated by previous step, FASTQ names may not support automatic parse. You need to provide the FASTQ dataframe by yourself. See the documentation of making FASTQ dataframe. The mapping summary part should also be done manually.


## mapping_config.ini

### What is mapping_config.ini file?

- It's a place gather all adjustable parameters of mapping pipeline into a single file in [INI format](https://en.wikipedia.org/wiki/INI_file), so you don't need to put in 100 parameters in a shell command...

- INI format is super simple:
    ```
    ; comment start with semicolon
    [section1]
    key1=value1
    key2=value2
    
    [section2]
    key1=value1
    key2=value2
    ```
- Currently, the pipeline don't allow to change the sections and keys, so just change values according to your needs.

### How to prepare mapping_config.ini file?
You can print out the default config file, save it to your own place and modify the value.

```shell
# MODE should be in snmC, NOMe, snmCT, snmCT-NOMe, depending on the library type
yap default-mapping-config --mode MODE
```

Here is an example of getting snmC-seq default mapping_config.ini file. You need to change the place holders to correct value, such as providing correct barcode version (V1, V2) or path to the reference genome

In [1]:
!yap default-mapping-config --mode mc

# Executing default-mapping-config...
; Mapping configurations
;
; INI format
; [Section1]
; KEY1 = VALUE1
; KEY2 = VALUE2
;
; [Section2]
; KEY1 = VALUE1
; KEY2 = VALUE2
;
; lines start with ";" is comment.
;
; NOTE: Don't change any section or key names.
; Custom keys won't work, only change value when adjust parameters.
;

[mode]
mode = mc


[multiplexIndex]
; This section is for demultiplex step
; V1: 8 random index version
; V2: 384 random index version
barcode_version = USE_CORRECT_BARCODE_VERSION_HERE


[fastqTrim]
r1_adapter = AGATCGGAAGAGCACACGTCTGAAC
r2_adapter = AGATCGGAAGAGCGTCGTGTAGGGA
; Universal illumina adapter

overlap = 6
; least overlap of base and illumina adapter

r1_left_cut = 10
; constant length to trim at 5 prime end, apply before quality trim.
; Aim to cut random primer part, determined by random primer length.
; Random primer can impact results, see bellow:
; https://sequencing.qcfail.com/articles/mispriming-in-pbat-libraries-causes-methylation-bias-and-poor-m

### Mapping Modes of the config file
yap support several different mapping mode for different experiments, which is controlled by mapping_config.ini as described bellow.

#### snmC-seq2 (mc)
- Normal snmC-seq2 library

#### snmCT-seq (mct)
- snmCT-seq library, each cell contain mixed reads from DNA and RNA
- The major differences are:
    - Need to do STAR mapping
    - Filter bismark BAM file to get DNA reads
    - Filter STAR BAM file to get RNA reads

#### snmC-seq + NOMe treatment (nome)
- snmC-seq with NOMe treatment, where the GCH contain open chromatin information, HCN contain normal methylation information.
- The major differences are:
    - Add additional one base in the context column of ALLC, so we can distinguish GpC sites with HpC sites

####  snmCT-seq + NOMe treatment (mct-nome)
- snmCT-seq with NOMe treatment, each cell contain mixed reads from DNA and RNA, and in DNA reads, the GCH contain open chromatin information, HCN contain normal methylation.
- The major differences are:
    - Need to do STAR mapping
    - Filter bismark BAM file to get DNA reads
    - Filter STAR BAM file to get RNA reads
    - Add additional one base in the context column of ALLC, so we can distinguish GpC sites with HpC sites