# Prepare Input Files for Pipeline

## FASTQ Requirements
- FASTQ file should be generated by SampleSheet made from previous step. This is to make sure the FASTQ names can be automatically parsed to get the uid, lane, read_type information.
- If the SampleSheet is not generated by previous step, name may not support automatic parse. You need to provide the FASTQ dataframe by yourself. See the documentation of making FASTQ dataframe bellow.
- The supported default name pattern of FASTQ file should be 
```python
*_, plate1, plate2, multi_field = path.name.split('-')
plate_pos, _, lane, read_type, _ = multi_field.split('_')
```

## Input Files
In order to run the pipeline for a single cell library, you need to have 2 things ready:
1. FASTQ files as described above for the whole library
2. mapping_config.ini for mapping parameters
3. (optional) fastq_dataframe.csv for fastq information. This file can be automatically generated by yap, if the fastq is demultiplexed by SampleSheet made from `yap make-sample-sheet`

## Mapping Modes
yap support several different mapping mode for different experiments, which is controlled by mapping_config.ini as described bellow.


#### snmC-seq2 (mc)
- Normal snmC-seq2 library

#### snmCT-seq (mct)
- snmCT-seq library, each cell contain mixed reads from DNA and RNA
- The major differences are:
    - Need to do STAR mapping
    - Filter bismark BAM file to get DNA reads
    - Filter STAR BAM file to get RNA reads

#### snmC-seq + NOMe treatment (nome)
- snmC-seq with NOMe treatment, where the GCH contain open chromatin information, HCN contain normal methylation information.
- The major differences are:
    - Add additional one base in the context column of ALLC

####  snmCT-seq + NOMe treatment (mct-nome)
- snmCT-seq with NOMe treatment, each cell contain mixed reads from DNA and RNA, and in DNA reads, the GCH contain open chromatin information, HCN contain normal methylation.
- The major differences are:
    - Need to do STAR mapping
    - Filter bismark BAM file to get DNA reads
    - Filter STAR BAM file to get RNA reads
    - Add additional one base in the context column of ALLC

## mapping_config.ini

### What is mapping_config.ini file?

- It's a place gather all adjustable parameters of mapping pipeline into a single file in [INI format](https://en.wikipedia.org/wiki/INI_file), so you don't need to put in 100 parameters in a shell command...

- INI format is super simple:
    ```
    ; comment start with semicolon
    [section1]
    key1=value1
    key2=value2
    
    [section2]
    key1=value1
    key2=value2
    ```
- Currently, the pipeline don't allow to change the sections and keys, so just change values according to your needs.

### How to prepare mapping_config.ini file?
You can print out the default config file, save it to your own place and modify the value.
```shell
# MODE should be in mc, nome, mct, mct-nome, depending on the library type. Default is mc
# see different mapping mode bellow
yap default-mapping-config --mode MODE
```

## FASTQ dataframe (optional)
**NOTE: This file can be automatically generated by yap, if the fastq is demultiplexed by SampleSheet made from `yap make-sample-sheet`**


### What is FASTQ dataframe?
- A tab separated table contain all information of a library of FASTQ files.
- First line is header, each other line is information of a FASTQ file.
- IMPORTANT: must have 4 columns, (order of column doesn't matter, but the column name must be the same as following):
    - lane: sequencing lane where the fastq come from, e.g. L001, L002, L003, L004...
    - uid: An unique id that corresponding to each illumina i5 i7 index.
    - read_type: read1 or read2, e.g. R1, R2
    - fastq_path: Absolute path of a FASTQ file.

### How to prepare FASTQ dataframe?
- Because different project may have different naming pattern, you should figure out which part of the file name is lane, uid and read_type, then make these information for each FASTQ file in a table like bellow.
- This may seems trivial, but once you done these, you can save the code for all subsequent experiments following same name pattern.
- so it is HIGHLY RECOMMENDED THAT YOU DO NOT CHANGE FASTQ NAME PATTERN THROUGHOUT YOUR PROJECT.

### FASTQ dataframe example

In [2]:
import pandas as pd
df = pd.read_table('fastq_dataframe.tsv.gz', index_col=None)[['lane', 'uid', 'read_type', 'fastq_path']]
print(f'This FASTQ dataframe have {df.shape[0]} rows (not include the first header row) for {df.shape[0]} FASTQ file, which is from a standard CEMBA snmC-seq experiment using 8 384-well plates.')
df.head()

This FASTQ dataframe have 3072 rows (not include the first header row) for 3072 FASTQ file, which is from a standard CEMBA snmC-seq experiment using 8 384-well plates.


Unnamed: 0,lane,uid,read_type,fastq_path
0,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A1,R1,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
1,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A1,R2,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
2,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A2,R1,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
3,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A2,R2,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
4,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A3,R1,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
