# Prepare FASTQ dataframe (Optional)

NOTICE: You can skip this page if you start from preparing samplesheet with yap

## FASTQ Name Pattern
### V1
- 8-random-primer
- 2 384-well plates are multiplexed together
```python
*_, plate1, plate2, multi_field = path.name.split('-')
plate_pos, _, lane, read_type, _ = multi_field.split('_')
```

### V2 
- 384-random-primer
- each plate has 6 multiplex groups. Cells within each multiplex group are multiplexed together with one SetB primer
```python
*_, plate, multiplex_group, multi_field = path.name.split('-')
primer_name, _, lane, read_type, _ = multi_field.split('_')
# the primer name is illumina i5/i7 name (SetB Pos384)
```

## Make FASTQ dataframe (optional)
**NOTE: This file can be automatically generated by yap, if the fastq is demultiplexed by SampleSheet made from `yap make-sample-sheet`**


### What is FASTQ dataframe?
- A tab separated table contain all information of a library of FASTQ files.
- First line is header, each other line is information of a FASTQ file.
- IMPORTANT: must have 4 columns, (order of column doesn't matter, but the column name must be the same as following):
    - lane: sequencing lane where the fastq come from, e.g. L001, L002, L003, L004...
    - uid: An unique id that corresponding to each illumina i5 i7 index.
    - read_type: read1 or read2, e.g. R1, R2
    - fastq_path: Absolute path of a FASTQ file.

### How to prepare FASTQ dataframe?
- Because different project may have different naming pattern, you should figure out which part of the file name is lane, uid and read_type, then make these information for each FASTQ file in a table like bellow.
- This may seems trivial, but once you done these, you can save the code for all subsequent experiments following same name pattern.
- so it is HIGHLY RECOMMENDED THAT YOU DO NOT CHANGE FASTQ NAME PATTERN THROUGHOUT YOUR PROJECT.

### FASTQ dataframe example (V1 example)

In [2]:
import pandas as pd
df = pd.read_table('fastq_dataframe.tsv.gz', index_col=None)[['lane', 'uid', 'read_type', 'fastq_path']]
print(f'This FASTQ dataframe have {df.shape[0]} rows (not include the first header row) for {df.shape[0]} FASTQ file, which is from a standard CEMBA snmC-seq experiment using 8 384-well plates.')
df.head()

This FASTQ dataframe have 3072 rows (not include the first header row) for 3072 FASTQ file, which is from a standard CEMBA snmC-seq experiment using 8 384-well plates.


Unnamed: 0,lane,uid,read_type,fastq_path
0,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A1,R1,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
1,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A1,R2,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
2,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A2,R1,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
3,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A2,R2,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
4,L001,CEMBA181022-6B-1-CEMBA181022-6B-2-A3,R1,/gale/oberon/data11/181119_A00280_0033_BHCHYKD...
