# How to run processing pipeline on raw data

This documentation contains instruction on how to preprocess and QC raw data from studies utilizing cadaveric islets, generated and/or provided by the Human Pancreas Analysis Program (HPAP), Integrated Islet Distribution Program (IIDP), and Prodo Labs (Prodo).

## Genome reference build

`references_Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta` was downloaded from https://console.cloud.google.com/storage/browser/gtex-resources/references on Oct 28, 2024, following TOPMed pipeline at https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md

`gencode.v39.annotation.gtf.gz` file was downloaded from GENCODE v39 website https://www.gencodegenes.org/human/release_39.html on Oct 28, 2024, then `gunzip gencode.v39.annotation.gtf.gz`.

To build STAR index, run `scripts/buildSTARindex.slurm` using: `sbatch scripts/buildSTARindex.slurm`.

## Prerequisite on preprocessing pipeline

The preprocessing steps rely heavily on the Nextflow pipeline developed here `https://github.com/ParkerLab/snRNAseq-NextFlow`.
One can start using the pipeline by downloading the repository using the command `git clone https://github.com/porchard/snRNAseq-NextFlow.git`.

## Processing samples using islets generated and/or provided by HPAP

As the pipeline assumes all libraries that are processed at the same time have the same chemistry, we need to separate the processing into two directories, namely `snRNAseq-NextFlow_v2` and `snRNAseq-NextFlow_v3` to avoid conflicts. 

### Step 1: create library.config file (to run the Nextflow pipeline)

This step utilizes the script `scripts/hpap_makeRNAconfig.bash`. It is important to note that this script assumes data is in a directory named `/nfs/turbo/umms-scjp-pank/1_HPAP/data/`; this path needs to be changed if the script is to be reused. For this particular dataset, one can simply symlink all data to `/nfs/turbo/umms-scjp-pank/1_HPAP/data/`.

The following code is an example to create a config file for V3 chemistry data:

```
cd /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/rna-library-config
#get list of samples
samples=`grep 10X-Chromium-GEX-3p-v3 /scratch/scjp_root/scjp99/PancDB/metadata/PancDB_scRNA-seq_metadata_10x.txt | cut -f 1`

for s in $samples
    do 
        d=`(cd /nfs/turbo/umms-scjp-pank/1_HPAP/data/"$s"/data && ls *RNA*R1*)`
        d1=($d)
        n="${#d1[@]}"
        bash makeRNAconfig.bash -s "$s" -d "$d" -i "$n" -o library-config_v3.json
done
```

After this, we need to manually edit `hpap_rna_library-config.json` to add details such as `"libraries": {`, `{` at the beginning and end of file, double check library names, etc. In hindsight, there should be quicker and smarter ways to create this file, which we will work on for future releases.

### Step 2: submit SLURM job to run Nextflow pipeline

```
cd /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/snRNAseq-NextFlow_v3/ #Nextflow is launched in whatever current dir is
sbatch --job-name=rnaQC_v3 --mem=500M --time=24:00:00 --account=scjp99 --mail-user=vthihong@umich.edu --mail-type=END,FAIL --signal=B:TERM@60 --wrap="exec ~/tools/nextflow run -resume -params-file /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/rna-library-config/library-config_v3.json --barcode-whitelist /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/snRNAseq-NextFlow_v3/3M-february-2018.txt --chemistry V3 --results /nfs/turbo/umms-scjp-pank/1_HPAP/results/v3_chemistry /nfs/turbo/umms-scjp-pank/1_HPAP/scripts/snRNAseq-NextFlow_v3/main.nf"
```

It is *critical* to note that all results in `/nfs/turbo/umms-scjp-pank/1_HPAP/results/v3_chemistry` are *symlinked* by default. All source files are stored in `work/` which is automatically created in the directory that we launch the Nextflow pipeline. In our case, the directory we launched the process is `/nfs/turbo/umms-scjp-pank/1_HPAP/scripts/snRNAseq-NextFlow_v3/`.

## Processing samples using islets generated and/or provided by IIDP and Prodo

As the pipeline assumes all libraries that are processed at the same time have the same chemistry, we need to separate the processing into different directories. In this case, we split the processes by studies to avoid conflicts and to speed up the process. In cases where one study used multiple chemistries for different samples, we need to split them according to the chemistry as well.

### Step 1: create library.config file (to run the Nextflow pipeline)

The idea of this step is very similar to that for HPAP, but using the script `scripts/iidp_makeRNAconfig.bash`. It is important to note that this script assumes data is in a directory named `/nfs/turbo/umms-scjp-pank/2_IIDP/0_rawData/` which needs to be changed if the script is to be reused.

Of note, the way different studies commit their sequencing data to GEO is different, so it is crucial that one checks read lengths beforehand to make sure the appropriate files are used in their corresponding place in the config file.

Example on how to run for the study `GSE142465`:

```
cd /nfs/turbo/umms-scjp-pank/2_IIDP/scripts/rna-library-config
for i in `cat /nfs/turbo/umms-scjp-pank/2_IIDP/0_rawData/GSE142465/sampleList.txt`; 
do bash makeRNAconfig.bash -t "GSE142465" -s "$i" -1 "$i"_2.fastq.gz -2 "$i"_3.fastq.gz -o GSE142465_library-config.json;
done
```

The `sampleList.txt` file is a simple text file that contains all SRR samples that we need to process.

```
cat /nfs/turbo/umms-scjp-pank/2_IIDP/0_rawData/GSE142465/sampleList.txt
SRR10751483
SRR10751484
SRR10751485
SRR10751486
SRR10751487
SRR10751488
SRR10751489
SRR10751490
SRR10751491
SRR10751492
SRR10751493
SRR10751494
SRR10751495
SRR10751496
SRR10751497
SRR10751498
SRR10751499
SRR10751500
SRR10751501
```

After this, we need to manually edit `GSE142465_library-config.json` to add details such as `"libraries": {`, `{` at the beginning and end of file, double check library names, etc. In hindsight, there should be quicker and smarter ways to create this file, which we will work on for future releases.

### Step 2: submit SLURM job to run Nextflow pipeline

Example commands:

```
cd /nfs/turbo/umms-scjp-pank/2_IIDP/scripts/GSE142465_snRNAseq-NextFlow #Nextflow is launched in whatever current dir is
sbatch --job-name=GSE142465 --mem=500M --time=72:00:00 --account=scjp99 --mail-user=vthihong@umich.edu --mail-type=END,FAIL --signal=B:TERM@60 --wrap="exec ~/tools/nextflow run -resume -params-file /nfs/turbo/umms-scjp-pank/2_IIDP/scripts/rna-library-config/GSE142465_library-config.json --barcode-whitelist /nfs/turbo/umms-scjp-pank/2_IIDP/scripts/GSE142465_snRNAseq-NextFlow/737K-august-2016.txt --chemistry V2 --results /nfs/turbo/umms-scjp-pank/2_IIDP/results/GSE142465 /nfs/turbo/umms-scjp-pank/2_IIDP/scripts/GSE142465_snRNAseq-NextFlow/main.nf"
```

## Explanation of result files obtained from Nextflow pipeline `snRNAseq-NextFlow`

1. `cellbender/*`: Cellbender results
File names are in the following format:
	`{library}-{genome}.cellbender_cell_barcodes.csv` <br>
	`{library}-{genome}.cellbender_FPR_*_filtered.h5` <br>
	`{library}-{genome}.cellbender_FPR_*.h5` <br>
	`{library}-{genome}.cellbender_FPR_*_metrics.csv` <br>
	`{library}-{genome}.cellbender_FPR_*_report.html` <br>
	`{library}-{genome}.cellbender.log` <br>
	`{library}-{genome}.log` <br>
	`{library}-{genome}.cellbender.pdf` <br>
	`{library}-{genome}.cellbender_posterior.h5` <br>
The detailed description about each file can be found in this link: https://cellbender.readthedocs.io/en/latest/usage/index.html

2. `multiqc/fastq/*`: multiqc summaries of fastqc results
`multiqc/fastq/multiqc_data`: data reported using MultiQC <br>
`multiqc/fastq/multiqc_report.html`: html report using MultiQC <br>

Reports from this step are information rich. We not only manually inspected all of these reports, but also attempted at quatifying the quality of per tile quality reported by FastQC in `2_fastQC_inspection.R.ipynb`.

3. `multiqc/star/*`: multiqc summaries of STAR logs
Structure is similar to `multiqc/fastq/*`

4. `prune/*`: filtered bam files (duplicates NOT removed)
Files are named `{library}-{genome}.before-dedup.bam`; `*bai`: index files

5. `qc/*`: Per-barcode QC metrics and QC metric plots
`{library}-{genome}.metrics.png`: metric plot <br>
`{library}-{genome}.qc.txt`: text file with all metrics <br>
`{library}-{genome}.suggested-thresholds.tsv`: suggested thresholds using Multi-Otsu. Using the thresholds suggested here is optional, but is a good starting point to choose high quality cells.

6. `starsolo/*`: starsolo output
`starsolo/{library}-{genome}/{library}-{genome}.Solo.out/*`: Count matrices derived using a variety of counting methods (see more detail at https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md) <br>
`{library}-{genome}.Aligned.sortedByCoord.out.bam`: BAM output sorted by coordinates from STARsolo <br>
`{library}-{genome}*.out*`: different reports obtained as STARsolo is run

7. `emptyDrops/results/*`: metrics obtained using EmptyDrops
`*knee.txt`: locations of knee point, inflection point, and end of cliff point <br>
`*pass.txt`: barcodes detected using EmptyDrops at FDR < 0.005 <br>

Under the directory `pctMTusingBelowEndCliff_pctMtless30_FDR0.005/`: <br>
`*metrics.csv`: metric per barcode file. Columns `filter*` and `pass_all_filters` indicate whether a barcode satisfies QC thresholdings <br>
`*_passQC_barcodes.csv`: list of barcodes that satisfy QC thresholdings <br>

More details on how to obtain barcodes that satisfy QC thresholdings are in `3_barcode_qc.ipynb`.