# LongTREC Summer School - Day 4
## Single-cell pipeline overview

In this hands-on session, we will be doing an overview of PacBio's commercial pipeline analysis for single-cell data, check the commands used in the pipeline and explore some of the intermediate files generated by the pipeline. Moreover, we will go through the report and we discuss some of the values and plots that can help us identify potential problems.

### Data overview
For this exercise, we are going to use Peripheral Blood Mononuclear Cells (PBMCs) data (a subset of blood cells!) that has been sequenced using PacBio Kinnex chemistry on a Revio machine. For barcoding, 10X 3' kit has been used in this case. 

<img src="./Cellular-Compsition-of-Whole-Blood.gif" alt="Overview of PBMCs" />

*ImageRef: https://bioscience.lonza.com/medias/Cellular-Compsition-of-Whole-Blood?context=bWFzdGVyfHBpY3R1cmVQYXJrTWVkaWF8MzAwOTd8aW1hZ2UvZ2lmfGFHUXdMMmcxWWk4NU1qTTBOakU1TXpBeE9URTRMME5sYkd4MWJHRnlJRU52YlhCemFYUnBiMjRnYjJZZ1YyaHZiR1VnUW14dmIyUXwwZDczODY3MjM0Mjg5NWZkNjgzYjU5NDg2NzM0Y2U3MjY2NTM4ZjgwN2JhZmE0Mzk4M2YxOWE2M2JkNTQ2ZWY1*

The data and report we are going to use in this exercise is publicly available from PacBio's cloud website and can be found here: https://downloads.pacbcloud.com/public/dataset/Kinnex-single-cell-RNA/DATA-Revio-Kinnex-PBMC-10x3p/.

### The IsoSeq pipeline
As we have learnt before, the Iso-Seq pipeline (https://isoseq.how) contains the PacBio tools to analyse bulk and single-cell RNA-seq data. For this exercise, we will focus on the single-cell analysis pipeline. All the commands, steps and terminology we are going to discuss here can be found on the links above!

The files we are going to use are BAM files, but they do not contain mapping information! These files are actually uBAM (unmapped BAM) files and their main advantage against FASTQ files is that you can store much more metadata in them. If you want to find more information about them you can check this website: https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format

### 1) Obtaining HiFi reads
The first file we get from the sequencer contains the subreads. These are reads that contain multiple passes of the same molecule that can to be collapsed into a CCS (circular consensus sequence) read in order to correct sequencing errors introduced by the polymerase.

To get the HiFi reads, CCS reads with high accuracy, we use CCS tool (https://ccs.how/).

<img src="./CCS_logo.png" alt="CCS logo" />

This tool combines the subreads into a CCS and keeps some additional information such as base quality values. To run it, we can use the following command:

In [None]:
### DO NOT RUN
ccs movie.subreads.bam m84014_240128_083549_s3.hifi_reads.bcM0003.bam

After this, we will have a file that contains the HiFi reads that we are going to use for next steps. Before we continue, let's check the contents of this file and some important tags. For all these examples, we will be using a subset of the original files. 

In [1]:
%%bash
samtools view ../../../data/Single_cell/sub_hifi_reads.bam | head -n 1

m84014_240128_083549_s3/227086241/ccs	4	*	0	255	*	*	0	0	AAGCTTACTTGTGAAGATCTACACGACGCTCTTCCGATCTTGCCGAGTCCACGTAAGGACGTTCTAAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGTTTAAAAATTCCTGTTGGGGACAGGGAATCCCTGAAGGGAATACATTAAAAAATATACAGTGGAAATGGGGAAGGAGAACGAGAGCTGTTGTCCAAGAGCTTCTACGTAAAAATAAAAATTAAAAAAAAAAAGGGAGAAAGAGAAAAATAAAGAGTCTCTTAAATGGTTCTGAGGGCTCTACCAACGAACTGAAAGGAGGATGAGACGATGGAGACAGACAGTTTTTATTGCTGGCCTGCCCCGAGTGGCTTAGCTGGCCTCCATCCGGAGCTTCTTCCTGCTTTGGGGCTCTTCAGGCTCAGACTCCGAGCTGGAGTCCGAGCCCCCATCGAAGGCAGAGATGGTGCTGCCCTTGGCGTTGGTGTAGAGGCTGTTGTCTGAGGAGGGGTAGTTGGTCTGCAGTTGGGCACTTGACCTCGCCTTCTCCAGTGCACGGACTAAAAGGCAACCAAGGGAGTGTGTTACTGCCTTCTGGAGACTTGGGGAGTAACCGAGTCTCAGACTCAGGGTCCAGCCTGTTCTTCCTAAGGGCTTGCTTGTTGGCCCCCGAGGCTCAAGATGCTCAGAAGTAGCTCCCTTCGGGTGGCTGTTCACTCACACACTTCATTTCCCTCCCGCCTTTCCCAGCTGGACCTGGGAGCCAGCAGACACTCTGCAATCCCACCCAACTCTGAAGAGAAGGCTTGTGATGTCACCGTCCTGCCGTGGAAGCCCCGTGGAGAGACTGGAGCGTGAGCTCCCTGTGGGATTCAGCAGTGCAATAACAGAGGAGAAGCTGGCCCAGGAGCATGAGGCTCCACAAGGTGTGAGGCCACACAACAGCAGGAAAGGATGACATAAG

We can also check the HiFi report and some interesting plots of the sequencing results.

<img src="HiFi_stats.png" alt="HiFi reads " />

In this table there are three main stats to look at. First, we can see we got 8,889,073 HiFi reads from this run. These are the reads we are going to use for next steps. The other values that are important to check are the median HiFi read length (13,912 bps) and the median read quality (Q30 ~ 99.9% accuracy). Normally, we should expect a high percentage of reads with quality values above Q20. If most of your data is below this number, maybe we should check if the sequencing or library prep went well.

We can also check the read length distribution to identify any potential issue.

<img src="HiFi_length_dist.png" alt="HiFi reads length distribution" />

### 2) Segmenting HiFi reads
As we learnt before, Kinnex protocol improves sequecing throughtput by concatenating different reads into arrays. In the case of single-cell RNA-seq, each HiFi read can contain up to 16 molecules in it. Therefore, after selecting the HiFi reads, the next step is to segment them into individual molecules.

To do this, we use Skera (https://skera.how/).

<img src="Skera_logo.png" alt="Skera logo" />

Skera will split the HiFi reads into S-reads (segmented reads). To do this, a list of the adaptors used for the concatenation must be provided. The adaptor file will look like this:

<img src="Adapter_example.png" alt="Example of adapters file for concatenation" />

Once you have this file, you can run Skera using the following command:

In [None]:
### DO NOT RUN
skera m84014_240128_083549_s3.hifi_reads.bcM0003.bam <adapters>.fasta segmented.bam

Let's have a look at the resulting file and some important tags in it.

In [5]:
%%bash
samtools view ../../../data/Single_cell/sub_segmented_reads.bam | head -n 2

m84014_240128_083549_s3/227086241/ccs/18_2165	4	*	0	255	*	*	0	0	CTACACGACGCTCTTCCGATCTTGCCGAGTCCACGTAAGGACGTTCTAAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGTTTAAAAATTCCTGTTGGGGACAGGGAATCCCTGAAGGGAATACATTAAAAAATATACAGTGGAAATGGGGAAGGAGAACGAGAGCTGTTGTCCAAGAGCTTCTACGTAAAAATAAAAATTAAAAAAAAAAAGGGAGAAAGAGAAAAATAAAGAGTCTCTTAAATGGTTCTGAGGGCTCTACCAACGAACTGAAAGGAGGATGAGACGATGGAGACAGACAGTTTTTATTGCTGGCCTGCCCCGAGTGGCTTAGCTGGCCTCCATCCGGAGCTTCTTCCTGCTTTGGGGCTCTTCAGGCTCAGACTCCGAGCTGGAGTCCGAGCCCCCATCGAAGGCAGAGATGGTGCTGCCCTTGGCGTTGGTGTAGAGGCTGTTGTCTGAGGAGGGGTAGTTGGTCTGCAGTTGGGCACTTGACCTCGCCTTCTCCAGTGCACGGACTAAAAGGCAACCAAGGGAGTGTGTTACTGCCTTCTGGAGACTTGGGGAGTAACCGAGTCTCAGACTCAGGGTCCAGCCTGTTCTTCCTAAGGGCTTGCTTGTTGGCCCCCGAGGCTCAAGATGCTCAGAAGTAGCTCCCTTCGGGTGGCTGTTCACTCACACACTTCATTTCCCTCCCGCCTTTCCCAGCTGGACCTGGGAGCCAGCAGACACTCTGCAATCCCACCCAACTCTGAAGAGAAGGCTTGTGATGTCACCGTCCTGCCGTGGAAGCCCCGTGGAGAGACTGGAGCGTGAGCTCCCTGTGGGATTCAGCAGTGCAATAACAGAGGAGAAGCTGGCCCAGGAGCATGAGGCTCCACAAGGTGTGAGGCCACACAACAGCAGGAAAGGATGACATAAGACGTATCAGC

Now, lets check the S-reads report and some important plots.

<img src="Sreads_stats.png" alt="S-reads metrics" />

Here, we can see that the first value corresponds to the number of HiFi reads we selected in the previous step. After segementation, we got \~141M S-reads with a mean length of 860bps. As we can see below, most of the arrays (\~96%) have 16 molecules.

There are a few plots where we can check the S-read length distribution and the concatenation results:

<img src="Sreads_length_dist.png" alt="S-reads metrics" />

<img src="Concatenation_perc.png" alt="S-reads metrics" />

<img src="Concatenation_dist.png" alt="S-reads metrics" />

### 3) Processing of S-reads
The S-reads we have obtained after segmentation still contain the primers, cell barcodes, UMIs, etc. Due to this, we will need to further process these reads. Following the Iso-seq pipeline, the next step is to remove the primers and arrage the reads in the right orientation (5'->3'). To do this, we use lima (https://lima.how).

<img src="Lima_logo.png" alt="Lima logo" />

To run lima, we will need a file containing the sequences of the primers used. Then, we can use the following command to use lima:

In [None]:
### DO NOT RUN
scisoseq.5p--3p.tagged.refined.corrected.sorted.dedup.bam  
lima segmented.bam primers.fasta scisoseq.5p--3p.bam --isoseq

After this, we will need to run a few modules of isoseq (the tool this time). The next step is to extract the cell barcode and the UMI and add them as tags in the output BAM file. To perform this step, we need to indicate the barcode design (in which order and how many bases each barcode has). You can find more information about this at https://isoseq.how/umi/umi-barcode-design.html.

Once you have this information, we can use this command (you need to adapt it depending on your case):

In [None]:
### DO NOT RUN
isoseq tag scisoseq.5p--3p.bam scisoseq.5p--3p.tagged.bam --design T-12U-16B

Next, we need to refine our reads. By default, this step will remove the polyA tail; however, depending on your experiment, you may want to keep it for downstream processing. If so, we will need to indicate it when we run the tool. Additionally, non-concatemer reads will be removed at this step but this has been performed while we were running Skera.

In [None]:
### DO NOT RUN
isoseq refine scisoseq.5p--3p.tagged.bam primers.fasta scisoseq.5p--3p.tagged.refined.bam --require-polya

Now we are going to correct the cell barcodes we have extracted previously. The Cell barcodes sequences are known and we can use them to correct sequencing errors. The Iso-seq pipeline uses Hamming distance to correct them.

- If the query cell barcode is the same as a cell barcode from the whitelist it remains intact.

- If a query cell barcode differs from a cell barcode from the whitelist by less than 2 nucleotides, it will be corrected to match the cell barcode from the whitelist (note: this does not take into account insertions and deletions).

- In case there are multiple options, the query cell barcode will be corrected to match the cell barcode from the whitelist that requires the less number of edits.

- If edit distance is \>2 or no candidates are found, the reads are marked as failing.

An additional step of this module is to call "real cells" using a barcode rank plot. Here, cells are ranked based on their UMI content in a decreasing order. There are two approaches that can be use to estimate this. By default, the Iso-seq pipeline uses the knee finding method but the percentile method can also be applied using a hard cutoff.

In [None]:
### DO NOT RUN
isoseq correct --barcodes barcodes.txt[.gz] scisoseq.5p--3p.tagged.refined.bam scisoseq.5p--3p.tagged.refined.corrected.bam

<img src="Knee_plot.png" alt="Knee plot" />

After running isoseq correct, we can run isoseq bcstats to get cell barcodes stats.

In [None]:
### DO NOT RUN
isoseq bcstats --json sample.bcstats.json -o sample.bcstats.tsv scisoseq.5p--3p.tagged.refined.corrected.bam

<img src="Cell_stats.png" alt="Cells statistics" />

Finally, we need to sort the corrected BAM file using samtools by cell barcode before running isoseq groupdedup. In this step, we use UMIs to remove PCR artifacts.

In [None]:
### DO NOT RUN
samtools sort -t CB scisoseq.5p--3p.tagged.refined.corrected.bam -o scisoseq.5p--3p.tagged.refined.corrected.sorted.bam

In [None]:
### DO NOT RUN
isoseq groupdedup scisoseq.5p--3p.tagged.refined.corrected.sorted.bam scisoseq.5p--3p.tagged.refined.corrected.sorted.dedup.bam

We can now check the report:

<img src="dedup_reads.png" alt="S-reads stats after dedup" />

We can see that, from the \~141M reads we had initially, we ended up with \~81.6M reads (\~46% lost). These are the reads we are going to use for next steps.

Lets check the resulting file!

In [7]:
%%bash
samtools view ../../../data/Single_cell/sub_dedup_reads.bam | head -n 2

molecule/0	4	*	0	255	*	*	0	0	TAGGGAAACCAGTGAGTCATAGGTTTGGTTTCTACATAATCCCATATTTCTTTTAGGTTTTTTTGGTTCCTTTTCATTCTTTCTTTCTAGTCTTGTCTGACTGTCTTATTTCAAAAAGCCAATCTTCAAGCTCTTAGACTCTTTCCTCCACTTGGTCTATTCTGCTTTTAATACTCGTGATTGCATTATAAAATTATTTCAGTGTGTTCTTCAGGTCTATCATGTCAGTTCTTGGAGGGTTTTTTTTGGTTCCTTTTCATTCTTTTTTTTTTCTATTCTTGTCTGACTGTCTTATTTTAAAAAGCCGGTCTTCAAGCTCTTAGATTCTTCCCTCCACTTGGTCTATTCTGCTATTAATACTTGTGATTGAATTATAGAATTATTTCAGTGTGTTCTTCCACTCTATCGTTACAGTTACATAGTTTTCTATGCCGGCTATTTTGTCTGCAGCTACTGTGTCATTTTATTGTGATTCTTATCTTCCTTGGATTGGGTTTCAATGTAGTCCTGCATCCCAATAATCTTTATTTGTATCTATATTTTGAATTCTATTTTTGTCATTTCAGTCATCTTAGCCCAGTTCAGAGCCCTTGCTAGAGAGGTGATATGTTCATTTGGAGGTAAGAAGGCACTCTGGCTTTTTTAGTTGTCAGGGTTATTGTGCTCATTCTTTCTCATCTTTGTGGGCTCATGTTTCTTTAATCTTCAAAGTTACTGACCATTGGGTGAGGTTTTTTTTTCTTTTATCCTATTTGATGACCTTGAGGATTCTATTGAGTTATAAGGTGCATTCAGCCAACTGGCTTCATTTCTGGAAGATTTTAGGAAGCCAGTGCTCAGCTCCCAATTCCTGGGTTGCATGCTGTAACTTTGAGGGAATTGCATTGAGCTCCATCCTTGTTTTCTGGCTCCTCGAAGTTGGAAATCTACTGCACTGGGTGTGGGAGGGGCATGAGGTGCTCCCAGACTGT

### 4) Aligning your reads and reconstruct your transcriptome
In the next step, we will map our reads to the reference genome. For this purpose, Iso-seq pipeline uses pbmm2 (PacBio's wrapper of minimap2).

In [None]:
### DO NOT RUN
pbmm2 align --preset ISOSEQ --sort scisoseq.5p--3p.tagged.refined.corrected.sorted.dedup.bam <ref.fa> mapped.bam

The last module from isoseq we are going to use is isoseq collapse. Once our reads are mapped to the reference genome, we can collapse the transcript models into a gff file that will be our preliminary transcriptome. 

In [None]:
### DO NOT RUN
isoseq collapse mapped.bam collapsed.gff

Next we will run pigeon, PacBio's SQANTI-like tool, to perform the classification of isoforms againts the reference annotations. 

<img src="Pigeon_logo.png" alt="Pigeon logo" />

The first step is to run pigeon prepare to sort the gff. Then, we can run pigeon classify to obtain our classiffication.

In [None]:
### DO NOT RUN
pigeon prepare collapsed.gff

In [None]:
### DO NOT RUN
pigeon classify collapsed.sorted.gff <annotations.gtf> <reference.fa> --fl abundance.txt

<img src="prelim_stats.png" alt="Preliminary stats" />

<img src="prelim_plot.png" alt="Preliminary plot" />

However, this transcriptome needs to be filtered to remove unsupported isoforms. We can do this with pigeon filter. We will need to provide the classification file outputted from pigeon classify.

In [None]:
### DO NOT RUN
pigeon filter classification.txt --isoforms collapsed.sorted.gff

<img src="filtered_stats.png" alt="Filtered stats" />

<img src="filtered_plot.png" alt="Filtered plot" />

### 4) Quantification
The last step is to generate compatible files with tertiary analysis tools (such as Seurat or Scanpy). Two folders with 3 files will be generated, one for gene quantification and one for transcripts quantification.

We need to use pigeon make-seurat to generate these files. We will need the file of reads after the deduplication step in FASTA format and a few files outputted previously by pigeon.

In [None]:
### DO NOT RUN
pigeon make-seurat --dedup scisoseq.5p--3p.tagged.refined.corrected.sorted.dedup.fasta --group collapse.group.txt -d <output_dir> classification.filtered_lite_classification.txt

<img src="count_folders.png" alt="Folders" />

<img src="count_files.png" alt="files" />

There are multiple ways to input this data into these tools. In case of the Iso-seq pipeline, it uses the matrix market exchange (MEX) format. Briefly:

- First 3 lines are headers
    - First two lines are comments
    - Third line is the sum of genes/cells/UMIs
- First column -> GeneID
- Second column -> CellID
- Third column -> UMI count

Let's finish the exercise checking the contents of the files.

In [9]:
%%bash
head -n 10 ../../../data/Single_cell/matrix.mtx

%%MatrixMarket matrix coordinate real general
%
871616 12852 13412354
1 1 1
822 1 1
820 1 1
885 1 1
898 1 1
962 1 2
1189 1 1


In [10]:
%%bash
head -n 10 ../../../data/Single_cell/barcodes.tsv

AGCAAGCGACATCTCC-1
ATCTCTAACTCATGTG-1
CAATCCCACGCATGGA-1
CAGCGTCGAGTGTTAC-1
AACCTAGTGCGCTGAC-1
AAGAAACACGGGAACT-1
ACCCTGACTCTCCTCA-1
ACGTTGAACCCAAAGG-1
ACTGCGGACCGGATAG-1
ATATCCACTTAGGCAG-1


In [11]:
%%bash
head -n 10 ../../../data/Single_cell/genes.tsv

PB.10001.1:GPA33	GPA33
PB.10001.10:GPA33	GPA33
PB.10001.11:GPA33	GPA33
PB.10001.13:GPA33	GPA33
PB.10001.23:GPA33	GPA33
PB.10001.27:GPA33	GPA33
PB.10001.28:GPA33	GPA33
PB.10001.29:GPA33	GPA33
PB.10001.30:GPA33	GPA33
PB.10001.36:GPA33	GPA33
