<a href="https://colab.research.google.com/github/nunososorio/SingleCellGenomics2024/blob/main/2_Tuesday_April9th/cellranger2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/nunososorio/SingleCellGenomics2024/blob/main/logo.png?raw=true" alt="AnnData" style="width:600px; height:auto;"/>

# CellRanger Software: Preprocessing scRNA-seq Data and Quality Evaluation 🧬
### Learning objectives of this session:
1. Understand the software CellRanger for preprocessing raw scRNA-seq data and generation of a count matrix;
2. Evaluate the quality of a scRNA-seq experiment using the output of CellRanger.

## The scRNA-seq workflow:

1. **Sample Preparation and Processing**: The first step in the scRNA-seq workflow involves preparing the sample. This includes isolating single cells from the tissue of interest, capturing individual cells in separate partitions (such as droplets or wells), and lysing the cells to release their RNA. The RNA is then reverse transcribed to create complementary DNA (cDNA), which serves as the template for subsequent amplification and library preparation. The end result of this step is a library of cDNA fragments, each tagged with a unique barcode that identifies the cell of origin.

2. **Sequencing**: The prepared library is then sequenced using next-generation sequencing technology. This process reads the cDNA fragments and generates a massive amount of raw sequencing data in the form of FASTQ files. Most single-cell RNA sequencing methods employ a strategy of pooled sequencing. This approach enhances data throughput by amplifying and sequencing numerous cells simultaneously within the same ‘pool’. Each read in a FASTQ file corresponds to a cDNA fragment and includes the sequence of the fragment along with the cell barcode and a unique molecular identifier (UMI) that tags each original RNA molecule.

3. **Conversion from FASTQ to Count Matrix**: The raw sequencing data is then processed to generate a count matrix, which is a table listing the number of times each gene (represented by rows) was detected in each cell (represented by columns). This involves aligning the sequencing reads to a reference genome, correcting for sequencing errors, and counting the number of UMIs associated with each gene in each cell. The count matrix serves as the starting point for most downstream analyses of scRNA-seq data, such as identifying cell types and states, detecting differentially expressed genes, and inferring developmental trajectories.

## FASTQ files

FASTQ files are the raw data output from sequencing. They contain the nucleotide sequences (reads) and corresponding quality scores.

In [None]:
# Download an example of a FASTQ file
!wget https://zenodo.org/record/3457880/files/subset_pbmc_1k_v3_S1_L001_R1_001.fastq.gz

In [5]:
# Unzip the file
!gunzip subset_pbmc_1k_v3_S1_L001_R1_001.fastq.gz

# Print the first 50 lines
!head -n 50 subset_pbmc_1k_v3_S1_L001_R1_001.fastq


@A00228:279:HFWFVDMXX:1:1101:4110:1063 1:N:0:ACATTACT
TGGGCTGGTCGCGGTTCATGGACATTCG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:7509:1063 1:N:0:ACATTACT
CATGCTCGTCTCTCACACTTTTTGGCAA
+
FFFFFFFFFFFFFFFFFFFF:FFFFFFF
@A00228:279:HFWFVDMXX:1:1101:15845:1063 1:N:0:ACATTACT
CACTGGGAGTTACTCGTTTTCTGTGGTT
+
F:FFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:16477:1063 1:N:0:ACATTACT
AAGCGTTTCGCTATTTCTATGTTCGCTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:3513:1094 1:N:0:ACATTACT
CTCCTTTTCTACACTTTTCTTGTCTATT
+
FFF,FFFFFFF,F:FFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:23213:1172 1:N:0:ACATTACT
AATGACCGTGTCATCACTTATAGTAAGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:7943:1251 1:N:0:ACATTACT
TTTCCTCTCTCTTGCGCAATACGTGCGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:24849:1251 1:N:0:ACATTACT
AGTAGCTAGGAGTACCCGGAAACTTGCG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:18005:1297 1:N:0:ACATTACT
CCCTAACCAGGTTCGCCCATTTTGTGCC


The output above has the first 50 lines of a FASTQ file.
Here’s a breakdown of the components:

*   Sequence Identifier: Each sequence in the file has a unique identifier that starts with ‘@’. For example, @A00228:279:HFWFVDMXX:1:1101:4110:1063 1:N:0:ACATTACT is an identifier. This line often contains information about the sequencing run and the specific read.
*   Sequence: The next line after the identifier is the actual sequence of bases (A, T, C, G). For example, TGGGCTGGTCGCGGTTCATGGACATTCG is a sequence.
Separator: The ‘+’ character is a separator that denotes the beginning of the quality scores for the sequence above it.
*   Quality Scores: The line following the ‘+’ character represents the quality scores for the sequence. These scores are encoded using ASCII characters, with each character representing the probability that the corresponding base in the sequence is incorrect. For example, FFFFFFFFFFFFFFFFFFFFFFFFFFFF are the quality scores for the sequence TGGGCTGGTCGCGGTTCATGGACATTCG.The character ‘F’ corresponds to a Phred quality score of 37, which indicates a very high confidence in the accuracy of the corresponding base call. The characters “:” and “,” represent lower confidence in the accuracy of the corresponding base call. The character “:” corresponds to a Phred quality score of 25, and the character “,” corresponds to a Phred quality score of 15.



In the convention used by Illumina sequencing platforms the first sequence identifier would mean:
A00228: Is the unique identifier of the sequencing instrument.
279: Is the run number, an identifier for the specific run on the sequencer.
HFWFVDMXX: Is the unique identifier for the flow cell used in the sequencing run.
* 1: This represents the lane number on the flow cell.
* 1101: This is the tile number on the flow cell.
* 4110: This is the ‘x’ coordinate of the cluster on the tile.
* 1063: This is the ‘y’ coordinate of the cluster on the tile.
* 1: This indicates the member of a pair (1 or 2) in paired-end sequencing.
* N: This indicates whether the read passed filtering. ‘Y’ means it passed, ‘N’ means it did not.
* 0: Control bits are used in some sequencing applications for specific purposes, often related to quality control. In the Illumina sequencing header, a ‘0’ typically means that no control bits are set. Control bits might be used, for example, to flag or identify specific types of reads. However, in many applications, including most RNA-seq experiments, this field may not be used and will just be set to ‘0’.
* ACATTACT: Index Sequence, also known as a barcode, is a short, unique sequence that is added to each DNA fragment in a sample during library preparation. This allows multiple samples to be mixed together and sequenced in the same run, a process known as multiplexing. After sequencing, the index sequence is used to identify which reads came from which sample. In the context of 10x Genomics single-cell RNA sequencing, this index sequence is used as a cell barcode. Each cell is given a unique barcode, allowing the RNA from thousands of individual cells to be sequenced together while still keeping track of which reads came from which cells.


 Questions:
 - The reads from the first 50 lines are from the first cell or from separate cells?
 - Which of the reads had a lower quality score?

## Count Matrix
a count matrix, which is a table listing the number of times each gene (represented by rows) was detected in each cell (represented by columns)

## FASTQ to Count Matrix using CellRanger

10x Genomics created a proprietary processing pipeline, called CellRanger, to handle the outputs generated by its scRNA-seq. There are other alternatives to do the same task, including STARsolo or UniverSC. In this course, we will focus on Cell Ranger since it is widely used and supported.

## Interpreting CellRanger's Output