<a href="https://colab.research.google.com/github/nunososorio/SingleCellGenomics2024/blob/main/2_Tuesday_April9th/cellranger2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/nunososorio/SingleCellGenomics2024/blob/main/logo.png?raw=true" alt="AnnData" style="width:600px; height:auto;"/>

# CellRanger Software: Preprocessing scRNA-seq Data and Quality Evaluation 🧬
### Learning objectives of this session:
1. Understand the software CellRanger for preprocessing raw scRNA-seq data and generation of a count matrix;
2. Evaluate the quality of a scRNA-seq experiment using the output Web Summary from CellRanger.

## The scRNA-seq workflow:

1. **Sample Preparation and Processing**: The first step in the scRNA-seq workflow involves preparing the sample. This includes isolating single cells from the tissue of interest, capturing individual cells in separate partitions (such as droplets or wells), and lysing the cells to release their RNA. The RNA is then reverse transcribed to create complementary DNA (cDNA), which serves as the template for subsequent amplification and library preparation. The end result of this step is a library of cDNA fragments, each tagged with a unique barcode that identifies the cell of origin.

2. **Sequencing**: The prepared library is then sequenced using next-generation sequencing technology. This process reads the cDNA fragments and generates a massive amount of raw sequencing data in the form of FASTQ files. Most single-cell RNA sequencing methods employ a strategy of pooled sequencing. This approach enhances data throughput by amplifying and sequencing numerous cells simultaneously within the same ‘pool’. Each read in a FASTQ file corresponds to a cDNA fragment and includes the sequence of the fragment along with the cell barcode and a unique molecular identifier (UMI) that tags each original RNA molecule.

3. **Conversion from FASTQ to Count Matrix**: The raw sequencing data is then processed to generate a count matrix, which is a table listing the number of times each gene (represented by rows) was detected in each cell (represented by columns). This involves aligning the sequencing reads to a reference genome, correcting for sequencing errors, and counting the number of UMIs associated with each gene in each cell. The count matrix serves as the starting point for most downstream analyses of scRNA-seq data, such as identifying cell types and states, detecting differentially expressed genes, and inferring developmental trajectories.

## FASTQ files

FASTQ files are the raw data output from sequencing. They contain the nucleotide sequences (reads) and corresponding quality scores.

In [None]:
# Download an example of a FASTQ file
!wget https://zenodo.org/record/3457880/files/subset_pbmc_1k_v3_S1_L001_R1_001.fastq.gz

# Unzip the file
!gunzip subset_pbmc_1k_v3_S1_L001_R1_001.fastq.gz


In [None]:
# Print the first 50 lines
!head -n 50 subset_pbmc_1k_v3_S1_L001_R1_001.fastq


In [None]:
# Print the last 50 lines
!tail -n 50 subset_pbmc_1k_v3_S1_L001_R1_001.fastq

## Questions and further information:

1. Do the reads from the initial 50 lines originate from a single cell or multiple cells?
2. Do the reads from the final 50 lines come from the same cell as the first 50 lines, or do they come from different cells?
3. Among the reads displayed in the notebook, which ones had a lower quality score?



The output above has the first 50 lines of a FASTQ file.
Here’s a breakdown of the components:

*   Sequence Identifier: Each sequence in the file has a unique identifier that starts with ‘@’. For example, @A00228:279:HFWFVDMXX:1:1101:4110:1063 1:N:0:ACATTACT is an identifier. This line often contains information about the sequencing run and the specific read.
*   Sequence: The next line after the identifier is the actual sequence of bases (A, T, C, G). For example, TGGGCTGGTCGCGGTTCATGGACATTCG is a sequence.
Separator: The ‘+’ character is a separator that denotes the beginning of the quality scores for the sequence above it.
*   Quality Scores: The line following the ‘+’ character represents the quality scores for the sequence. These scores are encoded using ASCII characters, with each character representing the probability that the corresponding base in the sequence is incorrect. For example, FFFFFFFFFFFFFFFFFFFFFFFFFFFF are the quality scores for the sequence TGGGCTGGTCGCGGTTCATGGACATTCG.The character ‘F’ corresponds to a Phred quality score of 37, which indicates a very high confidence in the accuracy of the corresponding base call. The characters “:” and “,” represent lower confidence in the accuracy of the corresponding base call. The character “:” corresponds to a Phred quality score of 25, and the character “,” corresponds to a Phred quality score of 15.



In the convention used by Illumina sequencing platforms the first sequence identifier would mean:
A00228: Is the unique identifier of the sequencing instrument.
279: Is the run number, an identifier for the specific run on the sequencer.
HFWFVDMXX: Is the unique identifier for the flow cell used in the sequencing run.
* 1: This represents the lane number on the flow cell.
* 1101: This is the tile number on the flow cell.
* 4110: This is the ‘x’ coordinate of the cluster on the tile.
* 1063: This is the ‘y’ coordinate of the cluster on the tile.
* 1: This indicates the member of a pair (1 or 2) in paired-end sequencing.
* N: This indicates whether the read passed filtering. ‘Y’ means it passed, ‘N’ means it did not.
* 0: Control bits are used in some sequencing applications for specific purposes, often related to quality control. In the Illumina sequencing header, a ‘0’ typically means that no control bits are set. Control bits might be used, for example, to flag or identify specific types of reads. However, in many applications, including most RNA-seq experiments, this field may not be used and will just be set to ‘0’.
* ACATTACT: Index Sequence, also known as a barcode, is a short, unique sequence that is added to each DNA fragment in a sample during library preparation. This allows multiple samples to be mixed together and sequenced in the same run, a process known as multiplexing. After sequencing, the index sequence is used to identify which reads came from which sample. In the context of 10x Genomics single-cell RNA sequencing, this index sequence is used as a cell barcode. Each cell is given a unique barcode, allowing the RNA from thousands of individual cells to be sequenced together while still keeping track of which reads came from which cells.


## FASTQ to Count Matrix using CellRanger

10x Genomics created a proprietary processing pipeline, called CellRanger, to handle the outputs generated by its scRNA-seq. There are other alternatives to do the same task, including STARsolo or UniverSC. In this course, we will focus on Cell Ranger since it is widely used and supported.

## Key Aspects of Running CellRanger

### Minimal Input Files
The minimal input files required to run CellRanger depend on the specific pipeline being used. Here are the key inputs for a typical run:

1. **FASTQ files**: These are text files that contain nucleotide sequence information along with quality scores for each nucleotide sequenced.
2. **Reference transcriptome**: This is a collection of all known transcripts sequences from a given organism. 10x Genomics provides downloadable pre-built references transcriptomes for human, mouse and some other organisms.

### Command to Run CellRanger
The command to run CellRanger is as follows:

```cellranger count --id=<ID> --transcriptome=<PATH> --fastqs=<PATH>```

Here, <ID> is a unique run ID string, <PATH> is the path of the folder containing the 10x-compatible transcriptome reference, and <fastqs> is the path of the folder containing the FASTQ files.

The ID can be any string, which is a sequence of alpha-numeric characters, underscores, or dashes and no spaces, that is less than 64 characters. Cell Ranger creates an output directory that is named using this id. This directory is called a "pipeline instance" or pipestance for short. The --fastqs should be a path to the directory containing the FASTQ files. The last argument required is the path to the transcriptome reference package with --transcriptome.

For example, if you have a run ID of my_run, a transcriptome reference at "/opt/refdata-gex-GRCh38-2020-A", and FASTQ files in "/opt/fastq_files", the command would be:

```cellranger count --id=my_run --transcriptome=/opt/refdata-gex-GRCh38-2020-A --fastqs=/opt/fastq_files```

### Output Files
The CellRanger pipelines output several types of files, including:

1. Web summary (HTML): A summary of the run in HTML format.
Metrics summary CSV: A CSV file containing summary metrics of the run.
2. BAM file: A binary version of a SAM file that contains aligned sequence data.
Raw and Filtered feature-barcode matrices (MEX, H5): These matrices contain the number of UMIs associated with a feature (row) and a barcode (column).
3. Secondary analysis files (CSV): These files contain results of secondary analysis.
4. Molecule info (H5): This file contains per-molecule information for all molecules that contain a valid barcode, valid UMI, and were assigned with high confidence to a gene or Feature Barcode.
5. Loupe files (cloupe and vloupe): These are visualization and analysis files for Loupe Browser.


## Understanding the Web Summary

Executing the CellRanger pipeline usually requires several hours. The pipeline generates a summary HTML file, named `web_summary.html`, which includes summary metrics and results from automated secondary analyses.

You can view the run summary from `cellranger count` by selecting "Summary" in the top left corner. The summary metrics provide information about the sequencing quality and various attributes of the identified cells. The `cellranger reanalyze` and `cellranger aggr` pipelines also produce similar web summaries.

This report acts as an initial feedback mechanism on the experiment's outcome. It offers a readily available summary for evaluating the experiment's success.

View this example of a web_summary: <a href="https://nunososorio.github.io/cellranger/file1.html" target="_blank">https://nunososorio.github.io/cellranger/file1.html</a>

Answer these questions:

- What is the run's quality?
- How many cells were identified?
- Is the estimated cell count reliable?
- Was the sequencing depth sufficient? Were the cells intact and in good condition?
- Is the quality of the cells consistent?

Please note that these questions will be answered more conclusively during a subsequent, more detailed part of the Quality Control (QC) process. Consider this as just the start of your evaluation.





## Roleplay Exercise 🎭

Team up with the colleague closest to you. One of you will play the role of the **illuminated experienced supervisor** 👩‍🔬, and the other will play the role of the **rising global star PhD student** 🌟.

### Task:
1. **Evaluate the files** and choose one or more of the web summaries below:
 - <a href="https://nunososorio.github.io/cellranger/file2.html" target="_blank">https://nunososorio.github.io/cellranger/file2.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file3.html" target="_blank">https://nunososorio.github.io/cellranger/file3.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file4.html" target="_blank">https://nunososorio.github.io/cellranger/file4.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file5.html" target="_blank">https://nunososorio.github.io/cellranger/file5.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file6.html" target="_blank">https://nunososorio.github.io/cellranger/file6.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file7.html" target="_blank">https://nunososorio.github.io/cellranger/file7.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file8.html" target="_blank">https://nunososorio.github.io/cellranger/file8.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file9.html" target="_blank">https://nunososorio.github.io/cellranger/file9.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file10.html" target="_blank">https://nunososorio.github.io/cellranger/file10.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file11.html" target="_blank">https://nunososorio.github.io/cellranger/file11.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file12.html" target="_blank">https://nunososorio.github.io/cellranger/file12.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file13.html" target="_blank">https://nunososorio.github.io/cellranger/file13.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file14.html" target="_blank">https://nunososorio.github.io/cellranger/file14.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file15.html" target="_blank">https://nunososorio.github.io/cellranger/file15.html</a>
 - <a href="https://nunososorio.github.io/cellranger/file16.html" target="_blank">https://nunososorio.github.io/cellranger/file16.html</a>
2. Create a script that involves a conversation between the supervisor and the student.
3. The student is reporting the latest single-cell results.
4. The brilliant supervisor will ask all the correct questions, while the rising start student will impress the supervisor with his wisdom and the amount of information he can provide about the experiment based on the web summary.

### Time Constraints:
- You have **1 hour and 30 minutes** to study the **CellRanger manuals and guidelines** and write the script.
- During the roleplay, you will have **8 minutes** to perform the conversation, and you can project the web summary during that time.

*Remember, the success of this roleplay depends on your preparation and collaboration!* 🚀


Selected Sources of Information to Study:

1. <a href="https://multiqc.info/modules/cellranger/" target="_blank">Basic QC metrics</a>
2. <a href="https://kb.10xgenomics.com/hc/en-us/articles/115005062366-What-is-sequencing-saturation" target="_blank">Diagnostic QC metrics – Sequencing</a>
3. <a href="https://www.10xgenomics.com/support/software/cell-ranger/latest" target="_blank">Diagnostic QC metrics – Mapping</a>
4. <a href="https://kb.10xgenomics.com/hc/en-us/articles/115002022743-What-is-the-recommended-sequencing-depth-for-Single-Cell-3-and-5-Gene-Expression-libraries" target="_blank">Ranked Barcode Plot</a>
5. <a href="https://kb.10xgenomics.com/hc/en-us/articles/4414460285197-What-metrics-should-I-be-looking-at-to-optimize-my-sample-prep-in-my-pilot-experiment" target="_blank">Saturation</a>
6. <a href="https://www.10xgenomics.com/analysis-guides/quality-assessment-using-the-cell-ranger-web-summary" target="_blank">Analysis view</a>
7. <a href="https://www.10xgenomics.com/support/software/cell-ranger/latest/getting-started/cr-what-is-cell-ranger" target="_blank">Troubleshooting</a>
