CD8_clustering

Background and Rationale

Single cell RNA-seq (scRNA-seq) experiments are increasing in prevalence within biological experimentation. An advantage of scRNA-seq over bulk sequencing experiments is the ability to look at gene expression of individual cells. This level of granularity allows for analysis that would otherwise not be possible with bulk sequencing experiments.

A classic example of single cell specific analysis is cell clustering. In a bulk sequencing experiment, the gene expression data is a mixture of various cell types that cannot be separated, so it is extremely difficult to get an accurate picture of the cellular composition of a sample. In scRNA-seq experiments, individual cells within a sample can be separated into specific sub-type clusters based on their gene expression profiles using classic clustering methods, resulting in the ability to infer the composition of cellular sub-types of a sample.

In this CD8_clustering workflow, we look specifically at CD8+ T cells, also known as "killer T cells", which have a cytotoxic function within adaptive immunity. CD8+ T cells are typically categorized into specific subtypes - naive, memory (stem cell, central, and effector), effector, and exhausted. The subtypes of CD8+ T cell have different gene expression profiles as well as different functions in the immune system.

By performing clustering on scRNA-seq data of CD8+ T cells, one can infer the subtypes of individual cells based on the characteristic gene expressions of each cluster and gain valuable information on the cellular composition of given sample(s) and therefore the immune function of cells.

Clustering Method

This workflow makes use of the K Nearest Neighbours clustering algorithm (KNN) where the number of nearest neighbours to use can be user specified. Since scRNA-seq data is highly dimensional and sparse, dimensionality reduction is necessary in order to performing clustering. The most common dimensionality reduction technique is Principal Component Analysis, but other methods are also available in this workflow (t-SNE and UMAP).

Workflow

The main steps of this workflow are:

FTP download of raw gene expression matrix files for all relevant samples from the NCBI GEO (tar format).
Extract individual sample specific data files (in this DAG, there are 7 individual samples).
Aggregation of individual sample data into one singular gene expression matrix.
Filter features and cells based on user determined thresholds.
Clustering of cells using KNN on dimension reduced gene expression data.

Usage

First, clone the git repository to your local machine and enter the project directory:

git clone git@github.com:pattiey/CD8_clustering.git
cd CD8_clustering

Ensure that necessary channels are available:

conda config --add channels 'bioconda'
conda config --add channels 'r'
conda config --add channels 'conda-forge'

Then create an environment with the required packages:

conda create --name cd8_clustering --file env.txt

Activate the environment:

conda activate cd8_clustering

Due to package conflicts, local R will be used and R packages must be installed outside of the Conda environment. Please ensure that your R version is up to date. To install needed R packages, run:

Rscript /path/to/CD8_clustering/scripts/init.R

And enter yes for any prompts that may appear.

Run the Snakemake workflow with relevant parameters. Be sure to set the appropriate number of cores. Ensure that the config.yaml file is updated with the relevant fields before running.

snakemake --snakefile /path/to/CD8_clustering/Snakemake/Snakefile --configfile /path/to/CD8_clustering/Snakemake/config.yaml --cores 4

Input

The input for this workflow is controlled by the config.yaml file. Here is a description of the fields of the config.yaml file.

Field	Description
`FTP_URL`	FTP download file from NCBI GEO of raw gene expression data
`DATA_DIR`	Directory where data is to be stored
`SCRIPTS_DIR`	Directory where project scripts are stored
`OUTPUT_DIR`	Directory where output files are to be stored
`PROJECT`	Name of experiment/project
`SAMPLES`	Sample names from GEO

Other user specified parameters are also available to adjust in the Snakefile.

Snakemake rule	Parameter	Description
`filter_cells`	`mito`	maximum percentage threshold of mitochondrial expression to filter
`filter_cells`	`ribo`	maximum percentage threshold of ribosomal expression to filter
`filter_cells`	`nFeature_lo`	minimum number of features present in cells to keep
`filter_cells`	`nFeature_hi`	maximum number of features present in cells to keep
`filter_cells`	`nCount_lo`	minimum number of counts required to keep a cell
`filter_cells`	`nCount_hi`	maximum number of counts required to keep a cell
`cluster_cells`	`reduction`	method of dimensionality reduction for clustering, pca, tsne, or umap
`cluster_cells`	`k`	Number of neighbours to use for K nearest neighbours clustering
`cluster_cells`	`num_features`	Number of features to use for SCTransform
`cluster_cells`	`resolution`	Value of the resolution parameter, use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities.
`cluster_cells`	`dims`	Number of dimensions to use for clustering.

The example config file in this repository contains the samples and URL data from the NCBI Gene Expression Omnibus (GEO) GSE116390. This is single cell RNA-seq data of CD8+ T cells from B16 melanoma tumours from mice pertaining to the experiments done by S. Carmona et al..

Output

The workflow produces the result of KNN clustering on the scRNA-seq samples.

Output File	Description
`cellClusters.csv`	A CSV file containing cluster labels of each cell identified through barcode and sample
`PCA_plot.png`	A plot of the first two principal components of the gene expression data, coloured by cluster
`TSNE_plot.png`	A t-SNE plot of the gene expression data, coloured by cluster
`UMAP_plot.png`	A UMAP plot of the gene expression data, coloured by cluster
`cell_comp.png`	A plot of cluster proportions, total and by sample

Using the sample data and the parameters specified in the Snakefile, here are the plots produced by the workflow.

The K Nearest Neighbours clustering found four distinct clusters. Using this clustering, one could then perform further analysis to find the corresponding CD8+ cell subtype for each cluster in order to extract information about the cellular composition and immuno-landscape of the experimental samples.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
B16CD28/output		B16CD28/output
Snakemake		Snakemake
sample_output		sample_output
scripts		scripts
.gitignore		.gitignore
README.md		README.md
dag.svg		dag.svg
env.txt		env.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B16CD28/output

B16CD28/output

Snakemake

Snakemake

sample_output

sample_output

scripts

scripts

.gitignore

.gitignore

README.md

README.md

dag.svg

dag.svg

env.txt

env.txt

Repository files navigation

CD8_clustering

Background and Rationale

Clustering Method

Workflow

Usage

Input

Output

Principal Component Plot

t-SNE Plot

UMAP plot

Cluster Composition Plot

About

Releases

Packages

Languages

pattiey/CD8_clustering

Folders and files

Latest commit

History

Repository files navigation

CD8_clustering

Background and Rationale

Clustering Method

Workflow

Usage

Input

Output

Principal Component Plot

t-SNE Plot

UMAP plot

Cluster Composition Plot

About

Resources

Stars

Watchers

Forks

Languages