CROP-seq

Repository for the processing and analysis of CROP-seq data. This repository contains code to do the following steps for CROP-seq analysis:

Reference preparation
Guide quantification from scRNA-seq data
Guide quantificationf from enriched RNAseq data

Reference preparation

The reference requires the following information:

Primary assembly of the organism (FASTA and GTF)
Spreadsheet with gRNA sequence and target genes
Spreadsheet with CROP-seq vector elements

Use the following script: buildReference.py to build the guide reference.

python buildReference.py -i example_data/GWAS_sgRNAs.csv -c example_data/CROP\-seq.csv -o Test

Concatenate the generated files to your primary assembly files, to create a new set of reference files.

> cat Homo_sapiens.GRCh38.dna.primary_assembly.fa CROPseq.fa > Homo_sapiens.CROPseq.fa
> cat Homo_sapiens.GRCh38.98.chr.gtf CROPseq.gtf > Homo_sapiens.CROPseq.gtf

Convert the files to Cell Ranger reference files using cellranger mkref.

# Path to input files
GENOME_NAME=Homo_sapiens.CROPseq
INPUT_FASTA=Homo_sapiens.CROPseq.fa
INPUT_GTF=Homo_sapiens.CROPseq.gtf
INPUT_FILTERED_GTF=Homo_sapiens.CROPseq.Filtered.gtf

# Filter GTF
cellranger mkgtf ${INPUT_GTF} ${INPUT_FILTERED_GTF}  --attribute=gene_biotype:protein_coding --attribute=gene_biotype:lincRNA --attribute=gene_biotype:antisense || exit 1

# Create STAR reference
cellranger mkref --genome=${GENOME_NAME} --fasta=${INPUT_FASTA} --genes=${INPUT_FILTERED_GTF} || exit 1

scRNA-seq Processing and Analysis

Chromium Processing

This step is usually done by the service provider. Samples need to be mapped to the reference prepared in the previous step.

Data for all arrays need to be combined via the cellranger aggr function, which not only concatenates the count matrices, but also performs a read depth equalization between all samples.

Guide molecule identification

If the reference has worked, you will receive a count table with the guide and control RNA as additional genes in the expression matrix. These can be retrieved using the following steps.

MiSeq Enrichment

FASTQ Preparation

This uses merged paired-end reads as generated by the Galaxy pipeline.

Label reads with combined cell barcode and UMI sequence, located at the first 28 nucleotides of the merged sequences. This can be done with fastp.

fastp -i Galaxy112-[ARRAY1_R1_R2_merge].fastqsanger  -o ARRAY1.fastq -U --umi_loc read1 --umi_len 28  -w 4

Strip polyT tail from the reads using cutadapt.

cutadapt -j 4 -g "T{200}" -o ARRAY1_Processed.fastq ARRAY1.fastq

This can be run for all samples using the preprocessFastq.sh script.

Mapping

Map guides to the reference generated in the first section using STAR aligner.

STAR --genomeDir $REF_DIR --readFilesCommand cat \
--readFilesIn $INPUT_FILE \
--outSAMtype BAM Unsorted \
--quantMode GeneCounts \
--runThreadN 16 \
--outFileNamePrefix $SAMPLE_NAME \
--outReadsUnmapped Fastx \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 \
--outFilterMatchNmin 0 \
--outFilterMismatchNmax 2

This will output a BAM file and a count table that we will use to trace back to the original cells.

Quantification via reads

Run quantifyEnrichment.py. This will process the BAM files (eg. ARRAY1_Aligned.sortedByCoord.out.bam) generated by STAR. These files will need to be indexed and position-sorted to work.

The script can be invoked as follows:

python quantifyEnrichment.py -i ARRAY1_Aligned.sortedByCoord.out.bam -o ARRAY1_EnrichmentReads.tsv

This will generate a spreadsheet that contains the number of reads mapped to each UMI-barcoded molecule containing a gRNA, and the cell it came from.

Example output

gRNA	cell_barcode	umi	reads
GUIDES_sg001_ABCA1	AAACCCATCACTGTTT	GTCCAGGGGGT	1
GUIDES_sg001_ABCA1	AAACGAAGTCCAGCGT	TATTGAGCATT	1
GUIDES_sg001_ABCA1	AAAGAACTCGGCGATC	TTATTTGCGTC	1
GUIDES_sg001_ABCA1	AAAGGGCGTCCCGTGA	GGCCGCCCTGC	2
GUIDES_sg001_ABCA1	AAATGGAAGCAGCCAG	ACATCAATCAG	1
GUIDES_sg001_ABCA1	AAATGGAGTAGACAGC	TACTTATAATG	1
GUIDES_sg001_ABCA1	AACAACCCAATTTCGG	AGTTGCTGCGG	1
GUIDES_sg001_ABCA1	AACAACCGTTTATGCG	TATATTATCAG	3
GUIDES_sg001_ABCA1	AACAAGAGTATGCTAC	GGCGAGATCAT	1

The output must then be processed via the R script [quantifyEnrichment.R][quantifyEnrichment.R], which will parse the output and identify the guides present in each cell based on the MiSeq enrichment data.

Script is invoked as follows:

Rscript quantifyTranscriptome.R ARRAY1_EnrichmentReads.tsv ARRAY1

Guides are assigned to a cell if:

They are the only guides found in a cell
They have more reads than 3x the sum other reads assigned to other guides in the cell AND the guides target a different gene
The guides for a single gene have 3x the sum of reads assigned to other guides in the cell

Example output

cell_barcode	gRNA	type	target
AAAAAAAAGGAGGGGG	sg097	GUIDES	TIMP3
AAAAAAAAGGGTACGT	sg103	GUIDES	TRIOBP
AAAAAAGGACGGGGGA	sg020	GUIDES	ANTXR1
AAAAACTGTCCTTAAG	sg045	GUIDES	EMID1
AAAAAGATCCAAGGAA	NA	NA	NA
AAAAAGTGAAAGAGAA	sg112	GUIDES	TEX41

Quantification via UMIs

Run quantifyTranscriptome.R to retrieve gRNA-mapped UMI counts from the Chromium data. Input should be the filtered_feature_bc_matrix found in the Cell Ranger output folder.

You will need to invoke the script as follows:

Rscript quantifyTranscriptome.R AH_array-1/filtered_feature_bc_matrix ARRAY1

This will generate a file with the following information:

cell_barcode	gRNA
AAACCCAAGAGTGTGC	NA
AAACCCAAGCGCTGAA	NA
AAACCCAAGGCACAAC	NA
AAACCCAAGTCGTTAC	NA
AAACCCAAGTCTGGAG	NA
AAACCCACAGAGACTG	NA
AAACCCACATAGGAGC	NA
AAACCCAGTACTCGTA	NA
AAACCCAGTATAGCTC	NA
AAACCCAGTTGTTTGG	NA
AAACCCATCACTGTTT	GUIDES-sg097-TIMP3-gene

Final guide assignments

The processTranscriptome.R script performs two functions:

Integration of assignments from read and transcriptome information
Filter cells based on QC quality and adds gRNA assignments to gene-count information

It is recommended you run this script manually as you will need to review the QC reports to make decisions on which cells to retain.

Analysis

Incorporation of scRNA-seq data and assignments

The aggregated scRNA-seq data was loaded into R via Seurat, and final guide assignments were added to the object metadata with the Prepare_SingleCellData.R script. This step also normalises the dataset using SCTransform.

Differential expression analysis

The script CombinedLRT.R loads the Seurat object generated in the previous step and runs differential expression analysis on it. User will need to run analysis on each gRNA target to review transcriptome-wide effects on test group vs controls.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
QC		QC
R		R
example_data		example_data
py		py
shell		shell
summary		summary
.gitignore		.gitignore
CROP-seq.Rproj		CROP-seq.Rproj
README.md		README.md

powellgenomicslab/CROP-seq

Folders and files

Latest commit

History

Repository files navigation

CROP-seq

Reference preparation

scRNA-seq Processing and Analysis

Chromium Processing

Guide molecule identification

MiSeq Enrichment

FASTQ Preparation

Mapping

Quantification via reads

Example output

Example output

Quantification via UMIs

Final guide assignments

Analysis

Incorporation of scRNA-seq data and assignments

Differential expression analysis

About

Resources

Stars

Watchers

Forks

Languages