Skip to content

Repository for the processing and analysis of CROP-seq data

Notifications You must be signed in to change notification settings

powellgenomicslab/CROP-seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CROP-seq

Repository for the processing and analysis of CROP-seq data. This repository contains code to do the following steps for CROP-seq analysis:

  1. Reference preparation
  2. Guide quantification from scRNA-seq data
  3. Guide quantificationf from enriched RNAseq data

Reference preparation

The reference requires the following information:

  1. Use the following script: buildReference.py to build the guide reference.
python buildReference.py -i example_data/GWAS_sgRNAs.csv -c example_data/CROP\-seq.csv -o Test
  1. Concatenate the generated files to your primary assembly files, to create a new set of reference files.
> cat Homo_sapiens.GRCh38.dna.primary_assembly.fa CROPseq.fa > Homo_sapiens.CROPseq.fa
> cat Homo_sapiens.GRCh38.98.chr.gtf CROPseq.gtf > Homo_sapiens.CROPseq.gtf
  1. Convert the files to Cell Ranger reference files using cellranger mkref.
# Path to input files
GENOME_NAME=Homo_sapiens.CROPseq
INPUT_FASTA=Homo_sapiens.CROPseq.fa
INPUT_GTF=Homo_sapiens.CROPseq.gtf
INPUT_FILTERED_GTF=Homo_sapiens.CROPseq.Filtered.gtf

# Filter GTF
cellranger mkgtf ${INPUT_GTF} ${INPUT_FILTERED_GTF}  --attribute=gene_biotype:protein_coding --attribute=gene_biotype:lincRNA --attribute=gene_biotype:antisense || exit 1

# Create STAR reference
cellranger mkref --genome=${GENOME_NAME} --fasta=${INPUT_FASTA} --genes=${INPUT_FILTERED_GTF} || exit 1

scRNA-seq Processing and Analysis

Chromium Processing

This step is usually done by the service provider. Samples need to be mapped to the reference prepared in the previous step.

Data for all arrays need to be combined via the cellranger aggr function, which not only concatenates the count matrices, but also performs a read depth equalization between all samples.

Guide molecule identification

If the reference has worked, you will receive a count table with the guide and control RNA as additional genes in the expression matrix. These can be retrieved using the following steps.

MiSeq Enrichment

FASTQ Preparation

This uses merged paired-end reads as generated by the Galaxy pipeline.

  1. Label reads with combined cell barcode and UMI sequence, located at the first 28 nucleotides of the merged sequences. This can be done with fastp.
fastp -i Galaxy112-[ARRAY1_R1_R2_merge].fastqsanger  -o ARRAY1.fastq -U --umi_loc read1 --umi_len 28  -w 4
  1. Strip polyT tail from the reads using cutadapt.
cutadapt -j 4 -g "T{200}" -o ARRAY1_Processed.fastq ARRAY1.fastq 

This can be run for all samples using the preprocessFastq.sh script.

Mapping

Map guides to the reference generated in the first section using STAR aligner.

STAR --genomeDir $REF_DIR --readFilesCommand cat \
--readFilesIn $INPUT_FILE \
--outSAMtype BAM Unsorted \
--quantMode GeneCounts \
--runThreadN 16 \
--outFileNamePrefix $SAMPLE_NAME \
--outReadsUnmapped Fastx \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 \
--outFilterMatchNmin 0 \
--outFilterMismatchNmax 2

This will output a BAM file and a count table that we will use to trace back to the original cells.

Quantification via reads

Run quantifyEnrichment.py. This will process the BAM files (eg. ARRAY1_Aligned.sortedByCoord.out.bam) generated by STAR. These files will need to be indexed and position-sorted to work.

The script can be invoked as follows:

python quantifyEnrichment.py -i ARRAY1_Aligned.sortedByCoord.out.bam -o ARRAY1_EnrichmentReads.tsv

This will generate a spreadsheet that contains the number of reads mapped to each UMI-barcoded molecule containing a gRNA, and the cell it came from.

Example output

gRNA cell_barcode umi reads
GUIDES_sg001_ABCA1 AAACCCATCACTGTTT GTCCAGGGGGT 1
GUIDES_sg001_ABCA1 AAACGAAGTCCAGCGT TATTGAGCATT 1
GUIDES_sg001_ABCA1 AAAGAACTCGGCGATC TTATTTGCGTC 1
GUIDES_sg001_ABCA1 AAAGGGCGTCCCGTGA GGCCGCCCTGC 2
GUIDES_sg001_ABCA1 AAATGGAAGCAGCCAG ACATCAATCAG 1
GUIDES_sg001_ABCA1 AAATGGAGTAGACAGC TACTTATAATG 1
GUIDES_sg001_ABCA1 AACAACCCAATTTCGG AGTTGCTGCGG 1
GUIDES_sg001_ABCA1 AACAACCGTTTATGCG TATATTATCAG 3
GUIDES_sg001_ABCA1 AACAAGAGTATGCTAC GGCGAGATCAT 1

The output must then be processed via the R script [quantifyEnrichment.R][quantifyEnrichment.R], which will parse the output and identify the guides present in each cell based on the MiSeq enrichment data.

Script is invoked as follows:

Rscript quantifyTranscriptome.R ARRAY1_EnrichmentReads.tsv ARRAY1

Guides are assigned to a cell if:

  1. They are the only guides found in a cell
  2. They have more reads than 3x the sum other reads assigned to other guides in the cell AND the guides target a different gene
  3. The guides for a single gene have 3x the sum of reads assigned to other guides in the cell

Example output

cell_barcode gRNA type target
AAAAAAAAGGAGGGGG sg097 GUIDES TIMP3
AAAAAAAAGGGTACGT sg103 GUIDES TRIOBP
AAAAAAGGACGGGGGA sg020 GUIDES ANTXR1
AAAAACTGTCCTTAAG sg045 GUIDES EMID1
AAAAAGATCCAAGGAA NA NA NA
AAAAAGTGAAAGAGAA sg112 GUIDES TEX41

Quantification via UMIs

Run quantifyTranscriptome.R to retrieve gRNA-mapped UMI counts from the Chromium data. Input should be the filtered_feature_bc_matrix found in the Cell Ranger output folder.

You will need to invoke the script as follows:

Rscript quantifyTranscriptome.R AH_array-1/filtered_feature_bc_matrix ARRAY1

This will generate a file with the following information:

cell_barcode gRNA
AAACCCAAGAGTGTGC NA
AAACCCAAGCGCTGAA NA
AAACCCAAGGCACAAC NA
AAACCCAAGTCGTTAC NA
AAACCCAAGTCTGGAG NA
AAACCCACAGAGACTG NA
AAACCCACATAGGAGC NA
AAACCCAGTACTCGTA NA
AAACCCAGTATAGCTC NA
AAACCCAGTTGTTTGG NA
AAACCCATCACTGTTT GUIDES-sg097-TIMP3-gene

Final guide assignments

The processTranscriptome.R script performs two functions:

  1. Integration of assignments from read and transcriptome information
  2. Filter cells based on QC quality and adds gRNA assignments to gene-count information

It is recommended you run this script manually as you will need to review the QC reports to make decisions on which cells to retain.

Analysis

Incorporation of scRNA-seq data and assignments

The aggregated scRNA-seq data was loaded into R via Seurat, and final guide assignments were added to the object metadata with the Prepare_SingleCellData.R script. This step also normalises the dataset using SCTransform.

Differential expression analysis

The script CombinedLRT.R loads the Seurat object generated in the previous step and runs differential expression analysis on it. User will need to run analysis on each gRNA target to review transcriptome-wide effects on test group vs controls.

About

Repository for the processing and analysis of CROP-seq data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published