GS-Preprocess

Introduction

GS-Preprocess is a simple, 5-argument pipeline that generates input data for the GUIDEseq Bioconductor package (https://doi.org/doi:10.18129/B9.bioc.GUIDEseq) from raw Illumina sequencer output. For off-target profiling, Bioconductor GUIDEseq only requires a 2-line guideRNA fasta, demultiplexed BAM files of "plus"- and "minus"-strands, and Unique Molecular Index (UMI) references for each read. The latter two are produced by GS-Preprocess.

Compatible libraries are constructed according to GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases (https://doi.org/10.1038/nbt.3117).

Getting Started

Set Up Sequencing Run

This pipeline is compatible with ANY SEQUENCER and requires NO PRE-CONFIGURATION of the illumina machine. This represents a flexible alternative to https://github.com/aryeelab/guideseq#miseq which requires a pre-configured MiSeq and sample manifest YAML.

Note: Paired-end sequencing should include 8 Index1 (i7) cycles and 16 Index2 (i5) cycles:

R1 | 8 | 16 | R2

adapted from Tsai et al. 2014

Prerequisites

Intended for use on computing clusters

≥32G of RAM allocated to the GS-Preprocess pipeline
Illumina output folder: Download from BaseSpace or directly from any Illumina sequencer after run completion. No demultiplexing or fastq generation necessary!
```
 Run_output_dir_Example
 |-- Config
 |-- Data
 |-- Images
 |-- InterOp
 |-- Logs
 |-- RTAComplete.txt
 |-- RTAConfiguration.xml
 |-- RTALogs
 |-- RTARead1Complete.txt
 |-- RTARead2Complete.txt
 |-- RTARead3Complete.txt
 |-- RTARead4Complete.txt
 |-- Recipe
 |-- RunCompletionStatus.xml
 |-- RunInfo.xml
 |-- RunParameters.xml
 |-- Thumbnail_Images
```
- Illumina-format SampleSheet: https://help.basespace.illumina.com/articles/descriptive/sample-sheet/ This sheet is in .csv format and is commonly used to demultiplex illumina .bcl files (raw sequencer output)
- RunInfo.xml: Contains high-level run information,such as the number of Reads and cycles in the sequencing run. This file is standard output from any illumina sequencer and will automatically populate in any run output folder. RunInfo.xml is found in the top-level output folder of any sequencing run
BWA Index Download: https://support.illumina.com/sequencing/sequencing_software/igenome.html

Dependencies

Add the below dependencies:

bcl2fastq2/2.20.0
python2
R/3.6.0
bwa/0.7.5a
cutadapt/1.9
gcc/8.1.0

Example:

module add bcl2fastq2/2.20.0

Download GS-Preprocess

In cluster working directory

git clone https://github.com/umasstr/GS-Preprocess.git

Prepare the Pipeline

Move into src directory

cd GS-Preprocess/src

Make all files executable

chmod +x *

Workflow

Run the Pipeline

./gs_preprocess.sh -t <number_of_threads> -o </absolute/path/to/output_directory> -r <directory_containing_RunInfo.xml> -s </path/to/SampleSheet.csv> -b </path/to/BWAIndex/genome.fa>

To avoid errors, use absolute paths.

Completion of gs_preprocess.sh generates 2 of 3 inputs needed for Bioconductor GUIDEseq and stores them in working directory (GS-Preprocess/src/)

plus- and minus-strand BAMs
UMIs.txt
guideRNA.fa

Expected Runtime & Resource Usage

Benchmarks for a 10 million read run with 40 uniquely barcoded samples (20 plus and minus strand):

8 cores, 48G RAM
```
  Ruin
```
24 cores, 48G RAM
```
  s
```
24 cores, 48G RAM
```
  s
```

Post GS-Preprocess Notes

guideRNA fasta

Bioconductor GUIDEseq accepts a standard 20bp gRNA sequence in the fasta format.

Open any text editor
Enter in your gRNA name and sequence
Save this 2-line text file with a .fa extension
```
 >gRNA_or_gene_name
 GAGTCCGAGCAGAAGAAGAA
```

Merging BAMs

Certain situations will require user to merge BAM files:

A sequencer with multiple lanes (NEXTseq e.g.) will generate 4 fastq.gz files per sample labeled L001-L004.
Replicate samples with distinct i5 and/or i7 barcodes. Different UMIs do not count as distinct barcodes for this purpose.

If 1 or 2 apply to your output:

	samtools merge -@ <threads> <merged_sample_name.bam> <sample_1.bam> <sample_2.bam> ... <sample_n.bam>

Sample Bioconductor GUIDEseq Input

	library(hash)
	library(GUIDEseq)
	library(BSgenome.Hsapiens.UCSC.hg38)
	library(TxDb.Hsapiens.UCSC.hg38.knownGene)
	library(org.Hs.eg.db)
	
	guideSeqResults <- GUIDEseqAnalysis(
	alignment.inputfile = c(POS_STRAND.bam,NEG_STRAND.BAM),
	umi.inputfile=c("UMIs.txt","UMIs.txt"),
	gRNA.file = gRNA_FILE.fa,
	max.mismatch = 10,
	BSgenomeName = Hsapiens,txdb = TxDb.Hsapiens.UCSC.hg38.knownGene,
	orgAnn = org.Hs.egSYMBOL,
	outputDir= SAMPLE_NAME,
	n.cores.max = NUMBER_THREADS)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GS-Preprocess

Introduction

Getting Started

Set Up Sequencing Run

adapted from Tsai et al. 2014

Prerequisites

Dependencies

Download GS-Preprocess

Prepare the Pipeline

Workflow

Run the Pipeline

Expected Runtime & Resource Usage

Post GS-Preprocess Notes

guideRNA fasta

Merging BAMs

Sample Bioconductor GUIDEseq Input

Files

README.md

Latest commit

History

README.md

File metadata and controls

GS-Preprocess

Introduction

Getting Started

Set Up Sequencing Run

adapted from Tsai et al. 2014

Prerequisites

Dependencies

Download GS-Preprocess

Prepare the Pipeline

Workflow

Run the Pipeline

Expected Runtime & Resource Usage

Post GS-Preprocess Notes

guideRNA fasta

Merging BAMs

Sample Bioconductor GUIDEseq Input