CRISPRi-lib-pipe

Michael Jahn, Kiyan Shabestary

Description

Pipeline to process CRISPRi library sequencing data.

Related publications

Yao et al., Pooled CRISPRi screening of the cyanobacterium Synechocystis sp PCC 6803 for enhanced industrial phenotypes, Nature Communications, 2020. Preprint is available at BioRxiv.org.

Prerequisites

bs-cp tool from Illumina (optional)
bcl2fastq for NGS file conversion (optional)
sickle, bowtie2, and samtools >= 1.14
R >= 4.0, packages DESeq2, DescTools, tidyverse, limma,scales
sequencing data in fastq.gz format (gzip compressed)
sgRNA library reference file in fasta format to assign reads

On Ubuntu flavored linux you can install the main tools from the official repositories:

sudo apt install sickle
sudo apt install bowtie2
sudo apt install samtools

Important note: A bug in the outdated samtools version 1.10 available through apt leads to false calculation of read numbers. It is recommended to install the LATEST versions of bowtie2 and samtools from their respective repositories.

Usage

Step 1: Retrieving data from Illumina basespace via command line (optional)

Data in form of *.fastq files can be manually downloaded from the basespace website on MacOS or Windows. For Linux systems, only the command line option is available via Illumina's basespace client bs-cp. Files are in Illumina's proprietary format. Execute the following line in a terminal and replace <your-run-ID> with the number you will find in the URL of your browser. For example, log in to basespace, navigate to runs, select a sequencing run and copy the ID you find in the URL: https://basespace.illumina.com/run/200872678/details.

bs-cp -v https://basespace.illumina.com/Run/<your-run-ID> /your/target/directory/

The data must then be converted to *.fastq (plain text) files using Illumina's bcl2fastq tool. It is recommended to run it with option --no-lane-splitting in order to obtain one file per sample, instead of several files subdivided by lane. If it complains about indices being too similar to demultiplex, the option --barcode-mismatches 0 can be added.

cd /your/target/directory/
bcl2fastq --no-lane-splitting

The gzipped *.fastq.gz files will be stored in ./Data/Intensities/BaseCalls/. To merge replicates of the same sample into a new *.fastq.gz file (usually not required!), run the following script. The script merges files matching a certain pattern into a single new file. Input and output folder can be specified with optional parameters (the default is current directory ./). New file names are truncated to the part preceding the variable pattern (all characters trailing the pattern are ignored).

input_dir - input directory
output_dir - output directory
file_ext - file extension of the target files (default: fastq.gz)
pattern - all files matching this pattern will be merged (default: _L00[1-4]_)

source/merge_fastq_files.sh --input_dir data/fastq/ --output_dir data/fastq/ --pattern _R.

fastq.gz files can be manually inspected in the terminal by using the following command. This can be useful to determine read length and format of the downloaded files.

zcat file.fastq.gz | head -n 1

Step 2: Pipeline for read trimming and mapping to reference

This script filters reads using sickle, and maps them to the sgRNA library reference. The script takes the following (optional) input parameters:

input_dir - input directory (default ./)
output_dir - output directory (default ./)
pattern - the file name pattern to look for (default .fastq.gz)
read_length - expected read length for sickle (default: 51)
ref_file - reference library file for reads assignment (default: ./ref/Synechocystis_v2.fasta)

The following example processes fastq.gz files from the data/fastq/) directory. Output are filtered fastq.gz files, .bam alignment files, and counts.tsv summary tables, one for each input file.

source/map_reads.sh --input_dir data/fastq/ --output_dir data/output/

Step 3: Quantify differential abundance (DESeq2) and calculate fitness score

This step summarizes count tables per sample to one main table, adds statistical metrics for a pairwise sample comparison using DESeq2, and calculates fitness scores per gene and condition. Mandatory inputs are 1) the counts.tsv files from the previous step, and a tab-separated sample annotation table, metadata.tsv. The structure of this table is shown here:

file_name                   condition replicate date        time group reference_group
Example_S1_L001_R1.fastq.gz cond_A            1 2021-03-26     0     1               1
Example_S1_L001_R2.fastq.gz cond_A            2 2021-03-26     0     1               1
Example_S2_L001_R1.fastq.gz cond_B            1 2021-03-26     0     2               1
Example_S2_L001_R2.fastq.gz cond_B            2 2021-03-26     0     2               1

The entries in the file_name column correspond to the names of the fastq.gz/counts.tsv files. Two more columns are particularly important, group and reference_group. These columns define the condition to which each sample belongs, and to which this condition should be compared to, in order to obtain statistics like log2 fold change and p-value. The time column is used to calculate fitness score over a number of generations within one condition. Other time units are possible too as long as the input is numeric. If only one time point is supplied for all samples, fitness calculation is omitted. The zero time point should normally be identical with the reference condition. Fitness F of a gene g is calculated as the area under the curve of log2 fold change over time t, normalized by total duration of the experiment.

$F_g = \frac{AUC(t, log_2 FC)}{max(t)/2}$

For an ideal gene depleting or enriching at a constant rate, the fitness score equals the log2 fold change at the final time point.

The main output from the pipeline is a table with statistics for each sgRNA. By default, the Rdata format is chosen for its memory efficiency (DESeq2_result.Rdata). Additionally, summary plots for number of reads per sample, per sgRNA, total quantified sgRNAs per sample, and a volcano plot for log2 FC versus negative log10 p-value are exported in .png and .pdf format. All files are saved to the folder specified with --counts_dir. The script takes the following (optional) input parameters:

metadata_dir - metadata directory (default ./)
counts_dir - counts directory (default ./)
normalization - optional argument to specify if count matrix should be normalized between conditions, but separated by time points. Possible values are none (default), quantile or cyclicloess. These are passed down to normalizeBetweenArrays() from package limma
gene_fitness - if sgRNA and gene fitness should be calculated or not (default: False). If True more than one time point must be provided (fitness is AUC of log2FC over time)
gene_sep - separator in sgRNA strings between sgRNA name and position (example: abc|22 is separated by the pipe operator \\|, the default). Optional, only important in combination with --gene_fitness True
output_format - File format for the final result table, can be one of rdata (the default), csv or tsv

source/calculate_gene_fitness.sh --metadata_dir data/fastq/ --counts_dir data/output/ --gene_fitness True

Expected output

A table with the following columns.

Column	Type	Example	Comment
sgRNA	`chr`	`aat	111`
sgRNA_target	`chr`	aat	name of sgRNA target gene
sgRNA_position	`numeric`	111	position of sgRNA relative to target start
condition	`chr`	example	experimental condition
date	`chr`	2021_01_09	experiment date
time	`numeric`	0	time / n generations, important for fitness calculation
group	`numeric`	1	group number for sample
reference_group	`numeric`	1	group number of reference for comparison
baseMean	`numeric`	`NA`	DESeq2 average number of reads for sgRNA
log2FoldChange	`numeric`	0	DESeq2 log2 FC for sgRNA
lfcSE	`numeric`	0	DESeq2 log2 FC error for sgRNA
stat	`numeric`	`NA`	DESeq2 t statistic for sgRNA
pvalue	`numeric`	1	DESeq2 p-value for sgRNA
padj	`numeric`	1	DESeq2 adjusted p-value for sgRNA
fitness	`numeric`	2.020183	fitness for sgRNA
sgRNA_index	`numeric`	4	relative position of sgRNA
sgRNA_correlation	`numeric`	0.6247412	correlation of sgRNA with others
sgRNA_efficiency	`numeric`	0.9893041	relative repression efficiency of sgRNA
wmean_log2FoldChange	`numeric`	0	weighted mean log2 FC for gene
sd_log2FoldChange	`numeric`	0	standard deviation of log2 FC for gene
wmean_fitness	`numeric`	1.777574	weighted mean fitness for gene
sd_fitness	`numeric`	0.9558989	standard dev of fitness for gene

Example of graphical summary

Total number of mapped reads	Distribution of log10 number of reads, per sgRNA

Top 10 sgRNA mutants by read number	Volcano plot, log2 FC vs -log10 p-value

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
ref		ref
source		source
.gitignore		.gitignore
CRISPRi_lib_principle.png		CRISPRi_lib_principle.png
LICENSE		LICENSE
README.html		README.html
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRISPRi-lib-pipe

Description

Related publications

Prerequisites

Usage

Step 1: Retrieving data from Illumina basespace via command line (optional)

Step 2: Pipeline for read trimming and mapping to reference

Step 3: Quantify differential abundance (DESeq2) and calculate fitness score

Expected output

Example of graphical summary

About

Releases

Packages

Contributors 2

Languages

License

m-jahn/CRISPRi-lib-pipe

Folders and files

Latest commit

History

Repository files navigation

CRISPRi-lib-pipe

Description

Related publications

Prerequisites

Usage

Step 1: Retrieving data from Illumina basespace via command line (optional)

Step 2: Pipeline for read trimming and mapping to reference

Step 3: Quantify differential abundance (DESeq2) and calculate fitness score

Expected output

Example of graphical summary

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages