SHARE-seq-alignment v1.0

Pipeline for demultiplexing and aligning both ATAC and RNA data generated in SHARE-seq
The pipeline is updated to a newer version. Please refer to V2 for future usage.
https://github.com/masai1116/SHARE-seq-alignmentV2/
Author: Sai Ma. sai@broadinstitute.org

Important note

The ATAC and RNA barcodes would in the format of R1.xx,R2.xx,R3.xx,P1.xx, where xx is in the range of 01-96.
The P1.xx indicates the primers that were used to amplify each ATAC and RNA sub-libraries. It is expected to be different between RNA and ATAC assay. The rest part of barcode (R1.xx,R2.xx,R3.xx) would be the same for ATAC and RNA library.
Depending on the downstream analysis pipeline, sometimes the comma in the barcode would be converted to period (R1.xx,R2.xx,R3.xx,P1.xx --> R1.xx.R2.xx.R3.xx.P1.xx).
A barcode translation table indicating the P1.xx used in the assay would be beneficial, when submitting data to GEO.

Installation

This pipeline requires following packages to be properly installed and added to system path: GNU parallel, Bcl2fastq, fastp, zcat, STAR, bowtie2, python2, umi_tools, samtools, picard (2.14.1, newer version may result in error), R, featureCounts, read_distribution.py from RSeQC, bedtools

The SHARE-seq-alignment scripts can be directly downloaded from the github website.
https://github.com/masai1116/SHARE-seq-alignment/

After downloading all scripts, update the general configuration section in main script "Split_seq_example.sh":

myPATH # where the SHARE-seq scripts are installed. e.g. myPATH='/mnt/users/Script/share-seq-github-v1/'
pythohPATH # where python2 is installed e.g. pythohPATH='/usr/bin/python'
picardPATH # where picard is installed e.g. picardPATH='/mnt/bin/picard/picard.jar'

The pipeline also requres gtf files and aligner index files to be download and unziped into the right location.
GTF files can be downloaded here.
Bowtie2 index files (Hg19 and mm10) can be downloaded here.
Assuming SHARE-seq aligment scripts are installed to "/home/SHARE-seq-alignment/", the gtf files should be placed in the "/home/SHARE-seq-alignment/gtf" folder.
The bowtie2 index files should be placed in the "/home/SHARE-seq-alignment/refGenome/bowtie2" folder.
Three sets of index files (hg19, mm10 and hg19-mm10 combined genome) for star aligner should be prepared according to star aligner manual, or downloded from here: hg19, mm10, combined genome.
The unziped index files should be placed in the "/home/SHARE-seq-alignment/refGenome/star/hg19", "/home/SHARE-seq-alignment/refGenome/star/mm10", and "/home/SHARE-seq-alignment/refGenome/star/both", respectively.
The index file for hg19-mm10 combined genome can be downloaded from 10x Genomics website.

How to run the script?

A small set of fastq files for testing are in the test_fastq_nova/ folder Before running, three sections in the main script "Split_seq_example.sh" need to be updated for each run, inlcuding A) paths B) sample configuration C) fastq configuration. After update all specific information in Split_seq_example.sh and config_example.ymal, run the script by ./Split_seq_example.sh

A) paths

rawdir=./test_fastq_nova/ # where the raw data is
dir=~/test/ # where output data will be stored
ymal=./config_example.ymal # where the ymal configuration file is

B) sample configuration

Project=(sp.rna sp.atac.first) # use differnt name for each sample
Type=(RNA ATAC) # ATAC or RNA
Genomes=(hg19 both) # both mm10 hg19
RawReadsPerBarcode and ReadsPerBarcode options are designed to remove barcodes with too few reads and speed up processing.
RawReadsPerBarcode=(10 10) # reads cutoff for the unfiltered bam file. Recommend to use 100 for full run; 10 for QC run
ReadsPerBarcode=(1 1) # reads cutoff for the filtered bam file. Recommend to use 100 for full run, 1 for QC run
keepMultiMapping=(F F) # T or F; default is F. Keep or discard multi-mapping reads

C) fastq configuration

Indexed=F # T or F; defaul is F. Indicate if the index reads are already attached to biological reads. Use F, when started with BCL file.
Start=Fastq_Merge # Bcl or Fastq_Merge (when fastq were generated per run) or Fastq_SplitLane (when fastq were generated per sequencing lane)
Runtype=full # QC or full; QC only analyze 12M reads to get a quick sense of data
Sequencer=Novaseq # Novaseq or Nextseq; miseq or nova-seq has the same sequencing direction, use "Novaseq" for either

RNA-seq options

The pipeline also offers flexible RNA-seq specific options for advanced users.

removeSingelReadUMI=F # T or F; default is F. If T, UMIs with single read will be removed.
keepIntron=T # T or F; default is T. If F, intronic RNA reads will be discarded.
matchPolyT=F # T or F; default is F. If T, will try to find TTTTTT (allowing 1 mis-match) in 11-16 bp position of biological read2. If TTTTTT is not identified, read will be disgarded. Only works if Read2 is longer than 16 bp.
SkipPolyGumi=F # T or F; default is F, pipeline will remove polyG UMIs. If T, pipeline will keep polyG UMIs.
genename=gene_name # gene_name (official gene symbol) or gene_id (ensemble gene name), gene_name is default
refgene=gencode # gencode or genes; gencode is default; genes is UCSC refseq genes; gencode also indcludes nc-RNA

Sample barcode table

SHARE-seq allows mutiplexing samples in one run. We use ymal file to store two levels of sample barcode information, including Round1 hybridization barcode (R1.xx), and PCR barcode (P1.xx, refers to the Ad1.xx primers used in the PCR step). See config_example.ymal as an example. This file needs to be updated for each sample and each sequencing run. When all the 96 barcodes in the Round1 plate are used during the experiment, see config_example2.yaml as an example.

---
Project1:
    Name: sp.atac.first
    Primer:
        - P1.01
        - P1.02
    Round1:
        - R1.05
        - R1.13
        - R1.21
        - R1.29
...

R1.xx can be R1.01, R1.02, ..., R1.96.
P1.xx can be P1.01, P1.02, ..., P1.96.
The detialed information about these barcode can be found in SHARE-seq manuscript.

Pipeline output

After successfully running the pipeline, data for each project will be kept in the folder with the same name as the project. Several QC plots will be generated.
For ATAC, the most important files are the fragment file after removing duplicates and mito reads (.rmdup.cutoff.bed.gz) and a summarize report (.counts.csv.gz).
For RNA, the most important files are UMIxCell matrix (.UMIcounts.csv.gz) and a summarize report (.counts.csv.gz).
This pipeline currently keeps many intermedia files. If preferred, they can be manually removed afterwards.

Read data example

Public SHARE-seq datasets are available at GEO.
A set of fastq files for species mixing experiment can be downloaded here.

Cite us

For more details, please refer to Ma et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin, Cell 2020

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
JDB_tools		JDB_tools
TSSfiles		TSSfiles
genomeBed		genomeBed
gtf		gtf
refGenome		refGenome
test_fastq_nova		test_fastq_nova
ATAC_species_mixing_plot.R		ATAC_species_mixing_plot.R
README.md		README.md
Read_distribution.R		Read_distribution.R
SampleSheet.csv		SampleSheet.csv
Split_seq_example.sh		Split_seq_example.sh
UMI_gene_perCell_plot.R		UMI_gene_perCell_plot.R
config_example.ymal		config_example.ymal
config_example2.yaml		config_example2.yaml
googleb00e943a6bcb14ed.html		googleb00e943a6bcb14ed.html
hg19.UCSC_RefSeq.bed		hg19.UCSC_RefSeq.bed
index.html		index.html
lib_size_sc_V4_single_species.R		lib_size_sc_V4_single_species.R
lib_size_sc_V4_species_mixing.R		lib_size_sc_V4_species_mixing.R
matchPolyT.py		matchPolyT.py
mm10.UCSC_RefSeq.bed		mm10.UCSC_RefSeq.bed
pfReads_Sai.py		pfReads_Sai.py
pySinglesGetAlnStats_atac.py		pySinglesGetAlnStats_atac.py
pyadapter_trim.py		pyadapter_trim.py
splitProjects_V4_nextseq_V2.py		splitProjects_V4_nextseq_V2.py
splitProjects_V4_novaseq_V2.py		splitProjects_V4_novaseq_V2.py
sum_reads.R		sum_reads.R
updateIndex_Next_core_V2.py		updateIndex_Next_core_V2.py
updateIndex_Next_mid_V2.py		updateIndex_Next_mid_V2.py
updateRGID_Singles_nextseq_V4.py		updateRGID_Singles_nextseq_V4.py
updateRGID_Singles_novaseq_V4.py		updateRGID_Singles_novaseq_V4.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SHARE-seq-alignment v1.0

Important note

Installation

How to run the script?

A) paths

B) sample configuration

C) fastq configuration

RNA-seq options

Sample barcode table

Pipeline output

Read data example

Cite us

About

Releases

Packages

Languages

masai1116/SHARE-seq-alignment

Folders and files

Latest commit

History

Repository files navigation

SHARE-seq-alignment v1.0

Important note

Installation

How to run the script?

A) paths

B) sample configuration

C) fastq configuration

RNA-seq options

Sample barcode table

Pipeline output

Read data example

Cite us

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages