KuPID: Kmer-based Upstream Preprocessing for Isoform Discovery

Introduction

Eukaryotic genes can encode multiple protein isoforms based on alternative splicing of exonic and intronic regions. More than 95% of human genes have been found to undergo alternative splicing. Alternative isoforms have been implicated in a wide-range of biological processes, such cell differentiation, stress response, and tumorigenesis. Most modern novel isoform discovery methods function by identifying and assembling exon splice junctions from an RNAseq sample. However, splice junctions can only be accurately annotated with time-intensive dynamic programming alignment.

KuPID is a preprocessing method designed for RNAseq analysis of long transcript reads. When given an RNAseq sample, KuPID will filter out the reads expressed from reference (annotated) isoforms. The remaining reads can be submitted to the downstream alignment and isoform discovery software of the user's choice. KuPID operates by applying kmer sketching methods to quickly pseudo-align RNAseq reads to a reference transcriptome. The KuPID-processed samples require less time for downstream alignment, and routinely discover more true novel isoforms.

Install

KuPID is currently available to build from source as a rust crate.

Requirements:

rust programming language and associated tools such as cargo are required and assumed to be in PATH.
A c compiler (e.g. GCC)
make

git clone https://github.com/mboro2000/KuPID.git
cd KuPID

# If default rust install directory is ~/.cargo
cargo install --path . --root ~/.cargo

# If ~/.cargo doesn't exist use below commands instead
#cargo build --release

(back to top)

Input

To function, KuPID requires the following inputs:

RNAseq sample (-i); should be formatted as a .fasta of the sequenced reads
Reference transcriptome (-r); should be formatted of a .fasta of the reference isoforms

Sample Tutorial

To demonstrate KuPID's abilities, we've provided an RNAseq sample of PacBio HiFi reads sequenced from chr1 of the human genome. In addition, we've provided an annotation file of the novel isoforms present in the sample.

To complete the downstream analysis, users should have access to the following software:

minimap2
stringtie2
gffcompare

Download reference data

cd KuPID/reference_data
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.transcripts.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
wget https://ftp.ensembl.org/pub/release-114/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
gzip -d gencode.v48.transcripts.fa.gz
gzip -d gencode.v48.annotation.gtf.gz
gzip -d Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
python edit_heading.py Homo_sapiens.GRCh38.dna.chromosome.1.fa "chr1"

Access sample data

cd KuPID/sample
tar -xvzf small_sample.ccs.fasta.tar.gz

Apply KuPID

cd KuPID/src
for mode in discovery quantify;do
cargo run -- -o "small_sample" -r "~/KuPID/reference_data/gencode.v48.transcripts.fa" -i "~/KuPID/sample/small_sample.ccs.fasta" -k 22 -s 0.1 -e 0.002 -n 30 -b 16 -m 3 -B 0.98 -t 1 -c 1.5 -l 5 -z 100 --mode $mode
done

Downstream Alignment and Isoform Discovery

cd KuPID/sample
mkdir minimap2_output
#Align non-processed reads
./minimap2 -ax splice:hq -uf --MD ~/KuPID/reference_data/Homo_sapiens.GRCh38.dna.chromosome.1.fa ~/KuPID/sample/small_sample.ccs.fasta -t 3 -o ~/KuPID/sample/minimap2_output/small_sample.sam
samtools view -S -b ~/KuPID/sample/minimap2_output/small_sample.sam > ~/KuPID/sample/minimap2_output/small_sample.bam
samtools sort ~/KuPID/sample/minimap2_output/small_sample.bam -o ~/KuPID/sample/minimap2_output/small_sample.sorted.bam
samtools index ~/KuPID/sample/minimap2_output/small_sample.sorted.bam

#Align KuPID-processed reads
for mode in discovery quantify;do
./minimap2 -ax splice:hq -uf --MD ~/KuPID/reference_data/Homo_sapiens.GRCh38.dna.chromosome.1.fa ~/KuPID/src/small_sample.$mode.fa -t 1 -o ~/KuPID/sample/minimap2_output/small_sample.$mode.sam
samtools view -S -b ~/KuPID/sample/minimap2_output/small_sample.$mode.sam > ~/KuPID/sample/minimap2_output/small_sample.$mode.bam
samtools sort ~/KuPID/sample/minimap2_output/small_sample.$mode.bam -o ~/KuPID/sample/minimap2_output/small_sample.$mode.sorted.bam
samtools index ~/KuPID/sample/minimap2_output/small_sample.$mode.sorted.bam
done

#Apply stringtie2
./stringtie -L ~/KuPID/sample/minimap2_output/small_sample.sorted.bam -G ~/KuPID/reference_data/gencode.v48.annotation.gtf -o ~/KuPID/sample/stringtie2.small_sample.gtf
for mode in discovery quantify;do
./stringtie -L ~/KuPID/sample/minimap2_output/small_sample.$mode.sorted.bam -G ~/KuPID/reference_data/gencode.v48.annotation.gtf -o ~/KuPID/sample/stringtie2.small_sample.$mode.gtf
done

Scale Quantification Results (Required for KuPID-quantify)

### scale quantification results for KuPID reads
python ~/scale_quantification.py -a ~/KuPID/sample/stringtie2.small_sample.quantify.gtf --method stringtie2 --output ~/KuPID/sample/stringtie2.small_sample.KuPID.scaled_tpm.csv --scale ~/KuPID/src/small_sample.scale_factors.csv -l 5 -p KuPID
### adjust quantifcation results of non-processed reads to only show abundances of known reference transcripts
python ~/scale_quantification.py -a ~/KuPID/sample/stringtie2.small_sample.quantify.gtf --method stringtie2 --output ~/KuPID/sample/stringtie2.small_sample.tpm.csv --scale ~/KuPID/src/small_sample.scale_factors.csv -l 5 -p None

# -a: initial abundance results reported by the chosen quantification method
# -m: ID method used to quantify transcripts. KuPID can currently be paired with IsoQuant, flair, or stringtie2
# -l: number of reads sampled from each annotated isoform during KuPID-quantify

Analyze Discovery Results

cd KuPID/sample
mkdir gffcompare
for trial in stringtie2.small_sample stringtie2.small_sample.discovery stringtie2.small_sample.quantify;do
./gffcompare -r ~/KuPID/reference_data/gencode.v48.annotation.gtf ~/KuPID/sample/$trial.gtf -o ~/KuPID/sample/gffcompare/$trial.comp_to_ref
## Extract the transcript models that are predicted to be novel
python ~/KuPID/sample/predicted_novel.py "/KuPID/sample/gffcompare/$trial.comp_to_ref.tracking" "~/KuPID/sample/$trial.gtf" "~/KuPID/sample/$trial.predicted_novel.gtf"
./gffcompare -r ~/KuPID/sample/small_sample.novel.gtf ~/KuPID/sample/$trial.predicted_novel.gtf -o ~/KuPID/sample/gffcompare/$trial.comp_to_novel
done

### gffcompare will report the precision and accuracy of novel transcript models assembled by stringtie2. To view this output, use:
for trial in stringtie2.small_sample stringtie2.small_sample.discovery stringtie2.small_sample.quantify;do
cat ~/KuPID/sample/gffcompare/$trial.comp_to_novel
done

Analyze Quantification Results

python ~/KuPID/sample/analyze_quantification.py -t ~/KuPID/sample/small_sample.true_tpm.csv -e ~/KuPID/sample/stringtie2.small_sample.tpm.csv -k ~/KuPID/sample/stringtie2.small_sample.KuPID.scaled_tpm.csv

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github/workflows		.github/workflows
reference_data		reference_data
sample		sample
src		src
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
predicted_novel.py		predicted_novel.py
scale_quantification.py		scale_quantification.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KuPID: Kmer-based Upstream Preprocessing for Isoform Discovery

Introduction

Install

Input

Sample Tutorial

Download reference data

Access sample data

Apply KuPID

Downstream Alignment and Isoform Discovery

Scale Quantification Results (Required for KuPID-quantify)

Analyze Discovery Results

Analyze Quantification Results

About

Uh oh!

Releases

Packages

Languages

mboro2000/KuPID

Folders and files

Latest commit

History

Repository files navigation

KuPID: Kmer-based Upstream Preprocessing for Isoform Discovery

Introduction

Install

Input

Sample Tutorial

Download reference data

Access sample data

Apply KuPID

Downstream Alignment and Isoform Discovery

Scale Quantification Results (Required for KuPID-quantify)

Analyze Discovery Results

Analyze Quantification Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages