GitHub - gstarrett/oncovirus_tools: Suite of tools for the analysis of virus integration in cancer.

Suite of tools for the analysis of viruses involved in cancer with a specific focus on understanding the attributes of integrated viruses.

Workflow

Align reads to reference genome containing host and/or viral sequences using software such as bowtie2 or bwa that will generate a bam file
Input bam file and reference fasta into get_int_loc.pl
- get_int_loc.pl will also output a fastq file of all viral sequence containing pairs and will assemble these using SPAdes
- normVirReads.py can be used at this step to normalize the viral coverage or "copy number" relative to the remaining viral genome (in case of large deletions) and the number of human reads
Prepare SPAdes fastg file for plotting with R using draw_fastg.pl
Annotate SPAdes fastg using annotate_fastg.pl
Plot fastg file using R and ggraph with plot_annotated_fastg.pl
Based on these values and the output of get_int_loc.pl refine integration sites
Get host and virus sequences using the aforemented refined integration sites with get_break_seqs.pl

Requirements

SPAdes
samtools
bedtools

Perl modules (Perl v5.18)

Bio::DB::FASTA
Text::Levenshtein
Bio::Perl
Algorithm::BloomFilter

R packages

ggplot2
ggraph
tidygraph
reshape2

Script descriptions and usage

bootstrap_locations.pl

Randomly get 1000 human chromosomal locations.

bootstrap_microhomology.pl

Randomly get 1000 human and virus integration pairs and calculate the degree of microhomology

annotate_fastg.pl

perl annotate_fastg.pl input.fastg.txt input.blastn.txt > input.fastg.ann.txt

Annotate the fastg data in a tabular format based on given reference sequences using BLAST (blast output is generated from get_int_loc.pl).

draw_fastg.pl

perl draw_fastg.pl input.fastg > input.fastg.txt

Make a tabular text file that can be further annotated or plotted using R.

get_break_seqs.pl

perl get_break_seqs.pl input.int.txt

Get the viral and host sequences overlapping "break" or viral integration sites to assess microhomology and other features.

get_int_loc.pl

perl get_int_loc.pl input.bam "virus_contig_name" ref.fa

Find the likely integration sites of a particular virus in a host genome while extracting all read pairs containing viral sequences identified by a bloom filter built on 32bp k-mers based on the viral reference sequence.

integration_cnvs.pl

Coming soon.

mutual_exclusivity.R

R script that calculated mutual exclusivity/co-occurence from an input matrix, such as for somatic alterations in genes (rows) by patient/sample (columns)

normVirReads.py

python normVirReads.py input.virus.bedgraph input.stats.txt

Example commands to generate the needed input

java -jar picard.jar BamIndexStats I=input.bam > input.stats.txt
samtools view -b input.bam 'virus' | bedtools genomecov -bg -split -ibam stdin > input.virus.bedgraph`

Calculate the fraction of viral genome bases covered and normalize the coverage based on these values as well as number of human reads.

plot_annotated_fastg.pl

perl plot_annotated_fastg.pl input.fastg.ann.txt

Make the pdf file containing the ggraph network plot of fastg data

Specific example for MCPyV

This assumes that the user has already generated a deduplicated and indexed bam alignment against a reference containing the human genome hg38 and MCPyV named hg38_MCPyV.fa. This command needs 4 threads and will need at least 60GB (up to 120GB) of RAM for the BLAST and SPAdes steps.

perl get_int_loc.pl MCC000.bam MCPyV hg38_MCPyV.fa

This command will generate the following files and directories:

MCC000.bam.MCPyV.TIMESTAMP.fastq : unpaired fastq file of all virus-aligned, unaligned, and 25bp virus k-mer containing reads
MCC000.bam.MCPyV.TIMESTAMP.bedpe : paired bed file containing virus-host discordant pair positions and strands to determine location and orientation of integration event
MCC000.bam.MCPyV.TIMESTAMP.int.txt : Integration summary file with the following columns:
hostChrom : Host chromosome with the integration event
hostStart : Start position of the integration event on the host chromosome
hostEnd : Start position of the integration event on the host chromosome
virChrom : Virus "chromosome" name
virStart : Start position of the integration event on the virus genome
virEnd : End position of the integration event on the virus genome
numPairs : Number of read pairs supporting this integration event
hostSkew : Skewdness of read coverage on the host chromosome over the integration event (only relevant for hybrid capture approaches like ViroPanel). Negative skew indicates that the junction is at the hostStart position, positive skew indicates its at the hostEnd position
virSkew : Skewdness of read coverage on the viral genome over the integration event (only relevant for hybrid capture approaches like ViroPanel). Negative skew indicates that the junction is at the virStart position, positive skew indicates its at the virEnd position
MCC000_spades : Default SPAdes output directory generated from MCC000.bam.MCPyV.TIMESTAMP.fastq
MCC000_spades/assembly_graph.fastg.blastn.txt : BLASTn output of assembly graph
MCC000_spades/mh.10.txt : Counts of microhomology events up in a 10bp window around integration junctions
MCC000_spades/mh.10.seq.txt : Summary of microhomology sequences in a 10bp window around integration junctions

Annotate the MCPyV integration assemblies

perl draw_fastg.pl MCC000_spades/assembly_graph.fastg > MCC000_spades/assembly_graph.fastg.txt
perl annotate_fastg.pl MCC000_spades/assembly_graph.fastg.txt MCC000_spades/assembly_graph.blastn.txt > MCC000_spades/assembly_graph.fastg.ann.txt
perl plot_annotated_fastg.pl MCC000_spades/assembly_graph.fastg.ann.txt

The final command will generate a pdf file with the annotated assembly graph of the integrated virus structure. This can be used to determine number of copies and copy structure.

Normalize viral read abundance by host reads and percent viral genome coverage. This is useful in estimating the number of copies per tumor cell.

java -jar picard.jar BamIndexStats I=MCC000.bam > MCC000.stats.txt
samtools view -b MCC000.bam 'MCPyV' | bedtools genomecov -bg -split -ibam stdin > MCC000.MCPyV.bedgraph
python normVirReads.py MCC000.MCPyV.bedgraph MCC000.stats.txt

Summary file contains the following columns in this order:

Virus name
File name
normCp : Normalized viral genome copies by percent coverage per 1000 human reads
huReads : Total number of human reads
bases : Number viral genome bases covered by reads
avgCov : Number of viral reads over number of viral bases covered
cov : Fraction of the viral genome covered by reads
length : Length of the complete viral genome

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.ropeproject		.ropeproject
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
annotate_fastg.pl		annotate_fastg.pl
bootstrap_locations.pl		bootstrap_locations.pl
bootstrap_microhomolgy.pl		bootstrap_microhomolgy.pl
draw_fastg.pl		draw_fastg.pl
get_break_seqs.pl		get_break_seqs.pl
get_int_loc.pl		get_int_loc.pl
integration_cnvs.pl		integration_cnvs.pl
mutualExclusivity_functionR.txt		mutualExclusivity_functionR.txt
mutual_exclusivity.R		mutual_exclusivity.R
normVirReads.py		normVirReads.py
oncovirus_tools.png		oncovirus_tools.png
plot_annotated_fastg.pl		plot_annotated_fastg.pl
plot_vir_var_cov.Rscript		plot_vir_var_cov.Rscript

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workflow

Requirements

Perl modules (Perl v5.18)

R packages

Script descriptions and usage

bootstrap_locations.pl

bootstrap_microhomology.pl

annotate_fastg.pl

draw_fastg.pl

get_break_seqs.pl

get_int_loc.pl

integration_cnvs.pl

mutual_exclusivity.R

normVirReads.py

plot_annotated_fastg.pl

Specific example for MCPyV

About

Releases 1

Packages

Languages

License

gstarrett/oncovirus_tools

Folders and files

Latest commit

History

Repository files navigation

Workflow

Requirements

Perl modules (Perl v5.18)

R packages

Script descriptions and usage

bootstrap_locations.pl

bootstrap_microhomology.pl

annotate_fastg.pl

draw_fastg.pl

get_break_seqs.pl

get_int_loc.pl

integration_cnvs.pl

mutual_exclusivity.R

normVirReads.py

plot_annotated_fastg.pl

Specific example for MCPyV

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages