Suite of tools for the analysis of viruses involved in cancer with a specific focus on understanding the attributes of integrated viruses.
- Align reads to reference genome containing host and/or viral sequences using software such as bowtie2 or bwa that will generate a bam file
- Input bam file and reference fasta into
get_int_loc.pl
get_int_loc.pl
will also output a fastq file of all viral sequence containing pairs and will assemble these using SPAdesnormVirReads.py
can be used at this step to normalize the viral coverage or "copy number" relative to the remaining viral genome (in case of large deletions) and the number of human reads
- Prepare SPAdes fastg file for plotting with R using
draw_fastg.pl
- Annotate SPAdes fastg using
annotate_fastg.pl
- Plot fastg file using R and ggraph with
plot_annotated_fastg.pl
- Based on these values and the output of
get_int_loc.pl
refine integration sites - Get host and virus sequences using the aforemented refined integration sites with
get_break_seqs.pl
- SPAdes
- samtools
- bedtools
- Bio::DB::FASTA
- Text::Levenshtein
- Bio::Perl
- Algorithm::BloomFilter
- ggplot2
- ggraph
- tidygraph
- reshape2
Randomly get 1000 human chromosomal locations.
Randomly get 1000 human and virus integration pairs and calculate the degree of microhomology
perl annotate_fastg.pl input.fastg.txt input.blastn.txt > input.fastg.ann.txt
Annotate the fastg data in a tabular format based on given reference sequences using BLAST (blast output is generated from get_int_loc.pl).
perl draw_fastg.pl input.fastg > input.fastg.txt
Make a tabular text file that can be further annotated or plotted using R.
perl get_break_seqs.pl input.int.txt
Get the viral and host sequences overlapping "break" or viral integration sites to assess microhomology and other features.
perl get_int_loc.pl input.bam "virus_contig_name" ref.fa
Find the likely integration sites of a particular virus in a host genome while extracting all read pairs containing viral sequences identified by a bloom filter built on 32bp k-mers based on the viral reference sequence.
Coming soon.
R script that calculated mutual exclusivity/co-occurence from an input matrix, such as for somatic alterations in genes (rows) by patient/sample (columns)
python normVirReads.py input.virus.bedgraph input.stats.txt
Example commands to generate the needed input
java -jar picard.jar BamIndexStats I=input.bam > input.stats.txt
samtools view -b input.bam 'virus' | bedtools genomecov -bg -split -ibam stdin > input.virus.bedgraph`
Calculate the fraction of viral genome bases covered and normalize the coverage based on these values as well as number of human reads.
perl plot_annotated_fastg.pl input.fastg.ann.txt
Make the pdf file containing the ggraph network plot of fastg data
This assumes that the user has already generated a deduplicated and indexed bam alignment against a reference containing the human genome hg38 and MCPyV named hg38_MCPyV.fa. This command needs 4 threads and will need at least 60GB (up to 120GB) of RAM for the BLAST and SPAdes steps.
perl get_int_loc.pl MCC000.bam MCPyV hg38_MCPyV.fa
This command will generate the following files and directories:
- MCC000.bam.MCPyV.TIMESTAMP.fastq : unpaired fastq file of all virus-aligned, unaligned, and 25bp virus k-mer containing reads
- MCC000.bam.MCPyV.TIMESTAMP.bedpe : paired bed file containing virus-host discordant pair positions and strands to determine location and orientation of integration event
- MCC000.bam.MCPyV.TIMESTAMP.int.txt : Integration summary file with the following columns:
- hostChrom : Host chromosome with the integration event
- hostStart : Start position of the integration event on the host chromosome
- hostEnd : Start position of the integration event on the host chromosome
- virChrom : Virus "chromosome" name
- virStart : Start position of the integration event on the virus genome
- virEnd : End position of the integration event on the virus genome
- numPairs : Number of read pairs supporting this integration event
- hostSkew : Skewdness of read coverage on the host chromosome over the integration event (only relevant for hybrid capture approaches like ViroPanel). Negative skew indicates that the junction is at the hostStart position, positive skew indicates its at the hostEnd position
- virSkew : Skewdness of read coverage on the viral genome over the integration event (only relevant for hybrid capture approaches like ViroPanel). Negative skew indicates that the junction is at the virStart position, positive skew indicates its at the virEnd position
- MCC000_spades : Default SPAdes output directory generated from MCC000.bam.MCPyV.TIMESTAMP.fastq
- MCC000_spades/assembly_graph.fastg.blastn.txt : BLASTn output of assembly graph
- MCC000_spades/mh.10.txt : Counts of microhomology events up in a 10bp window around integration junctions
- MCC000_spades/mh.10.seq.txt : Summary of microhomology sequences in a 10bp window around integration junctions
Annotate the MCPyV integration assemblies
perl draw_fastg.pl MCC000_spades/assembly_graph.fastg > MCC000_spades/assembly_graph.fastg.txt
perl annotate_fastg.pl MCC000_spades/assembly_graph.fastg.txt MCC000_spades/assembly_graph.blastn.txt > MCC000_spades/assembly_graph.fastg.ann.txt
perl plot_annotated_fastg.pl MCC000_spades/assembly_graph.fastg.ann.txt
The final command will generate a pdf file with the annotated assembly graph of the integrated virus structure. This can be used to determine number of copies and copy structure.
Normalize viral read abundance by host reads and percent viral genome coverage. This is useful in estimating the number of copies per tumor cell.
java -jar picard.jar BamIndexStats I=MCC000.bam > MCC000.stats.txt
samtools view -b MCC000.bam 'MCPyV' | bedtools genomecov -bg -split -ibam stdin > MCC000.MCPyV.bedgraph
python normVirReads.py MCC000.MCPyV.bedgraph MCC000.stats.txt
Summary file contains the following columns in this order:
- Virus name
- File name
- normCp : Normalized viral genome copies by percent coverage per 1000 human reads
- huReads : Total number of human reads
- bases : Number viral genome bases covered by reads
- avgCov : Number of viral reads over number of viral bases covered
- cov : Fraction of the viral genome covered by reads
- length : Length of the complete viral genome