Visualize distribution of HERV-K repeats in the 1000 genomes dataset at http://www.personal.psu.edu/wul135/visualization/
codes can be downloaded from visualization/
folder.
Note: [i]: input; [o]: output. Remove "[i]", "[o]" when running scripts.
(The final output file [unique.withrc.50.fa] can also be downloaded from https://www.dropbox.com/s/hobrefa3l4hza76/unique.withrc.50.fa?dl=0)
$ perl generate.pl [i]HERVK_alleles/ [o]kmer50pervirus/ 50
$ perl labels.pl [o]total_withlabel_50.fa [i]kmer50pervirus/
$ perl unique.pl [i]total_withlabel_50.fa [o]unique.50.fa
$ perl uniquewithrc.pl [i]unique.50.fa [o]unique.withrc.50.fa
where kmer50pervirus/ is a temporary directory.
(The final output [T.50] is in the Github.)
$ perl t.pl HERVK_alleles/ [o]tscript.50.pbs 50
where tscript.50.pbs
is a pbs script can be run in a server.
--- (Note: To use this tool, you can start from here.) ----
--- (1. download k-mer (k=50) reference [unique.withrc.50.fa] from Step 1.) ---
--- (2. downlad T values [T.50] for all HERV-K from Step 2. ---
--- (3. download and convert your data into 50-mer fasta file (e.g, 3.1), then do exact match as in 3.2) ---
3.1.1 download bam files from 1000 Genomes Project.
3.1.2 use the bed file to extract mapped reads based on the corresponding coordinates.
samtools view -b -L *.bed completed.bam > extracted.bam
"*.bed" is the input bed file containing HERV-K coordinates. They can be downloaded in 'bed files/' for hg19 and hg38:
HERVK_int.all.25oct2016.hg19.bed (hg19 build)
HERVK_hg38_sort_01apr2018.bed (hg38 build)
Note to GRCh37 and GRCh38 users: GRCh38 and GRCh37 have different naming scheme of chromosomes and the ALT contigs and the bed files should be changed to adapt.
3.1.3 convert bam files to fasta files, then kmerize to *.50.fa with k=50
(e.g, using dsk tool: http://minia.genouest.org/dsk/ ).
$ ./dsk -file sample.fa -kmer-size 50 -abundance-min 0
$ ./dsk2ascii -file sample.h5 -out sample.txt
$ perl tofasta.pl sample.txt sample.50.fa
Example files [dataset_samples/] can be downloaded as follows:
https://www.dropbox.com/sh/z1uhuavavywjpz3/AADgbiJyN2zeBYOq1wlR2Tsoa?dl=0
$ perl findmatch.pl [i]unique.withrc.50.fa [i]dataset_samples/ [o]hashlabels_1000g_50/ 50 HG00096 HG00097 HG00099 HG00100
$ perl labelcount.pl [i]hashlabels_1000g_50/HG00096.dat [i]sortedSites [o]hashlabels_1000g_50/HG00096.label
$ perl concate_tomatrix.pl [i]T.50 [i]peopleIDs_sample [i]hashlabels_1000g_50/ [o]mat.1kg.50.dat
// peopleIDs is the full list for KGP dataset
where hashlabels_1000g_50/ is a temporary directory, and HG00096 HG00097 HG00099 HG00100 are sample names.
Note: only need to implement this step when the sequencing depth of the input data is low. (i.e, < 20x).
- Mixture model: clustering/analysis.R
- To assign a label (0:absence, 1:presence, 2:solo-LTR):
5.1 retrieve the mapped k-mers.
perl checksolo.pl [i]chr8_735_cluster1.txt [i]unique.withrc.50.fa chr8_7355397_7364859 [i]dataset_samples [i]hashlabels_1000g_50 [o]chr8_7355397_7364859_cluster1
where chr8_735_cluster1.txt includes all individual's names in this cluster, and chr8_7355397_7364859 is the HERV-K coordinate.
5.2 align the mapped unique k-mers to the corresponding alleles (with the help of alignment tools).
An example can be checked in the supplementary in our paper. (S3 Fig. Alignment of unique k-mers to HERV-K at chr12: 55727215.)