Skip to content

lwl1112/polymorphicHERV

Repository files navigation

Visualization tool

Visualize distribution of HERV-K repeats in the 1000 genomes dataset at http://www.personal.psu.edu/wul135/visualization/

codes can be downloaded from visualization/ folder.

K-mer base approach to mine patterns and leverage the repetitive nature of sequencing data

Note: [i]: input; [o]: output. Remove "[i]", "[o]" when running scripts.

Generate n/T Matrix

Step 1: Generate reference k-mers (k=50)

(The final output file [unique.withrc.50.fa] can also be downloaded from https://www.dropbox.com/s/hobrefa3l4hza76/unique.withrc.50.fa?dl=0)

$ perl generate.pl [i]HERVK_alleles/  [o]kmer50pervirus/ 50  
$ perl labels.pl [o]total_withlabel_50.fa [i]kmer50pervirus/ 
$ perl unique.pl [i]total_withlabel_50.fa [o]unique.50.fa 
$ perl uniquewithrc.pl [i]unique.50.fa [o]unique.withrc.50.fa 

where kmer50pervirus/ is a temporary directory.

Step 2: Calculate T

(The final output [T.50] is in the Github.)

$ perl t.pl HERVK_alleles/ [o]tscript.50.pbs 50 

where tscript.50.pbs is a pbs script can be run in a server.

Step 3: Experiments: mapping to the k-mer references

--- (Note: To use this tool, you can start from here.) ----

--- (1. download k-mer (k=50) reference [unique.withrc.50.fa] from Step 1.) ---

--- (2. downlad T values [T.50] for all HERV-K from Step 2. ---

--- (3. download and convert your data into 50-mer fasta file (e.g, 3.1), then do exact match as in 3.2) ---

3.1 donwload and convert data

3.1.1 download bam files from 1000 Genomes Project.

3.1.2 use the bed file to extract mapped reads based on the corresponding coordinates.

samtools view -b -L *.bed completed.bam > extracted.bam

"*.bed" is the input bed file containing HERV-K coordinates. They can be downloaded in 'bed files/' for hg19 and hg38:

HERVK_int.all.25oct2016.hg19.bed (hg19 build)

HERVK_hg38_sort_01apr2018.bed (hg38 build)

Note to GRCh37 and GRCh38 users: GRCh38 and GRCh37 have different naming scheme of chromosomes and the ALT contigs and the bed files should be changed to adapt.

3.1.3 convert bam files to fasta files, then kmerize to *.50.fa with k=50

(e.g, using dsk tool: http://minia.genouest.org/dsk/ ).

$ ./dsk -file sample.fa -kmer-size 50 -abundance-min 0
$ ./dsk2ascii -file sample.h5 -out sample.txt
$ perl tofasta.pl sample.txt sample.50.fa   

Example files [dataset_samples/] can be downloaded as follows:

https://www.dropbox.com/sh/z1uhuavavywjpz3/AADgbiJyN2zeBYOq1wlR2Tsoa?dl=0

3.2 exact match data to the reference, then genereate the n/T matrix

$ perl findmatch.pl [i]unique.withrc.50.fa [i]dataset_samples/ [o]hashlabels_1000g_50/ 50 HG00096 HG00097 HG00099 HG00100  
$ perl labelcount.pl [i]hashlabels_1000g_50/HG00096.dat [i]sortedSites [o]hashlabels_1000g_50/HG00096.label
$ perl concate_tomatrix.pl [i]T.50 [i]peopleIDs_sample [i]hashlabels_1000g_50/ [o]mat.1kg.50.dat  
// peopleIDs is the full list for KGP dataset

where hashlabels_1000g_50/ is a temporary directory, and HG00096 HG00097 HG00099 HG00100 are sample names.

For Low Depth Data

Note: only need to implement this step when the sequencing depth of the input data is low. (i.e, < 20x).

  1. Mixture model: clustering/analysis.R
  1. To assign a label (0:absence, 1:presence, 2:solo-LTR):

5.1 retrieve the mapped k-mers.

perl checksolo.pl [i]chr8_735_cluster1.txt [i]unique.withrc.50.fa chr8_7355397_7364859 [i]dataset_samples [i]hashlabels_1000g_50 [o]chr8_7355397_7364859_cluster1

where chr8_735_cluster1.txt includes all individual's names in this cluster, and chr8_7355397_7364859 is the HERV-K coordinate.

5.2 align the mapped unique k-mers to the corresponding alleles (with the help of alignment tools).

An example can be checked in the supplementary in our paper. (S3 Fig. Alignment of unique k-mers to HERV-K at chr12: 55727215.)

About

Mining mapped reads for accurately predicting polymorphic human endogenous retroviruses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published