Visualization tool

Visualize distribution of HERV-K repeats in the 1000 genomes dataset at http://www.personal.psu.edu/wul135/visualization/

codes can be downloaded from visualization/ folder.

K-mer base approach to mine patterns and leverage the repetitive nature of sequencing data

Note: [i]: input; [o]: output. Remove "[i]", "[o]" when running scripts.

Generate n/T Matrix

Step 1: Generate reference k-mers (k=50)

(The final output file [unique.withrc.50.fa] can also be downloaded from https://www.dropbox.com/s/hobrefa3l4hza76/unique.withrc.50.fa?dl=0)

$ perl generate.pl [i]HERVK_alleles/  [o]kmer50pervirus/ 50  
$ perl labels.pl [o]total_withlabel_50.fa [i]kmer50pervirus/ 
$ perl unique.pl [i]total_withlabel_50.fa [o]unique.50.fa 
$ perl uniquewithrc.pl [i]unique.50.fa [o]unique.withrc.50.fa

where kmer50pervirus/ is a temporary directory.

Step 2: Calculate T

(The final output [T.50] is in the Github.)

$ perl t.pl HERVK_alleles/ [o]tscript.50.pbs 50

where tscript.50.pbs is a pbs script can be run in a server.

Step 3: Experiments: mapping to the k-mer references

--- (Note: To use this tool, you can start from here.) ----

--- (1. download k-mer (k=50) reference [unique.withrc.50.fa] from Step 1.) ---

--- (2. downlad T values [T.50] for all HERV-K from Step 2. ---

--- (3. download and convert your data into 50-mer fasta file (e.g, 3.1), then do exact match as in 3.2) ---

3.1 donwload and convert data

3.1.1 download bam files from 1000 Genomes Project.

3.1.2 use the bed file to extract mapped reads based on the corresponding coordinates.

samtools view -b -L *.bed completed.bam > extracted.bam

"*.bed" is the input bed file containing HERV-K coordinates. They can be downloaded in 'bed files/' for hg19 and hg38:

HERVK_int.all.25oct2016.hg19.bed (hg19 build)

HERVK_hg38_sort_01apr2018.bed (hg38 build)

Note to GRCh37 and GRCh38 users: GRCh38 and GRCh37 have different naming scheme of chromosomes and the ALT contigs and the bed files should be changed to adapt.

3.1.3 convert bam files to fasta files, then kmerize to *.50.fa with k=50

(e.g, using dsk tool: http://minia.genouest.org/dsk/ ).

$ ./dsk -file sample.fa -kmer-size 50 -abundance-min 0
$ ./dsk2ascii -file sample.h5 -out sample.txt
$ perl tofasta.pl sample.txt sample.50.fa

Example files [dataset_samples/] can be downloaded as follows:

https://www.dropbox.com/sh/z1uhuavavywjpz3/AADgbiJyN2zeBYOq1wlR2Tsoa?dl=0

3.2 exact match data to the reference, then genereate the n/T matrix

$ perl findmatch.pl [i]unique.withrc.50.fa [i]dataset_samples/ [o]hashlabels_1000g_50/ 50 HG00096 HG00097 HG00099 HG00100  
$ perl labelcount.pl [i]hashlabels_1000g_50/HG00096.dat [i]sortedSites [o]hashlabels_1000g_50/HG00096.label
$ perl concate_tomatrix.pl [i]T.50 [i]peopleIDs_sample [i]hashlabels_1000g_50/ [o]mat.1kg.50.dat  
// peopleIDs is the full list for KGP dataset

where hashlabels_1000g_50/ is a temporary directory, and HG00096 HG00097 HG00099 HG00100 are sample names.

For Low Depth Data

Note: only need to implement this step when the sequencing depth of the input data is low. (i.e, < 20x).

Mixture model: clustering/analysis.R

To assign a label (0:absence, 1:presence, 2:solo-LTR):

5.1 retrieve the mapped k-mers.

perl checksolo.pl [i]chr8_735_cluster1.txt [i]unique.withrc.50.fa chr8_7355397_7364859 [i]dataset_samples [i]hashlabels_1000g_50 [o]chr8_7355397_7364859_cluster1

where chr8_735_cluster1.txt includes all individual's names in this cluster, and chr8_7355397_7364859 is the HERV-K coordinate.

5.2 align the mapped unique k-mers to the corresponding alleles (with the help of alignment tools).

An example can be checked in the supplementary in our paper. (S3 Fig. Alignment of unique k-mers to HERV-K at chr12: 55727215.)

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
HERVK_alleles		HERVK_alleles
bed files		bed files
clustering		clustering
figures		figures
finalunique50_may		finalunique50_may
processT.50.2017may		processT.50.2017may
visualization		visualization
README.md		README.md
T.50		T.50
checksolo.pl		checksolo.pl
concate_tomatrix.pl		concate_tomatrix.pl
findmatch.pl		findmatch.pl
generate.pl		generate.pl
labelcount.pl		labelcount.pl
labels.pl		labels.pl
peopleIDs		peopleIDs
peopleIDs_example		peopleIDs_example
rewriteheader.pl		rewriteheader.pl
sortedSites		sortedSites
t.pl		t.pl
tofasta.pl		tofasta.pl
tscript.50.pbs		tscript.50.pbs
unique.pl		unique.pl
uniquewithrc.pl		uniquewithrc.pl

lwl1112/polymorphicHERV

Folders and files

Latest commit

History

Repository files navigation

Visualization tool

K-mer base approach to mine patterns and leverage the repetitive nature of sequencing data

Generate n/T Matrix

Step 1: Generate reference k-mers (k=50)

Step 2: Calculate T

Step 3: Experiments: mapping to the k-mer references

3.1 donwload and convert data

3.2 exact match data to the reference, then genereate the n/T matrix

For Low Depth Data

About

Resources

Stars

Watchers

Forks

Languages