Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



88 Commits

Repository files navigation

Identification of tagging single nucleotide polymorphisms


The repo contains scripts to automate the identification of tagging single nucleotide polymorphisms (tagSNP) with Haploview 4.2, Clustag v2, gpart R package 1.2.0, and Tagster 1.0. The scripts were tested under Ubuntu 22.04.4 LTS. We assume the paths to the softwares applied in the scripts and other resources are in the res.cfg file located at the folder with the scripts. Each row of the file has the software name and the path to the software seperated by comma. Please, use the links to the software websites to find out how to install them.

Subset SNPs

  • subsets SNPs from vcf file, e.g., downloaded from 1000 Genomes project, by genomic coordinates and MAF.
  • subsets SNPs from Plink 1.9 binary files by genomic coordinates and MAF.

tagSNP identification

Haploview 4.2

Haploview 4.2 runs haplotype analysis. It also implements Paul de Bakker's Tagger tagSNP selection algorithm. The command line options of the tool allow to apply it in scripts.

  • vcf2haploview.R converts vcf files into input files for Haploview.
  • a wrapper script to find out tagSNP with Tagger algorithm from Haploview.
  • post-process TAGS created by Haploview.
  • defines the indexes of SNPs tagged by tagSNPs given in ped.TAGS file created by Haploview.

Clustag 2

Clustag v2 is a software that applies hierarchical clustering and graph methods for selecting tagSNPs. It is emplemented in Java.

  • 2clustag.R creates input files for Clustag from vcf-file. bcftools should be in $PATH global variable.
  • is a wrapper script for running Clustag. It should be lanched from the same folder as input data.
  • post-process *.members.txt file created by Clustag software. Get sizes of clusters.
  • post-process *.out file created with Clustag. Get the distribution of cluster sizes and mean values of r2 between tagSNP and other SNPs in a cluster.
  • defines the indexes of tagged SNPs given in *.out file.

Tagster 1.0

Xu at el., 2007

Download Tagster

  • converts vcf file with phased genotypes into input file for Tagster software.
python3 -i path/to/filename.vcf -o path/to/filename -g gene_name



Qin et al., 2006

Other scripts and files

  • counts Bonacich centrality for weighted undirected graps formed by SNPs in blocks.
  • counts mean, median, rho and determinant of LD submatrices composed of consequitive SNPs.
  • counts mean, median, rho and determinant of LD submatrices composed on SNPs not necessarily consequitive.
  • contains functions to assist data processing with scripts in Python 3.8.
  • creats LD matrix under text and HD5F formats from hap.ld file with r2 values obtained with vcftools.
python3 path/to/prefix

prefix corresponds to prefix.hap.ld and prefix.vcf files.

The files prefix.ld, prefix.ld.h5 and will be created at the same folder as imput file. contains the list of rs IDs with genomic coordinates extracted from prefix.vcf file.

  • plot-hist.R plot the histogram of MAF (minor allele frequencies) counted with Plink 1.9 --freq argument.

  • outputs haplotypes from phased vcf file. The minor allele is coded as 1 and the major as 0.

python3 -i path/to/filename.vcf -o path/to/filename

Other softwares

gpart R package 1.2.0

gpart R package is the implementation of BIG-LD method (Kim et al., 2018), a block partition method based on interval graph modeling of LD bins which are clusters of strong pairwise LD SNPs, not necessarily physically consecutive.

  • vcf2gpart.R converts vcf into geno/info files for gpart. The R package VariantAnnotation to process vcf file is required.
vcf2gpart.R path/to/mydata.vcf path/to/output_folder
  • run-gpart.R is a wrapper script to apply BIG-LD with gpart.
Rscript run-gpart.R <input>

input: path/to/prefix of prefix.{info,geno} files

Use cases

The scripts in this repo were applied in the following researches:

Khrunin, A.V.; Khvorykh, G.V.; Arapova, A.S.; Kulinskaya, A.E.; Koltsova, E.A.; Petrova, E.A.; Kimelfeld, E.I.; Limborska, S.A. The Study of the Association of Polymorphisms in LSP1, GPNMB, PDPN, TAGLN, TSPO, and TUBB6 Genes with the Risk and Outcome of Ischemic Stroke in the Russian Population. Int. J. Mol. Sci. 2023, 24, 6831.

Khrunin, A.V.; Khvorykh, G.V.; Rozhkova, A.V.; Koltsova, E.A.; Petrova, E.A.; Kimelfeld, E.I.; Limborska, S.A. Examination of Genetic Variants Revealed from a Rat Model of Brain Ischemia in Patients with Ischemic Stroke: A Pilot Study. Genes 2021, 12, 1938.

Khvorykh, G., Khrunin, A., Filippenkov, I., Stavchansky, V., Dergunova, L., Limborska, S. A Workflow for Selection of Single Nucleotide Polymorphic Markers for Studying of Genetics of Ischemic Stroke Outcomes. Genes 2021, 12, 328.

Khrunin A.V., Khvorykh G.V., Gnatko E.D., Filippenkov I.B., Stavchansky V.V., Dergunova L.V., Limborska S.A. Study of polymorphism of human genes, orthologues of which are functionally involved in the response to experimental brain ischemia in model systems. Medical Genetics. 2020;19(5):83-85. (In Russ.)