## INDEL notation converter from "-" notation to proper VCF (including leading base).

#### For info please contact me: or.yaacov@mail.huji.ac.il

##### Dependencies:
R (3+), 
Bioconductor (packages: BSgenomem, BSgenome.Hsapiens.UCSC.hg19, Biostrings)

Takes a 5 col tsv file (chr, pos, name, ref, alt):
>chr1	20996757	NULL	T	- <br>
>chr1	20996257	NULL	TT	- <br>
>chr1	20996457	NULL	-	TT <br>
>chr1	20996457	NULL	-	T <br>
>chr1	20996457	NULL	A	G<br>

Converts to:
>chr1	20996756	NULL	AT	A<br>
>chr1	20996255	NULL	AGTT	AG<br>
>chr1	20996455	NULL	GA	GATT<br>
>chr1	20996456	NULL	A	AT<br>
>chr1	20996457	NULL	A	G<br>

In [1]:
#install.packages("BiocManager")
#BiocManager::install(c("BSgenome", "BSgenome.Hsapiens.UCSC.hg19", "Biostrings" ))
library("Biostrings")
library("BSgenome")
library("BSgenome.Hsapiens.UCSC.hg19")

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: s

Enter the file name and path, and run all cells (Ctrl+Entr)

In [2]:
filePath = "hyphen.vcf"
outPut = "leading.vcf"

This function looks up the base in hg19 by:
chr (e.g. "chr1"), position, length above 1 base (default is single bast, meaning len=0)

In [3]:
findbase <- function(chr, pos, len=0) {
  letter <- toString(Hsapiens[[chr]][pos:(pos+len)])
  return(letter)
}

In [4]:
#Read the table and prints it
read1 <- read.table(file = filePath, col.names = c("chr", "pos","name", "ref", "alt")
                    , colClasses=c("ref"="character", "alt"="character"))
read1

chr,pos,name,ref,alt
chr1,20996757,,T,-
chr1,20996257,,TT,-
chr1,20996457,,-,TT
chr1,20996457,,-,T
chr1,20996457,,A,G


In [5]:
#Converts the format and print it
for(i in 1:nrow(read1)) {
    row <- read1[i,]
    alt <- read1[i,5]
    ref <- read1[i,4]
    pos <- read1[i,2]
    chr <- read1[i,1]
    if (alt == "-") {
        Len= nchar(ref)
        leadingPos = (pos-Len)
        read1[i,2] <- leadingPos
        leading = findbase(toString(chr), leadingPos, len=(Len-1))
        read1[i,5] = leading
        read1[i,4] = paste(leading, ref, sep="")
        }
    else if (ref == "-") {
        Len= nchar(alt)
        leadingPos = (pos-Len)
        read1[i,2] = leadingPos
        leading = findbase(toString(chr), leadingPos, len=(Len-1))
        read1[i,4] = leading
        read1[i,5] = paste(leading, alt, sep="")
    }
}
read1

chr,pos,name,ref,alt
chr1,20996756,,AT,A
chr1,20996255,,AGTT,AG
chr1,20996455,,GA,GATT
chr1,20996456,,A,AT
chr1,20996457,,A,G


In [14]:
#Save output:
write.table(read1,file=outPut, sep = "\t", quote = FALSE, col.names = F, row.names = F)