## Illumina catalog to BED file conversion
Hope Tanudisastro | Jan 9, 2022

This notebook converts Illumina's 174k loci catalog into a BED file to be inputted to Tandem Repeat Finder to extract pure repeats. Notably, this notebook removes complex repeats and off target regions from the BED file. 

### Importing variant catalog file from Illumina

In [1]:
import json, csv, re
f = open('Illuminavariant_catalog.json')

In [2]:
data = json.load (f)

### Extracting coordinates and motif

In [3]:
coordinates = []
motif = []
for i in data:
    coordinates.append(i['ReferenceRegion'])
    motif.append(i['LocusStructure'])   
print(len(data))

174293


Check how many loci have complex (multiple) repeat structures and print out coordinates and motif

In [4]:
for i in coordinates: 
    if type(i) == type(coordinates): # complex repeats will have an array structure
        print(i)

['chr3:63912684-63912714', 'chr3:63912714-63912726']
['chr13:70139353-70139383', 'chr13:70139383-70139428']
['chr3:129172576-129172656', 'chr3:129172656-129172696', 'chr3:129172696-129172732']
['chr9:69037261-69037286', 'chr9:69037286-69037304']
['chr4:3074876-3074933', 'chr4:3074939-3074966']
['chr20:2652733-2652757', 'chr20:2652757-2652775']


In [5]:
for j in motif: 
    if j.count("(")>=2:
        print(j)

(GCA)*(GCC)+
(CTA)*(CTG)*
(CAGG)*(CAGA)*(CA)*
(A)*(GAA)*
(CAG)*CAACAG(CCG)*
(GGCCTG)*(CGCCTG)*


Extract indices of complex repeats

In [6]:
complex_repeats_index=[]
for i in coordinates: 
    if type(i) == type(coordinates):
        complex_repeats_index.append(coordinates.index(i))

In [7]:
print(complex_repeats_index)

[7, 8, 12, 17, 20, 23]


### Remove complex repeats 

In [8]:
#create a new motif array without the complex repeats
motif_without_complex_repeats = []
for i in range(len(motif)):
    if i not in complex_repeats_index: 
        motif_without_complex_repeats.append(motif[i])

In [9]:
print(len(motif_without_complex_repeats))

174287


### Clean up and prepare for GangSTR catalog format 

#### Remove regular expression characters from motif definition

In [10]:
for i in range(len(motif_without_complex_repeats)):
        line = motif_without_complex_repeats[i]
        motif_without_complex_repeats[i] = re.sub('[()*+]', '', line)

#### Create motif length attribute

In [11]:
motif_length = []
for i in range(len(motif_without_complex_repeats)):
    motif_length.append(len(motif_without_complex_repeats[i]))   

#### Create separate chromosome and coordinate attributes

In [12]:
chromosome = [] 
coordinate_1 =[] #start coordinate
coordinate_2 = [] #end coordinate 

for i in range(len(coordinates)):
    if i not in complex_repeats_index: 
        chromosome.append((coordinates[i].split(":"))[0])
        coordinate_pair = coordinates[i].split(":")[1]
        #print(coordinate_pair)
        coordinate_1.append(coordinate_pair.split("-")[0])
        coordinate_2.append(coordinate_pair.split("-")[1])


In [None]:
Sanity check

In [16]:
print(len(chromosome))
print(len(coordinate_1))
print(len(coordinate_2))
print(len(motif_length))
print(len(motif_without_complex_repeats))

174293
174287
174287
174287
174287
174287


All arrays have the expected length = 174,287

### Create a BED file using the attributes

In [14]:
bed_catalog = open("bed_catalog_without_complex_repeats.bed", "w")
for i in range(len(chromosome)): 
        bed_catalog.write(chromosome[i]+"\t"+ coordinate_1[i] +"\t"+ coordinate_2[i]+"\t"+ str(motif_length[i])+"\t"+ motif_without_complex_repeats[i]+"\n")
bed_catalog.close()