# DSL1 Database Notebook

In this notebook are presented the functions used to form a DSL1 database from the output json from the [crawler](https://github.com/R2-P2/wf-crawler), a json ('wf_crawl_nextflow.json') is already present in same file as this notebook. 

This notebook also produces 2 new json files :

- 'json_nextflow_DSL.json' this json contains the same information as 'wf_crawl_nextflow.json' with the added information of which version of DSL (DSL1 or DSL2) is used in the repository
- 'json_nextflow_DSL1.json' contains exclusively the repository written in DSL1

To form the database, for each github repository (project), the nextflow (.np extension) files are retrieved at the root. Since we are only interested in retrieving the DSL1 workflows (which are written in a single file), to simplify the search the root is only searched (generally the main is placed at the root anyway). 

The next step is to define if the repository is written in DSL1 or DSL2 : 
- if all the files are written in DSL1, the repository is defined as written in DSL1 and we can add each (nextflow from the repository) file to the database as a single workflow
- if a single file is written in DSL2, the repository is defined as written in DSL2 and we do not add the remaining files to the database (reminder a DSL2 workflow is decomposed into multiple file, but the version of nextflow is only defined in the main of workflow, which is placed at the root)

In [None]:
import json
import urllib
import re
from urllib.request import urlopen
import glob
from pathlib import Path
import pandas as pd
import os

Define the addresses used to save the database, at the end of the execution of the notebook : 

- __address_saved_DSL1__ will contain only the workflows written in DSL1
- __address_saved_DSL2__ will contain the remaining workflows which we are not sure of the version since in the repository was found a file written in DSL2

In [None]:
address_saved_DSL1 = 'some/addresse'
address_saved_DSL2 = 'some/addresse'
os.system(f"mkdir -p {address_saved_DSL1}")
os.system(f"mkdir -p {address_saved_DSL2}")

Read the initial json file

In [None]:
json_data={}
with open('wf_crawl_nextflow.json') as json_file:
    #Read the file
    json_data = json.load(json_file)
    json_data.pop('last_date', None)

Function that retrieves the links of the nextflow file at the root of the project

Since DSL1 workflows are written in just one file and are generally placed at the root of the project

Doing this is restrictive : but doing this we only retrieve workflows which we are sure are written in DSL1.

In [None]:

def get_links_to_download(id="joshua-d-campbell/nf-GATK_Exome_Preprocess"):
    try:
        #Step 1 : Retrieve HTML of Web page
        fp = urllib.request.urlopen("https://github.com/"+id)
        html = fp.read()
        html = html.decode("utf8")
        fp.close()
        #Step 2 : Retrieve the links of the nextflow files in the project
        pattern = r'href="(.+\.nf)"'
        links=[]
        for match in re.finditer(pattern, html):
            links.append(match.group(1))
        #Step 3 : Retrieve the raw addresses of the nextflow file to be able to download them later
        links_to_download=[]
        for l in links:
            fp = urllib.request.urlopen("https://github.com/"+l)
            html = fp.read()
            html = html.decode("utf8")
            fp.close()
            pattern = r' href="(.+\.nf)" id="raw-url"'
            for match in re.finditer(pattern, html):
                links_to_download.append(match.group(1))
        return links_to_download
    except Exception as inst:
        print(inst)   
        return []


Functions that download the files

In [None]:
def download_files(address, data):
    i=0
    for id in data:
        i+=1
        print(id, f'{i}/{len(data)}')
        links= get_links_to_download(id)
        nb=0
        for l in links:
            nb+=1
            link= f'https://github.com{l}'
            try:
                f = urlopen(link)
                myfile = f.read().decode('utf-8')
                myText = open(f'{address}/{id.replace("/", "__")}_{nb}.nf','w')
                myText.write(myfile)
                myText.close()
            except Exception as inst:
                print(inst)
                None

Function that checks if there is presence of the DSL2 indicator in the file and defines if a repository is considered to be written in DSL1 or not 

In [None]:
def check_DSL1(address, data):
    for j in data:
        files= []
        for file in glob.glob(address+f'/{j.replace("/", "__")}_*.nf'):
            files.append(file)
        if(len(files)!=0):
            DSL1 = True
            for f in files:
                txt = Path(f).read_text()
                if(bool(re.compile(r"(nextflow\.(preview|enable)\.dsl\s*=\s*2)").search(txt))):
                    DSL1 = False
            data[j]['DSL1'] = DSL1
        else:
            data[j]['DSL1'] = None

    with open('json_nextflow_DSL.json', 'w') as fp:
        json.dump(data, fp,  indent=4)
    
    return data

Function that extract the dictionnary containning only the DSL1 repository

In [None]:
def extract_DSL1(dict):
    df = pd.DataFrame.from_dict(dict).T
    data = df[df['DSL1']==True].T.to_dict()
    with open('json_nextflow_DSL1.json', 'w') as fp:
        json.dump(data, fp,  indent=4)
    return data

Function that moves the DSL1 workflows to their corresponding directory

In [None]:
def move_DSL1(old_address, new_address, dict_DSL1):
    for j in dict_DSL1:
        for file in glob.glob(old_address+f'/{j.replace("/", "__")}_*.nf'):
           os.system(f'mv {file} {new_address}')

Main which links all the functions together

In [None]:
download_files(address_saved_DSL2, json_data)
json_data_DSL = check_DSL1(address_saved_DSL2, json_data)
data = extract_DSL1(json_data_DSL)
move_DSL1(address_saved_DSL2, address_saved_DSL1, data)