# GalaxyToolsFetcher

### Table of Contents

1. [Introduction](#introduction)
2. [Getting tools metadata through ToolShed API](#ToolShedAPI)
    1. [Getting all repositories in the ToolShed](#repos)
    2. [Getting the metadata of all the tools](#meta)
3. [Getting the XML files](#xml) 
    1. [Downloading the zips containing the XMLs](#subxml)

## Introduction <a name="introduction"></a>

This code does:
* Download the repository metadata from Galaxy Tool Shed
* Download the tools XMLs 

Useful data available through **ToolShed**:

* Whether it is installable
* Whether there is a development repository
* Webpage
* Whether there is a dependencies statement
* Whther there is test data
* Descriptions
* Whether the version is downloadable

Useful data available in the **.xml** files:

* version
* requirements -- Not a good source for requirements
* description
* input and output formats
* command
* language
* examples / tests
* help -  can contain citation


## Getting tools metadata through ToolShed API <a name="ToolShedAPI"></a>

The API won't give us the repos metadata in batch, we need to request this information by ID (as far as I know). Thus, we first need to get the id of the tools and then request their metadata. 

### 1. Getting all repositories in the ToolShed  <a name="#repos"></a>

In [2]:
import requests
import json
import sys
import zipfile
import io
from bs4 import BeautifulSoup

# Necessary through the whole code:

session = requests.Session()
## It is required to have an account at the ToolShed (https://toolshed.g2.bx.psu.edu/) to use the API.
headers = {'key': '13bf98724eff7d9cbf6ae1de77bf8826'} #The API key of my ToolShed user. 

In [3]:
# url for quering the repositories:
reps_url = "https://toolshed.g2.bx.psu.edu/api/repositories?" 

repositories = session.get(reps_url, headers = headers)
repos =  json.loads(repositories.text) 

In [None]:
repo_ids = []
# Getting the ids of tools:
for rep in repos:
    # we are interested in tools (type:unrestriceted), not in tool_dependencies (type:tool_dependency) 
    if rep['type']=='unrestricted':
        repo_ids.append(rep['id'])

In [29]:
len(repo_ids)

3003

There are 3003 tools in the ShedTool at this moment

### 2. Getting the metadata of all the tools <a name="meta"></a>

In [58]:
# url for quering the metadata:
u = "https://toolshed.g2.bx.psu.edu/api/repositories/{id_}/metadata?"

# List to store all the metadata 
rep_metadatas = [] 

# Iterating through all tools in the ToolShed:
for rep_id in repo_ids:
    req = u.format(id_ = rep_id)
    try:
        re = session.get(req, headers = headers)
    except re.status_code != 200:
        print(re.status_code)
        print("problematic id:" + rep_id)
    else:    
        meta = re.json()
        rep_metadatas.append(meta)

For now, since I am still not processing this info yet, I store it in disk for exploration.

In [67]:
with open('rep_metadatas.json', 'w') as outfile:
    json.dump(rep_metadatas, outfile)

In [70]:
len(rep_metadatas)

3003

In [69]:
len(repo_ids)

3003

Downloading was not done in one run beacuse two ids metadata took a lot of time to be returned. These were:
* 593
* 2639

Did not dig into the possible reason of this.

The metadata of a given tool consists in all the revisions of that tool, all of which follow the same schema.  

#### Example of a revision of a tool. 
The id of the tool is 790743498728.

We can download its repo by: https://toolshed.g2.bx.psu.edu/repository/download?repository_id=790743498728befc&changeset_revision=96909b9d1df1&file_type=zip

The first revision looks as follows:
```
{'0:96909b9d1df1': {'changeset_revision': '96909b9d1df1',
   'downloadable': True,
   'has_repository_dependencies': False,
   'id': '94781fe549dd60ca',
   'includes_datatypes': False,
   'includes_tool_dependencies': False,
   'includes_tools': True,
   'includes_tools_for_display_in_tool_panel': True,
   'includes_workflows': False,
   'malicious': False,
   'missing_test_components': False,
   'model_class': 'RepositoryMetadata',
   'repository': {'deleted': False,
    'deprecated': False,
    'description': '2D feature extraction',
    'homepage_url': 'https://github.com/bmcv',
    'id': '790743498728befc',
    'model_class': 'Repository',
    'name': '2d_feature_extraction',
    'owner': 'imgteam',
    'private': False,
    'remote_repository_url': 'https://github.com/BMCV/galaxy-image-analysis/tools/2d_feature_extraction/',
    'times_downloaded': 17,
    'type': 'unrestricted',
    'user_id': 3075},
   'repository_dependencies': [],
   'repository_id': '790743498728befc',
   'tool_dependencies': {},
   'tools': [{'add_to_tool_panel': True,
     'description': 'Feature Extraction',
     'guid': 'toolshed.g2.bx.psu.edu/repos/imgteam/2d_feature_extraction/ip_2d_feature_extraction/0.0.8',
     'id': 'ip_2d_feature_extraction',
     'name': '2D Feature Extraction',
     'profile': 16.01,
     'requirements': [{'name': 'pandas',
       'type': 'package',
       'version': '0.23.4'},
      {'name': 'scikit-image', 'type': 'package', 'version': '0.14.2'},
      {'name': 'numpy', 'type': 'package', 'version': '1.15.4'},
      {'name': 'tifffile', 'type': 'package', 'version': '0.15.1'}],
     'tests': [{'inputs': [['input_label', 'input.tiff'],
        ['feature_options|features', 'select'],
        ['feature_options|selected_features', '--area']],
       'name': 'Test-1',
       'outputs': [['attributes', 'name']],
       'required_files': ['input.tiff', 'name']}],
     'tool_config': '/srv/toolshed/main/var/data/repos/004/repo_4359/2d_feature_extraction.xml',
     'tool_type': 'default',
     'version': '0.0.8',
     'version_string_cmd': None}]}}
```

## Getting the XML files <a name="xml"></a>

To get the xml, we need to either:
* download the zip containing the files from the repository: 

https://toolshed.g2.bx.psu.edu/repository/download?repository_id={'repository_id'}&changeset_revision={'changeset_revision'}&file_type=zip

For the example revision above:

https://toolshed.g2.bx.psu.edu/repository/download?repository_id=790743498728befc&changeset_revision=96909b9d1df1&file_type=zip

Name of file of interest is {'tools.id'}.xml. In this example, 2d_feature_extraction.xml.

* clone the mercurial ToolShed repository using the 'tools.guid' repo and 'tools.id': 

https://toolshed.g2.bx.psu.edu/repos/imgteam/2d_feature_extraction/2d_feature_extraction.xml 
This does not seem like the best idea, since we would need to clone one by one due to a non sistematic structure of their repos :/


### Downloading the zips containing the XMLs  <a name="subxml"></a>

In [2]:
import re 

In [3]:
class reposit(object):
    def __init__(self, r):
        self.res = r
        # self repository 
        self.z = zipfile.ZipFile(io.BytesIO(r.content), 'r')
        self.contentList = self.z.namelist() 
        self.meta = ''
        self.validXML()
        
    def enrichFromShed(self, repo):
        

        
    def validXML(self):
        validFs = [] # some zips contain more than one tool.
        validFbase = []
        for f in self.contentList:
            if '.xml' in f and True not in [word in f for word in exclude]:
                fil = self.z.open(f)
                BS = BeautifulSoup(fil, features="xml")
                if thisIsATool(BS) == True:
                    valid = [BS, '/'.join(f.split('/')[:-1])]
                    validFs.append(valid)
        self.validXMLs = validFs
         
    def get_macros(self, macrosList, baseUrl):
        '''
        This function takes a list of paths for macros and returns a 
        dictionary with the tokens inside them
        '''
        macros = {}
        tokens = {}
        requirements = []                 
        for imp in macrosList:
            filepath = baseUrl + '/' + imp
            Import = self.z.open(filepath)
            BSmacros = BeautifulSoup(Import, features="xml")
            ##--- tokens -------------------------------------
            tokens = parse_tokens(BSmacros, tokens)
        '''    
        for imp in macrosList:
            filepath = baseUrl + '/' + imp
            Import = self.z.open(filepath)
            BSmacros = BeautifulSoup(Import, features="xml")
            ##--- requirements -------------------------------
            requirements = parse_requirements(BSmacros, requirements, tokens)
        '''
        macros['tokens'] = tokens
        #macros['requirements'] = requirements
        return(macros)
              

    def parse(self):
        self.meta = self.parse_xmls()
        
    def parse_xmls(self):
        TOOLS = []
        for Tool in self.validXMLs:
            # build dictionary with macros
            tool = Tool[0]
            base_url = Tool[1]
            if tool.tool.macros:
                imports = [a.get_text() for a in tool.tool.macros.findAll("import")]
                macros = self.get_macros(imports, base_url)
            else:
                macros = {'tokens': {}, 'requirements': []}
                    
            t = {}
            #---- identity ---------------------------------------------------

            t['id'] = rMacros(macros['tokens'], tool.tool['id']) if 'id' in tool.tool.attrs.keys() else None
            t['name'] = rMacros(macros['tokens'], tool.tool['name']) if 'name' in tool.tool.attrs.keys() else None

            t['version'] = rMacros(macros['tokens'], tool.tool['version']) if 'version' in tool.tool.attrs.keys() else None
            t['description'] = rMacros(macros['tokens'], tool.tool.description.get_text()) if tool.tool.description else None
            #---- technical ---------------------------------------------------
            #t['requirements'] = macros['requirements']
            t['code_file'] = rMacros(macros['tokens'], tool.tool.code['file']) if tool.tool.code else None
            t['language'] = t['code_file'].split('.')[-1] if tool.tool.code else None
            if tool.tool.command:
                t['command'] = rMacros(macros['tokens'], tool.tool.command.get_text())
                if  'interpreter' in tool.tool.command.attrs.keys():
                    t['interpreter'] = rMacros(macros['tokens'], tool.tool.command['interpreter']) 
                else:
                    t['interpreter'] = None
            else:
                t['command'] = None
            #---- formats -----------------------------------------------------
            t['dataFormats'] =  parse_in_out(tool)
            #---- usability ---------------------------------------------------
            t['help'] = rMacros(macros['tokens'], str(tool.tool.help.get_text())) if tool.tool.help else None
            t['tests'] = existTest(tool)
            #---- credit -------------------------------------------------------
            t['ciation'] = get_citations(tool.tool.citations, macros) if tool.tool.citations else None
            # this does not come from XML:
            t['readme'] = self.existREADME()
            TOOLS.append(t)
        return(TOOLS)
    

    
    def existREADME(self):
        for f in self.contentList:
            if 'README' in f and True not in [word in f for word in exclude]:
                return(True)
            else:
                continue
        return(False)

    
def get_citations(citations, macros):

    cits = []
    for child in citations.findAll("citation"):
        cits.append({'citation' : rMacros(macros['tokens'], str(child.get_text())), 'type':  rMacros(macros['tokens'], str(child['type']))})
    return(cits)
    
    

def parse_tokens(BSmacros, tokens):
    fields = [a for a in BSmacros.findAll("token")]
    
    for e in fields:
        if '\\$' not in e.get_text().lstrip():
            tokens[e['name']] = e.get_text().lstrip()
    
    return(tokens)

'''
def parse_requirements(BSmacros, requirements, tokens):
    fields = [a for a in BSmacros.findAll("requirement")]
    
    for req in fields:
        type_ = rMacros(tokens, req['type']) if 'type' in req.attrs.keys() else None
        version = rMacros(tokens, req['version']) if 'version' in req.attrs.keys() else None
        name = rMacros(tokens, req.get_text())
        REQ = {'type': type_, 'version': version, 'name': name}
        requirements.append(REQ)    
    
    print(requirements)
    return(requirements)
'''

def parse_in_out(tool):
    inOut = {}
    inFormats = []
    outFormats = []
    
    for inp in [a for a in tool.findAll("inputs")]:
        for f in  [a for a in inp.findAll(["param","data"])]:
            inFormats.append(f['format']) if 'format' in f.attrs.keys() else None
            
    for outp in [a for a in tool.findAll("outputs")]:
        for f in  [a for a in outp.findAll(["param", "data"])]:
            outFormats.append(f['format']) if 'format' in f.attrs.keys() else None
    
    inOut['inputs'] = inFormats
    inOut['outputs'] = outFormats
    
    return(inOut)
        
    
def existTest(tool):
    if tool.findAll("test"):
        return(True)
    else:
        return(False)


def thisIsATool(BS):
    if BS.findAll('tool'):
        return(True)
    else:
        return(False)
    
    return(None)
  

exclude = ['dependencies','dependency','macros.xml','build.xml']


def rMacros(macroTokens, string):
    for key in macroTokens.keys():
        if key in string:
            string = string.replace(key, macroTokens[key])
    return(string)



def zip2Repo(ltst_rev):
    
    url = fileurl.format(repository_id=ltst_rev['repository_id'], changeset_revision=ltst_rev['changeset_revision'] )
            
    r = session.get(url, headers = headers) # TODO: handle exceptions here
            
    repository = reposit(r = r) # We build an toolZip object here
           
    repository.parse()  # We execute the parsing here
            
    ## We merge metadata from both sources
            
    return(repository)


In [4]:

### the url is of the form:
fileurl = "https://toolshed.g2.bx.psu.edu/repository/download?repository_id={repository_id}&changeset_revision={changeset_revision}&file_type=zip"


def main():
    # The repos metadata from galaxy contain the needed identifiers:
    with open('rep_metadatas.json', 'r') as outfile:
        rep_metadatas = json.load(outfile)

    all_metas = []
    #iterating through tools (metadata of tools contain the ids needed for the query)
    for repo in rep_metadatas[]: 
        if repo == {}: #some repos are empty
            continue
        else:
            ## XML downloading and parsing
            key = max(iter(repo.keys())) # I only want the latest revision
            ltst_rev = repo[key] 
            
            tool = zip2Repo(ltst_rev)
            
            # Add the metadata to repo
            
            tool.enrichFromShed(repo)
            
            all_metas.append(tool.meta)

    # save to disk
    with open('xml_metadatas.json', 'w') as outputfile:
        json.dump(all_metas, outputfile)


In [5]:
main()

In [128]:
imports = [a.get_text() for a in BSa.macros.findAll("import")]
imports

['macros.xml']

In [91]:
BSa.macros

<macros>
<import>macros.xml</import>
</macros>