This example notebook illustrates how to use Ines' MTI_api_functions for annotating ChEMBL assay descriptions. This uses the Medical Text Indexer provided by the National Library of Medicine.

More info here: https://ii.nlm.nih.gov/MTI/

The steps in the example are as follows:

1. Select assay descriptions from ChEMBL to create input files for the MeSH API (part A, this script)
2. Submit assays using bsub to the EBI cluster in a job array (part A, this script)
3. Redo any jobs that failed (part B, next script)
4. Insert results from output text file into oracle tables (part C)

There are three scripts(A, B and C) to cover these steps because the API jobs need to finish before doing the next part. I tested that these scripts work when executing them from the cluster.

In [42]:
import sqlalchemy as al
import logging
import MTI_api_functions as maf
import os

In [43]:
# Personal login details in correct format for sqlalchemy

with open('/homes/ines/alchemy_ines_login.txt', 'r') as f:
    engine = al.create_engine(f.read())

In [34]:
# Set up logging

logging.basicConfig(format='%(asctime)s %(message)s', filename='./logs/heart_assays.log', level=logging.DEBUG)

In [44]:
# STEP 1 - Select assay descriptions from ChEMBL 
# For the purpose of this example, let's only do the assays that mention 'heart'

my_assays = maf.get_assay_descriptions(engine, condition = "lower(description) like '%heart%'")

In [36]:
# STEP 2 - Create input files for the MeSH API and place in directory containing subdirectories with the inputfiles
# When I ran the whole of ChEMBL I used 4000 assays per inputfile and 100 files per subdirectory
# The maximum number of files per subdirectory is 1000 because that's the max number of jobs per array

maf.make_input_files(my_assays, nr_assays = 400, nr_files = 10, path_to_inputfiles = './heart_assays')


In [38]:
# Set up directories
to_api_dir = '/nfs/research2/jpo/shared/projects/HeCaToS/mesh_api/SKR_Web_API_V2_1'
to_example_dir = '/nfs/research2/jpo/shared/projects/HeCaToS/mesh_api/SKR_Web_API_V2_1/examples'

In [None]:
# For each subdirectory, submit a job array using bsub to EBI Cluster
# Run only 2-3 subdirectories a time.. on advice from NLM as otherwise server is slowed down.

for subdir in [item for item in os.listdir('./heart_assays') if not '.DS_Store' in item][:2]: # exclude invisible file, just doing two files here
    maf.submit_job_array_for_inputdir(inputfiles_dir = 'heart_assays' , inputfiles_subdir=subdir, path_to_MTI_dir = to_api_dir, path_to_files_dir = to_example_dir, email = 'ines@ebi.ac.uk')
    

In [None]:
# Now need to wait for jobs to finish and I manually submitted each of the subdirectories after the previous one had finished.
# See next script 'part B'