# Metaspace annotation pipeline on IBM Cloud

Experimental code to integrate Metaspace [engine](https://github.com/metaspace2020/metaspace/tree/master/metaspace/engine)
with [PyWren](https://github.com/pywren/pywren-ibm-cloud) for IBM Cloud.

## Table of Contents
1. [Setup](#setup)
2. [Upload Dataset into IBM COS](#dataset)
3. [Generate Isotopic Peaks from Molecular Databases](#database)
4. [Run Annotation Pipeline](#annotation)
5. [Clean Temp Data in IBM COS](#clean)

# <a name="setup"></a> 1. Setup

This notebook requires IBM Cloud Object Storage and IBM Cloud Functions
Please follow IBM Cloud dashboard and create both services.


In [None]:
# We need this to overcome Python notebooks limitations of too many open files
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Before:', soft, hard)

# Raising the soft limit. Hard limits can be raised only by sudo users
resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('After:', soft, hard)

In [None]:
%config Completer.use_jedi = False
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
# Run this for DEBUG mode
import logging
logging.basicConfig(level=logging.DEBUG)

### IBM COS Setup

Copy the file `config.json.template` to `config.json` and fill in the missing values for API keys, buckets and endpoints per these instructions:

Setup a bucket in IBM Cloud Object Storage

You need an IBM COS bucket which you will use to store the input data. If you don't know of any of your existing buckets or would like like to create a new one, please navigate to your cloud resource list, then find and select your storage instance. From here, you will be able to view all your buckets and can create a new bucket in the region you prefer. Make sure you copy the correct endpoint for the bucket from the Endpoint tab of this COS service dashboard. Note: The bucket names must be unique.

Obtain the API key and endpoint to the IBM Cloud Functions service. Navigate to Getting Started > API Key from the side menu and copy the values for "Current Namespace", "Host" and "Key" into the config below. Make sure to add "https://" to the host when adding it as the endpoint.

In [None]:
import json
config = json.load(open('config.json'))

In [None]:
from annotation_pipeline_v2.utils import get_ibm_cos_client
cos_client = get_ibm_cos_client(config)

### IBM PyWren Setup

In [None]:
# These are Python and Python lib path we want to use
import sys
sys.executable, sys.prefix

In [None]:
#Install PyWren-IBM if needed
try:
    import pywren_ibm_cloud as pywren
except ModuleNotFoundError:    
    !{sys.executable} -m pip install -U pywren-ibm-cloud==1.0.13
    import pywren_ibm_cloud as pywren

pywren.__version__

### Input Files Setup

Choose between the 3 input config files (small, big or huge) or copy `input_config.json.template` to `input_config.json` and fill in the missing values.

In [None]:
import json
input_config = json.load(open('metabolomics/input_config_small.json'))
# input_config = json.load(open('metabolomics/input_config_big.json'))
#input_config = json.load(open('metabolomics/input_config_huge.json'))
input_data = input_config['dataset']
input_db = input_config['molecular_db']
output = input_config['output']

In [None]:
input_data

In [None]:
input_db

In [None]:
output

# <a name="dataset"></a> 2. Upload Dataset into IBM COS

Download link for datasets "small", "big" and "huge" can be found [Here](https://s3.eu-de.cloud-object-storage.appdomain.cloud/pywren-annotation-pipeline-public/metabolomics.tar.gz).<br>
Put metabolomics folder in the root directory of the project.

In [None]:
import os
from annotation_pipeline_v2.utils import upload_to_cos

In [None]:
for root, dirnames, filenames in os.walk(input_data['path']):
    for fn in filenames:
        f_path = f'{root}/{fn}'
        print(f_path)
        upload_to_cos(cos_client, f_path, config['storage']['ds_bucket'], f_path)

In [None]:
# Dataset in IBM COS
cos_client.list_objects_v2(Bucket=config['storage']['ds_bucket'], Prefix=input_data['path']).get('Contents', [])

# <a name="database"></a> 3. Generate Isotopic Peaks from Molecular Databases

In [None]:
from annotation_pipeline.molecular_db import dump_mol_db, build_database, \
    calculate_centroids, get_formula_id_dfs, clean_formula_chunks

In [None]:
# Download commonly used mol DBs from METASPACE (add force=True to redownload if needed)
dump_mol_db(config, config['storage']['db_bucket'], 'metabolomics/db/mol_db1.pickle', 22) #HMDB-v4
dump_mol_db(config, config['storage']['db_bucket'], 'metabolomics/db/mol_db2.pickle', 19) #ChEBI-2018-01
dump_mol_db(config, config['storage']['db_bucket'], 'metabolomics/db/mol_db3.pickle', 24) #LipidMaps-2017-12-12
dump_mol_db(config, config['storage']['db_bucket'], 'metabolomics/db/mol_db4.pickle', 26) #SwissLipids-2018-02-02

In [None]:
num_formulas, formula_chunk_keys = build_database(config, input_db)

In [None]:
num_formulas, len(formula_chunk_keys), formula_chunk_keys[:3]

In [None]:
formula_to_id, id_to_formula = get_formula_id_dfs(config, input_db)

In [None]:
polarity = input_data['polarity'] # Use '+' if missing from the config, but it's better to get the actual value as it affects the results
isocalc_sigma = input_data['isocalc_sigma'] # Use 0.001238 if missing from the config, but it's better to get the actual value as it affects the results
centroids_shape, centroids_head = calculate_centroids(config, input_db, formula_chunk_keys, polarity, isocalc_sigma)

In [None]:
centroids_shape

In [None]:
centroids_head

In [None]:
clean_formula_chunks(config, input_db, formula_chunk_keys)

In [None]:
# Centroids database in IBM COS
cos_client.list_objects_v2(Bucket=config['storage']['db_bucket'], Prefix=input_db['centroids_pandas']).get('Contents', [])

# <a name="annotation"></a> 4. Run Annotation Pipeline

In [None]:
from annotation_pipeline_v2.pipeline import Pipeline

In [None]:
pipeline = Pipeline(config, input_config)

In [None]:
%time pipeline.load_ds()

In [None]:
%time pipeline.split_ds()

In [None]:
%time pipeline.segment_ds()

In [None]:
%time pipeline.segment_centroids()

In [None]:
%time pipeline.annotate()

In [None]:
%time pipeline.run_fdr()

In [None]:
%time pipeline.get_results()

In [None]:
%time pipeline.get_images()

# <a name="clean"></a> 5. Clean Temp Data in IBM COS

In [None]:
from annotation_pipeline_v2.utils import clean_from_cos

In [None]:
# Clean dataset chunks
clean_from_cos(config, config["storage"]["ds_bucket"], input_data["ds_chunks"])

In [None]:
# Clean dataset segments
clean_from_cos(config, config["storage"]["ds_bucket"], input_data["ds_segments"])

In [None]:
# Clean centroids database segments
clean_from_cos(config, config["storage"]["db_bucket"], input_db["centroids_segments"])

In [None]:
# Clean formula output images
clean_from_cos(config, config["storage"]["output_bucket"], output["formula_images"])