# Metaspace annotation pipeline on IBM Cloud

Experimental code to integrate Metaspace [engine](https://github.com/metaspace2020/metaspace/tree/master/metaspace/engine)
with [PyWren](https://github.com/pywren/pywren-ibm-cloud) for IBM Cloud.

## Table of Contents
1. [Setup](#setup)
2. [Generate Isotopic Peaks from Molecular Databases](#database)
3. [Run Annotation Pipeline](#annotation)
4. [Display results](#display)
5. [Validate results](#validate)
6. [Clean Temp Data in IBM COS](#clean)

# <a name="setup"></a> 1. Setup

This notebook requires IBM Cloud Object Storage and IBM Cloud Functions
Please follow IBM Cloud dashboard and create both services.


In [None]:
# We need this to overcome Python notebooks limitations of too many open files
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Before:', soft, hard)

# Raising the soft limit. Hard limits can be raised only by sudo users
resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('After:', soft, hard)

In [None]:
%config Completer.use_jedi = False
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
# Run this for DEBUG mode
import logging
logging.basicConfig(level=logging.DEBUG)

### IBM COS Setup

Copy the file `config.json.template` to `config.json` and fill in the missing values for API keys, buckets and endpoints per these instructions:

Setup a bucket in IBM Cloud Object Storage

You need an IBM COS bucket which you will use to store the input data. If you don't know of any of your existing buckets or would like like to create a new one, please navigate to your cloud resource list, then find and select your storage instance. From here, you will be able to view all your buckets and can create a new bucket in the region you prefer. Make sure you copy the correct endpoint for the bucket from the Endpoint tab of this COS service dashboard. Note: The bucket names must be unique.

Obtain the API key and endpoint to the IBM Cloud Functions service. Navigate to Getting Started > API Key from the side menu and copy the values for "Current Namespace", "Host" and "Key" into the config below. Make sure to add "https://" to the host when adding it as the endpoint.

In [None]:
import json
config = json.load(open('config.json'))

In [None]:
from annotation_pipeline.utils import get_ibm_cos_client
cos_client = get_ibm_cos_client(config)

### IBM PyWren Setup

In [None]:
# If pywren_ibm_cloud isn't installed, please run `pip install -e .` in this directory to install all dependencies
import pywren_ibm_cloud as pywren

pywren.__version__

### Input Files Setup

Choose between the input config files to select how much processing will be done. See the `README.md` for more information on each dataset.

In [None]:
import json
#input_config = json.load(open('metabolomics/input_config_small.json'))
input_config = json.load(open('metabolomics/input_config_big.json'))
#input_config = json.load(open('metabolomics/input_config_huge.json'))
#input_config = json.load(open('metabolomics/input_config_huge2.json'))
#input_config = json.load(open('metabolomics/input_config_huge3.json'))
input_data = input_config['dataset']
input_db = input_config['molecular_db']
output = input_config['output']

In [None]:
# Please note that some input_configs specify a `mol_db6`, which is not yet publicly available.
# This will remove it from the config to prevent later errors. Results will still be generated for other databases.
input_db['databases'] = [db for db in input_db['databases'] if 'mol_db6' not in db]

In [None]:
input_data

In [None]:
input_db

In [None]:
output

### Validate that dataset is present

Download links for example datasets can be found in [README.md](https://github.com/metaspace2020/pywren-annotation-pipeline/blob/master/README.md).

In [None]:
from annotation_pipeline.utils import ds_imzml_path
import sys
try:
    assert ds_imzml_path(input_config['dataset']['path'])
except:
    print(f"No imzML file was found in {input_config['dataset']['path']}. "
           "Please follow the instructions in README.md to download and extract the dataset required by this input_config.json file.",
          file=sys.stderr)

# <a name="database"></a> 2. Generate Isotopic Peaks from Molecular Databases

In [None]:
from annotation_pipeline.molecular_db import build_database, calculate_centroids, upload_mol_dbs_from_dir

In [None]:
# Upload databases
upload_mol_dbs_from_dir(config, config['storage']['db_bucket'], 'metabolomics/db', 'metabolomics/db')

In [None]:
num_formulas, num_formulas_chunks = build_database(config, input_db)

In [None]:
num_formulas, num_formulas_chunks

In [None]:
polarity = input_data['polarity'] # Use '+' if missing from the config, but it's better to get the actual value as it affects the results
isocalc_sigma = input_data['isocalc_sigma'] # Use 0.001238 if missing from the config, but it's better to get the actual value as it affects the results
num_centroids, num_centroids_chunks = calculate_centroids(config, input_db, polarity, isocalc_sigma)

In [None]:
num_centroids, num_centroids_chunks

# <a name="annotation"></a> 3. Run Annotation Pipeline

In [None]:
from annotation_pipeline.pipeline import Pipeline

In [None]:
pipeline = Pipeline(config, input_config)

In [None]:
%time pipeline.load_ds()

In [None]:
# Here we upload spectra chunks to COS
%time pipeline.split_ds()

In [None]:
#sort of input dataset
%time pipeline.segment_ds()

In [None]:
#sort of input databases
%time pipeline.segment_centroids()

In [None]:
#annotation engine - generate images in COS
%time pipeline.annotate()

In [None]:
%time pipeline.run_fdr()

In [None]:
%time results_df = pipeline.get_results()

# <a name="display"></a> 4. Display results

In [None]:
# Here we download pickled output images from COS
%time formula_images = pipeline.get_images()

In [None]:
top_mols = (results_df
               .sort_values('msm', ascending=False)
               .drop('database_path', axis=1)
               .drop_duplicates(['mol','modifier','adduct']))
top_mols.head()

In [None]:
# Display images
import matplotlib.pyplot as plt
for i, (formula_i, row) in enumerate(top_mols.head().iterrows()):
    plt.figure(i)
    plt.title(f'{row.mol}{row.modifier}{row.adduct} - MSM {row.msm:.3f}')
    plt.imshow(formula_images[formula_i][0].toarray())

# <a name="validate"></a> 5. Validate Results

Note that due to minor changes in the METASPACE algorithm since these datasets were originally uploaded, only the "big" dataset currently passes this validation step.

In [None]:
checked_results = pipeline.check_results()

# <a name="clean"></a> 6. Clean Temp Data in IBM COS

In [None]:
from annotation_pipeline.utils import clean_from_cos
clean_from_cos(config, config["storage"]["ds_bucket"], "metabolomics/tmp")
clean_from_cos(config, config["storage"]["db_bucket"], "metabolomics/tmp")