# Metaspace Annotation Pipeline on IBM Cloud
Experimental code to integrate Metaspace [engine](https://github.com/metaspace2020/metaspace/tree/master/metaspace/engine) with [PyWren](https://github.com/pywren/pywren-ibm-cloud) for IBM Cloud

## Table of contents
1. [Follow Setup Instructions](#setup)
2. [Upload Data Files into IBM Cloud Object Storage](#upload)
3. [Split Dataset into Segments](#split)
4. [Apply Annotation to each Segment in Parallel](#annotation)
5. [Get Results](#results)
6. [Clean Segments Data in IBM Cloud Object Storage](#clean)

# <a name="setup"></a> 1. Follow Setup Instructions

This notebook requires IBM Cloud Object Storage and IBM Cloud Functions
Please follow IBM Cloud dashboard and create both services.


In [None]:
# We need this to overcome Python notebooks limitations of too many open files
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Bebore:', soft, hard)

# Raising the soft limit. Hard limits can be raised only by sudo users
resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('After:', soft, hard)

In [None]:
%config Completer.use_jedi = False
%matplotlib inline
%load_ext autoreload
%autoreload 2
import json

In [None]:
# Run this for DEBUG mode
import logging
logging.basicConfig(level=logging.DEBUG)

### IBM Cloud PyWren Setup

In [None]:
#Install PyWren-IBM if needed
import sys
try:
    import pywren_ibm_cloud as pywren
except ModuleNotFoundError:    
    !{sys.executable} -m pip install -U pywren-ibm-cloud==1.0.10
    import pywren_ibm_cloud as pywren

pywren.__version__

Copy the file `config.json.template` to `config.json` and fill in the missing values for API keys, buckets and endpoints per instructions below:

In [None]:
config = json.load(open('config.json'))

#### IBM Cloud Object Storage:

Setup a bucket in IBM Cloud Object Storage

You need an IBM COS bucket which you will use to store the input data. If you don't know of any of your existing buckets or would like like to create a new one, please navigate to your cloud resource list, then find and select your storage instance. From here, you will be able to view all your buckets and can create a new bucket in the region you prefer. Make sure you copy the correct endpoint for the bucket from the Endpoint tab of this COS service dashboard. Note: The bucket names must be unique.

#### IBM Cloud Functions:

Obtain the API key and endpoint to the IBM Cloud Functions service. Navigate to Getting Started > API Key from the side menu and copy the values for "Current Namespace", "Host" and "Key" into the config below. Make sure to add "https://" to the host when adding it as the endpoint.

In [None]:
from annotation_pipeline.utils import get_ibm_cos_client
cos_client = get_ibm_cos_client(config)

### Data Files Setup

Copy the file `input_config.json.template` to `input_config.json` and fill in the missing values for buckets.<br>
Change `"ds_id"` value to use different datasets (specify one of the datasets options).<br>
Change `"modifiers"` value to use different databases.

In [None]:
input_config = json.load(open('input_config.json'))
input_data = input_config['dataset']
input_db = input_config['molecular_db']

# <a name='upload'></a> 2. Upload Data Files into IBM Cloud Object Storage

### Input Dataset

This part uploads input data from url or local path into IBM Cloud Object Storage.<br>
To upload dataset from local path, define `"ds_id": "local"` and specify files' paths inside input config file.

In [None]:
# Specified database to be uploaded:
ds_id = input_data['ds_id']
ds_id

In [None]:
from annotation_pipeline.dataset import dumb_input_dataset

In [None]:
%%time
# Downloads dataset from URL (or loads from local) and uploads (add force=True to reupload if needed)
dumb_input_dataset(config, input_data)

In [None]:
# Prints details of ds.txt to ensure that everything is correct
key=input_data['datasets'][ds_id]['ds']
cos_client.list_objects_v2(Bucket=input_db['bucket'], Prefix=key).get('Contents', [])

In [None]:
# Prints details of ds_coords.txt to ensure that everything is correct
key=input_data['datasets'][ds_id]['ds_coord']
cos_client.list_objects_v2(Bucket=input_db['bucket'], Prefix=key).get('Contents', [])

### Generate Isotopic Peaks from Molecular Databases

This part creates formulas in IBM Cloud Object Storage and then genrates and uploads centroids database.

In [None]:
# Specified modifiers to be used:
input_db['modifiers']

In [None]:
from annotation_pipeline.molecular_db import dump_mol_db, build_database, calculate_centroids, clean_formula_chunks

In [None]:
# Download commonly used mol DBs from METASPACE (add force=True to redownload if needed)
dump_mol_db(config, input_db['bucket'], 'metabolomics/db/mol_db1.pickle', 22) #HMDB-v4
dump_mol_db(config, input_db['bucket'], 'metabolomics/db/mol_db2.pickle', 19) #ChEBI-2018-01
dump_mol_db(config, input_db['bucket'], 'metabolomics/db/mol_db3.pickle', 24) #LipidMaps-2017-12-12
dump_mol_db(config, input_db['bucket'], 'metabolomics/db/mol_db4.pickle', 26) #SwissLipids-2018-02-02

In [None]:
%%time
num_formulas, formula_chunk_keys = build_database(config, input_db)

In [None]:
num_formulas, len(formula_chunk_keys), formula_chunk_keys[:3]

In [None]:
%%time
centroids_shape, centroids_head = calculate_centroids(config, input_db, formula_chunk_keys)

In [None]:
centroids_shape

In [None]:
centroids_head

In [None]:
clean_formula_chunks(config, input_db, formula_chunk_keys)

In [None]:
# Prints details of centroids.pickle to ensure that everything is correct
cos_client.list_objects_v2(Bucket=input_db['bucket'], Prefix=input_db['centroids_pandas']).get('Contents', [])

# <a name='split'></a> 3. Split Dataset into Segments

In [None]:
from annotation_pipeline.dataset_segmentation import generate_segm_intervals, split_spectra_into_segments

In [None]:
segm_n = 256

In [None]:
segm_intervals = generate_segm_intervals(config, input_db, segm_n)

In [None]:
segm_intervals[:5]

In [None]:
%%time
split_spectra_into_segments(config, input_data, segm_n, segm_intervals)

In [None]:
# Prints details of pickled segments
cos_client.list_objects_v2(Bucket=input_data['bucket'], Prefix=input_data['segments']).get('Contents', [])[:3]

In [None]:
# Prints segments number in COS
cos_client.list_objects_v2(Bucket=input_data['bucket'], Prefix=input_data['segments'])['KeyCount']

# <a name='annotation'></a> 4. Apply Annotation to each Segment in Parallel

In [None]:
from annotation_pipeline.annotation import annotate_spectra

In [None]:
%%time
results = annotate_spectra(config, input_data, input_db, segm_n)

In [None]:
len(results)

# <a name='results'></a> 5. Get Results

In [None]:
from annotation_pipeline.annotation import merge_annotation_results
formula_scores_df, formula_images = merge_annotation_results(results)

In [None]:
formula_scores_df.shape, len(formula_images)

### Example

In [None]:
from matplotlib import pyplot as plt
for key, image_set in formula_images.items():
    img = image_set[0][1]
    break
plt.imshow(img.toarray())

# <a name='clean'></a> 6. Clean Segments Data in IBM Cloud Object Storage

In [None]:
from annotation_pipeline.dataset_segmentation import clean_segments
clean_segments(config, input_data)