# Metaspace annotation pipeline on IBM Cloud

Experimental code to integrate Metaspace [engine](https://github.com/metaspace2020/metaspace/tree/master/metaspace/engine)
with [PyWren](https://github.com/pywren/pywren-ibm-cloud) for IBM Cloud.

## Table of Contents
1. [Setup](#setup)
2. [Generate Isotopic Peaks from Molecular Databases](#database)
3. [Run Annotation Pipeline](#annotation)
4. [Display Annotations](#display)
5. [Clean Temp Data in IBM COS](#clean)

# <a name="setup"></a> 1. Setup

This notebook requires IBM Cloud Object Storage and IBM Cloud Functions
Please follow IBM Cloud dashboard and create both services.


In [None]:
%config Completer.use_jedi = False
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
# We need this to overcome Python notebooks limitations of too many open files
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Before:', soft, hard)

# Raising the soft limit. Hard limits can be raised only by sudo users
resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('After:', soft, hard)

In [None]:
import logging
logging.basicConfig(level=logging.INFO)

### Project Setup

Install `pywren-annotation-pipeline` project and all its dependencies.

In [None]:
!pip install -e .

In [None]:
# Disaply IBM PyWren version
import pywren_ibm_cloud as pywren
pywren.__version__

### IBM PyWren Setup

Copy the file `config.json.template` to `config.json` and fill in the missing values for API keys, buckets and endpoints per these instructions:

1. IBM Cloud Object Storage - you need an IBM COS bucket which you will use to store the input data. If you don't know of any of your existing buckets or would like like to create a new one, please navigate to your cloud resource list, then find and select your storage instance. From here, you will be able to view all your buckets and can create a new bucket in the region you prefer. Make sure you copy the correct endpoint for the bucket from the Endpoint tab of this COS service dashboard. Note: The bucket names must be unique.

2. IBM Cloud Functions - obtain the API key and endpoint to the IBM Cloud Functions service. Navigate to Getting Started > API Key from the side menu and copy the values for "Current Namespace", "Host" and "Key" into the config below. Make sure to add "https://" to the host when adding it as the endpoint.

In [None]:
import json
config = json.load(open('config.json'))

### Input Files Setup

Choose between the input config files to select how much processing will be done. See the `README.md` for more information on each dataset.

In [None]:
import json

# Input dataset and database (increase/decrease config number to increase/decrease job size)
input_ds = json.load(open('metabolomics/ds_config2.json'))
input_db = json.load(open('metabolomics/db_config2.json'))

In [None]:
input_ds

In [None]:
input_db

# <a name="database"></a> 2. Generate Isotopic Peaks from Molecular Databases

In [None]:
from annotation_pipeline.molecular_db import upload_mol_dbs_from_dir

In [None]:
# Upload molecular databases into IBM COS
upload_mol_dbs_from_dir(config, config['storage']['db_bucket'], 'metabolomics/db', 'metabolomics/db')

# <a name="annotation"></a> 3. Run Annotation Pipeline

In [None]:
from annotation_pipeline.pipeline import Pipeline
pipeline = Pipeline(config, input_ds, input_db, use_cache=True)

### Database preprocessing

In [None]:
# Parse the database
%time pipeline.build_database()

In [None]:
# Calculate theoretical centroids for each formula
%time pipeline.calculate_centroids()

### Dataset preprocessing

In [None]:
# Load the dataset's parser
%time pipeline.load_ds()

In [None]:
# Parse dataset chunks into IBM COS
%time pipeline.split_ds()

In [None]:
# Sort dataset chunks to ordered dataset segments
%time pipeline.segment_ds()

In [None]:
# Sort database chunks to ordered database segments
%time pipeline.segment_centroids()

### Engine

In [None]:
# Annotate the molecular database over the dataset by creating images into IBM COS
%time pipeline.annotate()

In [None]:
# Discover expected false annotations by FDR (False-Discovery-Rate)
%time pipeline.run_fdr()

In [None]:
# Display statistic results
%time results_df = pipeline.get_results()
results_df

### PyWren Summary

In [None]:
# Display PyWren statistics file
from annotation_pipeline.utils import get_pywren_stats
get_pywren_stats()

# <a name="display"></a> 4. Display Annotations

In [None]:
# Download annotated molecules images
%time formula_images = pipeline.get_images()

In [None]:
# Display most annotated molecules statistics
top_mols = (results_df
               .sort_values('msm', ascending=False)
               .drop('database_path', axis=1)
               .drop_duplicates(['mol','modifier','adduct']))
top_mols.head()

In [None]:
# Display most annotated molecules images
import matplotlib.pyplot as plt
for i, (formula_i, row) in enumerate(top_mols.head().iterrows()):
    plt.figure(i)
    plt.title(f'{row.mol}{row.modifier}{row.adduct} - MSM {row.msm:.3f}')
    plt.imshow(formula_images[formula_i][0].toarray())

# <a name="clean"></a> 5. Clean Temp Data in IBM COS

In [None]:
pipeline.clean()