# Metaspace annotation pipeline on Lithops

Experimental code to integrate the [METASPACE annotation engine](https://github.com/metaspace2020/metaspace/tree/master/metaspace/engine)
with [Lithops](https://github.com/lithops-cloud/lithops).

## Table of Contents
1. [Setup](#setup)
2. [Run Annotation Pipeline](#annotation)
3. [Display Annotations](#display)
4. [Clean Temp Data](#clean)

# <a name="setup"></a> Setup

Follow the instructions in [README.md](./README.md) to configure Lithops to use a cloud platform. If not configured, Lithops will execute code on the local computer using `multiprocessing`.

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False
%matplotlib inline

In [2]:
# We need this to overcome Python notebooks limitations of too many open files
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Before:', soft, hard)

# Raising the soft limit. Hard limits can be raised only by sudo users
resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('After:', soft, hard)

Before: 4096 1048576
After: 10000 1048576


In [3]:
import logging
logging.basicConfig(level=logging.INFO)

## Project Setup

Check that the `Lithops-METASPACE` project is installed.

In [4]:
try:
    from annotation_pipeline import Pipeline
    import lithops
    print('Lithops version: ' + lithops.__version__)
except ImportError:
    print('Failed to import Lithops-METASPACE. Please run "pip install -e ." in this directory.')

Lithops version: 3.0.0


In [5]:
# Display Lithops version
import lithops
lithops.__version__

'3.0.0'

## Input Files Setup

Choose between the input config files to select how much processing will be done. See the `README.md` for more information on each dataset.

In [6]:
import json

# Input dataset and database (increase/decrease config number to increase/decrease job size)
input_ds = json.load(open('metabolomics/ds_config1.json'))
input_db = json.load(open('metabolomics/db_config1.json'))

In [7]:
input_ds

{'name': 'Brain02_Bregma1-42_02',
 'imzml_path': 'cos://embl-datasets/small/ds.imzML',
 'ibd_path': 'cos://embl-datasets/small/ds.ibd',
 'num_decoys': 20,
 'polarity': '+',
 'isocalc_sigma': 0.000693,
 'metaspace_id': '2016-09-22_11h16m11s'}

In [8]:
input_db

{'name': 'db_config1',
 'databases': ['metabolomics/db/mol_db1.csv'],
 'adducts': ['', '+H', '+Na', '+K'],
 'modifiers': ['']}

# <a name="annotation"></a> Run Annotation Pipeline

In [11]:
from annotation_pipeline.pipeline import Pipeline
pipeline = Pipeline(
    # Input dataset & metabolite database
    input_ds, input_db, 
    # Whether to use the pipeline False to accelerate repeated runs with the same database or dataset
    use_ds_cache=True, use_db_cache=True, 
    # Set to 'auto' to used the hybrid Serverless+VM implementation when available,
    # True to force Hybrid mode, or False to force pure Serverless mode.
    hybrid_impl='auto'
)

INFO:lithops.config:Lithops v3.0.0 - Python3.8
INFO:annotation-pipeline:Using the pure Serverless implementation
INFO:lithops.config:Lithops v3.0.0 - Python3.8
INFO:lithops.storage.backends.aws_s3.aws_s3:S3 client created - Region: us-east-1
INFO:lithops.serverless.backends.aws_lambda.aws_lambda:AWS Lambda client created - Region: us-east-1
INFO:lithops.storage.backends.aws_s3.aws_s3:S3 client created - Region: us-east-1
INFO:annotation-pipeline:Using cached logs/2023-10-11_13:36:13.csv for statistics


### Database preprocessing

In [12]:
# Upload required molecular databases from local machine
pipeline.upload_molecular_databases()

INFO:annotation-pipeline:Uploaded 1 molecular databases


In [13]:
# Parse the database
pipeline.build_database()

INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M000 - Selected Runtime: metaspace:01 - 512MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M000 - Starting function invocation: generate_formulas() - Total: 84 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M000 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M000.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 84 function activations
  from .autonotebook import tqdm as notebook_tqdm
INFO:lithops.executors:ExecutorID e782ae-1 - Cleaning temporary data
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M001 - Selected Runtime: metaspace:01 - 512MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M001 - Starting function invocation: get_formulas_number_per_chunk() - Total: 32 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M001 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M001.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 32 function activations
INFO:

In [14]:
# Calculate theoretical centroids for each formula
pipeline.calculate_centroids()

INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M004 - Selected Runtime: metaspace:01 - 2048MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M004 - Starting function invocation: calculate_peaks_chunk() - Total: 256 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M004 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M004.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 256 function activations
INFO:lithops.executors:ExecutorID e782ae-1 - Cleaning temporary data
INFO:annotation-pipeline:Calculated 3841612 centroids in 256 chunks
INFO:annotation-pipeline:Calculated 256 centroid chunks


### Dataset preprocessing

In [15]:
# Upload the dataset if needed
pipeline.upload_dataset()

INFO:annotation-pipeline:Translating IBM COS path to public HTTPS path for example file "cos://embl-datasets/small/ds.imzML"
INFO:annotation-pipeline:Translating IBM COS path to public HTTPS path for example file "cos://embl-datasets/small/ds.ibd"


In [16]:
# Load the dataset's parser
pipeline.load_ds()

INFO:lithops.invokers:ExecutorID e782ae-1 | JobID A005 - Selected Runtime: metaspace:01 - 2048MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID A005 - Starting function invocation: get_portable_imzml_reader() - Total: 1 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID A005 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-A005.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 1 function activations
INFO:lithops.executors:ExecutorID e782ae-1 - Cleaning temporary data
INFO:annotation-pipeline:Parsed imzml: 12088 spectra found


In [17]:
# Parse dataset chunks into IBM COS
pipeline.split_ds()

INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M006 - Selected Runtime: metaspace:01 - 3072MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M006 - Starting function invocation: upload_chunk() - Total: 1 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M006 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M006.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 1 function activations
INFO:lithops.executors:ExecutorID e782ae-1 - Cleaning temporary data
INFO:annotation-pipeline:Uploaded 1 dataset chunks


In [18]:
# Sort dataset chunks to ordered dataset segments
pipeline.segment_ds()

INFO:annotation-pipeline:Defining dataset segments bounds
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID A007 - Selected Runtime: metaspace:01 - 1024MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID A007 - Starting function invocation: get_segm_bounds() - Total: 1 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID A007 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-A007.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 1 function activations
INFO:lithops.executors:ExecutorID e782ae-1 - Cleaning temporary data
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M008 - Selected Runtime: metaspace:01 - 2560MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M008 - Starting function invocation: segment_spectra_chunk() - Total: 1 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M008 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M008.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 1 function activations
INFO:lith

In [19]:
# Sort database chunks to ordered database segments
pipeline.segment_centroids()

INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M010 - Selected Runtime: metaspace:01 - 512MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M010 - Starting function invocation: clip_centr_df_chunk() - Total: 256 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M010 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M010.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 256 function activations
INFO:lithops.executors:ExecutorID e782ae-1 - Cleaning temporary data
INFO:annotation-pipeline:Prepared 3509505 centroids
INFO:annotation-pipeline:Defining centroids segments bounds
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M011 - Selected Runtime: metaspace:01 - 512MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M011 - Starting function invocation: get_first_peak_mz() - Total: 256 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M011 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M011.log
INFO:lithops.wait:ExecutorID e782ae

### Engine

In [20]:
# Annotate the molecular database over the dataset by creating images into IBM COS
pipeline.annotate()

INFO:annotation-pipeline:Annotating...
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M014 - Selected Runtime: metaspace:01 - 2048MB
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M014 - Starting function invocation: process_centr_segment() - Total: 350 activations
INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M014 - View execution logs at /tmp/lithops-pau/logs/e782ae-1-M014.log
INFO:lithops.wait:ExecutorID e782ae-1 - Getting results from 350 function activations
INFO:lithops.executors:ExecutorID e782ae-1 - Cleaning temporary data
INFO:annotation-pipeline:Metrics calculated: 2336


In [23]:
# Discover expected false annotations by FDR (False-Discovery-Rate)
pipeline.run_fdr()

INFO:lithops.invokers:ExecutorID e782ae-1 | JobID M019 - Selected Runtime: metaspace:01 - 1536MB
INFO:lithops.job.serialize:module_name: pickle, origin: /home/pau/.pyenv/versions/3.8.5/lib/python3.8/pickle.py
INFO:lithops.job.serialize:module_name: _io, origin: built-in
INFO:lithops.job.serialize:module_name: annotation_pipeline.formula_parser, origin: /home/pau/CLOUDLAB/serverless_benchmarks/Lithops-METASPACE/annotation_pipeline/formula_parser.py
INFO:lithops.job.serialize:module_name: pandas, origin: /home/pau/.pyenv/versions/3.8.5/lib/python3.8/site-packages/pandas/__init__.py
INFO:lithops.job.serialize:module_name: collections, origin: /home/pau/.pyenv/versions/3.8.5/lib/python3.8/collections/__init__.py
INFO:lithops.job.serialize:module_name: annotation_pipeline.utils, origin: /home/pau/CLOUDLAB/serverless_benchmarks/Lithops-METASPACE/annotation_pipeline/utils.py
INFO:lithops.job.serialize:module_name: itertools, origin: built-in
INFO:lithops.job.serialize:module_name: logging, or

TypeError: sequence item 7: expected str instance, NoneType found

### Lithops Summary

In [None]:
# Display statistics file
from annotation_pipeline.utils import PipelineStats
PipelineStats.get()

# <a name="display"></a> 3. Display Annotations

In [None]:
# Display most annotated molecules statistics
results_df = pipeline.get_results()
top_mols = (results_df
               .sort_values('msm', ascending=False)
               .drop('database_path', axis=1)
               .drop_duplicates(['mol','modifier','adduct']))
top_mols.head()

In [None]:
# Download annotated molecules images
formula_images = pipeline.get_images(as_png=False)

In [None]:
# Display most annotated molecules images
import matplotlib.pyplot as plt
for i, (formula_i, row) in enumerate(top_mols.head().iterrows()):
    plt.figure(i)
    plt.title(f'{row.mol}{row.modifier}{row.adduct} - MSM {row.msm:.3f} FDR {row.fdr*100:.0f}%')
    plt.imshow(formula_images[formula_i][0].toarray())

# <a name="clean"></a> 4. Clean Temp Data

In [None]:
pipeline.clean()