# Metaspace annotation pipeline on Lithops

Experimental code to integrate the [METASPACE annotation engine](https://github.com/metaspace2020/metaspace/tree/master/metaspace/engine)
with [Lithops](https://github.com/lithops-cloud/lithops).

## Table of Contents
1. [Setup](#setup)
2. [Run Annotation Pipeline](#annotation)
3. [Display Annotations](#display)
4. [Clean Temp Data](#clean)

# <a name="setup"></a> Setup

Follow the instructions in [README.md](./README.md) to configure Lithops to use a cloud platform. If not configured, Lithops will execute code on the local computer using `multiprocessing`.

In [1]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False
%matplotlib inline

In [2]:
# We need this to overcome Python notebooks limitations of too many open files
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Before:', soft, hard)

# Raising the soft limit. Hard limits can be raised only by sudo users
resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('After:', soft, hard)

Before: 4096 1048576
After: 10000 1048576


In [3]:
import logging
logging.basicConfig(level=logging.INFO)

## Project Setup

Check that the `Lithops-METASPACE` project is installed.

In [4]:
try:
    from annotation_pipeline import Pipeline
    import lithops
    print('Lithops version: ' + lithops.__version__)
except ImportError:
    print('Failed to import Lithops-METASPACE. Please run "pip install -e ." in this directory.')

Lithops version: 3.0.0


In [5]:
# Display Lithops version
import lithops
lithops.__version__

'3.0.0'

## Input Files Setup

Choose between the input config files to select how much processing will be done. See the `README.md` for more information on each dataset.

In [6]:
import json

# Input dataset and database (increase/decrease config number to increase/decrease job size)
input_ds = json.load(open('metabolomics/ds_config1.json'))
input_db = json.load(open('metabolomics/db_config1.json'))

In [7]:
input_ds

{'name': 'Brain02_Bregma1-42_02',
 'imzml_path': 'cos://embl-datasets/small/ds.imzML',
 'ibd_path': 'cos://embl-datasets/small/ds.ibd',
 'num_decoys': 20,
 'polarity': '+',
 'isocalc_sigma': 0.000693,
 'metaspace_id': '2016-09-22_11h16m11s'}

In [8]:
input_db

{'name': 'db_config1',
 'databases': ['metabolomics/db/mol_db1.csv'],
 'adducts': ['', '+H', '+Na', '+K'],
 'modifiers': ['']}

# <a name="annotation"></a> Run Annotation Pipeline

In [9]:
from annotation_pipeline.pipeline import Pipeline
pipeline = Pipeline(
    # Input dataset & metabolite database
    input_ds, input_db, 
    # Whether to use the pipeline False to accelerate repeated runs with the same database or dataset
    use_ds_cache=True, use_db_cache=True, 
    # Set to 'auto' to used the hybrid Serverless+VM implementation when available,
    # True to force Hybrid mode, or False to force pure Serverless mode.
    hybrid_impl='auto'
)

INFO:lithops.config:Lithops v3.0.0 - Python3.8
INFO:annotation-pipeline:Using the pure Serverless implementation
INFO:lithops.config:Lithops v3.0.0 - Python3.8
INFO:lithops.storage.backends.aws_s3.aws_s3:S3 client created - Region: us-east-1
INFO:lithops.serverless.backends.aws_lambda.aws_lambda:AWS Lambda client created - Region: us-east-1
INFO:lithops.storage.backends.aws_s3.aws_s3:S3 client created - Region: us-east-1
INFO:annotation-pipeline:Using cached logs/2023-10-09_10:40:53.csv for statistics


### Database preprocessing

In [10]:
# Upload required molecular databases from local machine
pipeline.upload_molecular_databases()

INFO:annotation-pipeline:Loaded 1 molecular databases from cache


In [11]:
# Parse the database
pipeline.build_database()

INFO:annotation-pipeline:Loaded 256 formula segments and 1 formula-to-id chunks from cache


In [12]:
# Calculate theoretical centroids for each formula
pipeline.calculate_centroids()

INFO:annotation-pipeline:Loaded 256 centroid chunks from cache


### Dataset preprocessing

In [13]:
# Upload the dataset if needed
pipeline.upload_dataset()

INFO:annotation-pipeline:Translating IBM COS path to public HTTPS path for example file "cos://embl-datasets/small/ds.imzML"
INFO:annotation-pipeline:Translating IBM COS path to public HTTPS path for example file "cos://embl-datasets/small/ds.ibd"


In [14]:
# Load the dataset's parser
pipeline.load_ds()

INFO:annotation-pipeline:Loaded imzml from cache, 12088 spectra found


In [16]:
# Parse dataset chunks into IBM COS
pipeline.split_ds()

INFO:lithops.invokers:ExecutorID bc8a09-0 | JobID M001 - Selected Runtime: metaspace:01 - 3072MB
INFO:lithops.job.serialize:added modules {'concurrent.futures.thread', 'annotation_pipeline.utils', 'annotation_pipeline.segment', 'sys', 'pickle', 'lithops.storage.utils', 'numpy', 'logging', '_io'}
INFO:lithops.job.serialize:added modules set()
INFO:lithops.job.serialize:module_name: concurrent.futures.thread, origin: /home/pau/.pyenv/versions/3.8.5/lib/python3.8/concurrent/futures/thread.py
INFO:lithops.job.serialize:module_name: annotation_pipeline.utils, origin: /home/pau/CLOUDLAB/serverless_pipelines/Lithops-METASPACE/annotation_pipeline/utils.py
INFO:lithops.job.serialize:module_name: annotation_pipeline.segment, origin: /home/pau/CLOUDLAB/serverless_pipelines/Lithops-METASPACE/annotation_pipeline/segment.py
INFO:lithops.job.serialize:module_name: sys, origin: None
INFO:lithops.job.serialize:module_name: pickle, origin: /home/pau/.pyenv/versions/3.8.5/lib/python3.8/pickle.py
INFO:lit

TypeError: sequence item 0: expected str instance, NoneType found

In [None]:
# Sort dataset chunks to ordered dataset segments
pipeline.segment_ds()

In [None]:
# Sort database chunks to ordered database segments
pipeline.segment_centroids()

### Engine

In [None]:
# Annotate the molecular database over the dataset by creating images into IBM COS
pipeline.annotate()

In [None]:
# Discover expected false annotations by FDR (False-Discovery-Rate)
pipeline.run_fdr()

### Lithops Summary

In [None]:
# Display statistics file
from annotation_pipeline.utils import PipelineStats
PipelineStats.get()

# <a name="display"></a> 3. Display Annotations

In [None]:
# Display most annotated molecules statistics
results_df = pipeline.get_results()
top_mols = (results_df
               .sort_values('msm', ascending=False)
               .drop('database_path', axis=1)
               .drop_duplicates(['mol','modifier','adduct']))
top_mols.head()

In [None]:
# Download annotated molecules images
formula_images = pipeline.get_images(as_png=False)

In [None]:
# Display most annotated molecules images
import matplotlib.pyplot as plt
for i, (formula_i, row) in enumerate(top_mols.head().iterrows()):
    plt.figure(i)
    plt.title(f'{row.mol}{row.modifier}{row.adduct} - MSM {row.msm:.3f} FDR {row.fdr*100:.0f}%')
    plt.imshow(formula_images[formula_i][0].toarray())

# <a name="clean"></a> 4. Clean Temp Data

In [None]:
pipeline.clean()