# Federated Protein Embeddings and Task Model Fitting with BioNeMo

This example notebook shows how to obtain protein learned representations in the form of embeddings using the ESM-1nv pre-trained model. The model is trained with NVIDIA's BioNeMo framework for Large Language Model training and inference. For more details, please visit NVIDIA BioNeMo Service at https://www.nvidia.com/en-us/gpu-cloud/bionemo.

This notebook will walk you through the task fitting workflow in the following sections:

* 
*
*

### Install requirements
Let's start by installing and importing library dependencies. We'll use requests to interact with the BioNeMo service, BioPython to parse FASTA sequences into SeqRecord objects, scikit-learn for classification tasks, and matplotlib for visualization.

In [11]:
#!pip install -r requirements.txt
!pip install -e /media/hroth/NVIDIA/home_old/hroth/Code2/nvflare/bionemo_nvflare

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///media/hroth/NVIDIA/home_old/hroth/Code2/nvflare/bionemo_nvflare
  distutils: /tmp/pip-build-env-1nki54lw/normal/lib/python3.10/site-packages
  sysconfig: /tmp/pip-build-env-1nki54lw/normal/local/lib/python3.10/dist-packages[0m
  distutils: /tmp/pip-build-env-1nki54lw/normal/lib/python3.10/site-packages
  sysconfig: /tmp/pip-build-env-1nki54lw/normal/local/lib/python3.10/dist-packages[0m
  user = False
  home = None
  root = None
  prefix = '/tmp/pip-build-env-1nki54lw/normal'[0m
  distutils: /tmp/pip-build-env-1nki54lw/overlay/lib/python3.10/site-packages
  sysconfig: /tmp/pip-build-env-1nki54lw/overlay/local/lib/python3.10/dist-packages[0m
  distutils: /tmp/pip-build-env-1nki54lw/overlay/lib/python3.10/site-packages
  sysconfig: /tmp/pip-build-env-1nki54lw/overlay/local/lib/python3.10/dist-packages[0m
  user = False
  home = None
  root = None
  prefix = '/tmp/pip-build-env-1nki54lw/overlay

### Obtaining the protein embeddings using the BioNeMo ESM-1nv model
Using BioNeMo, users can obtain numerical vector representations of protein sequences called embeddings. Protein embeddings can then be used for visualization or making downstream predictions.

Here we are interested in training a neural network to predict subcellular location from an embedding.

The data we will be using comes from the paper [Light attention predicts protein location from the language of life](https://academic.oup.com/bioinformaticsadvances/article/1/1/vbab035/6432029) by Stärk et al. In this paper, the authors developed a machine learning algorithm to predict the subcellular location of proteins from sequence through protein langage models that are similar to those hosted by BioNeMo. Protein subcellular location refers to where the protein localizes in the cell, for example a protein my be expressed in the Nucleus or in the Cytoplasm. Knowing where proteins localize can provide insights into the underlying mechanisms of cellular processes and help identify potential targets for drug development. The following image includes a few examples of subcellular locations in an animal cell:


(Image freely available at https://pixabay.com/images/id-48542)

### Dataset sourcing
For our target input sequences, we will point to FASTA sequences in a benchmark dataset called Fitness Landscape Inference for Proteins (FLIP). FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families.

In [2]:
# Example protein dataset location
fasta_url= "http://data.bioembeddings.com/public/FLIP/fasta/scl/mixed_soft.fasta"

First, we define the source of example protein dataset with the FASTA sequences. This data follows the [biotrainer](https://github.com/sacdallago/biotrainer/blob/main/docs/data_standardization.md) standard, so it includes information about the class in the FASTA header, and the protein sequence. Here are two example sequences in this file:

```
>Sequence1 TARGET=Cell_membrane SET=train VALIDATION=False
MMKTLSSGNCTLNVPAKNSYRMVVLGASRVGKSSIVSRFLNGRFEDQYTPTIEDFHRKVYNIHGDMYQLDILDTSGNHPFPAM
RRLSILTGDVFILVFSLDSRESFDEVKRLQKQILEVKSCLKNKTKEAAELPMVICGNKNDHSELCRQVPAMEAELLVSGDENC
AYFEVSAKKNTNVNEMFYVLFSMAKLPHEMSPALHHKISVQYGDAFHPRPFCMRRTKVAGAYGMVSPFARRPSVNSDLKYIKA
KVLREGQARERDKCSIQ
>Sequence4833 TARGET=Nucleus SET=train VALIDATION=False
MARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRFRPGTVALREIRKYQKSTELLIRKLPFQRLVREIAQDFKTDL
RFQSSAVAALQEAAEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA
Note the following attributes in the FASTA header:
```

* `TARGET` attribute holds the subcellular location classification for the sequence, for instance Cell_membrane and Nucleus. This dataset includes a total of ten subcellelular location classes -- more on that below.
* `SET` attribute defines whether the sequence should be used for training (train) or testing (test)
* `VALIDATION` attribute defines whether the sequence should be used for validation (all sequences where this is True are also in set=train)

### Downloading the protein sequences and subcellular location annotations
In this step we download the FASTA file defined above and parse the sequences into a list of BioPython SeqRecord objects.



In [3]:
import io
import requests
from Bio import SeqIO

# Download the FASTA file from FLIP: https://github.com/J-SNACKKB/FLIP/tree/main/splits/scl
fasta_content = requests.get(fasta_url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x86)'
}).content.decode('utf-8')
fasta_stream = io.StringIO(fasta_content)

# Obtain a list of SeqRecords/proteins which contain sequence and attributes
# from the FASTA header
proteins = list(SeqIO.parse(fasta_stream, "fasta"))
print(f"Downloaded {len(proteins)} sequences")

Downloaded 13949 sequences


### Data splitting
Next, we prepare the data for simulating federated learning using `n_clients`.

In [4]:
n_clients = 3
# limiting to the proteins with sequence length<512 for embedding queries
MAX_SEQUENCE_LEN = 512
seed=0
out_dir = "/tmp/fasta/mixed_soft"
split_alpha = 1.0  # moderate label heterogeneity of alpha=1.0

import os
import re
import numpy as np
import pandas as pd
import uuid

from importlib import reload
import split_data
reload(split_data)
from split_data import split, list_to_dataframe
np.random.seed(seed)

# Extract meta data and split
data = []
for i, x in enumerate(proteins):
        if len(str(x.seq)) > MAX_SEQUENCE_LEN:
            continue
            
        entry = {key: value for key, value in re.findall(r"([A-Z_]+)=(-?[A-z0-9]+[.0-9]*)", x.description)}
        entry["sequence"] = str(x.seq)
        entry["id"] = str(i)
       
        data.append(entry)
print(f"Read {len(data)} valid sequences.")
               
# Split the data and save for each client
# Note, test_data is kept the same on each client and is not split
split(proteins=data, num_sites=n_clients, split_dir=out_dir, alpha=split_alpha)

Read 8619 valid sequences.
Partition protein dataset with 10 classes into 3 sites with Dirichlet sampling under alpha 1.0
{'site-1': {'Cell_membrane': 226,
            'Cytoplasm': 762,
            'Endoplasmic_reticulum': 108,
            'Extracellular': 239,
            'Mitochondrion': 572,
            'Nucleus': 117,
            'Peroxisome': 10,
            'Plastid': 325},
 'site-2': {'Cell_membrane': 152,
            'Cytoplasm': 269,
            'Endoplasmic_reticulum': 54,
            'Extracellular': 728,
            'Mitochondrion': 81,
            'Nucleus': 1019,
            'Peroxisome': 25,
            'Plastid': 71},
 'site-3': {'Cell_membrane': 136,
            'Cytoplasm': 132,
            'Endoplasmic_reticulum': 300,
            'Extracellular': 365,
            'Golgi_apparatus': 164,
            'Lysosome': 149,
            'Mitochondrion': 285,
            'Nucleus': 560,
            'Peroxisome': 54,
            'Plastid': 16}}
Saved 2358 training and 1700 test

### Federated embedding extraction
Running inference of the ESM-1nv model to extract embeddings requires a GPU with at least 12 GB memory. Here we run inference on each client sequentially using one thread to preserve GPU memory.

In [5]:
from nvflare import SimulatorRunner    

simulator = SimulatorRunner(
    job_folder="jobs/embeddings",
    workspace="/tmp/nvflare/bionemo/embeddings",
    n_clients=n_clients,
    threads=1  # due to memory constraints, we run the client execution sequentially in one thread
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

2023-12-02 03:27:50,632 - SimulatorRunner - INFO - Create the Simulator Server.
2023-12-02 03:27:50,636 - CoreCell - INFO - server: creating listener on tcp://0:41615
2023-12-02 03:27:50,652 - CoreCell - INFO - server: created backbone external listener for tcp://0:41615
2023-12-02 03:27:50,653 - ConnectorManager - INFO - 737: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-12-02 03:27:50,654 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:63829] is starting
2023-12-02 03:27:51,157 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:63829
2023-12-02 03:27:51,159 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:41615] is starting
2023-12-02 03:27:51,233 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 50975
2023-12-02 03:27:51,234 - SimulatorRunner - INFO - Deploy the Apps.
2023-12-02 03:27:51,826 - SimulatorRunner - INFO - Create t

      rank_zero_deprecation(
    
I1202 03:28:07.507468 140070230435648 rank_zero.py:53] GPU available: True (cuda), used: True
I1202 03:28:07.507681 140070230435648 rank_zero.py:53] TPU available: False, using: 0 TPU cores
I1202 03:28:07.507756 140070230435648 rank_zero.py:53] IPU available: False, using: 0 IPUs
I1202 03:28:07.507817 140070230435648 rank_zero.py:53] HPU available: False, using: 0 HPUs


[NeMo I 2023-12-02 03:28:07 megatron_init:234] Rank 0 has data parallel group: [0]
[NeMo I 2023-12-02 03:28:07 megatron_init:237] All data parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:07 megatron_init:238] Ranks 0 has data parallel rank: 0
[NeMo I 2023-12-02 03:28:07 megatron_init:246] Rank 0 has model parallel group: [0]
[NeMo I 2023-12-02 03:28:07 megatron_init:247] All model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:07 megatron_init:257] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-12-02 03:28:07 megatron_init:261] All tensor model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:07 megatron_init:262] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-12-02 03:28:07 megatron_init:276] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-12-02 03:28:07 megatron_init:288] Rank 0 has embedding group: [0]
[NeMo I 2023-12-02 03:28:07 megatron_init:294] All pipeline model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:07 megatron_init:295]

[NeMo W 2023-12-02 03:28:07 modelPT:244] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.


[NeMo I 2023-12-02 03:28:07 nlp_overrides:401] Model ESM1nvModel was successfully restored from /tmp/nvflare/bionemo/embeddings/simulate_job/app_site-1/models/esm1nv.nemo.
[NeMo I 2023-12-02 03:28:07 utils:340] DDP is not initialized. Initializing...
2023-12-02 03:28:07,963 - lightning_fabric.utilities.distributed - INFO - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
2023-12-02 03:28:07,965 - pytorch_lightning.utilities.rank_zero - INFO - ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[NeMo I 2023-12-02 03:28:07 text_memmap_dataset:116] Building data files
[NeMo I 2023-12-02 03:28:07 text_memmap_dataset:462] Processing 1 data files using 18 workers


I1202 03:28:07.963937 140070230435648 distributed.py:244] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
I1202 03:28:07.965160 140070230435648 rank_zero.py:53] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

    


[NeMo I 2023-12-02 03:28:08 text_memmap_dataset:432] Building indexing for fn = /tmp/fasta/mixed_soft/data_site-1.csv
[NeMo I 2023-12-02 03:28:08 text_memmap_dataset:444] Saving idx file = /tmp/fasta/mixed_soft/data_site-1.csv.idx.npy
[NeMo I 2023-12-02 03:28:08 text_memmap_dataset:446] Saving metadata file = /tmp/fasta/mixed_soft/data_site-1.csv.idx.info
[NeMo I 2023-12-02 03:28:08 text_memmap_dataset:471] Time building 1 / 1 mem-mapped files: 0:00:00.350398
[NeMo I 2023-12-02 03:28:09 text_memmap_dataset:462] Processing 1 data files using 18 workers
[NeMo I 2023-12-02 03:28:10 text_memmap_dataset:471] Time building 0 / 1 mem-mapped files: 0:00:00.426964
[NeMo I 2023-12-02 03:28:10 text_memmap_dataset:158] Loading data files
[NeMo I 2023-12-02 03:28:10 text_memmap_dataset:249] Loading /tmp/fasta/mixed_soft/data_site-1.csv
[NeMo I 2023-12-02 03:28:10 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000728
[NeMo I 2023-12-02 03:28:10 text_memmap_dataset:165] Computing g

[NeMo W 2023-12-02 03:28:10 memmap_csv_fields_dataset:61] CSVFieldsMemmapDataset will be available in NeMo 1.21


2023-12-02 03:28:11,212 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0:   0%|          | 0/32 [00:00<?, ?it/s]

I1202 03:28:11.212718 140070230435648 cuda.py:58] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting DataLoader 0: 100%|██████████| 32/32 [00:11<00:00,  2.86it/s]2023-12-02 03:28:27,543 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=bionemo_inference, peer=site-1, peer_run=simulate_job]: got result from client site-1 for task: name=bionemo_inference, id=d75e36e7-344b-4983-a2c5-9fb0d1b226c5
2023-12-02 03:28:27,553 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=bionemo_inference, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=bionemo_inference, task_id=d75e36e7-344b-4983-a2c5-9fb0d1b226c5]: finished processing client result by bionemo_inference
2023-12-02 03:28:27,554 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-1   task_id:d75e36e7-344b-4983-a2c5-9fb0d1b226c5
2023-12-02 03:28:27,557 - SimulatorClientRunner - INFO - Simulate Run client: site-2 on GPU group: None
2023-12-02 03:28:27,560 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00006 Not Connected] is closed PID: 737
2023-1

      rank_zero_deprecation(
    
I1202 03:28:40.212412 140307082291008 rank_zero.py:53] GPU available: True (cuda), used: True
I1202 03:28:40.212627 140307082291008 rank_zero.py:53] TPU available: False, using: 0 TPU cores
I1202 03:28:40.212706 140307082291008 rank_zero.py:53] IPU available: False, using: 0 IPUs
I1202 03:28:40.212767 140307082291008 rank_zero.py:53] HPU available: False, using: 0 HPUs


[NeMo I 2023-12-02 03:28:40 megatron_init:234] Rank 0 has data parallel group: [0]
[NeMo I 2023-12-02 03:28:40 megatron_init:237] All data parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:40 megatron_init:238] Ranks 0 has data parallel rank: 0
[NeMo I 2023-12-02 03:28:40 megatron_init:246] Rank 0 has model parallel group: [0]
[NeMo I 2023-12-02 03:28:40 megatron_init:247] All model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:40 megatron_init:257] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-12-02 03:28:40 megatron_init:261] All tensor model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:40 megatron_init:262] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-12-02 03:28:40 megatron_init:276] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-12-02 03:28:40 megatron_init:288] Rank 0 has embedding group: [0]
[NeMo I 2023-12-02 03:28:40 megatron_init:294] All pipeline model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:28:40 megatron_init:295]

[NeMo W 2023-12-02 03:28:40 modelPT:244] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.


[NeMo I 2023-12-02 03:28:40 nlp_overrides:401] Model ESM1nvModel was successfully restored from /tmp/nvflare/bionemo/embeddings/simulate_job/app_site-2/models/esm1nv.nemo.
[NeMo I 2023-12-02 03:28:40 utils:340] DDP is not initialized. Initializing...
2023-12-02 03:28:40,740 - lightning_fabric.utilities.distributed - INFO - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
2023-12-02 03:28:40,741 - pytorch_lightning.utilities.rank_zero - INFO - ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[NeMo I 2023-12-02 03:28:40 text_memmap_dataset:116] Building data files
[NeMo I 2023-12-02 03:28:40 text_memmap_dataset:462] Processing 1 data files using 18 workers


I1202 03:28:40.740457 140307082291008 distributed.py:244] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
I1202 03:28:40.741708 140307082291008 rank_zero.py:53] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

    


[NeMo I 2023-12-02 03:28:41 text_memmap_dataset:432] Building indexing for fn = /tmp/fasta/mixed_soft/data_site-2.csv
[NeMo I 2023-12-02 03:28:41 text_memmap_dataset:444] Saving idx file = /tmp/fasta/mixed_soft/data_site-2.csv.idx.npy
[NeMo I 2023-12-02 03:28:41 text_memmap_dataset:446] Saving metadata file = /tmp/fasta/mixed_soft/data_site-2.csv.idx.info
[NeMo I 2023-12-02 03:28:41 text_memmap_dataset:471] Time building 1 / 1 mem-mapped files: 0:00:00.384671
[NeMo I 2023-12-02 03:28:42 text_memmap_dataset:462] Processing 1 data files using 18 workers
[NeMo I 2023-12-02 03:28:42 text_memmap_dataset:471] Time building 0 / 1 mem-mapped files: 0:00:00.447844
[NeMo I 2023-12-02 03:28:43 text_memmap_dataset:158] Loading data files
[NeMo I 2023-12-02 03:28:43 text_memmap_dataset:249] Loading /tmp/fasta/mixed_soft/data_site-2.csv
[NeMo I 2023-12-02 03:28:43 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000768
[NeMo I 2023-12-02 03:28:43 text_memmap_dataset:165] Computing g

[NeMo W 2023-12-02 03:28:43 memmap_csv_fields_dataset:61] CSVFieldsMemmapDataset will be available in NeMo 1.21


2023-12-02 03:28:44,082 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0:   0%|          | 0/32 [00:00<?, ?it/s]

I1202 03:28:44.082162 140307082291008 cuda.py:58] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting DataLoader 0: 100%|██████████| 32/32 [00:11<00:00,  2.91it/s]2023-12-02 03:29:00,115 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=bionemo_inference, peer=site-2, peer_run=simulate_job]: got result from client site-2 for task: name=bionemo_inference, id=64027b69-464b-44c1-ada0-cd5476dbe0d2
2023-12-02 03:29:00,118 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=bionemo_inference, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=bionemo_inference, task_id=64027b69-464b-44c1-ada0-cd5476dbe0d2]: finished processing client result by bionemo_inference
2023-12-02 03:29:00,120 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-2   task_id:64027b69-464b-44c1-ada0-cd5476dbe0d2
2023-12-02 03:29:00,124 - SimulatorClientRunner - INFO - Simulate Run client: site-3 on GPU group: None
2023-12-02 03:29:00,129 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00007 Not Connected] is closed PID: 737
2023-1

      rank_zero_deprecation(
    
I1202 03:29:12.616693 140649102886720 rank_zero.py:53] GPU available: True (cuda), used: True
I1202 03:29:12.616901 140649102886720 rank_zero.py:53] TPU available: False, using: 0 TPU cores
I1202 03:29:12.616976 140649102886720 rank_zero.py:53] IPU available: False, using: 0 IPUs
I1202 03:29:12.617035 140649102886720 rank_zero.py:53] HPU available: False, using: 0 HPUs


[NeMo I 2023-12-02 03:29:12 megatron_init:234] Rank 0 has data parallel group: [0]
[NeMo I 2023-12-02 03:29:12 megatron_init:237] All data parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:29:12 megatron_init:238] Ranks 0 has data parallel rank: 0
[NeMo I 2023-12-02 03:29:12 megatron_init:246] Rank 0 has model parallel group: [0]
[NeMo I 2023-12-02 03:29:12 megatron_init:247] All model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:29:12 megatron_init:257] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-12-02 03:29:12 megatron_init:261] All tensor model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:29:12 megatron_init:262] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-12-02 03:29:12 megatron_init:276] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-12-02 03:29:12 megatron_init:288] Rank 0 has embedding group: [0]
[NeMo I 2023-12-02 03:29:12 megatron_init:294] All pipeline model parallel group ranks: [[0]]
[NeMo I 2023-12-02 03:29:12 megatron_init:295]

[NeMo W 2023-12-02 03:29:12 modelPT:244] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.


[NeMo I 2023-12-02 03:29:13 nlp_overrides:401] Model ESM1nvModel was successfully restored from /tmp/nvflare/bionemo/embeddings/simulate_job/app_site-3/models/esm1nv.nemo.
[NeMo I 2023-12-02 03:29:13 utils:340] DDP is not initialized. Initializing...
2023-12-02 03:29:13,144 - lightning_fabric.utilities.distributed - INFO - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
2023-12-02 03:29:13,145 - pytorch_lightning.utilities.rank_zero - INFO - ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[NeMo I 2023-12-02 03:29:13 text_memmap_dataset:116] Building data files
[NeMo I 2023-12-02 03:29:13 text_memmap_dataset:462] Processing 1 data files using 18 workers


I1202 03:29:13.144635 140649102886720 distributed.py:244] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
I1202 03:29:13.145705 140649102886720 rank_zero.py:53] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

    


[NeMo I 2023-12-02 03:29:13 text_memmap_dataset:432] Building indexing for fn = /tmp/fasta/mixed_soft/data_site-3.csv
[NeMo I 2023-12-02 03:29:13 text_memmap_dataset:444] Saving idx file = /tmp/fasta/mixed_soft/data_site-3.csv.idx.npy
[NeMo I 2023-12-02 03:29:13 text_memmap_dataset:446] Saving metadata file = /tmp/fasta/mixed_soft/data_site-3.csv.idx.info
[NeMo I 2023-12-02 03:29:13 text_memmap_dataset:471] Time building 1 / 1 mem-mapped files: 0:00:00.373930
[NeMo I 2023-12-02 03:29:14 text_memmap_dataset:462] Processing 1 data files using 18 workers
[NeMo I 2023-12-02 03:29:15 text_memmap_dataset:471] Time building 0 / 1 mem-mapped files: 0:00:00.472219
[NeMo I 2023-12-02 03:29:15 text_memmap_dataset:158] Loading data files
[NeMo I 2023-12-02 03:29:15 text_memmap_dataset:249] Loading /tmp/fasta/mixed_soft/data_site-3.csv
[NeMo I 2023-12-02 03:29:15 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000845
[NeMo I 2023-12-02 03:29:15 text_memmap_dataset:165] Computing g

[NeMo W 2023-12-02 03:29:15 memmap_csv_fields_dataset:61] CSVFieldsMemmapDataset will be available in NeMo 1.21


2023-12-02 03:29:16,487 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting DataLoader 0:   0%|          | 0/31 [00:00<?, ?it/s]

I1202 03:29:16.487171 140649102886720 cuda.py:58] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting DataLoader 0: 100%|██████████| 31/31 [00:11<00:00,  2.75it/s]2023-12-02 03:29:32,668 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=bionemo_inference, peer=site-3, peer_run=simulate_job]: got result from client site-3 for task: name=bionemo_inference, id=440f18ae-60cd-4229-a141-d89935e985d3
2023-12-02 03:29:32,671 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=bionemo_inference, peer=site-3, peer_run=simulate_job, peer_rc=OK, task_name=bionemo_inference, task_id=440f18ae-60cd-4229-a141-d89935e985d3]: finished processing client result by bionemo_inference
2023-12-02 03:29:32,672 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-3   task_id:440f18ae-60cd-4229-a141-d89935e985d3
2023-12-02 03:29:32,674 - SimulatorClientRunner - INFO - Simulate Run client: site-1 on GPU group: None
2023-12-02 03:29:32,677 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00008 Not Connected] is closed PID: 737
2023-1

### Inspecting the embeddings and labels
Embeddings returned from the BioNeMo server are vectors of fixed size for each input sequence. In other words, if we input 10 sequences, we will obtain a matrix `10xD`, where `D` is the size of the embedding (in the case of ESM-1nv, `D=768`). At a glance, these real-valued vector embeddings don't show any obvious features (see the printout in the next cell). But these vectors do contain information that can be used in downstream models to reveal properties of the protein, for example the subcellular location as we'll explore below.

In [6]:
# load embeddings from site-1
import pickle
protein_embeddings = pickle.load(open(os.path.join(out_dir, "data_site-1.pkl"), "rb"))
print(f"Loaded {len(protein_embeddings)} embeddings from site-1.")

for i in range(4):
    protein_embedding = protein_embeddings[i]
    print(f"Inference result contains {list(protein_embedding.keys())}")
    x = protein_embedding["embeddings"]
    print(f"{protein_embedding['id']}: range {np.min(x)}-{np.max(x)}, mean={np.mean(x)}, shape={x.shape}")

Loaded 4040 embeddings from site-1.
Inference result contains ['embeddings', 'hiddens', 'sequence', 'id']
7796: range -0.81787109375-1.1162109375, mean=-0.000674092210829258, shape=(768,)
Inference result contains ['embeddings', 'hiddens', 'sequence', 'id']
5822: range -0.962890625-1.2626953125, mean=-0.004092104267328978, shape=(768,)
Inference result contains ['embeddings', 'hiddens', 'sequence', 'id']
8012: range -0.7548828125-1.033203125, mean=-0.0030728678684681654, shape=(768,)
Inference result contains ['embeddings', 'hiddens', 'sequence', 'id']
4582: range -1.2197265625-1.30078125, mean=-0.000614077493082732, shape=(768,)


Let's enumerate the labels corresponding to potential subcellular locations.

In [7]:
# Let's also print all the labels

labels = set([entry['TARGET'] for entry in data])

for i, label in enumerate(labels):
    print(f"{i+1}. {label.replace('_', ' ')}")

1. Mitochondrion
2. Cytoplasm
3. Endoplasmic reticulum
4. Cell membrane
5. Plastid
6. Peroxisome
7. Extracellular
8. Nucleus
9. Golgi apparatus
10. Lysosome


### Training a MLP to predict subcellular location
To be able to classify proteins for their subcellular location, we train a simple scikit-learn Multi-layer Perceptron (MPL) classifier. The MLP model uses a network of hidden layers to fit the input embedding vectors to the model classes (the cellular locations above). In the call below, we define the MLP to use the Adam optimizer with a network of 32 hidden layers, defining a random state (or seed) for reproducibility, and trained for a maximum of 500 iterations.

### Local training

In [8]:
os.environ["SIM_LOCAL"] = "True"

simulator = SimulatorRunner(
    job_folder="jobs/fedavg",
    workspace=f"/tmp/nvflare/bionemo/local_alpha{split_alpha}",
    n_clients=n_clients,
    threads=n_clients
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

2023-12-02 03:29:50,997 - SimulatorRunner - INFO - Create the Simulator Server.
2023-12-02 03:29:51,008 - CoreCell - INFO - server: creating listener on tcp://0:52167
2023-12-02 03:29:51,061 - CoreCell - INFO - server: created backbone external listener for tcp://0:52167
2023-12-02 03:29:51,063 - ConnectorManager - INFO - 2034: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-12-02 03:29:51,064 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:21824] is starting
2023-12-02 03:29:51,568 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:21824
2023-12-02 03:29:51,571 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:52167] is starting
2023-12-02 03:29:51,662 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 34393
2023-12-02 03:29:51,663 - SimulatorRunner - INFO - Deploy the Apps.
2023-12-02 03:29:51,669 - SimulatorRunner - INFO - Create 



2023-12-02 03:29:55,001 - BioNeMoMLPModelPersistor - INFO - [identity=simulator_server, run=simulate_job]: MLPClassifier coefficients [(768, 512), (512, 256), (256, 128), (128, 10)], intercepts [(512,), (256,), (128,), (10,)]
2023-12-02 03:29:55,003 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: starting workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) ...
2023-12-02 03:29:55,004 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl]: Initializing ScatterAndGather workflow.
2023-12-02 03:29:55,006 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl]: Workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) started
2023-12-02 03:29:55,007 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl]: Beginning ScatterAndGather training phase.
202



2023-12-02 03:30:06,736 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=27d352df-9d02-42ef-854e-52c494d152df]: There are 2148 training samples and 1693 testing samples.
Iteration 1, loss = 2.44906789
2023-12-02 03:30:06,871 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=27d352df-9d02-42ef-854e-52c494d152df]: Client identity: site-3




2023-12-02 03:30:06,991 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=27d352df-9d02-42ef-854e-52c494d152df]: Model (owner=None) has an accuracy of 3.54%
2023-12-02 03:30:06,995 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=27d352df-9d02-42ef-854e-52c494d152df]: Evaluation finished. Returning result
2023-12-02 03:30:07,072 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=27d352df-9d02-42ef-854e-52c494d152df]: Current/Total Round: 1/30 (epoch_len=17)
2023-12-02 03:30:07,073 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=27d352df-9d02-42ef-854e-52c494d152df]: Client identity: site-3
2023-12-02 03:30:08,005 - BioNeMoMLPLearner - INFO - [ide



Iteration 1, loss = 2.41007208
2023-12-02 03:30:08,236 - BioNeMoMLPLearner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=f8826277-d5ad-4af9-afcf-28f9d8bece26]: Client identity: site-2
2023-12-02 03:30:08,486 - BioNeMoMLPLearner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=f8826277-d5ad-4af9-afcf-28f9d8bece26]: Model (owner=None) has an accuracy of 10.63%
2023-12-02 03:30:08,487 - BioNeMoMLPLearner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=f8826277-d5ad-4af9-afcf-28f9d8bece26]: Evaluation finished. Returning result
2023-12-02 03:30:08,657 - BioNeMoMLPLearner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=f8826277-d5ad-4af9-afcf-28f9d8bece26]: Current/Total Round: 1/30 (epoch_len=19)
2023-12-02 03:30:08,658 -

### Federated learning

In [9]:
import os
os.environ["SIM_LOCAL"] = "False"

simulator = SimulatorRunner(
    job_folder="jobs/fedavg",
    workspace=f"/tmp/nvflare/bionemo/fedavg_alpha{split_alpha}",
    n_clients=n_clients,
    threads=n_clients
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

2023-12-02 03:35:56,880 - SimulatorRunner - INFO - Create the Simulator Server.
2023-12-02 03:35:56,893 - CoreCell - INFO - server: creating listener on tcp://0:39181
2023-12-02 03:35:56,918 - CoreCell - INFO - server: created backbone external listener for tcp://0:39181
2023-12-02 03:35:56,920 - ConnectorManager - INFO - 5509: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-12-02 03:35:56,921 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:30356] is starting
2023-12-02 03:35:57,423 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:30356
2023-12-02 03:35:57,426 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:39181] is starting
2023-12-02 03:35:57,509 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 41311
2023-12-02 03:35:57,510 - SimulatorRunner - INFO - Deploy the Apps.
2023-12-02 03:35:57,515 - SimulatorRunner - INFO - Create 



2023-12-02 03:36:01,301 - BioNeMoMLPModelPersistor - INFO - [identity=simulator_server, run=simulate_job]: MLPClassifier coefficients [(768, 512), (512, 256), (256, 128), (128, 10)], intercepts [(512,), (256,), (128,), (10,)]
2023-12-02 03:36:01,304 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: starting workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) ...
2023-12-02 03:36:01,310 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl]: Initializing ScatterAndGather workflow.
2023-12-02 03:36:01,318 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl]: Workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) started
2023-12-02 03:36:01,321 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl]: Beginning ScatterAndGather training phase.
202



2023-12-02 03:36:13,211 - BioNeMoMLPLearner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=2d60b763-bcea-4a5c-afb9-28652bfead88]: Loaded 4040 embeddings
2023-12-02 03:36:13,401 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=035ee9fd-6d60-4cc6-befc-340dab54be63]: There are 2148 training samples and 1693 testing samples.




Iteration 1, loss = 2.36159545
2023-12-02 03:36:13,576 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=035ee9fd-6d60-4cc6-befc-340dab54be63]: Client identity: site-3
2023-12-02 03:36:13,751 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=035ee9fd-6d60-4cc6-befc-340dab54be63]: Model (owner=None) has an accuracy of 4.43%
2023-12-02 03:36:13,757 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=035ee9fd-6d60-4cc6-befc-340dab54be63]: Evaluation finished. Returning result
2023-12-02 03:36:13,858 - BioNeMoMLPLearner - INFO - [identity=site-3, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=035ee9fd-6d60-4cc6-befc-340dab54be63]: Current/Total Round: 1/30 (epoch_len=17)
2023-12-02 03:36:13,859 - 



Iteration 1, loss = 2.41613049
2023-12-02 03:36:19,311 - BioNeMoMLPLearner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=2d60b763-bcea-4a5c-afb9-28652bfead88]: Client identity: site-1
2023-12-02 03:36:19,684 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl, peer=site-2, peer_run=simulate_job]: got result from client site-2 for task: name=train, id=bd6d3075-f1b4-422e-8fd4-55d97a9066f6
2023-12-02 03:36:19,710 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=bd6d3075-f1b4-422e-8fd4-55d97a9066f6]: Contribution from site-2 ACCEPTED by the aggregator at round 0.
2023-12-02 03:36:19,714 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_gather_ctl, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=bd6d3075-f1b4

## Finetuning ESM2nv 650 M
#### Federated Learning

In [13]:
import os
os.environ["SIM_LOCAL"] = "False"

simulator = SimulatorRunner(
    job_folder="jobs/fedavg_finetune",
    workspace=f"/tmp/nvflare/bionemo/fedavg_finetune_alpha{split_alpha}",
    n_clients=n_clients,
    threads=n_clients
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

2023-12-02 04:00:52,175 - SimulatorRunner - INFO - Create the Simulator Server.
2023-12-02 04:00:52,179 - CoreCell - INFO - server: creating listener on tcp://0:55279
2023-12-02 04:00:52,200 - CoreCell - INFO - server: created backbone external listener for tcp://0:55279
2023-12-02 04:00:52,202 - ConnectorManager - INFO - 9901: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-12-02 04:00:52,203 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:56294] is starting
2023-12-02 04:00:52,705 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:56294
2023-12-02 04:00:52,708 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:55279] is starting
2023-12-02 04:00:52,790 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 45811
2023-12-02 04:00:52,791 - SimulatorRunner - INFO - Deploy the Apps.
2023-12-02 04:00:52,799 - SimulatorRunner - INFO - Create 