# Federated Protein Embeddings and Task Model Fitting with BioNeMo

This example notebook shows how to obtain protein learned representations in the form of embeddings using the ESM-1nv pre-trained model. The model is trained with NVIDIA's BioNeMo framework for Large Language Model training and inference. For more details, please visit NVIDIA BioNeMo Service at https://www.nvidia.com/en-us/gpu-cloud/bionemo.

This notebook will walk you through the task fitting workflow in the following sections:

* 
*
*

### Obtaining the protein embeddings using the BioNeMo ESM-1nv model
Using BioNeMo, users can obtain numerical vector representations of protein sequences called embeddings. Protein embeddings can then be used for visualization or making downstream predictions.

Here we are interested in training a neural network to predict subcellular location from an embedding.

The data we will be using comes from the paper [Light attention predicts protein location from the language of life](https://academic.oup.com/bioinformaticsadvances/article/1/1/vbab035/6432029) by Stärk et al. In this paper, the authors developed a machine learning algorithm to predict the subcellular location of proteins from sequence through protein langage models that are similar to those hosted by BioNeMo. Protein subcellular location refers to where the protein localizes in the cell, for example a protein my be expressed in the Nucleus or in the Cytoplasm. Knowing where proteins localize can provide insights into the underlying mechanisms of cellular processes and help identify potential targets for drug development. The following image includes a few examples of subcellular locations in an animal cell:


(Image freely available at https://pixabay.com/images/id-48542)

### Dataset sourcing
For our target input sequences, we will point to FASTA sequences in a benchmark dataset called Fitness Landscape Inference for Proteins (FLIP). FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families.

In [None]:
# Example protein dataset location
fasta_url= "http://data.bioembeddings.com/public/FLIP/fasta/scl/mixed_soft.fasta"

First, we define the source of example protein dataset with the FASTA sequences. This data follows the [biotrainer](https://github.com/sacdallago/biotrainer/blob/main/docs/data_standardization.md) standard, so it includes information about the class in the FASTA header, and the protein sequence. Here are two example sequences in this file:

```
>Sequence1 TARGET=Cell_membrane SET=train VALIDATION=False
MMKTLSSGNCTLNVPAKNSYRMVVLGASRVGKSSIVSRFLNGRFEDQYTPTIEDFHRKVYNIHGDMYQLDILDTSGNHPFPAM
RRLSILTGDVFILVFSLDSRESFDEVKRLQKQILEVKSCLKNKTKEAAELPMVICGNKNDHSELCRQVPAMEAELLVSGDENC
AYFEVSAKKNTNVNEMFYVLFSMAKLPHEMSPALHHKISVQYGDAFHPRPFCMRRTKVAGAYGMVSPFARRPSVNSDLKYIKA
KVLREGQARERDKCSIQ
>Sequence4833 TARGET=Nucleus SET=train VALIDATION=False
MARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRFRPGTVALREIRKYQKSTELLIRKLPFQRLVREIAQDFKTDL
RFQSSAVAALQEAAEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA
Note the following attributes in the FASTA header:
```

* `TARGET` attribute holds the subcellular location classification for the sequence, for instance Cell_membrane and Nucleus. This dataset includes a total of ten subcellelular location classes -- more on that below.
* `SET` attribute defines whether the sequence should be used for training (train) or testing (test)
* `VALIDATION` attribute defines whether the sequence should be used for validation (all sequences where this is True are also in set=train)

### Downloading the protein sequences and subcellular location annotations
In this step we download the FASTA file defined above and parse the sequences into a list of BioPython SeqRecord objects.



In [None]:
import io
import requests
from Bio import SeqIO

# Download the FASTA file from FLIP: https://github.com/J-SNACKKB/FLIP/tree/main/splits/scl
fasta_content = requests.get(fasta_url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x86)'
}).content.decode('utf-8')
fasta_stream = io.StringIO(fasta_content)

# Obtain a list of SeqRecords/proteins which contain sequence and attributes
# from the FASTA header
proteins = list(SeqIO.parse(fasta_stream, "fasta"))
print(f"Downloaded {len(proteins)} sequences")

### Data splitting

### Federated embedding extraction

In [13]:
from nvflare import SimulatorRunner    

simulator = SimulatorRunner(
    job_folder="jobs/embeddings",
    workspace="/tmp/nvflare/bionemo/embeddings",
    n_clients=1,
    threads=1
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

2023-07-26 22:02:13,465 - SimulatorRunner - INFO - Create the Simulator Server.
2023-07-26 22:02:13,469 - Cell - INFO - server: creating listener on tcp://0:41509
2023-07-26 22:02:13,502 - Cell - INFO - server: created backbone external listener for tcp://0:41509
2023-07-26 22:02:13,504 - ConnectorManager - INFO - 9053: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-07-26 22:02:13,506 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:16524] is starting
2023-07-26 22:02:14,008 - Cell - INFO - server: created backbone internal listener for tcp://localhost:16524
2023-07-26 22:02:14,012 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:41509] is starting
2023-07-26 22:02:14,103 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 45799
2023-07-26 22:02:14,104 - SimulatorRunner - INFO - Deploy the Apps.
2023-07-26 22:02:14,623 - SimulatorRunner - INFO - Create the simulate

[NeMo W 2023-07-26 22:02:29 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-07-26 22:02:30 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.




W0726 22:02:31.139726 140064913278784 cell.py:179] site-1.simulate_job: no connection to child site-1.simulate_job.0




W0726 22:02:32.142046 140064913278784 cell.py:179] site-1.simulate_job: no connection to child site-1.simulate_job.0
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************


2023-07-26 22:02:32,844 - Cell - INFO - site-1.simulate_job.0: created backbone internal connector to tcp://localhost:28396 on parent
2023-07-26 22:02:32,844 - ConnectorManager - INFO - 9180: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-07-26 22:02:32,845 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:17247] is starting
2023-07-26 22:02:32,848 - Cell - INFO - site-1.simulate_job.1: created backbone internal connector to tcp://localhost:28396 on parent
2023-07-26 22:02:32,848 - ConnectorManager - INFO - 9181: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-07-26 22:02:32,849 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:21932] is starting


W0726 22:02:33.143659 140064913278784 cell.py:179] site-1.simulate_job: no connection to child site-1.simulate_job.0


2023-07-26 22:02:33,345 - Cell - INFO - site-1.simulate_job.0: created backbone internal listener for tcp://localhost:17247
2023-07-26 22:02:33,346 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:28396] is starting
2023-07-26 22:02:33,348 - SubWorkerExecutor - INFO - SubWorkerExecutor process started.
2023-07-26 22:02:33,350 - Cell - INFO - site-1.simulate_job.1: created backbone internal listener for tcp://localhost:21932
2023-07-26 22:02:33,350 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:28396] is starting
2023-07-26 22:02:33,353 - SubWorkerExecutor - INFO - SubWorkerExecutor process started.
2023-07-26 22:02:37,019 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmpjflm4dzo
2023-07-26 22:02:37,019 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmpjflm4dzo/_remote_module_non_scriptable.py
2023-07-26 22:02:37,027 - torch.distributed.nn.jit.instantiator 

[NeMo W 2023-07-26 22:02:43 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-07-26 22:02:44 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.


2023-07-26 22:02:44,718 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=share_config, peer=site-1, peer_run=simulate_job, task_name=bionemo_inference, task_id=0e14097a-d4cd-4400-9c36-fff8ca30daf7]: assigned task to client site-1: name=bionemo_inference, id=0e14097a-d4cd-4400-9c36-fff8ca30daf7
2023-07-26 22:02:44,724 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=share_config, peer=site-1, peer_run=simulate_job, task_name=bionemo_inference, task_id=0e14097a-d4cd-4400-9c36-fff8ca30daf7]: sent task assignment to client. client_name:site-1 task_id:0e14097a-d4cd-4400-9c36-fff8ca30daf7
2023-07-26 22:02:44,726 - GetTaskCommand - INFO - return task to client.  client_name: site-1  task_name: bionemo_inference   task_id: 0e14097a-d4cd-4400-9c36-fff8ca30daf7  sharable_header_task_id: 0e14097a-d4cd-4400-9c36-fff8ca30daf7
[NeMo I 2023-07-26 22:02:45 utils:250] Restoring model from /tmp/nvflare/bionemo/embeddings/simulate_job/app_site-1/models/esm1

      rank_zero_deprecation(
    
I0726 22:02:45.099255 140246922557184 setup.py:163] GPU available: True (cuda), used: True
I0726 22:02:45.099840 140246922557184 setup.py:166] TPU available: False, using: 0 TPU cores
I0726 22:02:45.100004 140246922557184 setup.py:169] IPU available: False, using: 0 IPUs
I0726 22:02:45.100112 140246922557184 setup.py:172] HPU available: False, using: 0 HPUs
[NeMo E 2023-07-26 22:02:45 exp_manager:306] exp_manager did not receive a cfg argument. It will be disabled.


[NeMo I 2023-07-26 22:02:45 megatron_init:231] Rank 0 has data parallel group: [0, 1]
[NeMo I 2023-07-26 22:02:45 megatron_init:234] All data parallel group ranks: [[0, 1]]
[NeMo I 2023-07-26 22:02:45 megatron_init:235] Ranks 0 has data parallel rank: 0
[NeMo I 2023-07-26 22:02:45 megatron_init:243] Rank 0 has model parallel group: [0]
[NeMo I 2023-07-26 22:02:45 megatron_init:244] All model parallel group ranks: [[0], [1]]
[NeMo I 2023-07-26 22:02:45 megatron_init:254] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-07-26 22:02:45 megatron_init:258] All tensor model parallel group ranks: [[0], [1]]
[NeMo I 2023-07-26 22:02:45 megatron_init:259] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-07-26 22:02:45 megatron_init:273] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-07-26 22:02:45 megatron_init:285] Rank 0 has embedding group: [0]
[NeMo I 2023-07-26 22:02:45 megatron_init:291] All pipeline model parallel group ranks: [[0], [1]]
[NeMo I 2023-07-26 22:02:

[NeMo W 2023-07-26 22:02:45 modelPT:245] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
I0726 22:02:45.631522 140246922557184 distributed.py:244] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
I0726 22:02:45.638794 140174512092928 distributed.py:244] Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
I0726 22:02:45.643024 140246922557184 distributed.py:248] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

    


[NeMo I 2023-07-26 22:02:45 text_memmap_dataset:104] Building data files
[NeMo I 2023-07-26 22:02:45 text_memmap_dataset:343] Processing 1 data files using 6 workers
[NeMo I 2023-07-26 22:02:45 text_memmap_dataset:349] Time building 0 / 1 mem-mapped files: 0:00:00.165732
[NeMo I 2023-07-26 22:02:46 text_memmap_dataset:114] Loading data files
[NeMo I 2023-07-26 22:02:46 text_memmap_dataset:205] Loading /tmp/data/FLIP/secondary_structure/test/x000.csv
[NeMo I 2023-07-26 22:02:46 text_memmap_dataset:117] Time loading 1 mem-mapped files: 0:00:00.002323
[NeMo I 2023-07-26 22:02:46 text_memmap_dataset:121] Computing global indices
[NeMo I 2023-07-26 22:02:46 mapped_dataset:206] Filtered out (ignored) 24 samples ( 340 / 364 )
2023-07-26 22:02:46,773 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2023-07-26 22:02:46,773 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting: 0it [00:00, ?it/s]

I0726 22:02:46.773807 140246922557184 cuda.py:58] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
I0726 22:02:46.773926 140174512092928 cuda.py:58] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting DataLoader 0: 100%|██████████| 3/3 [00:04<00:00,  1.40s/it]
%%%%%% Saving 680 samples to output_fname = /tmp/data/FLIP/secondary_structure/test/x000.pkl
2023-07-26 22:02:54,157 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=share_config, peer=site-1, peer_run=simulate_job]: got result from client site-1 for task: name=bionemo_inference, id=0e14097a-d4cd-4400-9c36-fff8ca30daf7
2023-07-26 22:02:54,160 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=share_config, peer=site-1, peer_run=simulate_job, peer_rc=OK, task_name=bionemo_inference, task_id=0e14097a-d4cd-4400-9c36-fff8ca30daf7]: finished processing client result by share_config
2023-07-26 22:02:54,163 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-1   task_id:0e14097a-d4cd-4400-9c36-fff8ca30daf7
2023-07-26 22:02:54,344 - BioNeMoInference - INFO - [identity=simulator_server, run=simulate_job, wf=share_config]: task bionemo_inference exit with status

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 783, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 779, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 770, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 241, in launc

2023-07-26 22:02:59,627 - MPM - INFO - MPM: Good Bye!
Simulator finished with run_status 0
