# Running Cellmaps VNN with multiple training data

This notebook demonstrates how to use cellmaps_vnn with five distinct training datasets. Each trained model is used to generate predictions, and the resulting system importance scores are aggregated and visualized on the hierarchy.

### Installation

It is highly recommended to create conda virtual environment and run jupyter from there.

`conda create -n vnn_env python=3.11`

`conda activate vnn_env`

To install Cellmaps Pipeline run:

`pip install cellmaps_vnn`

Exit the notebook and reopen it in `vnn_env` environtment.

### Drug response data

First provide training datasets and testing datasets you want to run cellmaps_vnn with. Training and test data should be in separate files.

- `training_data.txt`:
    A tab-delimited file containing all data points that you want to use to train the model. The 1st column is identification of cells (genotypes), the 2nd column is a SMILES string of the drug and the 3rd column is an observed drug response in a floating point number, and the 4th column is source where the data was obtained from.
    ```
    HS633T_SOFT_TISSUE              CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C     0.67   GDSC2
    KINGS1_CENTRAL_NERVOUS_SYSTEM   CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C     0.64   GDSC1
    ```
    <br/>
- `test_data.txt`:
    A tab-delimited file containing all data points that you want to estimate drug response for. The 1st column is identification of cells (genotypes), the 2nd column is a SMILES string of the drug and the 3rd column is an observed drug response in a floating point number, and the 4th column is source where the data was obtained from.
  
    ```
    EW24_BONE       CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C    0.99    GDSC1
    ES7_BONE	    CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C	0.65	GDSC2
    ```

**Optional**: If you do not have your own data, but you want to run this example notebook,
you can use the training data from example directory. Here we create 5 different training datasets using the training data. Please uncomment and run the following code.

In [59]:
# YOU DO NOT NEED TO RUN THIS STEP IF YOU PROVIDE YOUR OWN DATA

import random

current_path = os.getcwd()
training_data_dir = os.path.join(current_path, 'training_data_dir')
training_data_path = '../examples/training_data.txt'

# Create output directory exists
os.makedirs(training_data_dir, exist_ok=True)

# Read all lines from the training data file
with open(training_data_path, 'r') as f:
    lines = f.readlines()

# Determine sample size (60%)
sample_size = int(0.6 * len(lines))

# Generate and write 5 files with random 60% samples
for i in range(1, 6):
    sampled_lines = random.sample(lines, sample_size)
    output_path = os.path.join(training_data_dir, f'training_data_{i}.txt')
    with open(output_path, 'w') as out_f:
        out_f.writelines(sampled_lines)

In [60]:
# Replace the path with paths to your own training data !

training_data_paths = [
    os.path.join(training_data_dir, f'training_data_{1}.txt'),
    os.path.join(training_data_dir, f'training_data_{2}.txt'),
    os.path.join(training_data_dir, f'training_data_{3}.txt'),
    os.path.join(training_data_dir, f'training_data_{4}.txt'),
    os.path.join(training_data_dir, f'training_data_{5}.txt'),
]

### Cell Feature Data (required for both training and prediction steps)

The future data can be found in `examples` directory. The strucure and content of each file is described [here](https://cellmaps-vnn.readthedocs.io/en/latest/inputs_nestvnn.html#cell-feature-files) or in another notebook (notebooks/step-by-step-guide-run-vnn.ipynb).

In [61]:
gene2id = '../examples/gene2ind.txt'
cell2id = '../examples/cell2ind.txt'
mutations = '../examples/cell2mutation.txt'
cn_deletions = '../examples/cell2cndeletion.txt'
cn_amplifications = '../examples/cell2cnamplification.txt'

## Step 1: <span style="color:red">Training</span>

### Input Data for Training
Additionally to training data and feature data, the training process requires hierarchy in CX2 format to build the visible neural network. This example uses the hierarchy from `examples` directory.

For required and optional arguments refer to [documentation](https://cellmaps-vnn.readthedocs.io/en/latest/usage_command_line.html#training-mode-and-prediction-and-interpretation-mode).

In [62]:
hierarchy_path = '../examples/hierarchy.cx2'

### Training command

Run training for each of the training datasets. Specify separate output directories (RO-Crates) where the model and other traning output files will be saved for each dataset.

In [64]:
import subprocess

training_out_paths = []

for i, training_data_path in enumerate(training_data_paths):
    outdir = './model_dir' + str(i)
    training_out_paths.append(outdir)
    command = (
        f"cellmaps_vnncmd.py train {outdir} --hierarchy {hierarchy_path} "
        f"--gene2id {gene2id} --cell2id {cell2id} --training_data {training_data_path} "
        f"--mutations {mutations} --cn_deletions {cn_deletions} "
        f"--cn_amplifications {cn_amplifications}"
    )
    subprocess.run(command, shell=True, check=True)

## Step 2: <span style="color:red">Prediction and Interpretation</span>

In this step, we will test the models generated in the previous step. Each node in the hierarchy will then be assigned a score to indicate the importance of the subsystem in each model's decision-making process.

### Input Data for Prediction

The input of for this step is the output from training (step 1) and test data, you can use one test dataset for all the modesl or separate test datasets.

In [68]:
# In this example, we use test data from examples directory. You can use 
# the same test data for all models or use separate test data files
test_data_paths = [ '../examples/test_data.txt' ]

### Prediction and Interpretation command

In [75]:
predict_out_paths = []

for i, train_out in enumerate(training_out_paths):
    outdir = './test_dir' + str(i)
    predict_out_paths.append(outdir)
    test_data = test_data_paths[0]
    if len(test_data_paths) == len(training_out_paths):
        test_data = test_data_paths[i]
    command = (
        f"cellmaps_vnncmd.py predict {outdir} --inputdir {train_out} "
        f"--gene2id {gene2id} --cell2id {cell2id} --predict_data {test_data} "
        f"--mutations {mutations} --cn_deletions {cn_deletions} "
        f"--cn_amplifications {cn_amplifications}"
    )
    subprocess.run(command, shell=True, check=True)

Starting prediction process
Starting score calculation
Prediction and interpretation executed successfully


FAIRSCAPE hidden files registration:   0%|          | 5/1698 [00:01<05:45,  4.89it/s]


Starting prediction process
Starting score calculation


FAIRSCAPE hidden files registration:   0%|          | 1/1698 [00:00<04:51,  5.83it/s]

Prediction and interpretation executed successfully


FAIRSCAPE hidden files registration:   0%|          | 5/1698 [00:01<05:48,  4.86it/s]


Starting prediction process
Starting score calculation
Prediction and interpretation executed successfully


FAIRSCAPE hidden files registration:   0%|          | 5/1698 [00:01<05:50,  4.84it/s]


Starting prediction process
Starting score calculation
Prediction and interpretation executed successfully


FAIRSCAPE hidden files registration:   0%|          | 5/1698 [00:01<05:52,  4.81it/s]


Starting prediction process
Starting score calculation


FAIRSCAPE hidden files registration:   0%|          | 1/1698 [00:00<04:50,  5.85it/s]

Prediction and interpretation executed successfully


FAIRSCAPE hidden files registration:   0%|          | 5/1698 [00:01<05:53,  4.79it/s]


## Step 3: <span style="color:red">Annotation and Visualization</span>

In this step, the hierarchy used to build the neural network (VNN) can be annotated with system importance scores from step 2, which will aid in interpreting the results.

### Input Data for Annotation

The input for the annotation will be list of output directories where we saved results from test and interpretation. The results will be aggregated and the hierarchy will be annotated.

### Annotation command (without visualization in Cytoscape)

In [81]:
outdir = './annotate_dir'

command = (f"cellmaps_vnncmd.py annotate {outdir} "
           f"--model_predictions " + (" ").join(predict_out_paths)
          )
subprocess.run(command, shell=True, check=True)

CompletedProcess(args='cellmaps_vnncmd.py annotate ./annotate_dir --model_predictions ./test_dir0 ./test_dir1 ./test_dir2 ./test_dir3 ./test_dir4', returncode=0)

### Visualization

The visualization is powered by [Cytoscape Web](http://ndexbio.org/cytoscape). 

To take full advantage of styling and analysis features, upload your network to NDEx. Start by creating a free [NDEx account](https://www.ndexbio.org). Once registered, you can upload your network along with the VNN hierarchy annotations.

Note: To upload the hierarchy correctly, you must have access to the parent network (also known as the interactome). If the interactome is publicly available on NDEx, you can use its UUID. Otherwise, you may provide a local file path. For the hierarchy from the `examples` directory, the parent network can be found [here](https://www.ndexbio.org/viewer/networks/0b7b8aee-332f-11ef-9621-005056ae23aa).


### Anotation and Cytoscape Visualisation command

Use your NDEx credentials, type your username and password when prompted below. 

In [83]:
import getpass

ndexuser = getpass.getpass()

 ········


In [85]:
ndexpassword = getpass.getpass()

 ········


In [89]:
outdir = './annotate_visualization_dir'


command = (f"cellmaps_vnncmd.py annotate {outdir} "
           f"--model_predictions " + (" ").join(predict_out_paths) +
           f" --parent_network 0b7b8aee-332f-11ef-9621-005056ae23aa --ndexuser {ndexuser} "
           f"--ndexpassword {ndexpassword} --visibility"
          )
subprocess.run(command, shell=True, check=True)

Hierarchy uploaded. To view hierarchy on NDEx please paste this URL in your browser https://www.ndexbio.org/viewer/networks/753109be-577a-11f0-a218-005056ae3c32. To view Hierarchy on new experimental Cytoscape on the Web, go to https://ndexbio.org/cytoscape/0/networks/753109be-577a-11f0-a218-005056ae3c32


CompletedProcess(args='cellmaps_vnncmd.py annotate ./annotate_visualization_dir --model_predictions ./test_dir0 ./test_dir1 ./test_dir2 ./test_dir3 ./test_dir4 --parent_network 0b7b8aee-332f-11ef-9621-005056ae23aa --ndexuser jlenkiewicz@ucsd.edu --ndexpassword testjlenkiewicz --visibility', returncode=0)

### Viewing the Hierarchy Locally with Cytoscape Web

You can also view the annotated hierarchy locally without an NDEx account. While local visualization may have limited styling and analysis capabilities, it still offers a useful overview of the network structure and the important systems for prediction.

In [102]:
import subprocess
import re
import os
from ndex2.cx2 import RawCX2NetworkFactory

def get_jupyter_server_info():
    output = subprocess.check_output(["jupyter", "server", "list"]).decode('utf-8')
    token = re.search(r"(?<=token=)[a-f\d]+", output)
    jupyter_path = output.split()[-1].strip()
    return token.group(0) if token else None, jupyter_path

def get_current_directory(jupyter_path):
    return os.getcwd().replace(jupyter_path, "")

token, jupyter_path = get_jupyter_server_info()
files_path = get_current_directory(jupyter_path)

# Adjust to your network specifics
annotate_dir = "annotate_dir"
hierarchy_path = os.path.join(annotate_dir, 'hierarchy.cx2')
ndex_host = "ndexbio.org"
uuid_interactome = "753109be-577a-11f0-a218-005056ae3c32"

updated_hierarchy_path = update_hcx_annotations(annotate_dir, ndex_host, uuid_interactome)
print(f"https://ndexbio.org/cytoscape/?import=http://localhost:8888/files{get_current_directory(jupyter_path)}/{hierarchy_path}?token={token}")

https://ndexbio.org/cytoscape/?import=http://localhost:8888/files/notebooks/annotate_dir/hierarchy.cx2?token=2f7e10791fa59bf375dfee47c9cd1d69534127f0c71b9058
