# Step-by-step Guide: Running Cellmaps Pipeline

### Installation

It is highly recommended to create conda virtual environment and run jupyter from there.

`conda create -n cm4ai python=3.8`

`conda activate cm4ai`

To install Cellmaps Pipeline run:

`pip install cellmaps_pipeline`

### Input Data

The cell maps pipeline requires the following input files for building MuSIC maps by integrating IF images with an AP-MS interaction network:

- samples file: CSV file with list of IF images to download (see sample samples file in examples folder)

- unique file: CSV file of unique samples (see sample unique file in examples folder)

- bait list file: TSV file of baits used for AP-MS experiments

- edge list file: TSV file of edges for protein interaction network

- provenance: file containing provenance information about input files in JSON format (see sample provenance file in examples folder, or create one directly as described above)

## Step 1: Download ImmunoFluorescent image data

Detailed documentation available [here](https://cellmaps-imagedownloader.readthedocs.io/).

In [None]:
from cellmaps_imagedownloader.runner import CellmapsImageDownloader
from cellmaps_imagedownloader.runner import MultiProcessImageDownloader
from cellmaps_imagedownloader.gene import ImageGeneNodeAttributeGenerator as IGen 
from cellmaps_imagedownloader.proteinatlas import ProteinAtlasReader, ProteinAtlasImageUrlReader, ImageDownloadTupleGenerator
import json

u_list = IGen.get_unique_list_from_csvfile('../examples/unique.csv')
s_list=IGen.get_samples_from_csvfile('../examples/samples.csv')
with open('../examples/provenance.json', 'r') as f:
    json_prov = json.load(f)

imagegen = IGen(unique_list=u_list, samples_list=s_list)

outdir='1.image_download'
dloader = MultiProcessImageDownloader(poolsize=4)
proteinatlas_reader = ProteinAtlasReader(outdir)
proteinatlas_urlreader = ProteinAtlasImageUrlReader(reader=proteinatlas_reader)
imageurlgen = ImageDownloadTupleGenerator(reader=proteinatlas_urlreader,
                                          samples_list=imagegen.get_samples_list(),
                                          valid_image_ids=imagegen.get_samples_list_image_ids())

x = CellmapsImageDownloader(outdir=outdir, imagedownloader=dloader, imgsuffix='.jpg', imagegen=imagegen, 
                            imageurlgen=imageurlgen, provenance=json_prov)
x.run()

#### Main Outputs

* `1_image_gene_node_attributes.tsv`:
A TSV file containing attributes for image genes generated during the first fold of execution. 2_image_gene_node_attributes.tsv corresponds to the second fold of execution etc.

```
name        represents      ambiguous       antibody        filename        imageurl
UHRF2       ensembl:ENSG00000147854         HPA026633       B2AI_1_untreated_D2_R5_ no image url found
TET3        ensembl:ENSG00000187605         HPA050845       B2AI_1_untreated_E5_R5_ no image url found
HDAC6       ensembl:ENSG00000094631         HPA003714       B2AI_1_untreated_G3_R5_ no image url found
HDAC3       ensembl:ENSG00000171720         HPA052052       B2AI_1_untreated_D3_R7_ no image url found
```

* `blue`, `red`, `green`, `yellow`:
Directories containing downloaded images in different color spectrum.

## Step 2: Download Affinity-Purification mass spectrometry (AP-MS) data as a Protein-Protein Interaction network

Detailed documentation available [here](https://cellmaps-ppidownloader.readthedocs.io/).

In [None]:
from cellmaps_ppidownloader.runner import CellmapsPPIDownloader
from cellmaps_ppidownloader.gene import APMSGeneNodeAttributeGenerator

with open('../examples/provenance.json', 'r') as f:
    json_prov = json.load(f)

apmsgen = APMSGeneNodeAttributeGenerator(
    apms_edgelist=APMSGeneNodeAttributeGenerator.get_apms_edgelist_from_tsvfile('../examples/edgelist.tsv'),
    apms_baitlist=APMSGeneNodeAttributeGenerator.get_apms_baitlist_from_tsvfile('../examples/baitlist.tsv'))

x = CellmapsPPIDownloader(outdir='1.ppi_download', apmsgen=apmsgen, provenance=json_prov, input_data_dict={})
x.run()

#### Main Outputs

* `ppi_edgelist.tsv`:
A processed edge list file which represents protein-protein interactions, where proteins are identified by their symbols.

```
geneA       geneB
DNMT3A      SAP18
DNMT3A      DDX3X
DNMT3A      SEC16A
DNMT3A      U2SURP
DNMT3A      SYNJ2
```

* `ppi_gene_node_attributes.tsv`:
Contains attributes for each gene node in the protein-protein interaction network. This includes information like gene names, ensembl ID, and other relevant data.

```
name        represents      ambiguous       bait
DNMT3A      ensembl:ENSG00000119772         TRUE
HDAC2       ensembl:ENSG00000196591         TRUE
KDM6A       ensembl:ENSG00000147050         TRUE
SMARCA4     ensembl:ENSG00000127616         TRUE
```

## Step 3: Generate embeddings from ImmunoFluorescent image data

Detailed documentation available [here](https://cellmaps-image-embedding.readthedocs.io/).

In [None]:
from cellmaps_image_embedding.runner import CellmapsImageEmbedder
from cellmaps_image_embedding.runner import DensenetEmbeddingGenerator
import os

model_path = 'https://github.com/CellProfiling/densenet/releases/download/v0.1.0/external_crop512_focal_slov_hardlog_class_densenet121_dropout_i768_aug2_5folds_fold0_final.pth'
outdir = '2.image_embedding'
inputdir = '1.image_download'
gen = DensenetEmbeddingGenerator(os.path.abspath(inputdir),
                                 outdir=os.path.abspath(outdir),
                                 model_path=model_path,
                                 fold=1)
x = CellmapsImageEmbedder(outdir=outdir,
                         inputdir=inputdir,
                         embedding_generator=gen)
x.run()

#### Main Outputs

* `image_emd.tsv`:
A tab-separated file containing the generated embeddings for each image. Each row corresponds to an image and the subsequent columns contain the embedding vector.

```
        1   2       3       4
BPTF        -0.037030112    -0.139459819    0.417184144     0.386600941
KAT2B       0.02969132      -0.139459819    -0.038685802    0.136547908
PARP1       -0.037030112    -0.139459819    0.540370524     0.119614214
MSL1        0.18169874      -0.139459819    -0.038685802    0.152157351
KAT6B       -0.037030112    -0.139459819    0.308141887     0.257056117
```

*`labels_prob.tsv`:
This tab-separated file contains probability scores for each of the 28 possible protein labels (e.g., Nucleoplasm, N. membrane, etc.) for each image.

```
    Nucleoplasm     N. membrane     Nucleoli        N. fibrillar c.
BPTF        0.740698278     0.270941526     0.147179633     0.149313971
KAT2B       0.38626197      0.092356719     0.36738047      0.238842875
PARP1       0.596435964     0.100168504     0.382214785     0.179471999
MSL1        0.195862561     0.01370267      0.101418771     0.038516384
KAT6B       0.606423676     0.101763181     0.337655455     0.201311186
```

* `model.pth`:
The pre-trained Densenet model used for image embedding.

* `blue_resize`, `green_resize`, `red_resize`, `yellow_resize`: This directory contains images that are processed in the given channel.

## Step 4: Generate embeddings from Protein-Protein interaction networks

Detailed documentation available [here](https://cellmaps-ppi-embedding.readthedocs.io/).

In [None]:
from cellmaps_ppi_embedding.runner import Node2VecEmbeddingGenerator
from cellmaps_ppi_embedding.runner import CellMapsPPIEmbedder
import networkx as nx

inputdir = '1.ppi_download'
outdir = '2.ppi_embedding'
gen = Node2VecEmbeddingGenerator(nx_network=nx.read_edgelist(CellMapsPPIEmbedder.get_apms_edgelist_file(inputdir),
                                                             delimiter='\t'))

x =CellMapsPPIEmbedder(outdir=outdir,
                       embedding_generator=gen,
                      inputdir=inputdir)
x.run()

#### Main Outputs

* `ppi_emd.tsv`:
A TSV file that contains the embeddings for the protein-protein interactions (PPIs). The first column consists of gene names, followed by the embedding vectors in subsequent columns.

```
        1   2       3       4
HDAC2       0.00322267      0.068772331     0.087871492     0.074549779
SMARCA4     0.014913903     -0.025018152    -0.01334604     -0.050020121
DNMT3A      0.030166976     0.082494646     0.083659336     -0.005459526
KDM6A       0.058055822     0.151974067     0.122265264     0.057505969
RPS4X       0.016731756     0.046027087     0.041698962     0.010518731
```

## Step 5: Generate co-embedding from image and Protein-Protein Interaction (PPI) embeddings

Detailed documentation available [here](https://cellmaps-coembedding.readthedocs.io/).

In [None]:
from cellmaps_coembedding.runner import MuseCoEmbeddingGenerator
from cellmaps_coembedding.runner import CellmapsCoEmbedder

ppi_embeddingdir = '2.ppi_embedding'
image_embeddingdir = '2.image_embedding'
outdir = '3.coembedding'
gen = MuseCoEmbeddingGenerator(ppi_embeddingdir=ppi_embeddingdir,
                               image_embeddingdir=image_embeddingdir,
                               outdir=os.path.abspath(outdir))

x = CellmapsCoEmbedder(outdir=outdir,
                      inputdirs=[ppi_embeddingdir, image_embeddingdir],
                      embedding_generator=gen)
x.run()

#### Main Outputs

* `coembedding_emd.tsv`:
This file represents the co-embedding of Protein-Protein Interaction (PPI) and image embeddings. The first column contains identifiers (either gene symbols or sample IDs) while the subsequent columns contain embedding values.

```
        1   2       3       4
AURKB       -0.06713819     -0.027032608    -0.117943764    -0.14860943
BAZ1B       0.100407355     0.1299548       -0.011916596    0.02393107
BRD7        0.07245989      0.12707146      -0.000744308    0.023155764
CBX3        -0.115645304    -0.1549612      -0.08860879     -0.038656197
CHD1        0.016580202     0.11743456      -0.009839832    -0.008252605
```

## Step 6: Generate hierarchy from coembeddings using HiDeF.

Detailed documentation available [here](https://cellmaps-generate-hierarchy.readthedocs.io/).

In [None]:
from cellmaps_generate_hierarchy.ppi import CosineSimilarityPPIGenerator
from cellmaps_generate_hierarchy.hierarchy import CDAPSHiDeFHierarchyGenerator
from cellmaps_generate_hierarchy.maturehierarchy import HiDeFHierarchyRefiner
from cellmaps_generate_hierarchy.hcx import HCXFromCDAPSCXHierarchy
from cellmaps_generate_hierarchy.runner import CellmapsGenerateHierarchy

inputdir = '3.coembedding'
outdir = '4.hierarchy'
ppigen = CosineSimilarityPPIGenerator(embeddingdirs=[inputdir])

refiner = HiDeFHierarchyRefiner()

converter = HCXFromCDAPSCXHierarchy()

hiergen = CDAPSHiDeFHierarchyGenerator(refiner=refiner,
                                       hcxconverter=converter)

x = CellmapsGenerateHierarchy(outdir=outdir,
                              inputdirs=inputdir,
                              ppigen=ppigen,
                              hiergen=hiergen)
x.run()

#### Main Outputs

* `hierarchy.cx2`:
The main output file containing the generated hierarchy in HCX format.

* `hierarchy_parent.cx2`:
The parent or primary network used as a reference for generating the hierarchy in CX2 format.

* `ppi_cutoff_*.cx`:
Protein-Protein Interaction networks in CX format. Can be omitted.

* `ppi_cutoff_*.id.edgelist.tsv`:
Edgelist representation of the Protein-Protein Interaction networks.

* `hidef_output.edges`:
Contains the edges or interactions in the HiDeF generated hierarchy.

```
Cluster0-0  Cluster1-0      default
Cluster0-0  Cluster1-1      default
```

* `hidef_output.nodes`:
Contains the nodes or entities in the HiDeF generated hierarchy.

```
Cluster0-0  23      0 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9      0
Cluster1-0  7       0 1 10 20 4 5 6 119
```

* `hidef_output.pruned.edges`:
Contains pruned edges after certain filtering (maturing) processes on the original hierarchy.

```
Cluster0-0  Cluster1-0      default
Cluster0-0  Cluster1-1      default
```

* `hidef_output.pruned.nodes`:
Contains pruned nodes after certain filtering (maturing) processes on the original hierarchy.

```
Cluster0-0  23      3 17 21 4 20 1 10 12 9 14 8 2 15 19 5 11 7 16 18 0 13 22 6      0
Cluster1-0  7       20 1 5 4 10 0 6 119
```

* `hidef_output.weaver`:
Information related to the weaving process used in generating the hierarchy.

## Step 7: Annotate a hierarchy by performing enrichment against three NDEx networks HPA, CORUM, and GO-CC

Detailed documentation available [here](https://cellmaps-hierarchyeval.readthedocs.io/).

In [None]:
from cellmaps_hierarchyeval.runner import CellmapshierarchyevalRunner

inputdir = '4.hierarchy'
outdir = '5.hierarchyeval'

x = CellmapshierarchyevalRunner(outdir=outdir,
                               hierarchy_dir=inputdir)
x.run()

#### Main Outputs

* `hierarchy.cx2`:
This is the enriched hierarchy network file that integrates the results of the enrichment analysis into the hierarchy, formatted in CX2.

* `hierarchy_parent.cx2`:
The reference parent network from which the hierarchy was generated, formatted in CX2. Copy from input.

* `hierarchy_node_attributes.tsv`:
A TSV file containing attributes for each node, which includes information such as enriched terms, their descriptions, and related statistical data.

## (Optional) Step 8: Upload hierarchy to NDEx

Detailed documentation available [here](https://cellmaps-generate-hierarchy.readthedocs.io/en/latest/usage.html#uploading-hierarchy-to-ndex).

In [None]:
import os
import ndex2
from ndex2.cx2 import RawCX2NetworkFactory
from cellmaps_generate_hierarchy.ndexupload import NDExHierarchyUploader

#Specify NDEx server
ndexserver = 'idekerlab.ndexbio.org'
ndexuser = '<USER>'
ndexpassword = '<PASSWORD>'

# Load the hierarchy and parent network CX2 files into network objects
factory = RawCX2NetworkFactory()
hierarchy_network = factory.get_cx2network('5.hierarchyeval/hierarchy.cx2')
parent_network = factory.get_cx2network('5.hierarchyeval/hierarchy_parent.cx2')

# Initialize NDExHierarchyUploader with the specified NDEx server and credentials
uploader = NDExHierarchyUploader(ndexserver, ndexuser, ndexpassword, visibility=True)

# Upload the hierarchy and parent network to NDEx
parent_uuid, parenturl, hierarchy_uuid, hierarchyurl = uploader.save_hierarchy_and_parent_network(hierarchy_network, parent_network)

print(f"Parent network UUID is {parent_uuid} and its URL in NDEx is {parenturl}")
print(f"Hierarchy network UUID is {hierarchy_uuid} and its URL in NDEx is {hierarchyurl}")