# Step-by-step guide to build MuSIC v1

Before starting, please install relevant packages following guidelines in the **Dependency** section of GitHub README.

## Step 1. Data embeddings

We here provide the 1024-dimension embeddings for the 1,451 images and the 661 proteins used in MuSIC v1. The MuSIC pipeline can handle any number of data types (e.g. IF, APMS, etc.) and any dimension of protein embeddings (i.e. length of feature vector), but all proteins need to have same number of dimension within each individual data type, and we recommend keeping consistent number of dimensions among different types of data. 

For customized embedding files, please format file to match the style in the example embedding files:
- First column: embedding index
- Second column: gene name
- Following columns: each column is one entry in the embedding vector
- Comma separate all columns

In [1]:
import pandas as pd
IF_emd = pd.read_csv('./Examples/IF_image_embedding.csv', header=None)
APMS_emd = pd.read_csv('./Examples/APMS_embedding.MuSIC.csv', header=None)

In [2]:
IF_emd.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025
0,IF_1,HLA-DPA1,0.022881,-0.060293,-0.140246,0.037648,0.009863,0.292151,-0.042231,0.020537,...,-0.036327,0.166945,0.023639,0.60142,-0.078425,0.19218,0.011603,0.031387,-0.029221,0.23038
1,IF_2,HLA-DPA1,0.022881,-0.11374,-0.140246,0.032609,0.191921,0.401565,-0.042231,0.027777,...,-0.129697,0.103786,0.023639,0.654706,-0.078425,0.156273,0.011603,0.031387,-0.001136,0.20808


In [3]:
APMS_emd.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025
0,APMS_1,RRS1,0.07591,0.161315,-0.025731,0.071347,-0.175898,0.041408,-0.061304,-0.136247,...,-0.052199,-0.018137,0.042997,0.246699,-0.043538,0.016049,-0.147477,-0.049635,-0.001222,-0.172205
1,APMS_2,SNRNP70,-0.019872,0.083736,0.151332,0.080374,-0.053558,0.067913,-0.057474,-0.114813,...,-0.014049,-0.154981,-0.187242,-0.082924,-0.089952,-0.09161,-0.051024,-0.005062,-0.170704,0.042429


## Step 2. Calibrate protein-protein distance and proximity from Gene Ontology

In [None]:
%%bash
python calibrate_pairwise_distance.py \
--protein_file ./Examples/MuSIC_proteins.txt \
--outprefix ./Examples/output/test

## Step 3. Random forest prediction of protein distances

Each random forest regressor in the original MuSIC study was trained with ~1M samples consisted of 2060 input features, requiring ~1 day and >100 Gb memory with 24 threads. As a demo, code below will generate two set of embeddings for image embeddings (MuSIC used six) and create 1000 samples for training.

In [None]:
%%bash
python random_forest_samples.py \
--outprefix ./Examples/output/test \
--protein_file ./Examples/MuSIC_proteins.txt \
--emd_files ./Examples/IF_image_embedding.csv ./Examples/APMS_embedding.MuSIC.csv \
--emd_label IF_emd APMS_emd \
--num_set 2 auto \
--n_samples 1000 

In [None]:
%%bash
for ((fold = 1; fold <= 5; fold++))
do
    for ((IF_set = 1; IF_set <= 2; IF_set++))
    do
        python run_random_forest.py \
        --outprefix ./Examples/output/test \
        --fold $fold \
        --emd_label IF_emd APMS_emd \
        --train_set $IF_set 1 \
        --n_jobs 60;
    done
done

In [None]:
%%bash
python random_forest_output.py --outprefix ./Examples/output/test

The predicted protein-protein proximity for all pairs of the given protein is saved in **$outprefix_predicted_proximity.txt** file. 

## Step 4. Pan-resolution community detection

To perform pan-resolution community detection as in MuSIC, please install:
- [CliXO v1.0](https://github.com/fanzheng10/CliXO-1.0)
- [DDOT](https://github.com/michaelkyu/ddot)
	- **Note:** the dependencies are already satisfied, but users need to **follow instructions in section *Install the ddot Python package*** to complete installation. 
- [alignOntology](https://github.com/mhk7/alignOntology)
	- **Note:** DDOT has alignOntology in `/ddot/alignOntology` folder. If user has trouble installing  alignOntology from GitHub, user can use path to alignOntology in DDOT for the `--path_to_alignOntology` parameter in the community detection section.


In [None]:
%%bash
# Because results from step 3 are only for demo, well-trained data are provided to reproduce MuSIC in step 4.
cp ./Examples/MuSIC_predicted_proximity.txt ./Examples/output/test_predicted_proximity.txt
cp ./Examples/MuSIC_avgPred_ytrue.csv ./Examples/output/test_avgPred_ytrue.csv

In [None]:
%%bash
python community_detection.py \
--outprefix ./Examples/output/test \
--path_to_clixo /cellar/users/y8qin/Modules/CliXO \
--clixo_i ./Examples/output/test_predicted_proximity.txt \
--clixo_a 0.01 --clixo_b 0.5 --clixo_m 0.008 --clixo_z 0.05 --min_diff 2 \
--path_to_alignOntology /cellar/users/y8qin/Modules/alignOntology-master \
--predict_nm_size --keep_all_files

# Note that CliXO can take ~7 hours for MuSIC. It's recommended to run the below command line in background
bash /cellar/users/y8qin/Data2/deepLoc/MuSIC/Examples/output/test.sh