# Setup

Section to set up Jupyter Notebook and intialize experimental settings.

### Give Jupyter Notebook access to relative import

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

### Create GTMDecon object

For ease of user access, we use the GTMDecon python wrapper, built around the gtm-decon C executable files.

In [2]:
from PythonWrapper.GTM_decon import GTM_decon

Initialize GTMDecon wrapper object:

Basic Constructor Arguments:
- **experiment_name** : str [optional]
- **n_topics** : int [optional, default=1]
    - number of topics we wish to set per celltype
- **engine_path** : str
    - path to GTM-decon C executable
    Here we only set the experiment name and engine path, the n_topics parameter will be by default set to 1.

In [55]:
GTM = GTM_decon(
    experiment_name = "gtm-tutorial",
    engine_path = "/home/mcb/users/slaksh1/projects/revision_gb/gtm-decon-phinorm/gtm-decon-plus-noupd-ab-phinorm"
)

We can see the parameters set for our GTM wrapper, including the number of topics per celltype and the engine_path (path to C executable).

We can see that the **experiment_name**, **n_topics**, and **engine_path** attributes have been set as we intended, while the remaining attributes have been left unfilled. The **genes**, **celltypes**, and **bulk_samples** parameters will be populated as we provide our input reference and bulk data.

In [56]:
print(GTM)

GTM-decon wrapper object with attributes:
  - experiment_name: gtm-tutorial
  - n_topics: 1
  - engine_path: /home/mcb/users/slaksh1/projects/revision_gb/gtm-decon-phinorm/gtm-decon-plus-noupd-ab-phinorm
  - genes: []
  - celltypes: []
  - bulk_samples: []
  - verbose: True
  - output_intermediates: False
  - override_geneset: False



# Example Deconvolution Pipeline

In order to infer cell-type proportions for a given bulk dataset and given single cell reference matrix, we can use the **GTMDecon.pipeline** function to process the input information, run it through the gtm-decon C executables, and output the predicted cell-type proportions of our bulk.

### Loading DataFrames

In [57]:
import pandas as pd
import anndata as ad

Load our example reference and bulk dataframes from the example csvs.

The **reference_DataFrame** should be a pandas DataFrame object, the rows are cells, the columns are the genes, with one additional column named *Celltype* containing the cell-type labels associated with each row.

The **bulk_DataFrame** should be a pandas DataFrame, where the rows represent genes, with the genes stored as the index, and the columns represent the bulk batches.

First need to uncompress tutorial data.

In [65]:
!ls ../data

Fig3a.png
Fig3c.png
genes.txt
GEO_Fadista_5topics_metagene_normalized.csv
Human_cell_markers_geneset.txt
temp
trainData_5topics_phi_normalized.csv
tutorial_data.tar.gz


In [68]:
!tar -xzvf ../data/tutorial_data.tar.gz -C ../data

bulk_data.csv
reference_data.csv
example_proportions.csv


In [32]:
bulk_DataFrame = pd.read_csv("../data/bulk_data.csv", index_col=0)
reference_DataFrame = pd.read_csv("../data/reference_data.csv")

### Single Leave-One-Out CV fold

Since we have paired single cell reference and bulk data for this example, we will remove one batch from the reference data, and deconvolve the bulk data corresponding to that same individual (in order to prevent data leaking). 

Here we will leave-out H2.

In [47]:
reference_df = reference_DataFrame[reference_DataFrame['Batch'] != 'H2']

We need to first remove the **Batch** column, as GTM_decon expects DataFrames to only contain the genes and cell-type labels in its columns.

In [49]:
reference_df = reference_df.drop(columns=['Batch'])

In [50]:
reference_df.head()

Unnamed: 0,CLIC4,TGFBR3,DBT,LIN9,CSMD2,TRABD2B,DAB1,USP24,LEPR,ST6GALNAC3,...,PRR34,PKDREJ,GTSE1,GTSE1-AS1,LOC642757,LOC90834,TUBGCP6,TYMP,MIOX,Celltype
0,127,0,1,0,0,0,0,8,0,0,...,0,0,0,0,0,0,162,13,0,acinar cell
1,0,0,57,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,alpha cell
2,0,0,56,0,0,0,0,2,0,0,...,0,0,0,0,0,0,6,0,0,gamma cell
3,33,0,102,0,0,0,0,56,0,0,...,3,0,0,0,0,0,0,0,0,gamma cell
4,0,0,0,0,0,0,2,17,0,0,...,0,0,0,0,0,0,0,0,0,gamma cell


For the bulk data we will do the inverse of the above, we will keep batch H2 so that we can infer the cell-type proportions of this sample.

In [41]:
bulk_df = bulk_DataFrame[['H2']]

In [42]:
bulk_df.head()

Unnamed: 0,H2
SGIP1,616
AZIN2,344
CLIC4,2753
AGBL4,155
NECAP2,1737


### Running our Pipeline

GTMDecon.pipeline arguments:
- **bulk_data** : pd.DataFrame
- **reference_data** : pd.DataFrame
- **directory** : str
    - directory where we want to save the model parameters and inferred cell-type proportions 
    - we expect the inferred propotions to end up here: **/vignette_results/gatheredResults.csv**


We make a directory to store the results for this vignette

In [53]:
!mkdir tutorial_results

Here we run our pipeline, including processing data to GTM-decon format, training, and cell-type proportion inference.

If we want to suppress print statements, set GTM.verbose = False

In [58]:
GTM.pipeline(
    bulk_data = bulk_df,
    reference_data = reference_df,
    directory = os.path.join(os.getcwd(), 'tutorial_results'),
)

Running GTM Deconvolution Pipeline
Writing results to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results
**********************************

Saving genes file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/genes.txt ...
Successfully wrote genes file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/genes.txt
Saving meta file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/meta.txt ...
Successfully wrote meta file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/meta.txt
Saving training file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/trainData.txt ...
Successfully wrote training file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/trainData.txt
Saving prior file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/priorData.txt ...
Successfully wrote prior file to /home/mcb/users/zhuang35/p



Successfully wrote bulk file to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/bulkData.txt
Saving attributes dict_keys(['experiment_name', 'n_topics', 'engine_path', 'genes', 'celltypes', 'bulk_samples', 'verbose', 'output_intermediates', 'override_geneset']) to path: /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/GTMParameters.json
Successfully saved GTMWrapper parameters to /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/GTMParameters.json
Running Inference Engine ...
--------------------
Input arguments: 
trainDataFile: /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/trainData.txt
testDataFile: 
trainPriorFile: /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/priorData.txt
testPriorFile: 
metaFile: /home/mcb/users/zhuang35/projects/gtm-decon/vignettes/tutorial_results/meta.txt
numTopics#: 14
iter#: 5
convergenceThreshold: 1e-07
inference method: JCVB0
NMAR inference e

Upon completion we should be able to obtain the predicted proportions in **/tutorial_results/gatheredResults.csv**

This file contains the inferred cell-type proportions of our provided bulk data given the provided refernce data. The sample names are the index and the celltypes are the columns of this file.

In [59]:
predicted_props = pd.read_csv("../vignettes/tutorial_results/gatheredResults.csv", index_col=0)

In [60]:
predicted_props.head()

Unnamed: 0,MHC class II cell,PSC cell,acinar cell,alpha cell,beta cell,co-expression cell,delta cell,ductal cell,endothelial cell,epsilon cell,gamma cell,mast cell,unclassified cell,unclassified endocrine cell
H2,0.007957,0.014288,0.15615,0.042907,0.330283,0.066652,0.074174,0.118783,0.011726,0.029721,0.043195,0.029012,0.035782,0.039373


In [69]:
# C make