# Tutorial of GRAPE using data GSE10245

#### Contributor: Antonio Mora, Chengshu Xie 
#### Date of first version: 2018-11-20
#### Date of last review: 2020-05-26 
#### Summary:

This is the tutorial about using how to use R package `GRAPE`. We use the example data, [part of GSE10245](https://github.com/mora-lab/benchmarks/blob/master/single-sample/workflows/data/GSE10245.RDS), for this tutorial, which is available in Github. The dataset is [a microarray dataset about non-small cell lung cancer(NSCLC) in GEO database](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10245).
 . 

#### Contents:
* [1. Data Preparation](#link1)    <br>
    * [1.1 Prerequisites](#link2)     <br>
    * [1.2 Import data](#link3)   <br>
* [2. Method application](#link4)  <br>
    * [2.1 Calculate pathway scores using `makeGRAPE_psMat()`](#link5)     <br>
    * [2.2 `getPathwayScores()` to get pathway scores of new samples](#link6)   <br>
    * [2.3 `makeBinaryTemplateAndProbabilityTemplate()`](#link7)     <br>
    * [2.4 GRAPE Classification](#link8)   <br>
    
    
    
## <a id=link1>1. Data Preparation</a>

### <a id=link2>1.1 Prerequisites</a>

R package, `GRAPE`, needs to be installed and loaded in the R session, this can be done easily with the following chunk of code: 

In [1]:
install.packages("GRAPE")
suppressPackageStartupMessages(library(GRAPE))

### <a id=link3>1.2 Import data</a>

The main functions in `GRAPE`, require the expression data, phenotype data and the reference pathways(gene sets). <br>
All the data could be obtained from [GitHub](https://github.com/mora-lab/benchmarks/blob/master/single-sample/workflows/data). About the reference pathway, you could download the `.GMT` file from [GSEA|MSigDB](http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/7.0/c2.cp.kegg.v7.0.symbols.gmt), which contains KEGG pathway information and could be read into R via [read_gmt()](https://github.com/mora-lab/benchmarks/blob/master/single-sample/R/read_gmt.R). What users need to do is to download them before loading the example data into R session. <br>
Then use the following R codes to load them: 

In [2]:
### Not run, to run it, download the example .RDS file and change the address in "readRDS()"
GSE10245 = readRDS("git@github.com:mora-lab/benchmarks/tree/master/single-sample/workflows/data/GSE10245.RDS")

source("git@github.com:mora-lab//benchmarks/blob/master/single-sample/R/read_gmt.R")
pathwaylist = read_gmt("git@github.com:mora-lab/benchmarks/tree/master/single-sample/data/example_pathway.gmt")

To see the contents of the `GSE10245` object and `pathwaylist` onject, use the commands: 

In [3]:
head(GSE10245)

Unnamed: 0_level_0,GSM258551,GSM258552,GSM258553,GSM258554,GSM258555,GSM258556,GSM258557,GSM258558,GSM258559,GSM258560,...,GSM258599,GSM258600,GSM258601,GSM258602,GSM258603,GSM258604,GSM258605,GSM258606,GSM258607,GSM258608
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
normal,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,...,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
COX1,15.31219,15.33513,15.14794,15.23526,15.40252,15.23544,15.26664,15.23615,15.53066,15.3233,...,15.35491,15.49565,15.39272,15.48079,15.37211,15.47678,15.1251,14.82045,15.40969,15.14544
EEF1A1,15.1805,15.18023,15.26209,15.24665,15.27287,15.18183,15.1627,15.12682,15.31581,15.40652,...,15.16511,15.18418,15.13071,15.19139,15.17459,15.1924,15.16712,15.29834,15.11026,15.08846
IGK,15.39448,15.37585,15.25943,15.25016,11.99687,15.05328,15.4507,15.39105,14.65891,14.70683,...,15.17709,15.37075,14.98571,14.7719,15.04847,15.2363,14.67816,15.12034,15.34982,15.37227
TMSB4X,15.16448,15.33779,14.88632,14.72336,14.73436,14.75568,15.09072,14.74747,15.3573,14.56345,...,14.86133,15.18403,14.71638,14.88559,14.95012,14.88124,15.2363,14.60038,15.15858,15.05968
ATP6,14.76594,14.93945,14.53067,14.95695,14.96051,14.76712,14.64807,14.84704,14.89766,15.09347,...,14.96365,15.08632,14.99458,15.07931,14.95273,15.04735,14.12465,14.31763,14.95427,14.73761


In [4]:
str(pathwaylist)

List of 8
 $ KEGG_PI3K-AKT_signaling_pathway: chr [1:354] "EGF" "TGFA" "EREG" "AREG" ...
 $ KEGG_MAPK_signaling_pathway    : chr [1:295] "CACNA1A" "CACNA1B" "CACNA1C" "CACNA1D" ...
 $ KEGG_RAS_signaling_pathway     : chr [1:232] "EGF" "TGFA" "FGF1" "FGF2" ...
 $ KEGG_calcium_signaling_pathway : chr [1:193] "SLC8A1" "SLC8A2" "SLC8A3" "ATP2B1" ...
 $ KEGG_cell_cycle                : chr [1:124] "CCND1" "CCND2" "CCND3" "CDK4" ...
 $ KEGG_Erbb_signaling_pathway    : chr [1:85] "EGF" "TGFA" "AREG" "EGFR" ...
 $ KEGG_P53_signaling_pathway     : chr [1:72] "ATM" "CHEK2" "ATR" "CHEK1" ...
 $ KEGG_NON_small_cell_lung_cancer: chr [1:66] "FHIT" "RARB" "RXRA" "RXRB" ...




And then manage data so that we could get the input data format of `GRAPE`, including reference data, tumor data. 

In [5]:
# The following codes are used to get expression data and phenotype data, respectively.
refdata = as.matrix(GSE10245[-1,][,which(GSE10245[1,] == 1)])
     
tumordata = as.matrix(GSE10245[-1,][,which(GSE10245[1,] == 0)])
alldata = as.matrix(GSE10245[-1,])

### <a id=link4>2. Method application</a>

#### <a id=link5>2.1 Calculate pathway scores using  `makeGRAPE_psMat()`</a>

Calculate pathway space matrix between reference data and all data, which represents all samples as vectors of pathway scores relative to reference samples.

INPUT: matrix, where columns are samples and rows are pathway genes, pathway list, which contains a list of pathways;<br>
OUTPUT: pathway score matrix.<br>

In [6]:
w_quad = function(x){return(4*abs(x-0.5)^2)}
psmat = makeGRAPE_psMat(refge = refdata, newge = alldata, pathway_list = pathwaylist,w = w_quad)
colnames(psmat) = colnames(alldata)
head(psmat)

Unnamed: 0,GSM258551,GSM258552,GSM258553,GSM258554,GSM258555,GSM258556,GSM258557,GSM258558,GSM258559,GSM258560,...,GSM258599,GSM258600,GSM258601,GSM258602,GSM258603,GSM258604,GSM258605,GSM258606,GSM258607,GSM258608
KEGG_PI3K-AKT_signaling_pathway,0.0,0,2.652522,0,2.5970198,0.8167896,0.0,0.0,0.03967546,0.07098296,...,0.5537931,0.0,0.0,0.05857567,0.8350235,0.8830876,1.0294441,0,0.7532087,0
KEGG_MAPK_signaling_pathway,0.0,0,2.110329,0,2.5574811,1.901566,0.0,0.0,0.0,0.51136397,...,1.3824401,0.0,0.363308,0.21030052,0.6464503,1.8132251,1.5601387,0,0.5098099,0
KEGG_RAS_signaling_pathway,0.0,0,3.373391,0,2.9036664,2.1056241,0.0,0.0,0.0,0.0,...,1.585649,0.2002694,0.0,0.16999962,1.7045144,1.7318463,1.4178284,0,0.5216111,0
KEGG_calcium_signaling_pathway,0.3846309,0,4.69687,0,4.5368333,1.5387432,0.1999915,0.1387785,0.0,0.85390543,...,2.2186262,0.7313576,0.0,0.0,1.2932813,0.0,0.4520188,0,1.3653798,0
KEGG_cell_cycle,0.0,0,2.796953,0,3.9962786,2.3463046,0.0,0.7320846,0.08363554,0.31500142,...,2.6236984,0.7060789,2.0642699,0.0,0.3970463,0.4952607,2.6010493,0,1.1843966,0
KEGG_Erbb_signaling_pathway,0.0,0,1.441405,0,0.8545216,1.927097,0.0,0.0,0.7093163,0.70123416,...,0.6020642,0.0,0.8050459,0.95188947,0.23673,0.0,0.5216252,0,0.0,0


#### <a id=link6>2.2 `getPathwayScores()` to get pathway scores of new samples</a>

Calculate pathway scores of a single pathway of a set of samples relative to a reference set of samples.

INPUT: matrix, where columns are samples and rows are pathway genes;<br>
OUTPUT: scores of a single pathway .<br>

In [7]:
###Attention: It will take a long time if the samples are too many.
ps_new = getPathwayScores(refmat = refdata[1:10,],newmat = tumordata[1:10,]) ### get pathway scores of tumor samples
ps_ref = getPathwayScores(refmat = refdata[1:10,],newmat = refdata[1:10,]) ### get pathway scores of reference samples
ps_both = getPathwayScores(refmat = refdata[1:10,],newmat = alldata[1:10,]) ### get pathway scores of both

In [8]:
ps_new
ps_ref
ps_both

#### <a id=link7>2.3 `makeBinaryTemplateAndProbabilityTemplate()`</a> 

INPUT: matrix, where columns are samples and rows are pathway genes;<br>
OUTPUT: binary and probability templates.<br>

List containing binary template vector and probability template vector.

In [9]:
temp = makeBinaryTemplateAndProbabilityTemplate(refdata[1:10,1:5])
bt = temp$binary_template
pt = temp$probability_template
cbind(bt,pt)

Unnamed: 0,bt,pt
COX1 < EEF1A1,0,0.2
COX1 < IGK,1,0.8
COX1 < TMSB4X,0,0.2
COX1 < ATP6,0,0.0
COX1 < RPLP1,0,0.0
COX1 < RPL37,0,0.0
COX1 < RPL37A,0,0.0
COX1 < IGLC1,1,0.6
COX1 < RPL41,0,0.0
EEF1A1 < IGK,1,0.8


#### <a id=link8>2.4 GRAPE Classification</a> 


`predictClassGRAPE()` function to get predicted class labels for test set.<br>
Classification of a samples according to grape distances from templates. Usually applied to the gene expression values for a single pathway.

INPUT: Matrix of gene expression for set of genes accross training/test set samples. Each column is a sample. And Vector of class labels for each sample in the training set.<br>
OUTPUT: c


In [10]:
# Toy example of two classes
set.seed(10)
path_genes = c("gA","gB","gC","gD"); nsamps = 50 # Four genes, 50 samples per class
class_one_samps = matrix(NA,nrow=length(path_genes),ncol=nsamps) # Class 1
rownames(class_one_samps) = path_genes
class_one_samps[1,] = rnorm(ncol(class_one_samps),4,2)
class_one_samps[2,] = rnorm(ncol(class_one_samps),5,4)
class_one_samps[3,] = rnorm(ncol(class_one_samps),1,1)
class_one_samps[4,] = rnorm(ncol(class_one_samps),2,1)
class_two_samps = matrix(NA,nrow=length(path_genes),ncol=nsamps) # Class 2
rownames(class_two_samps) = path_genes
class_two_samps[1,] = rnorm(ncol(class_two_samps),2,3)
class_two_samps[2,] = rnorm(ncol(class_two_samps),5,2)
class_two_samps[3,] = rnorm(ncol(class_two_samps),1,1)
class_two_samps[4,] = rnorm(ncol(class_two_samps),0,1)
all_samps = cbind(class_one_samps,class_two_samps)
labels = c(rep(1,nsamps),rep(2,nsamps))
testid = sample.int(100,20)
trainmat = all_samps[,-testid]
head(trainmat)
train_labels = labels[-testid]
testmat = all_samps[,testid]
head(testmat)
test_labels = labels[testid]
train_labels

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
gA,4.0374923,3.631495,1.2573389,4.58909,4.779589,1.583848,3.272648,0.7466546,6.20355901,5.511563,...,-0.5220323,-4.6164153,-1.384168,-2.0239299,6.8153421,4.233271,1.63234962,2.2777539,0.9210343,5.08571216
gB,3.3974498,3.661774,10.47181581,7.023277,8.14537,1.391152,7.131588,2.416423,0.04962212,3.175295,...,5.1463916,4.4853508,5.533613,7.7754487,5.3861593,6.184633,7.10208933,7.311595,4.4910639,7.54736853
gC,0.2381957,1.419375,-0.03994336,0.366787,1.563175,1.660987,-0.6580509,2.028168,-0.2801546,2.128868,...,0.9475872,1.0961239,1.266407,1.5547935,2.2354459,1.285133,2.6194187,0.3833266,3.1145213,-0.02856737
gD,1.6088958,1.750133,3.15510475,1.133322,-0.321017,2.60883,3.150006,0.8004023,2.65316619,1.450592,...,-0.6828386,-0.6129412,1.89078,-0.1454105,-0.6394237,-0.411176,0.03799977,0.9210647,-0.5998269,1.24444646


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
gA,5.9371327,1.1635295,3.5273561,3.796478,0.705499,3.5235329,4.5862466,2.696874,3.1312182,2.064696,5.7851452,3.4870432,6.1731028,2.8073787,3.185467,2.801665,1.0771787,1.1515351,0.9263603,5.646541
gB,7.8515771,5.6550203,5.7697352,0.7934454,0.3402409,1.6787094,3.3400509,4.704177,5.082777,7.3798294,5.9554809,6.16395,3.3345813,6.5236889,5.785147,13.551068,5.4454709,4.2882951,2.9311241,7.155785
gC,-0.9968156,1.4807212,0.2760049,0.6360178,3.3929129,0.5358655,0.5412071,1.954786,0.8693548,1.3724723,2.3167653,2.1279536,-0.6753322,0.5186344,1.623478,1.711574,0.8307244,1.9579449,1.3845411,2.502545
gD,0.6396939,-0.5280637,0.0660703,2.9598291,-1.3996571,2.5210545,0.3427919,2.046361,-1.4497605,0.5313023,0.7264094,0.4199992,2.1130294,0.8626001,-1.133247,1.135273,-0.4291438,0.4003617,2.1349656,-0.886788


In [11]:
yhat = predictClassGRAPE(trainmat,testmat,train_labels,w_quad)
yhat
sum(diag(table(test_labels,yhat)))/length(test_labels) # accuracy

In [12]:
sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 
[2] LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GRAPE_0.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6    digest_0.6.25   crayon_1.3.4    IRdisplay_0.7.0
 [5] repr_1.1.0      jsonlite_1.6.1  evaluate_0.14   pillar_1.4.4   
 [9] rlang_0.4.5     uuid_0.1-4      vctrs_0.2.4     IRkernel_1.1   
[13] tools_3.6.3     compiler_3.6.3  base64enc_0.1-3 htmltools_0.4.0
[17] pbdZMQ_0.3-3   