## GRAPE tutorial using data GSE10245

Example data using two groups("NSCLC_AC" and "NSCLC_SCC") of [GSE10245](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10245).

## 1. Install and library packages

In [1]:
###  install.packages("GRAPE")
library(GRAPE)

## 2. Import Data 

You could use `get_input_data()` function to get data, like the steps in [workflow.](https://nbviewer.jupyter.org/github/Chengshu21/GSE36221/blob/master/Workflow/GSE36221_analysis%281%29.ipynb) 

### 2.1 Expression Data

In [2]:
### read data into R
GSE10245 <- readRDS("git@github.com:mora-lab/benchmarks/tree/master/single-sample/workflows/data/GSE10245.RDS")

GSE10245_exp <- GSE10245$exprdata
GSE10245_p <- GSE10245$pdata

In [3]:
refdata = GSE10245_exp[,which(colnames(GSE10245_exp) %in% rownames(GSE10245_p[GSE10245_p$subtype == "NSCLC_AC",]))]
tumordata = GSE10245_exp[,which(colnames(GSE10245_exp) %in% rownames(GSE10245_p[GSE10245_p$subtype == "NSCLC_SCC",]))]
alldata = GSE10245_exp

### 2.2 Pathway list

Download pathway file from [GSEA|MSigDB](http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/7.0/c2.cp.kegg.v7.0.symbols.gmt)(make sure that you have logged in).

In [4]:
read_gmt = function(file){
  if(!grepl("\\.gmt$",file)[1]){stop("Pathway information must be a .gmt file")}
  geneSetDB = readLines(file)                                ##read in the gmt file as a vector of lines
  geneSetDB = strsplit(geneSetDB,"\t")                       ##convert from vector of strings to a list
  names(geneSetDB) = sapply(geneSetDB,"[",1)                 ##move the names column as the names of the list
  geneSetDB = lapply(geneSetDB, "[",-1:-2)                   ##remove name and description columns
  geneSetDB = lapply(geneSetDB, function(x){x[which(x!="")]})##remove empty strings
  return(geneSetDB)
}

In [5]:
pathwaylist = read_gmt("git@github.com:mora-lab/benchmarks/tree/master/single-sample/data/NSCLC.target.pathway.symbols.gmt")
str(pathwaylist)

List of 8
 $ KEGG_MAPK_signaling_pathway    : chr [1:295] "CACNA1A" "CACNA1B" "CACNA1C" "CACNA1D" ...
 $ KEGG_Erbb_signaling_pathway    : chr [1:85] "EGF" "TGFA" "AREG" "EGFR" ...
 $ KEGG_cell_cycle                : chr [1:124] "CCND1" "CCND2" "CCND3" "CDK4" ...
 $ KEGG_P53_signaling_pathway     : chr [1:72] "ATM" "CHEK2" "ATR" "CHEK1" ...
 $ KEGG_PI3K-AKT_signaling_pathway: chr [1:354] "EGF" "TGFA" "EREG" "AREG" ...
 $ KEGG_RAS_signaling_pathway     : chr [1:232] "EGF" "TGFA" "FGF1" "FGF2" ...
 $ KEGG_calcium_signaling_pathway : chr [1:193] "SLC8A1" "SLC8A2" "SLC8A3" "ATP2B1" ...
 $ KEGG_NON_small_cell_lung_cancer: chr [1:66] "FHIT" "RARB" "RXRA" "RXRB" ...


## 3. GRAPE application

### 3.1 makeGRAPE_psMat() to calculate pathway scores 

Represents new samples as vectors of pathway scores relative to reference samples.<br>

Arguments:<br>

**refge**: Gene expression matrix of reference samples. Rows are genes, columns are samples.<br>

**newge**: Gene expression matrix of new samples. Rows are genes, columns are samples.<br>

**pathway_list**: List of pathways. Each pathway is a character vector consisting of gene names.<br>

**w**: Weight function. Default is quadratic weight function.

In [6]:
psmat = makeGRAPE_psMat(refdata,alldata, pathwaylist)
colnames(psmat) = colnames(alldata)
head(psmat)

Unnamed: 0,GSM258551,GSM258552,GSM258553,GSM258554,GSM258555,GSM258556,GSM258557,GSM258558,GSM258559,GSM258560,...,GSM258599,GSM258600,GSM258601,GSM258602,GSM258603,GSM258604,GSM258605,GSM258606,GSM258607,GSM258608
KEGG_MAPK_signaling_pathway,0,0.01116157,2.022845,0,2.387516,1.7095345,0.0,0.0,0.0,0.5682507,...,1.3347691,0.0,0.3171894,0.13541091,0.6519819,1.6654334,1.3298231,0,0.5293371,0
KEGG_Erbb_signaling_pathway,0,0.0,2.575049,0,1.229818,1.959601,0.0,0.0,0.7188533,0.718632,...,0.6168597,0.0,0.815811,0.96910583,0.2578619,0.0,0.5352869,0,0.0,0
KEGG_cell_cycle,0,0.0,2.882081,0,4.097809,2.4185404,0.0,0.7990439,0.1512868,0.3601488,...,2.7190356,0.7884731,2.1593188,0.0,0.4353913,0.572663,2.6874123,0,1.2535576,0
KEGG_P53_signaling_pathway,0,0.0,2.834376,0,2.268225,0.7840213,0.4402001,0.231085,0.0,0.0,...,1.8063761,0.4010712,0.3462908,0.0,0.1013823,0.692357,1.816896,0,0.0,0
KEGG_PI3K-AKT_signaling_pathway,0,0.0,2.572929,0,2.382305,0.7054254,0.0,0.0,0.1491285,0.2397564,...,0.6741338,0.0,0.0,0.14138109,0.937963,0.9543236,1.043099,0,0.9039479,0
KEGG_RAS_signaling_pathway,0,0.0,4.136042,0,3.242856,2.0247114,0.0,0.028948,0.0,0.0,...,1.7971396,0.1980065,0.0,0.07012191,1.7917947,1.7363397,1.1713943,0,0.5183494,0


### 3.2 getPathwayScores() to get pathway scores of new samples 

In [7]:
###Attention: It will take a long time if the samples are too many.
ps_new <- getPathwayScores(refdata[1:10,],tumordata[1:10,]) ### get pathway scores of tumor samples
ps_ref <- getPathwayScores(refdata[1:10,],refdata[1:10,]) ### get pathway scores of reference samples
ps_both <- getPathwayScores(refdata[1:10,],alldata[1:10,]) ### get pathway scores of both

In [8]:
ps_new
ps_ref
ps_both

### 3.3 makeBinaryTemplateAndProbabilityTemplate() 

INPUT: matrix, where columns are samples and rows are pathway genes;<br>
OUTPUT: binary and probability templates.<br>

List containing binary template vector and probability template vector.

In [9]:
temp = makeBinaryTemplateAndProbabilityTemplate(refdata[1:10,1:5])
bt = temp$binary_template
pt = temp$probability_template
cbind(bt,pt)

Unnamed: 0,bt,pt
COX1 < EEF1A1,0,0.2
COX1 < TMSB4X,0,0.2
COX1 < ATP6,0,0.0
COX1 < RPLP1,0,0.0
COX1 < RPL37,0,0.0
COX1 < RPL37A,0,0.0
COX1 < RPL41,0,0.0
COX1 < RPS27,0,0.0
COX1 < ND4,0,0.0
EEF1A1 < TMSB4X,0,0.4


### 3.4 GRAPE Classification

`predictClassGRAPE()` function to get predicted class labels for test set.<br>
Classification of a samples according to grape distances from templates. Usually applied to the gene expression values for a single pathway.

Argument:<br>

**trainmat**： Matrix of gene expression for set of genes accross training set samples. Each column is a sample.

**testmat**： Matrix of gene expression for set of genes accross test set samples. Each column is a sample.

**train_labels**： Vector of class labels for each sample in the training set.

**w**： Weight function. Default is quadratic weight function.

In [10]:
# Toy example of two classes
set.seed(10)
path_genes <- c("gA","gB","gC","gD"); nsamps <- 50 # Four genes, 50 samples per class
class_one_samps <- matrix(NA,nrow=length(path_genes),ncol=nsamps) # Class 1
rownames(class_one_samps) <- path_genes
class_one_samps[1,] <- rnorm(ncol(class_one_samps),4,2)
class_one_samps[2,] <- rnorm(ncol(class_one_samps),5,4)
class_one_samps[3,] <- rnorm(ncol(class_one_samps),1,1)
class_one_samps[4,] <- rnorm(ncol(class_one_samps),2,1)
class_two_samps <- matrix(NA,nrow=length(path_genes),ncol=nsamps) # Class 2
rownames(class_two_samps) <- path_genes
class_two_samps[1,] <- rnorm(ncol(class_two_samps),2,3)
class_two_samps[2,] <- rnorm(ncol(class_two_samps),5,2)
class_two_samps[3,] <- rnorm(ncol(class_two_samps),1,1)
class_two_samps[4,] <- rnorm(ncol(class_two_samps),0,1)
all_samps <- cbind(class_one_samps,class_two_samps)
labels <- c(rep(1,nsamps),rep(2,nsamps))
testid <- sample.int(100,20)
trainmat <- all_samps[,-testid]
head(trainmat)
train_labels <- labels[-testid]
testmat <- all_samps[,testid]
head(testmat)
test_labels <- labels[testid]
train_labels

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
gA,4.0374923,3.631495,1.2573389,4.58909,4.779589,1.583848,3.272648,0.7466546,6.20355901,5.511563,...,-0.5220323,-4.6164153,-1.384168,-2.0239299,6.8153421,4.233271,1.63234962,2.2777539,0.9210343,5.08571216
gB,3.3974498,3.661774,10.47181581,7.023277,8.14537,1.391152,7.131588,2.416423,0.04962212,3.175295,...,5.1463916,4.4853508,5.533613,7.7754487,5.3861593,6.184633,7.10208933,7.311595,4.4910639,7.54736853
gC,0.2381957,1.419375,-0.03994336,0.366787,1.563175,1.660987,-0.6580509,2.028168,-0.2801546,2.128868,...,0.9475872,1.0961239,1.266407,1.5547935,2.2354459,1.285133,2.6194187,0.3833266,3.1145213,-0.02856737
gD,1.6088958,1.750133,3.15510475,1.133322,-0.321017,2.60883,3.150006,0.8004023,2.65316619,1.450592,...,-0.6828386,-0.6129412,1.89078,-0.1454105,-0.6394237,-0.411176,0.03799977,0.9210647,-0.5998269,1.24444646


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
gA,5.9371327,1.1635295,3.5273561,3.796478,0.705499,3.5235329,4.5862466,2.696874,3.1312182,2.064696,5.7851452,3.4870432,6.1731028,2.8073787,3.185467,2.801665,1.0771787,1.1515351,0.9263603,5.646541
gB,7.8515771,5.6550203,5.7697352,0.7934454,0.3402409,1.6787094,3.3400509,4.704177,5.082777,7.3798294,5.9554809,6.16395,3.3345813,6.5236889,5.785147,13.551068,5.4454709,4.2882951,2.9311241,7.155785
gC,-0.9968156,1.4807212,0.2760049,0.6360178,3.3929129,0.5358655,0.5412071,1.954786,0.8693548,1.3724723,2.3167653,2.1279536,-0.6753322,0.5186344,1.623478,1.711574,0.8307244,1.9579449,1.3845411,2.502545
gD,0.6396939,-0.5280637,0.0660703,2.9598291,-1.3996571,2.5210545,0.3427919,2.046361,-1.4497605,0.5313023,0.7264094,0.4199992,2.1130294,0.8626001,-1.133247,1.135273,-0.4291438,0.4003617,2.1349656,-0.886788


In [11]:
yhat <- predictClassGRAPE(trainmat,testmat,train_labels,w_quad)
sum(diag(table(test_labels,yhat)))/length(test_labels) # accuracy

### 3.5 w_quad： Quadratic weight function to get weight of each element

`w_quad(x)`: Calculates the weights of all input entries. All entries should take values in [0,1].


**x** : Any number, vector of matrix.

> w_quad <- function(x){return(4*abs(x-0.5)^2)}

In [12]:
w_quad(0.95)