-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lasso_on_BRCA_RNASeq.Rmd
613 lines (518 loc) · 26.4 KB
/
Lasso_on_BRCA_RNASeq.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
---
title: "Using Lasso to Find Predictors for Breast Cancer"
author: "Michael Kesling"
date: "8/23/2019"
output: rmarkdown::github_document # html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.height=6, fig.width=8,
fig.align = "center", dpi=300)
require(gridExtra)
######## FUNCTIONS NOT TO BE PRINTED IN MARKDOWN DOC:
scaleData <- function(DM, MEAN=TRUE, ROW=FALSE){
# this function takes a data matrix, DM, and scales the data by each row's
# standard deviation (if ROW=TRUE). If ROW=FALSE, then all scaling will be
# performed by column. It may also first subtract off the (row)'s mean if
# MEAN=TRUE. For (rows) whose SD=0, they are set aside and added back once
# the normalization is finished
# We assume that counts have already been normalized across samples!!
rowcol <- ifelse(ROW==TRUE, 1, 2) # normalization by row or column
rowcolInv <- ifelse(rowcol==1, 2, 1) # inverse val needed for norm
SD <- apply(DM, rowcol, sd) # calc std deviations
numSDzeros <- sum(SD==0)
# remove row / col if SD == 0
if(numSDzeros>0){ # rm row/col if SD == 0
zeroSD <- which(SD == 0) # id examples with zero SD
DM <- ifelse(rowcol==2, list(DM[, -zeroSD]), list(DM[-zeroSD,]))[[1]]
SD <- SD[-zeroSD]
}
means <- apply(DM, rowcol, mean)
# apply normalization with or without mean subtraction:
if(MEAN==FALSE){
DM <- t(apply(DM, rowcolInv, function(x){x / SD}))
}
else{
DM <- t(apply(DM, rowcolInv, function(x){(x - means) / SD}))
}
if(rowcol == 1){ # transpose matrix if normalization is across rows
DM <- t(DM)
}
# add back all-zero row / column taken out earlier (if done at all)
if(numSDzeros>0){
if(rowcol ==1){
zeros <- matrix(rep(0, length(zeroSD)*dim(DM)[2]), ncol=dim(DM)[2])
rownames(zeros) <- names(zeroSD)
DM <- rbind(DM, zeros)
}
else{
zeros <- matrix(rep(0, length(zeroSD)*dim(DM)[1]), nrow=dim(DM)[1])
colnames(zeros) <- names(zeroSD)
DM <- cbind(DM, zeros)
}
}
return(DM)
}
#### TEST
if(FALSE){
x <- c(1,0,9,4,0,33,0,0,0,3,0,9)
X <- matrix(x, ncol=4)
rownames(X) <- c("one", "two", "three")
colnames(X) <- c("S1", "S2", "S3", "S4")
Y <- scaleData(X, TRUE, FALSE)
}
```
###### (Document in Progress)
### Objective:
We are going to use the cleaned batch-normalized dataset of RNASeq data from 199 healthy breast samples and 199 breast tumors to find RNASeq signatures which predict the cancerous state.
###### NOTE: Some functions are not included in the markdown / html file. To view those, see the [.RMD file](https://github.com/keslingmj/MachineLearningRNASeq/blob/master/Lasso_on_BRCA_RNASeq.Rmd)
### Finish Dataframe Cleanup
We start by importing the data as a dataframe while taking the transpose, as the genes will act as the predictors.
We also need to combine the Hugo symbol and Entrez ID rows as the column names and then remove those 2 rows from the dataframe.
```{r message=FALSE}
# transpose causes data to be saved as matrix rather than df. However, that helps manipulation
# of text in first 2 rows to create meaningful column names for the df.
wangMatrix <- t(read.table("../data/wangBreastFPKM398_Attrib.txt", header=TRUE)) # rownames okay now
numRows <- dim(wangMatrix)[1]
# convert to dataframe while removing first 2 rows and name the columns from the Matrix data:
wangWithAttrib <- data.frame(wangMatrix[3:numRows,], stringsAsFactors = FALSE)
colnames(wangWithAttrib) <- gsub("-NA", "",(gsub(" ", "", paste0(wangMatrix[1,], "-", wangMatrix[2,]))))
```
Next, we'll need to decide which sample attributes to use. As the only consistent attribute is Age, I'll remove all others (perhaps we could impute the others at another time). I'll also filter out the 3 male samples. And I'll need to create an output vector. I'm going to start with *primary_diagnosis*, which will be cleaned up to 0 for healthy and 1 for cancer. However, the *tumor_stage* describes cancer in greater detail, and I'll save that as an alternative output vector.
```{r message=FALSE}
require(dplyr);
require(tibble);
# convert rownames to column so that they are not lost through filter step
wangWithAttrib <- wangWithAttrib %>% rownames_to_column('sample')
# wangWithAttrib %>% dplyr::filter(gender=="female") %>% dim() #shows correctly removes 3 rows
wangWithAttrib <- wangWithAttrib %>% dplyr::filter(gender=="female")
primary_diagnosis_archive <- wangWithAttrib$primary_diagnosis # archive
tumor_stage_archive <- wangWithAttrib$tumor_stage # archive
# before removing output variable, I'm going to split wangWithAttrib into a training and test sets
# first, we'll convert primary_diagnosis into 0's and 1's.
wangWithAttrib$primary_diagnosis <- ifelse(wangWithAttrib$primary_diagnosis=="healthy", 0, 1)
require(caTools);
set.seed(233992812)
idxTrain <- sample.split(wangWithAttrib$primary_diagnosis, SplitRatio = 0.75)
# confirm randomness of selection:
qa <- cbind(idxTrain, wangWithAttrib$primary_diagnosis)
paste(sum(qa[qa[,2]==0,][,1]), " and ", sum(qa[qa[,2]==1,][,1]), " training sizes amongst healthy and cancer samples shows equal partitioning")
# next, we grab the output vectors, both train and test
diagTrain <- subset(wangWithAttrib$primary_diagnosis, idxTrain==TRUE)
diagTest <- subset(wangWithAttrib$primary_diagnosis, idxTrain==FALSE)
# next, we remove unused columns
wangWithAge <- wangWithAttrib %>% select(-gender, -race, -ethnicity, -prior_malignancy,
-vital_status,
-primary_diagnosis, -tumor_stage) # correctly removes 7 columns
# then we use indices to separate what remains into train and test sets:
wangTrain <- wangWithAge %>% filter(idxTrain==TRUE) # 296 training obs
wangTest <- wangWithAge %>% filter(idxTrain==FALSE) # 99 test obs
print(paste(dim(wangTrain), dim(wangTest)))
```
### Normalizing Data for Lasso
In order to perform Lasso, each gene must be normalized
so that larger-unit betas don't dominate smaller betas
###### Need to rigorously test whether my normalization yields identical
###### results to glmnet's own internal normalization.
```{r, fig.height=8, fig.width=8, message=FALSE}
# start by converting data-frames to matrices, and create row names:
wangTrainMatrix <- data.matrix(wangTrain[,2:dim(wangTrain)[2]])
rownames(wangTrainMatrix) <- wangTrain[,1]
wangTestMatrix <- data.matrix(wangTest[,2:dim(wangTest)[2]])
rownames(wangTestMatrix) <- wangTest[,1]
# Normalizing Training Data with and without subtracting off mean:
wangTrainNormMean <- scaleData(wangTrainMatrix, MEAN=TRUE, ROW=TRUE)
wangTrainNorm <- scaleData(wangTrainMatrix, MEAN=FALSE, ROW=TRUE)
# Normalizing Test Data with and without subtracting off mean:
wangTestNormMean <- scaleData(wangTestMatrix, MEAN=TRUE, ROW=TRUE)
wangTestNorm <- scaleData(wangTestMatrix, MEAN=FALSE, ROW=TRUE)
# Must re-order Test Data columns to be the same as for Training Data:
wangTestNormMean <- wangTestNormMean[,colnames(wangTrainNormMean)]
wangTestNorm <- wangTestNorm[,colnames(wangTrainNorm)]
# test consistency with wangTrainNorm
print(c(all(colnames(wangTestNorm) == colnames(wangTrainNorm)),
all(colnames(wangTestNormMean) == colnames(wangTrainNormMean))))
```
### Viewing Partitioning of Cancer/Healthy Samples in Data Space using t-SNE
T-distributed stochastic neighbor embedding is a popular technique for projecting the high-dimensional dataset onto a 2-D surface. We perform that here in order to view how the cancer/healthy sample attribute partitions across the dataset.
```{r, fig.height=6, fig.width=10, message=FALSE}
require(glmnet);
require(ggplot2);
# install_github("ririzarr/rafalib")
require(rafalib);
###############
### Create t-SNE Plot on:
### (A) non-scaled data
### (B) data scaled by SD only
### (C) data scaled by SD and centered by Mean
xTrainSDnorm <- as.matrix(wangTrainNorm)
xTrainSDMean <- as.matrix(wangTrainNormMean)
# Let's look at map of data after normalizing using t-SNE
require(Rtsne);
require(ggplot2);
set.seed(31234)
tSNEout_full_noNorm <- Rtsne(wangTrainMatrix, dims=2)
set.seed(31234)
tSNEout_full_SD <- Rtsne(xTrainSDnorm, dims=2)
set.seed(31234)
tSNEout_full_SDmean <- Rtsne(xTrainSDMean, dims=2)
tsne_plot_full_noNorm <- data.frame(x=tSNEout_full_noNorm$Y[,1],
y = tSNEout_full_noNorm$Y[,2], col=diagTrain)
tsne_plot_full_SD <- data.frame(x=tSNEout_full_SD$Y[,1],
y = tSNEout_full_SD$Y[,2], col=diagTrain)
tsne_plot_full_SDmean <- data.frame(x=tSNEout_full_SDmean$Y[,1],
y = tSNEout_full_SDmean$Y[,2], col=diagTrain)
palette <- c("#000000", "#56B4E9")
mypar(1,3)
ggplot(tsne_plot_full_noNorm) +
geom_point(aes(x=x, y=y, color=as.factor(col))) +
ggtitle("t-SNE Plot of Training Data Using All Predictors -- no Normalization") +
scale_color_manual(name="Cancer",
breaks = c("0","1"),
values = palette,
labels = c("No", "Yes"))
ggplot(tsne_plot_full_SD) +
geom_point(aes(x=x, y=y, color=as.factor(col))) +
ggtitle("t-SNE Plot of Training Data Using All Predictors -- SD scaling only") +
scale_color_manual(name="Cancer",
breaks = c("0","1"),
values = palette,
labels = c("No", "Yes"))
ggplot(tsne_plot_full_SDmean) +
geom_point(aes(x=x, y=y, color=as.factor(col))) +
ggtitle("t-SNE Plot of Training Data Using All Predictors -- SD and mean norm") +
scale_color_manual(name="Cancer",
breaks = c("0","1"),
values = palette,
labels = c("No", "Yes"))
```
While the t-SNE plots show good separation between the healthy samples and the
cancerous samples, it's not a clean when I normalized across genes rather than
normalized across samples, which was done here.
I'm going to continue with the analysis using all 3 normalization schemes.
### Perform Logistic Regression with Lasso Coefficient Shrinkage
We will perform Logistic Regression with a Lasso shrinkage term on the dataset that is
(a) unscaled by gene standard deviation across samples
(b) scaled by gene standard deviation across samples
(c) scaled by gene standard deviation and offset by gene mean across samples
```{r, fig.height=10}
# creating fits
set.seed(1011)
fit.NONORM.lasso <- glmnet(wangTrainMatrix, diagTrain, family="binomial",
alpha = 1)
set.seed(1011)
fit.SD.lasso <- glmnet(xTrainSDnorm, diagTrain, family="binomial",
alpha = 1)
set.seed(1011)
fit.SDmean.lasso <- glmnet(xTrainSDMean, diagTrain, family="binomial",
alpha = 1)
frame()
par(mfrow=c(2,2), mar=c(4.5,4.5,4,1))
plot(fit.NONORM.lasso, xvar="lambda", label=TRUE)
title("Unscaled Data", line=2.5)
plot.new(fit.SD.lasso, xvar="lambda", label=TRUE)
title("Scaled by SD", line=2.5)
plot.new(fit.SDmean.lasso, xvar="lambda", label=TRUE)
title("SD- and Mean-Norm", line=2.5)
```
Normalizing by both subtracting off the mean and scaling by the standard deviation
(right panel) gives the greatest number of genes with coefficients not close to 0 for the greatest range of lambda values.
Also note that the magnitudes of the coefficients in these 3 plots vary greatly
### Fit Quality as Function of Lambda
```{r}
set.seed(1011)
cv.NONORM.lasso <- cv.glmnet(wangTrainMatrix, diagTrain, family="binomial", alpha=1,
type.measure = "deviance")
set.seed(1011)
cv.SD.lasso <- cv.glmnet(xTrainSDnorm, diagTrain, family="binomial", alpha=1,
type.measure = "deviance")
set.seed(1011)
cv.SDmean.lasso <- cv.glmnet(xTrainSDMean, diagTrain, family="binomial", alpha=1,
type.measure = "deviance")
mypar(1,3)
plot(cv.NONORM.lasso) # gives results close to earlier results
plot(cv.SD.lasso)
plot(cv.SDmean.lasso)
#coefsNN <- data.frame(as.matrix(coef(fit.NONORM.lasso, s=cv.NONORM.lasso$lambda.1se)))
#coefsNN <- cbind(rownames(coefsNN), coefsNN) %>% filter(X1!=0) %>% arrange(X1)
#cNNnames <- coefsNN$`rownames(coefsNN)` %>% sort()
###############
#xTrain <- as.matrix(wangTrainNorm)
#fit.lasso <- glmnet(xTrain, diagTrain, family="binomial",
# alpha = 1) # 'binomial' needed for logistic regression
# alpha = 1 -> Lasso; alpha = 0 -> Ridge
#mypar(1,2) # doesn't work in RMD, but does on R command line
#plot(fit.lasso, xvar="dev", label=TRUE)
#plot(fit.lasso, xvar="lambda", label=TRUE) + abline(v=-3.77)
###########
#xTrain2 <- as.matrix(wangTrainNorm2)
#fit.lasso2 <- glmnet(xTrain2, diagTrain, family="binomial",
# alpha = 1) # no subtraction of mean
```
### Choosing Simplest Model with Near-Minimum Error
To determine what the simplest model that gives low error is, we'll plot MSE vs log-Lambda
```{r message=FALSE}
#set.seed(1011)
#cv.lasso <- cv.glmnet(xTrain, diagTrain, family="binomial", alpha=1,
# type.measure = "deviance")
# misclassification error use type.measure = "class"
# misclassif. gives much smaller model (8 predictors)
######
#cv.lasso2 <- cv.glmnet(xTrain2, diagTrain, family="binomial", alpha=1,
# type.measure = "deviance")
#all(coef(cv.lasso) == coef(cv.lasso2)) # not identical
which(coef(cv.NONORM.lasso)!= 0) # 37 coefs
which(coef(cv.SD.lasso)!= 0) # 44 coefs
which(coef(cv.SDmean.lasso)!=0) #46 coefs
which(coef(cv.NONORM.lasso)!= 0) %in% which(coef(cv.SDmean.lasso)!=0) # models diverge significantly
# might look at those only having abs value > 0.1 or something, but can't compare
# across datasets, as units will vary
#plot(cv.lasso)
```
We can see that there is a large divergence in the genes in the logistic regression models, which isn't ideal.
Let's see what the value of lambda is for the 1-SE-from-minimum is:
```{r, include=FALSE}
print(paste("The best value of lambda for un-normalized data is ",
cv.NONORM.lasso$lambda.1se, " and the best lambda for SD-normalized data is ",
cv.SD.lasso$lambda.1se, " and the best lambda for SD/mean-normalized data is ",
cv.SDmean.lasso$lambda.1se, "."))
```
Let's look at the coefficient values for the selected model:
## LATER
```{r}
# grabbing model coefficients
# (if we substitute "fit.lasso" with "cv.lasso", it yields the identical coefs)
#coefs <- data.frame(as.matrix(coef(fit.lasso, s=cv.lasso$lambda.1se)))
#coefs <- cbind(rownames(coefs), coefs) %>% filter(X1!=0) %>% arrange(X1)
#print(coefs)
#cNames <- coefs$`rownames(coefs)` %>% sort()
#as.character(cNames) == as.character(cNNnames)
```
### Predictions on Test Data
Now that we've identified a Lasso-shrinked model, we'll see how well the model performs on the test data that were not involved in model training.
```{r message=FALSE}
# While the intercept is not needed for the training data (glmnet takes care
# of it), we need to add it to the TEST data, as the first coefficient in the
# model will be the intercept term.
xTest_noNorm <- cbind(1, wangTestMatrix)
colnames(xTest_noNorm)[1] <- "(Intercept)"
xTest_noNorm <- as.matrix(xTest_noNorm)
xTest_SD <- cbind(1, wangTestNorm)
colnames(xTest_SD)[1] <- "(Intercept)"
xTest_SD <- as.matrix(xTest_SD)
xTest_SDmean <- cbind(1, wangTestNormMean)
colnames(xTest_SDmean)[1] <- "(Intercept)"
xTest_SDmean <- as.matrix(xTest_SDmean)
######
# TO DO: change setup so that we're only using the reduced model.
######
# we then extract the fitted "lambda.1se" coefficient
nBeta_noNorm <- coef(cv.NONORM.lasso, s="lambda.1se")
nBeta_SD <- coef(cv.SD.lasso, s="lambda.1se")
nBeta_SDmean <- coef(cv.SDmean.lasso, s="lambda.1se")
# we then perform the class prediction:
testPredictions_noNorm <- ifelse(xTest_noNorm %*% nBeta_noNorm > 0, 1, 0)
testPredictions_SD <- ifelse(xTest_SD %*% nBeta_SD > 0, 1, 0)
testPredictions_SD_mean <- ifelse(xTest_SDmean %*% nBeta_SDmean > 0, 1, 0)
# confusion matrix
tbl1 <- table(diagTest, testPredictions_noNorm)
tbl2 <- table(diagTest, testPredictions_SD)
tbl3 <- table(diagTest, testPredictions_SD_mean)
print(tbl1)
print(tbl2)
print(tbl3)
# I had difficulty getting the 'predict()' function to work on this model
# Many people online say the same thing, as well as predict's own error
# message. So I simply performed matrix multiplication.
```
```{r, include=FALSE}
#print(paste0("The model is yielding great results, with just ", tbl[2,1]," false positives and ", tbl[1,2], " false negatives out of ", length(diagTest), " test predictions."))
```
```{r, include=FALSE}
#print(paste0("This yields a sensitivity of ", tbl[2,2]/sum(tbl[,2]), " and a specificity of ", round(tbl[1,1]/sum(tbl[,1]),2), "."))
```
### Plotting Predictors and Non-Predictors Across Samples
We've seen that these models perform extremely well. I'd like to ensure that it's not
due to some artifact that would prevent the model from scaling well.
We'll start by plotting a number of predictor and non-predictor genes, grouped by sample = cancer and sample = healthy.
```{r}
# grab predictors, random genes, and output and place into df
# create jitter plot for each gene, colored by output
# we expect to see a clear difference in the chosen predictors and the random genes.
coefs_SDmean <- data.frame(as.matrix(coef(fit.SDmean.lasso, s=cv.SDmean.lasso$lambda.1se)))
coefs_SDmean2 <- cbind(rownames(coefs_SDmean), coefs_SDmean) %>% filter(X1!=0) %>% arrange(X1)
predictorCols <- as.character(coefs_SDmean2$`rownames(coefs_SDmean)`)
predictorCols <- predictorCols[grep("(Intercept)", predictorCols, invert=TRUE)]
predictors_SDmean <- xTrainSDMean[, predictorCols] # predictors
predictors_SDmean <- data.frame(cbind(predictors_SDmean, diagTrain))
nonPredictors_SDmean <- cbind(rownames(coefs_SDmean), coefs_SDmean) %>% filter(X1==0)
nonPredSample <- as.character(sample(nonPredictors_SDmean$`rownames(coefs_SDmean)`, 10))
#tmp <- cbind(diagTrain, xTrainSDMean)
non_predictors_SDmean <- data.frame(cbind(xTrainSDMean[, nonPredSample], diagTrain))
# need to convert into a taller df to plot in ggplot
palette <- c("#000000", "#56B4E9")
jitterPlot <- function(df, var1, var2, titl, pal){
obj <- ggplot(df, aes(df[,var1], df[,var2]), colour=as.factor(diagTrain))+ #as.factor()
geom_jitter(aes(col=as.factor(diagTrain)),size=0.5, height=0.3) + #as.factor() needed
scale_color_manual(name="Cancer",
breaks = c("0","1"),
values = pal,
labels = c("No", "Yes")) +
ggtitle(titl) + xlab(var1) + ylab(var2)
theme(plot.title = element_text(size = 18, hjust=0.5)) +
theme(axis.title = element_text(size = 12, hjust=0.5));
return(obj)
}
p1 <- jitterPlot(predictors_SDmean, var1 = "diagTrain", var2="PAMR1.25891",
"Jitter Plot of Predictor Genes",
pal = palette)
grid.arrange(p1)
```
This plot is very problematic, as the PAMR1 gene, which has the largest coefficient in the prediction model has far more noise than the difference in signal.
#### Next, I'm going to need to do an MA plot of the whole dataset, coloring the predictor genes.
We'll place the expression data and the sample info (Age, Cancer/Healthy) into an ExpressionSet object in order to create an MA-plot and a Volcano Plot. This should help tell us what types of predictors the lasso-regularized logistic regression models have selected.
```{r}
require(Biobase)
require(viper)
# transpose xTrainSDMean matrix and take out Age variable
xTrainSDMeanGenes <- t(xTrainSDMean)
age <- xTrainSDMeanGenes["age",]
xTrainSDMeanGenes <- xTrainSDMeanGenes[-which(rownames(xTrainSDMeanGenes)=="age"),]
ageDF <- data.frame(age)
diagnosisDF <- data.frame(diagTrain)
rownames(diagnosisDF) <- colnames(xTrainSDMeanGenes)
colnames(diagnosisDF) <- "diagnosis"
phenoDataDF <- cbind(diagnosisDF, ageDF)
# place data matrix and sample attributes in ExpressionSet object
es <- ExpressionSet(xTrainSDMeanGenes)
#pData(es) <- ageDF
pData(es) <- phenoDataDF
# create MA Plot
d <- exprs(es)
ttests <- rowTtest(d[,1:148], d[,149:296])
hist(ttests$statistic)
```
### Visualizing Model on Training Data
We're going to visualize the reduced training dataset using only the non-zero coefficients from the Lasso model.
```{r message=FALSE}
# start by reducing training dataset
# coefs <- data.frame(as.matrix(nBeta))
# coefs <- cbind(rownames(coefs), coefs) %>% filter(X1!=0)
# xTrainReduced <- xTrain[,coefs$`rownames(coefs)`] # (matrix)
#
# # now create the t-SNE plot
# require(Rtsne);
# require(ggplot2);
# set.seed(31234)
# tSNEout_reduced <- Rtsne(xTrainReduced)
# tsne_plot_reduced <- data.frame(x=tSNEout_reduced$Y[,1], y = tSNEout_reduced$Y[,2],
# col=diagTrain)
# ggplot(tsne_plot_reduced) +
# geom_point(aes(x=x, y=y, color=col)) +
# ggtitle("t-SNE Plot of Training Data Using Only Non-Zero Lasso Coefficients")
```
Now let's try the t-SNE Plot of the non-reduced Training Dataset:
```{r message=FALSE}
#require(devtools)
#install_github("ropensci/plotly")
# require(plotly);
# require(Rtsne);
# require(ggplot2);
# set.seed(31234)
# tSNEout_full <- Rtsne(xTrain, dims=2)
# tsne_plot_full <- data.frame(x=tSNEout_full$Y[,1], y = tSNEout_full$Y[,2], col=diagTrain)
# ggplot(tsne_plot_full) +
# geom_point(aes(x=x, y=y, color=col)) +
# ggtitle("t-SNE Plot of Training Data Using All Predictors")
#ggplotly(g) # better if sample named appeared in mouse-over
```
# Using t-SNE to separate out genes in full dataset
```{r message=FALSE}
# require(Rtsne);
# require(ggplot2);
# require(RColorBrewer);
# palette <- c("#000000", "#56B4E9")
# spctrlPal <- c("#9E0142", "#D53E4F", "#F46D43", "#FDAE61", "#FEE08B", "#FFFFBF", "#E6F598", "#ABDDA4", "#66C2A5","#3288BD", "#5E4FA2")
# # I need a vector for coloring genes that were important in Lasso model
# # 3 COLORS MIGHT BE INTERESTED DEPENDING ON COEF VALUES IN MODEL
# geneImportance = c()
# for(gene in colnames(xTrain)){
# if(gene %in% coefs$`rownames(coefs)`){
# if(coefs[coefs$`rownames(coefs)`==gene,][,2] > 0){
# geneImportance <- c(geneImportance, 1)
# }
# else{
# geneImportance <- c(geneImportance, -1)
# }
# }
# else{
# geneImportance <- c(geneImportance, 0)
# }
# }
# # run t-SNE and plot:
# set.seed(31234)
# tSNEout_genes <- Rtsne(t(xTrain), dims=2, check_duplicates = FALSE)
# tsne_plot_genes <- data.frame(x=tSNEout_genes$Y[,1], y = tSNEout_genes$Y[,2],
# col=geneImportance)
# tsne_Important <- tsne_plot_genes %>% filter(col!=0)
# tsne_plot_genes <- rbind(tsne_plot_genes, tsne_Important)
# ggplot(tsne_plot_genes) +
# geom_point(aes(x=x, y=y, color=as.factor(col)), size=.3) +
# ggtitle("t-SNE Plot of Genes using Training Data Using All Predictors") +
# scale_color_manual(name="Corr With",
# breaks = c("-1", "0", "1"),
# values = c(spctrlPal[1], spctrlPal[8], spctrlPal[11]),
# labels = c("Healthy", "None", "Cancer"))
#ggplotly(g)
```
We can see that there is a clean separation between genes whose expression is positively correlated with cancer (blue) and genes whose expression is negatively correlated with cancer (red). This isn't necessarily surprising, as these are the genes that give the greatest predictive value for separating the disease state (cancer / healthy).
## Ruling Out Arifacts
One possibility that the Cancer / Healthy Breast samples segregate so cleanly in the t-SNE graph while using all data is that there could be a batch effect between these samples that isn't the "cancer"/"healthy" state, but is something else, such as repository/project (TCGA/GTEx), date, or other effect.
## A. Plotting TCGA / GTEx Batch Effects
All GTEx samples are of healthy subjects. Luckily, TCGA has a good numbered of healthy and cancer samples. That should enable us to see if there is a major TCGA vs GTEx batch effect while looking at the healthy samples.
We're going to do this using the full set of predictors.
```{r message=FALSE}
# we start by creating a vector of the 296 training examples according to their class
# classes3 <- subset(wangWithAttrib, idxTrain==TRUE)
# sampleNames <- classes3$gene # actually the sample name and not gene name
# rm(classes3) # cleanup
#
# # classes GTEX-healthy = 1, TCGA-healthy = 2, TCGA-cancer = 3
# c3vect <- c(rep(0, length(sampleNames)))
# c3vect <- grepl("^GTEX", sampleNames) + c3vect
# c3vect <- c3vect + 2*(grepl("^TCGA", sampleNames) &
# grepl("\\.11[AB]\\.", sampleNames, perl=TRUE))
# c3vect <- c3vect + 3*(grepl("^TCGA", sampleNames) &
# grepl("\\.01[AB]\\.", sampleNames, perl=TRUE))
#
# # define colors
# colors3pal <- c("#FDAE6B", "#E6550D", "#56B4E9")
#
# tsne_plot_3class <- data.frame(x=tSNEout_full$Y[,1], y = tSNEout_full$Y[,2], col=c3vect)
# ggplot(tsne_plot_3class) +
# geom_point(aes(x=x, y=y, color=as.factor(col))) +
# ggtitle("t-SNE Plot of Training Data Using All Predictors") +
# scale_color_manual(name="Category",
# breaks = c("1", "2", "3"),
# values = c(colors3pal[1], colors3pal[2], colors3pal[3]),
# labels = c("Healthy-GTEX", "Healthy-TCGA", "Cancer-TCGA"))
```
Overall, healthy/cancer still separate much better in the data than do healthy-TCGA and healthy-GTEX. However, one can see a smaller effect in that the healthy-TCGA tend to be more tightly clustered within the healthy group than do healthy-GTEX.
## Ruling Out Batch Effect of Date
## Plotting Disease State WRT 1st 2 Principal Components
The above plots have looked at
## Plotting Informative Genes across samples and on MA Plot (along with other genes)
## Functional grouping of genes
Though we have a small grouping of gene predictors (26), and most of them are correlated with healthy tissue and not breast cancer, I'd like to see if there's enrichments with Panther / GO classes
```{r message=FALSE}
# # grab coef gene names and just use GeneID:
# geneIDs <- gsub("-[A-Z,0-9]+", "", as.character(coefs$`rownames(coefs)`))
# geneIDs <- geneIDs[2:length(geneIDs)]
# write.csv(geneIDs, file="predictorGenesBRCA.txt", row.names = FALSE)
#
# # reference gene list:
# allGeneIDs <- gsub("-[A-Z,0-9]+", "", colnames(xTrain))
# write.csv(allGeneIDs, file="allGenesRNASeq.txt", row.names = FALSE)
```
These genes had no significant correlation with GO categories, as analyzed on the Panther website. I needed to strip out all quotes (") around the gene names.
Unsupervised clustering may yield structure with gene patterns across the dataset.
## Unsupervised clustering