Training on different biological contexts #45

jaclyn-taroni · 2018-11-29T15:02:10Z

Related to: #39

We want to train models on different biological contexts and see what pathways are recovered. Here, I modify 26-describe_recount2 such that we can identify samples that are predicted to be from specific contexts in MetaSRA and convert from the identifiers used by MetaSRA (e.g., SRSxxxxx) to the sample/column names used in recount2 (e.g., SRPxxxxx.SRRxxxxx).

Contexts are:

Only cancer samples
Only blood samples (recall that a lot of what the models learn is related to leukocytes)
Everything but blood
Cell line samples
Tissue samples

I've added scripts/subsampling_PLIER.R which allows us to do the subsampling experiments two ways:

Randomly selecting n number of samples; this is repeated r times (default is 5) using different random seeds (related PR forthcoming)
Using a list of sample ids to subset the input data; no repeats are performed (models tend to be pretty stable in our experience)

Finally, I'm adding 28-train_different_biological_contexts.sh -- the shell script for training all the models.

gwaybio

Nice PR - LGTM. A couple of minor comments

gwaybio · 2018-11-29T15:10:44Z

26-describe_recount2.Rmd

+tissue.accessions <- ConvertToRecountSampleName(tissue.samples,
+                                                   conversion.df)
+tissue.file <- file.path("data", "sample_info", 
+                            "recount2_tissue_accessions.tsv")


indentation a bit off

gwaybio · 2018-11-29T15:10:53Z

26-describe_recount2.Rmd

+  ConvertToRecountSampleName(cancer.samples, 
+                             conversion.df)
+cancer.file <- file.path("data", "sample_info", 
+                            "recount2_cancer_accessions.tsv")


indentation here too

gwaybio · 2018-11-29T15:11:58Z

26-describe_recount2.Rmd

+blood.samples <- names(blood.samples[which(unlist(blood.samples))])
+blood.accessions <- ConvertToRecountSampleName(blood.samples, conversion.df)
+blood.file <- file.path("data", "sample_info", 
+                            "recount2_blood_accessions.tsv")


indentation

gwaybio · 2018-11-29T15:15:45Z

scripts/subsampling_PLIER.R

+              help = "Number of repeats to perform"),
+  make_option(c("-s", "--seed"), type = "integer", default = 123,
+              help = "Number of repeats to perform"),
+  make_option(c("-u", "--use_sample_list"), type = "logical", default = FALSE,


for a logical option, can use argument action='store_true'.

This way, when its called here all that is needed is --use_sample_list instead of --use_sample_list TRUE

gwaybio · 2018-11-29T15:16:59Z

scripts/subsampling_PLIER.R

+  smpl.exprs <- prepped.data[[1]][, sample.index]
+
+  plier.results <- PLIERWrapper(exprs = smpl.exprs,
+                              pathway.mat = prepped.data[[2]],


indentation

jaclyn-taroni · 2018-11-29T16:26:39Z

Can confirm that changes introduced with dd93688 gave me the same md5 checksum for one of the models.

jaclyn-taroni added 10 commits November 16, 2018 11:34

WIP: Rscript for training PLIER on different sample sizes

791abb8

Don't want to use seed as index

57c2593

Add documentation

dac14fc

We don't use the k value

75b6b66

Add ability to subset training data based on supplied sample list

d98e65d

WIP: add accession code conversion function

07b7624

Update: get lists of recount2 samples from different biological contexts

309b192

Add shell script for training models in different bio contexts

6133445

Fix tissue accessions bug

f9be30c

Newline

6b91b90

jaclyn-taroni requested a review from gwaybio November 29, 2018 15:02

gwaybio approved these changes Nov 29, 2018

View reviewed changes

jaclyn-taroni added 2 commits November 29, 2018 10:47

Indentation fixes

81f3654

Use action = "store_true"

dd93688

jaclyn-taroni merged commit e01b123 into greenelab:master Nov 29, 2018

jaclyn-taroni deleted the 39-bio-context branch November 29, 2018 16:27

jaclyn-taroni mentioned this pull request Nov 29, 2018

Training on different subsets of recount2 #39

Closed

5 tasks

jaclyn-taroni mentioned this pull request Dec 16, 2018

Add repeats for training on different biological contexts #51

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on different biological contexts #45

Training on different biological contexts #45

jaclyn-taroni commented Nov 29, 2018

gwaybio left a comment

gwaybio Nov 29, 2018

gwaybio Nov 29, 2018

gwaybio Nov 29, 2018

gwaybio Nov 29, 2018

gwaybio Nov 29, 2018

jaclyn-taroni commented Nov 29, 2018

Training on different biological contexts #45

Training on different biological contexts #45

Conversation

jaclyn-taroni commented Nov 29, 2018

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio Nov 29, 2018

Choose a reason for hiding this comment

gwaybio Nov 29, 2018

Choose a reason for hiding this comment

gwaybio Nov 29, 2018

Choose a reason for hiding this comment

gwaybio Nov 29, 2018

Choose a reason for hiding this comment

gwaybio Nov 29, 2018

Choose a reason for hiding this comment

jaclyn-taroni commented Nov 29, 2018