Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on different biological contexts #45

Merged
merged 12 commits into from
Nov 29, 2018

Conversation

jaclyn-taroni
Copy link
Collaborator

Related to: #39

We want to train models on different biological contexts and see what pathways are recovered. Here, I modify 26-describe_recount2 such that we can identify samples that are predicted to be from specific contexts in MetaSRA and convert from the identifiers used by MetaSRA (e.g., SRSxxxxx) to the sample/column names used in recount2 (e.g., SRPxxxxx.SRRxxxxx).

Contexts are:

  • Only cancer samples
  • Only blood samples (recall that a lot of what the models learn is related to leukocytes)
  • Everything but blood
  • Cell line samples
  • Tissue samples

I've added scripts/subsampling_PLIER.R which allows us to do the subsampling experiments two ways:

  • Randomly selecting n number of samples; this is repeated r times (default is 5) using different random seeds (related PR forthcoming)
  • Using a list of sample ids to subset the input data; no repeats are performed (models tend to be pretty stable in our experience)

Finally, I'm adding 28-train_different_biological_contexts.sh -- the shell script for training all the models.

Copy link

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR - LGTM. A couple of minor comments

tissue.accessions <- ConvertToRecountSampleName(tissue.samples,
conversion.df)
tissue.file <- file.path("data", "sample_info",
"recount2_tissue_accessions.tsv")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation a bit off

ConvertToRecountSampleName(cancer.samples,
conversion.df)
cancer.file <- file.path("data", "sample_info",
"recount2_cancer_accessions.tsv")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation here too

blood.samples <- names(blood.samples[which(unlist(blood.samples))])
blood.accessions <- ConvertToRecountSampleName(blood.samples, conversion.df)
blood.file <- file.path("data", "sample_info",
"recount2_blood_accessions.tsv")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

help = "Number of repeats to perform"),
make_option(c("-s", "--seed"), type = "integer", default = 123,
help = "Number of repeats to perform"),
make_option(c("-u", "--use_sample_list"), type = "logical", default = FALSE,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a logical option, can use argument action='store_true'.

This way, when its called here all that is needed is --use_sample_list instead of --use_sample_list TRUE

smpl.exprs <- prepped.data[[1]][, sample.index]

plier.results <- PLIERWrapper(exprs = smpl.exprs,
pathway.mat = prepped.data[[2]],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@jaclyn-taroni
Copy link
Collaborator Author

Can confirm that changes introduced with dd93688 gave me the same md5 checksum for one of the models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants