Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predict mutation status #51

Closed
envest opened this issue Aug 18, 2021 · 2 comments
Closed

Predict mutation status #51

envest opened this issue Aug 18, 2021 · 2 comments

Comments

@envest
Copy link
Contributor

envest commented Aug 18, 2021

I'm putting some thoughts down here as a working document and starting point for synchronous/asynchronous discussion 馃樅

Currently, this project does subtype classification as the supervised learning task. We want to expand to include prediction of mutation status in three genes: PIK3CA, PTEN, and TP53.

I see two main avenues to get this done.

  1. Copy existing scripts into new scripts, which we edit to fit the mutation paradigm
  2. Modify existing scripts (and structure of clinical data inputs) to make scripts work for subtype or mutation prediction.

I think option 1 might be easier, but 2 is better. 2 is better because any changes that apply to all supervised learning tasks would only need to be made in one place. 2 may even be easier.

Steps to accomplish modification of existing scripts (option 2):

  • Combine clinical info (with subtype) and mutation data such that each sample has one row with all their information. This way, one clinical file is read in for all prediction tasks and then the relevant column can be selected.
    • Mutations are currently encoded as 0/1, but could be "Has Mutation"/"No Mutation", or "TP53 Mutation"/"No Mutation"
  • Add option to scripts for what is being predicted ("subtype" or gene name corresponding to column of input clinical data)
    • Use check_options() functions to make sure the given option is correct
    • Script option would apply to steps 0, 1, 2, 3
  • Each script creates output read by the next script. Use the prediction task ("subtype" or gene name) in the output file so the next script knows what to read as input. Alternatively, create subdirectories for the task and keep file names the same.
  • In general, replace "subtype" with "category" in variable names, etc. to make it clear the prediction task is not limited to subtype prediction

鈿狅笍 The overlap of samples present in MC3 and having gene expression data is not complete (i.e. there are a few samples with gene expression but no mutation calls whatsoever in MC3). Theoretically, a sample could have been analyzed by MC3 but had 0 mutations to report and thus is not present in the MAF. However, I might regard such cases as highly suspect especially in BRCA and GBM. So for mutation prediction, we need to reduce our set of samples to only those with actual mutation calls in MC3 before splitting into testing/training (must have 0 or 1 mutation status, not NA). This was not a problem previously because all samples in our data have a subtype associated with them.

Beyond this, I want to know

  • What are the known and observed associations between subtypes and these mutations
  • What are the pathway genes up or down regulated in mutated samples and is this picked up by the model
  • For samples seemingly misclassified (especially no mutation predicted to have mutation) what other aberrations might be causing a "mutation-like" gene expression profile.
@jaclyn-taroni
Copy link
Collaborator

I agree that option 2 is better. To me, it seems like that will be easier long term such that the additional time cost up front may be worth it, but let's discuss this afternoon. Specifically, I'm interested in how much more work you think option 2 would be over option 1 to see if our perceptions are the same.

@envest
Copy link
Contributor Author

envest commented Aug 19, 2021

Based on virtual meeting, we discussed

  • Option 2 is preferred
  • Okay to drop samples with no observed mutations in MC3
  • Drop PTEN due to potential signaling from copy number change (1/3 have deep deletion of PTEN in GBM, 1/2 in BRCA)
  • Follow up on associations between subtype and mutation. If there are associations, include subtype as a model covariate.
  • Pathway questions go beyond scope of this paper. Try to avoid misclassification problem by dropping PTEN.
  • For Point 2 (pathway stuff) look at TCGA DNA damage repair paper wrt TP53 pathway genes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants