You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm putting some thoughts down here as a working document and starting point for synchronous/asynchronous discussion 馃樅
Currently, this project does subtype classification as the supervised learning task. We want to expand to include prediction of mutation status in three genes: PIK3CA, PTEN, and TP53.
I see two main avenues to get this done.
Copy existing scripts into new scripts, which we edit to fit the mutation paradigm
Modify existing scripts (and structure of clinical data inputs) to make scripts work for subtype or mutation prediction.
I think option 1 might be easier, but 2 is better. 2 is better because any changes that apply to all supervised learning tasks would only need to be made in one place. 2 may even be easier.
Steps to accomplish modification of existing scripts (option 2):
Combine clinical info (with subtype) and mutation data such that each sample has one row with all their information. This way, one clinical file is read in for all prediction tasks and then the relevant column can be selected.
Mutations are currently encoded as 0/1, but could be "Has Mutation"/"No Mutation", or "TP53 Mutation"/"No Mutation"
Add option to scripts for what is being predicted ("subtype" or gene name corresponding to column of input clinical data)
Use check_options() functions to make sure the given option is correct
Script option would apply to steps 0, 1, 2, 3
Each script creates output read by the next script. Use the prediction task ("subtype" or gene name) in the output file so the next script knows what to read as input. Alternatively, create subdirectories for the task and keep file names the same.
In general, replace "subtype" with "category" in variable names, etc. to make it clear the prediction task is not limited to subtype prediction
鈿狅笍 The overlap of samples present in MC3 and having gene expression data is not complete (i.e. there are a few samples with gene expression but no mutation calls whatsoever in MC3). Theoretically, a sample could have been analyzed by MC3 but had 0 mutations to report and thus is not present in the MAF. However, I might regard such cases as highly suspect especially in BRCA and GBM. So for mutation prediction, we need to reduce our set of samples to only those with actual mutation calls in MC3 before splitting into testing/training (must have 0 or 1 mutation status, not NA). This was not a problem previously because all samples in our data have a subtype associated with them.
Beyond this, I want to know
What are the known and observed associations between subtypes and these mutations
What are the pathway genes up or down regulated in mutated samples and is this picked up by the model
For samples seemingly misclassified (especially no mutation predicted to have mutation) what other aberrations might be causing a "mutation-like" gene expression profile.
The text was updated successfully, but these errors were encountered:
I agree that option 2 is better. To me, it seems like that will be easier long term such that the additional time cost up front may be worth it, but let's discuss this afternoon. Specifically, I'm interested in how much more work you think option 2 would be over option 1 to see if our perceptions are the same.
I'm putting some thoughts down here as a working document and starting point for synchronous/asynchronous discussion 馃樅
Currently, this project does subtype classification as the supervised learning task. We want to expand to include prediction of mutation status in three genes: PIK3CA, PTEN, and TP53.
I see two main avenues to get this done.
I think option 1 might be easier, but 2 is better. 2 is better because any changes that apply to all supervised learning tasks would only need to be made in one place. 2 may even be easier.
Steps to accomplish modification of existing scripts (option 2):
check_options()
functions to make sure the given option is correctBeyond this, I want to know
The text was updated successfully, but these errors were encountered: