-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass at parameterizing script inputs used in tp53_analysis.sh #98
Conversation
…and to specify custom --drop_x_genes (similar to --drop_rasopathy)
Thanks for the contribution @blankenberg ! My PhD defense is one week from today, so there will be some delays in giving this a thorough look |
No worries, there is no rush. Please focus on your defense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your interest and contributions @blankenberg !
I made several comments that require addressing before we merge. My apologies for the delay!
scripts/apply_weights.py
Outdated
|
||
parser.add_argument('-x', '--x_matrix', default=None, | ||
help='Filename of features to use in model') | ||
parser.add_argument( '--filename_mut', default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove leading space character in lines 37 - 48? (between (
and '
)
copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv') | ||
copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv') | ||
mutation_burden_file = os.path.join('data', 'mutation_burden_freeze.tsv') | ||
rnaseq_file = args.x_matrix or os.path.join('data', 'pancan_rnaseq_freeze.tsv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this addition 👍
@@ -65,7 +81,11 @@ | |||
if line[0] == 'Diseases:': | |||
diseases = line[1:] | |||
if line[0] == 'Coefficients:': | |||
coef_df = pd.read_table(line[1]) | |||
coef_df = line[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the summary file was constructed (e.g. here) implies that the coefficients for the specific classifier (of which to apply the weights using) was already generated.
I guess I am confused about the specific scenario where the classifier file doesn't exist, but the summary file does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If e.g. I use a job staging submission process on a cluster, the Job that writes classifier_summary.txt
might do so in a job/node specific directory /a/b/c/
, which would set the value for Coefficients
to /a/b/c/classifier_coefficients.tsv
in the classifier summary file. The written directories and files, can then be staged back out to a persistent directory, maybe /x/y/z/
.
On a subsequent job, that makes use of the classifier, it may stage files in at /q/r/s/
(or use /x/y/z/
directly), so /q/r/s/classifier_summary.txt
(loaded by user provided file path) has a Coefficients
value that points to /a/b/c/classifier_coefficients.tsv
, but /a/b/c
does not exist, the file is actually at /q/r/s/classifier_coefficients.tsv
.
scripts/copy_burden_figures.R
Outdated
) | ||
) | ||
|
||
opt <-parse_args(OptionParser(option_list = option_list)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add space between <-
and parse_args(
?
snaptron_file <- file.path("scripts", "snaptron", | ||
"junctions_with_mutations.csv.gz") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra space
scripts/util/tcga_util.py
Outdated
@@ -61,10 +63,24 @@ def get_args(): | |||
help='Remove mutation data from y matrix') | |||
parser.add_argument('-z', '--drop_rasopathy', action='store_true', | |||
help='Decision to drop rasopathy genes from X matrix') | |||
parser.add_argument( '--drop_x_genes', default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also remove extra space
@@ -41,8 +41,31 @@ | |||
parser.add_argument('-f', '--alt_folder', default='Auto', | |||
help='location to save') | |||
|
|||
parser.add_argument('-x', '--x_matrix', default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be added starting in line 43
scripts/within_tissue_analysis.py
Outdated
@@ -41,8 +41,31 @@ | |||
parser.add_argument('-f', '--alt_folder', default='Auto', | |||
help='location to save') | |||
|
|||
parser.add_argument('-x', '--x_matrix', default=None, | |||
help='Filename of features to use in model') | |||
parser.add_argument( '--filename_mut', default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra spaces
scripts/within_tissue_analysis.py
Outdated
help='Filename of features to use in model') | ||
parser.add_argument( '--filename_mut', default=None, | ||
help='Filename of sample/gene mutations to use in model') | ||
parser.add_argument( '--filename_mut_burden', default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like these filename flags are repeated often. Can we make a separate file like tcga_util.get_args
?
@@ -74,4 +97,9 @@ | |||
'--alt_folder', alt_folder, '--shuffled', '--keep_intermediate'] | |||
if remove_hyper: | |||
command += ['--remove_hyper'] | |||
# Only set filename if it has been set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool
I am just closing out old PRs. Thanks again for your previous efforts. |
The primary goal of this PR is to allow control of input and output paths and to e.g. enable execution on a file staging cluster setup.
All changes should be 100% backwards compatible, but I did not find a test suite that I could run to doubly confirm.
Also addresses some quirks when e.g. running on a file staging cluster, where files are created in temporary directories (so full path to
classifier_coefficients.tsv
declared in files may no longer exist). If the file DNE, it looks for a copy local to the current classifier folder structure.I added an argument
--x_as_raw
to pancancer_classifier.py to handle the special casing that was keyed based upon using default value for--x_matrix
; which israw
and causespancan_rnaseq_freeze.tsv.gz
to be used. if you had specifiedpancan_rnaseq_freeze.tsv.gz
directly as input, then the 'raw' actions were never performed; now you can do both. Probably a better, more descriptive, terminology thanraw
could be proposed.Add
--drop_x_genes
to pancancer_classifier.py to enable custom list of genes to be dropped, similar to rasopathy_genes. Also explicitly setmatplotlib.use('agg')
to work on headless systems.An upgrade to pandas 0.24.0+ will give 'infer' option for compression in
to_csv()
which is now currently being worked around for gzip outputs. Pandas currently pinned at 0.23.0 inenvironment.yml
.In cases of jupyter notebooks that are run as scripts, I parameterize the execution with environmental variables.