First pass at parameterizing script inputs used in tp53_analysis.sh #98

blankenberg · 2019-02-01T21:49:12Z

The primary goal of this PR is to allow control of input and output paths and to e.g. enable execution on a file staging cluster setup.

All changes should be 100% backwards compatible, but I did not find a test suite that I could run to doubly confirm.

Also addresses some quirks when e.g. running on a file staging cluster, where files are created in temporary directories (so full path to classifier_coefficients.tsv declared in files may no longer exist). If the file DNE, it looks for a copy local to the current classifier folder structure.

I added an argument --x_as_raw to pancancer_classifier.py to handle the special casing that was keyed based upon using default value for --x_matrix; which is raw and causes pancan_rnaseq_freeze.tsv.gz to be used. if you had specified pancan_rnaseq_freeze.tsv.gz directly as input, then the 'raw' actions were never performed; now you can do both. Probably a better, more descriptive, terminology than raw could be proposed.

Add --drop_x_genes to pancancer_classifier.py to enable custom list of genes to be dropped, similar to rasopathy_genes. Also explicitly set matplotlib.use('agg') to work on headless systems.

An upgrade to pandas 0.24.0+ will give 'infer' option for compression in to_csv() which is now currently being worked around for gzip outputs. Pandas currently pinned at 0.23.0 in environment.yml.

In cases of jupyter notebooks that are run as scripts, I parameterize the execution with environmental variables.

…and to specify custom --drop_x_genes (similar to --drop_rasopathy)

…lysis.sh

blankenberg · 2019-02-01T21:50:51Z

Another benefit/goal is to enable usage inside of Galaxy, example TP53 workflow:

gwaybio · 2019-02-04T15:46:49Z

Thanks for the contribution @blankenberg ! My PhD defense is one week from today, so there will be some delays in giving this a thorough look

blankenberg · 2019-02-04T18:30:26Z

No worries, there is no rush. Please focus on your defense.

gwaybio

Thanks again for your interest and contributions @blankenberg !

I made several comments that require addressing before we merge. My apologies for the delay!

gwaybio · 2019-04-04T16:05:02Z

scripts/apply_weights.py

+
+parser.add_argument('-x', '--x_matrix', default=None,
+                    help='Filename of features to use in model')
+parser.add_argument( '--filename_mut', default=None,


can you remove leading space character in lines 37 - 48? (between ( and ')

gwaybio · 2019-04-04T16:06:20Z

scripts/apply_weights.py

-copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv')
-copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv')
-mutation_burden_file = os.path.join('data', 'mutation_burden_freeze.tsv')
+rnaseq_file = args.x_matrix or os.path.join('data', 'pancan_rnaseq_freeze.tsv')


I like this addition 👍

gwaybio · 2019-04-04T16:09:55Z

scripts/apply_weights.py

@@ -65,7 +81,11 @@
        if line[0] == 'Diseases:':
            diseases = line[1:]
        if line[0] == 'Coefficients:':
-            coef_df = pd.read_table(line[1])
+            coef_df = line[1]


The way the summary file was constructed (e.g. here) implies that the coefficients for the specific classifier (of which to apply the weights using) was already generated.

I guess I am confused about the specific scenario where the classifier file doesn't exist, but the summary file does.

If e.g. I use a job staging submission process on a cluster, the Job that writes classifier_summary.txt might do so in a job/node specific directory /a/b/c/, which would set the value for Coefficients to /a/b/c/classifier_coefficients.tsv in the classifier summary file. The written directories and files, can then be staged back out to a persistent directory, maybe /x/y/z/.

On a subsequent job, that makes use of the classifier, it may stage files in at /q/r/s/ (or use /x/y/z/ directly), so /q/r/s/classifier_summary.txt (loaded by user provided file path) has a Coefficients value that points to /a/b/c/classifier_coefficients.tsv, but /a/b/c does not exist, the file is actually at /q/r/s/classifier_coefficients.tsv.

gwaybio · 2019-04-04T16:10:34Z

scripts/copy_burden_figures.R

+  )
+)
+
+opt <-parse_args(OptionParser(option_list = option_list))


Can you add space between <- and parse_args(?

gwaybio · 2019-04-04T16:11:02Z

scripts/copy_burden_figures.R

+  snaptron_file <- file.path("scripts", "snaptron",
+                             "junctions_with_mutations.csv.gz")
+}
+


remove extra space

gwaybio · 2019-04-04T16:24:16Z

scripts/util/tcga_util.py

@@ -61,10 +63,24 @@ def get_args():
                        help='Remove mutation data from y matrix')
    parser.add_argument('-z', '--drop_rasopathy', action='store_true',
                        help='Decision to drop rasopathy genes from X matrix')
+    parser.add_argument( '--drop_x_genes', default=None,


lets use nargs='+'.

This will load a list instead of a comma separated string. This will help here

also remove extra space

gwaybio · 2019-04-04T16:24:57Z

scripts/within_tissue_analysis.py

@@ -41,8 +41,31 @@
 parser.add_argument('-f', '--alt_folder', default='Auto',
                    help='location to save')

+parser.add_argument('-x', '--x_matrix', default=None,


this can be added starting in line 43

gwaybio · 2019-04-04T16:25:06Z

scripts/within_tissue_analysis.py

@@ -41,8 +41,31 @@
 parser.add_argument('-f', '--alt_folder', default='Auto',
                    help='location to save')

+parser.add_argument('-x', '--x_matrix', default=None,
+                    help='Filename of features to use in model')
+parser.add_argument( '--filename_mut', default=None,


remove extra spaces

gwaybio · 2019-04-04T16:26:08Z

scripts/within_tissue_analysis.py

+                    help='Filename of features to use in model')
+parser.add_argument( '--filename_mut', default=None,
+                    help='Filename of sample/gene mutations to use in model')
+parser.add_argument( '--filename_mut_burden', default=None,


it looks like these filename flags are repeated often. Can we make a separate file like tcga_util.get_args?

gwaybio · 2019-04-04T16:26:49Z

scripts/within_tissue_analysis.py

@@ -74,4 +97,9 @@
               '--alt_folder', alt_folder, '--shuffled', '--keep_intermediate']
    if remove_hyper:
        command += ['--remove_hyper']
+    # Only set filename if it has been set


blankenberg · 2021-03-24T14:43:14Z

I am just closing out old PRs. Thanks again for your previous efforts.

blankenberg added 5 commits January 30, 2019 16:58

Allow setting all input filenames in scripts/pancancer_classifier.py …

f2e9f2f

…and to specify custom --drop_x_genes (similar to --drop_rasopathy)

Allow forcing x-matrix to be treated as "raw".

e1dafc9

add --x_as_raw flag to Treat x_matrix as "raw"

30f237e

set matplotlib.use(agg)

c2c4546

First pass at parameterizing inputs and files to run through tp53_ana…

1c7e6c6

…lysis.sh

matplotlib.use(agg) in scripts/visualize_decisions.py

1162ed3

gwaybio self-requested a review February 4, 2019 15:47

gwaybio mentioned this pull request Apr 4, 2019

Upgrade Pandas to 0.24 #100

Open

gwaybio requested changes Apr 4, 2019

View reviewed changes

blankenberg added 4 commits April 30, 2019 12:25

Whitespace fixes.

efba673

remove extra newline

aabe59b

--drop_x_genes to use nargs=+

b026a75

Add source for RASOPATHY_GENES and whitespace fixes.

cf7d774

blankenberg closed this Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at parameterizing script inputs used in tp53_analysis.sh #98

First pass at parameterizing script inputs used in tp53_analysis.sh #98

blankenberg commented Feb 1, 2019

blankenberg commented Feb 1, 2019

gwaybio commented Feb 4, 2019

blankenberg commented Feb 4, 2019

gwaybio left a comment

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

blankenberg Apr 30, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

gwaybio Apr 4, 2019

blankenberg commented Mar 24, 2021

First pass at parameterizing script inputs used in tp53_analysis.sh #98

First pass at parameterizing script inputs used in tp53_analysis.sh #98

Conversation

blankenberg commented Feb 1, 2019

blankenberg commented Feb 1, 2019

gwaybio commented Feb 4, 2019

blankenberg commented Feb 4, 2019

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blankenberg commented Mar 24, 2021