Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at parameterizing script inputs used in tp53_analysis.sh #98

Closed
wants to merge 10 commits into from

Conversation

blankenberg
Copy link
Contributor

The primary goal of this PR is to allow control of input and output paths and to e.g. enable execution on a file staging cluster setup.

All changes should be 100% backwards compatible, but I did not find a test suite that I could run to doubly confirm.

Also addresses some quirks when e.g. running on a file staging cluster, where files are created in temporary directories (so full path to classifier_coefficients.tsv declared in files may no longer exist). If the file DNE, it looks for a copy local to the current classifier folder structure.

I added an argument --x_as_raw to pancancer_classifier.py to handle the special casing that was keyed based upon using default value for --x_matrix; which is raw and causes pancan_rnaseq_freeze.tsv.gz to be used. if you had specified pancan_rnaseq_freeze.tsv.gz directly as input, then the 'raw' actions were never performed; now you can do both. Probably a better, more descriptive, terminology than raw could be proposed.

Add --drop_x_genes to pancancer_classifier.py to enable custom list of genes to be dropped, similar to rasopathy_genes. Also explicitly set matplotlib.use('agg') to work on headless systems.

An upgrade to pandas 0.24.0+ will give 'infer' option for compression in to_csv() which is now currently being worked around for gzip outputs. Pandas currently pinned at 0.23.0 in environment.yml.

In cases of jupyter notebooks that are run as scripts, I parameterize the execution with environmental variables.

@blankenberg
Copy link
Contributor Author

Another benefit/goal is to enable usage inside of Galaxy, example TP53 workflow:
image

@gwaybio
Copy link
Collaborator

gwaybio commented Feb 4, 2019

Thanks for the contribution @blankenberg ! My PhD defense is one week from today, so there will be some delays in giving this a thorough look

@gwaybio gwaybio self-requested a review February 4, 2019 15:47
@blankenberg
Copy link
Contributor Author

No worries, there is no rush. Please focus on your defense.

@gwaybio gwaybio mentioned this pull request Apr 4, 2019
Copy link
Collaborator

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your interest and contributions @blankenberg !

I made several comments that require addressing before we merge. My apologies for the delay!


parser.add_argument('-x', '--x_matrix', default=None,
help='Filename of features to use in model')
parser.add_argument( '--filename_mut', default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove leading space character in lines 37 - 48? (between ( and ')

copy_loss_file = os.path.join('data', 'copy_number_loss_status.tsv')
copy_gain_file = os.path.join('data', 'copy_number_gain_status.tsv')
mutation_burden_file = os.path.join('data', 'mutation_burden_freeze.tsv')
rnaseq_file = args.x_matrix or os.path.join('data', 'pancan_rnaseq_freeze.tsv')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this addition 👍

@@ -65,7 +81,11 @@
if line[0] == 'Diseases:':
diseases = line[1:]
if line[0] == 'Coefficients:':
coef_df = pd.read_table(line[1])
coef_df = line[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the summary file was constructed (e.g. here) implies that the coefficients for the specific classifier (of which to apply the weights using) was already generated.

I guess I am confused about the specific scenario where the classifier file doesn't exist, but the summary file does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If e.g. I use a job staging submission process on a cluster, the Job that writes classifier_summary.txt might do so in a job/node specific directory /a/b/c/, which would set the value for Coefficients to /a/b/c/classifier_coefficients.tsv in the classifier summary file. The written directories and files, can then be staged back out to a persistent directory, maybe /x/y/z/.

On a subsequent job, that makes use of the classifier, it may stage files in at /q/r/s/ (or use /x/y/z/ directly), so /q/r/s/classifier_summary.txt (loaded by user provided file path) has a Coefficients value that points to /a/b/c/classifier_coefficients.tsv, but /a/b/c does not exist, the file is actually at /q/r/s/classifier_coefficients.tsv.

)
)

opt <-parse_args(OptionParser(option_list = option_list))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add space between <- and parse_args(?

snaptron_file <- file.path("scripts", "snaptron",
"junctions_with_mutations.csv.gz")
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space

@@ -61,10 +63,24 @@ def get_args():
help='Remove mutation data from y matrix')
parser.add_argument('-z', '--drop_rasopathy', action='store_true',
help='Decision to drop rasopathy genes from X matrix')
parser.add_argument( '--drop_x_genes', default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use nargs='+'.

This will load a list instead of a comma separated string. This will help here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also remove extra space

@@ -41,8 +41,31 @@
parser.add_argument('-f', '--alt_folder', default='Auto',
help='location to save')

parser.add_argument('-x', '--x_matrix', default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be added starting in line 43

@@ -41,8 +41,31 @@
parser.add_argument('-f', '--alt_folder', default='Auto',
help='location to save')

parser.add_argument('-x', '--x_matrix', default=None,
help='Filename of features to use in model')
parser.add_argument( '--filename_mut', default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra spaces

help='Filename of features to use in model')
parser.add_argument( '--filename_mut', default=None,
help='Filename of sample/gene mutations to use in model')
parser.add_argument( '--filename_mut_burden', default=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like these filename flags are repeated often. Can we make a separate file like tcga_util.get_args?

@@ -74,4 +97,9 @@
'--alt_folder', alt_folder, '--shuffled', '--keep_intermediate']
if remove_hyper:
command += ['--remove_hyper']
# Only set filename if it has been set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

@blankenberg
Copy link
Contributor Author

I am just closing out old PRs. Thanks again for your previous efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants