Report on but do not error when patient has no RNA data #245

jburos · 2017-08-18T01:03:04Z

Here I'm trying to achieve a goal of allowing an analysis on a cohort where some but not all records have RNA sequencing data. Currently, if one is using a summary function like cohorts.functions.expressed_missense_snv_count, you will see an exception if some but not all records have RNA expression data.

Instead, we want to print a warning for these patients and populate the resulting df with np.NaN for any patient that does not contain RNA sequencing data where the filter function depends on it. This behavior would then be consistent with (for example) that of other summary functions like missense_snv_count for patients that do not have VCF files.

Aside: The correct way to handle this in most situations is to require that the cohort be filtered to those Patients having RNA data prior to analysis. This would make the exception desirable. However there are scenarios (like when using as_dataframe) where one might desire a different behavior. This however, runs the risk of a user ignoring the INFO messages & proceeding with analysis in error. This might be something we want to allow the user to configure, ie whether to allow NaN values in the cohort or not.

At any rate, for now, this PR does a few things:

Create new error classes for missing BAM files of various types
Catch those errors if they are used to filter variants or effects (not sure if this should apply to polyphen?)
Report those errors as info using the logger.

Also, for the sake of consistency, this PR:

Modifies earlier print statements when VCFs are not found for a patient to also use these error classes & report as INFO using the logger (previously these were print statements).

Note that I refactored some parts of the Cohort._load_single_patient_variants and Cohort._load_single_patient_effects in order to throw & report on these scenarios.

…t scenario

This reverts commit 164c106.

coveralls · 2017-08-18T14:34:40Z

Coverage decreased (-0.2%) to 52.347% when pulling 7114c6e on fix-expression-errors into ea31881 on master.

jburos · 2017-08-18T21:27:44Z

Just noticed that these are all reported using logger.info() ; I wonder if they should instead be warnings? Edited the text above to reflect this.

coveralls · 2017-08-18T21:52:38Z

Coverage decreased (-0.2%) to 52.347% when pulling cd9e2ff on fix-expression-errors into ea31881 on master.

tavinathanson · 2017-08-20T20:55:22Z

@jburos I generally agree with your aside. In lung, I just have load_cohort(only_rna=True) and proceed from there. I'm not sure what the downsides are of that?

tavinathanson

See comment above

jburos · 2017-08-21T10:53:45Z

@tavinathanson Thx. First this is a question of consistency, since this PR extends the logic currently in place for many functions to another class of functions.

But to elaborate on my aside, above, there are two primary/immediate reasons why this is undesirable for my use case - mainly when using as_dataframe to export data for a cohort.

In some cases, I want to compare attributes (e.g. survival) for patients with & without missing data as part of an exploratory analysis. Without the ability to export data for the full cohort, including NaN values for missing data, I am unable to do this type of analysis.
Many of my analysis make use of data where values are missing, either to impute data for these values or include the records to (e.g.) stabilize survival estimates.

In general, when for example working in R, I often want to export my data once & then move forward with analysis using a data frame in which I can see the full cohort's data. Re-exporting data for each analysis is not only inefficient, but also runs the risk of inconsistencies in computed values across exported instances.

Separately, I can imagine a use case where you might want to initialize a cohort once (which can take a non-trivial amt of time, even with caching) & then loop through various analyses irrespective of the fact that some data are missing for some patients in some analyses. This is where an unsuspecting user might ignore or forget that the sample sizes are different for each analysis. Maybe we want to go through each of our analyses & report on the sample size for each?

This is not always a problem by the way - for example, in our bladder-cohort analyses we didn't restrict all analyses to the same subset of patients as has all data available.

At any rate in general I propose we move this discussion to an issue (or set of issues), since they are somewhat beyond the scope of this PR. For now, I see two open questions:

At what level to report these messages? As INFO or WARNING (warning actually seems more appropriate to me)?
Do we want consistency across the various functions in how these are handled? If so, is this the right approach to achieve this?

coveralls · 2017-08-22T10:53:41Z

Coverage decreased (-0.2%) to 52.314% when pulling 1eb7199 on fix-expression-errors into ea31881 on master.

tavinathanson · 2017-08-22T17:55:11Z

@jburos fair points. Since we have different use cases, and I want to make sure people don't inadvertently have missing data because they misspelled a BAM filename, how about we gate this on a new Cohort.fail_on_missing_bams that defaults to True?

jburos · 2017-08-22T20:01:38Z

@tavinathanson then, why not do the same for VCFs, etc? what about an allow_nan which defaults to False? If True, then NaN values in any of these fields would be "OK"

tavinathanson · 2017-08-22T20:03:09Z

@jburos I was just worried about confusion with clinical data, since we do allow that to be NaN.

jburos · 2017-08-22T20:15:35Z

@tavinathanson First, we should probably separate the issue of allowing NaN for variants but not expressed variants, vs the name of the parameter ;) my fault for confounding this.

I care less about the parameter name & more about consistency across summary-functions. Even if the flag were fail_on_missing_bams, shouldn't we also fail on missing vcfs? Or would this be a separate flag?

Regardless, was thinking when I wrote the aside that this would be nice to customize at the level of the analysis (ie within plot_benefit) as well as a cohort-wide default.

I was also thinking (one could argue) that this should apply also to clinical data -- ie fine for age to be missing for some patients until we are using it in an analysis.

Finally, I should probably separate the issues of a misspelled bamfile name from that of a patient not having a bamfile (i should rename the exception to reflect this). I'm pretty sure the misspelled-bam-file case will result in a different error at the moment.

… hla, and variants

jburos · 2017-08-23T00:03:41Z

@tavinathanson also FYI (no rush) I'm done making changes to this one, for whenever you are available to take a look. We should really add some test cases to this -- going to create an issue for that.

coveralls · 2017-08-23T00:06:32Z

Coverage increased (+0.2%) to 52.704% when pulling de9301d on fix-expression-errors into ea31881 on master.

coveralls · 2017-08-23T00:52:55Z

Coverage increased (+0.2%) to 52.704% when pulling b7c8139 on fix-expression-errors into ea31881 on master.

tavinathanson · 2017-08-23T14:59:34Z

cohorts/cohort.py

            no_variants = True

        # Note that this is the number of variant collections and not the number of
        # variants. 0 variants will lead to 0 neoantigens, for example, but 0 variant
        # collections will lead to NaN variants and neoantigens.
        if no_variants:
-            print("Variants did not exist for patient %s" % patient.id)
-            merged_variants = None
+            raise MissingVariantFile(patient_id=patient.id)


There might be some repeated logic here; so many checks. Maybe file an issue to clean up this function? Don't want to delay the PR.

Agreed - this function does need to be cleaned up. But, yep that would belong in a different PR

coveralls · 2017-08-30T11:47:12Z

Coverage increased (+0.1%) to 52.856% when pulling 3c5e86a on fix-expression-errors into f9f4af2 on master.

coveralls · 2017-09-07T02:17:19Z

Coverage increased (+0.1%) to 53.019% when pulling 4a26b14 on fix-expression-errors into 5ed28a4 on master.

jburos added 6 commits August 17, 2017 14:15

add error classes for bamfiles not found (so they can be caught)

08edb48

catch RNABamFileNotFound exceptions when processing expressed variants

290e4a2

modify print->logger.info statements; also catch & report on no-effec…

5985204

…t scenario

catch error in filter_effects, etc to distinguish None from 0

7cab29f

remove unused imports

164c106

Revert "remove unused imports"

7114c6e

This reverts commit 164c106.

jburos changed the title ~~Warn but do not error when patient has no RNA data~~ Report on but do not error when patient has no RNA data Aug 18, 2017

fix quote chars

cd9e2ff

jburos requested a review from tavinathanson August 18, 2017 21:43

tavinathanson reviewed Aug 20, 2017

View reviewed changes

catch missing bamfile error for expressed neoantigens

1eb7199

jburos added 4 commits August 22, 2017 18:16

rename error classes; add cohort config options fail_on_missing_bams,…

5068cf4

… hla, and variants

catch errors in expressed_variant_set as well

c0d83dc

implement fail_on_missing_bams in filter_variants

a670f57

fix lingering BamFileNotFound errors

de9301d

jburos mentioned this pull request Aug 23, 2017

Test new parameters - fail_on_missing_bams, etc #250

Open

change repr to str?

b7c8139

tavinathanson reviewed Aug 23, 2017

View reviewed changes

jburos mentioned this pull request Aug 23, 2017

Refactor _load_single_patient_variants function #251

Open

Merge branch 'master' into fix-expression-errors

3c5e86a

Merge branch 'master' into fix-expression-errors

4a26b14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report on but do not error when patient has no RNA data #245

Report on but do not error when patient has no RNA data #245

jburos commented Aug 18, 2017 •

edited

coveralls commented Aug 18, 2017 •

edited

jburos commented Aug 18, 2017

coveralls commented Aug 18, 2017 •

edited

tavinathanson commented Aug 20, 2017

tavinathanson left a comment

jburos commented Aug 21, 2017

coveralls commented Aug 22, 2017 •

edited

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017 •

edited

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017

jburos commented Aug 23, 2017

coveralls commented Aug 23, 2017 •

edited

coveralls commented Aug 23, 2017 •

edited

tavinathanson Aug 23, 2017

jburos Aug 23, 2017

coveralls commented Aug 30, 2017 •

edited

coveralls commented Sep 7, 2017 •

edited

Report on but do not error when patient has no RNA data #245

Are you sure you want to change the base?

Report on but do not error when patient has no RNA data #245

Conversation

jburos commented Aug 18, 2017 • edited

coveralls commented Aug 18, 2017 • edited

jburos commented Aug 18, 2017

coveralls commented Aug 18, 2017 • edited

tavinathanson commented Aug 20, 2017

tavinathanson left a comment

Choose a reason for hiding this comment

jburos commented Aug 21, 2017

coveralls commented Aug 22, 2017 • edited

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017 • edited

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017

jburos commented Aug 23, 2017

coveralls commented Aug 23, 2017 • edited

coveralls commented Aug 23, 2017 • edited

tavinathanson Aug 23, 2017

Choose a reason for hiding this comment

jburos Aug 23, 2017

Choose a reason for hiding this comment

coveralls commented Aug 30, 2017 • edited

coveralls commented Sep 7, 2017 • edited

jburos commented Aug 18, 2017 •

edited

coveralls commented Aug 18, 2017 •

edited

coveralls commented Aug 18, 2017 •

edited

coveralls commented Aug 22, 2017 •

edited

jburos commented Aug 22, 2017 •

edited

coveralls commented Aug 23, 2017 •

edited

coveralls commented Aug 23, 2017 •

edited

coveralls commented Aug 30, 2017 •

edited

coveralls commented Sep 7, 2017 •

edited