Refactor cohorts #14

tavinathanson · 2016-05-01T02:07:27Z

Highlights:

Instead of passing in a million arguments to a single Cohort object, split that out into Patient, Sample, etc. This also makes me less worried about the ordering of sample_ids, tumor_bam_ids, etc.; before, it would have been easier to accidentally re-order those lists IMO.
Since a VCF path didn't make sense to be specified as a parameter of a Sample (where the Sample could be tumor or normal), I created PairedSample.
A Cohort is a Collection of Patients.
Move away from passing path functions around. Paths are now passed in at the time of Sample or PairedSample creation.
I added logic for sending extra dataframes to the Cohort for joining; that's the join_with (which dataframe to join) and join_how (how to join). Dataframes get stored in a Cohort and can be joined with the rest of the data on demand. For example: cohort = data.init_cohort(join_with=["pdl1", "tcr"], join_how="inner") or cohort = data.init_cohort(join_with="cibersort").
I tried, wherever I could, to make the logic work for situations where patient ID != paired sample ID != sample ID, though I don't yet have tests for this. It's not really my priority right now; I just got roped into it as a result of the new objects.
Move cufflinks loading to cohorts.
The isovar stuff hasn't changed since Expressed neoantigens and other minor changes #11.
Add a mannwhitneyu test, and set two-sided to be the default.
Plotting functions return the plot for further manipulation.
Fixes Possible issue: sample_id as int vs str #1 by forcing and asserting string IDs.

I tried to keep the consumption part of the API (vs. loading in the data) mostly the same; things this breaks:

self.clinical_dataframe is now self.as_dataframe()
Anything that touches the Cohort object.

Also note that some of the CohortDataFrame and group_by logic is a bit confusing. I'd like to simplify it and better document it at some point, but would prefer to defer to another PR on that.

coveralls · 2016-05-01T02:19:10Z

Coverage decreased (-3.0%) to 52.93% when pulling 41dec1a on isovar into 2a06c6b on master.

coveralls · 2016-05-01T04:13:18Z

Coverage decreased (-1.2%) to 54.699% when pulling 4d39cac on isovar into 2a06c6b on master.

arahuja · 2016-05-01T18:53:05Z

cohorts/load.py

+    if group_by not in ["patient", "paired_sample"]:
+        raise ValueError("Invalid group_by: %s" % group_by)
+
+def first_not_none(params, default):


Didn't understand this usage above

Yeah, sorry for the lack of comments. Wanted to get the PR out there but will update.

tavinathanson · 2016-05-01T19:31:37Z

@arahuja curious to get your thoughts on Sample, PairedSample and Patient: can these be more clearly named/organized? I think Sample vs. PairedSample is pretty confusing.

P.S. I may just get rid of a bunch of confusing logic and assume, for now, that every patient has just a single associated tumor and normal sample.

arahuja · 2016-05-01T19:35:59Z

curious to get your thoughts on Sample, PairedSample and Patient: can these be more clearly named/organized? I think Sample vs. PairedSample is pretty confusing.

Yea, it's hard to comment on the API without some experience using it. It did seem confusing at first, but I thought I'd wait to experience it first-hand a little.

I may just get rid of a bunch of confusing logic and assume, for now, that every patient has just a single associated tumor and normal sample.

I think that makes sense. I think that should always be the case. The time it might be confusing is we are likely to have an RNA BAM for tumor samples, but not normal (though we could). Similarly, if we are including TCRSeq data we have that on many samples (many normals and tumors possibly). I haven't thought about it much, so whatever you think will be simplest for is probably right

tavinathanson · 2016-05-01T19:42:48Z

@arahuja Why would it always be the case that patients would just have a single sample vs. samples at multiple timepoints/locations?

coveralls · 2016-05-03T18:41:01Z

Coverage decreased (-1.0%) to 54.945% when pulling 672e58b on isovar into 2a06c6b on master.

tavinathanson · 2016-05-03T19:07:43Z

Simplified by:

Getting rid of PairedSample and moving fields into Patient.
Getting rid of all group_by logic.
All IDs are Patient IDs.

coveralls · 2016-05-03T19:22:37Z

Coverage decreased (-0.5%) to 55.385% when pulling feea216 on isovar into 2a06c6b on master.

arahuja · 2016-05-03T19:35:19Z

test/data_generate.py

@@ -36,7 +36,7 @@ def generate_vcfs(id_to_mutation_count, file_format_func, template_name):
        template_path = data_path(template_name)
        vcf_reader = vcf.Reader(filename=template_path)
        file_path = generated_data_path(
-            path.join("vcfs", file_format_func(sample_id, None, None)))
+            path.join("vcfs", file_format % sample_id))


worth documenting that file_format has a to be a format string? Why did you prefer this over the function?

This is just for testing; shouldn't matter much? In general, I moved away from complex format functions to just setting the paths when creating the relevant objects.

Ah ok sorry, didn't notice where it was.

tavinathanson · 2016-05-03T19:56:32Z

@arahuja One annoyance is that sample_id to patient_id renders all existing cached variants/effects/neoantigens invalid. Is re-generating the cache more annoying than is tolerable?

arahuja · 2016-05-03T19:58:04Z

No, I think that is fine, probably worth doing anyways with an update on all related tools?

tavinathanson · 2016-05-03T20:00:38Z

@arahuja Cool, that was my original thinking.

coveralls · 2016-05-03T20:19:13Z

Coverage decreased (-0.8%) to 55.165% when pulling 96117d5 on isovar into 2a06c6b on master.

tavinathanson added 14 commits April 30, 2016 21:48

Add expressed neoantigens via isovar

3d6cd68

Add isovar as requirement

476d25f

Don't use PyPi for isovar

1c5dbb3

Add dependency links in setup.py

6f4f516

Return plots

a015578

Make two-sided the default for mannwhitney

052b028

Imitate a display in Travis

95e1287

Refactor cohorts; make dataframes less central

03a12e6

Add DataFrame joining

064d6c8

Add a simple collection class

f79bd59

Pass with_extra and how_extra around

cbb1f5a

Fix collection bug

265585e

Update README

42b6ad4

Add cufflinks loading

41dec1a

tavinathanson assigned arahuja May 1, 2016

tavinathanson added the enhancement label May 1, 2016

tavinathanson added 2 commits April 30, 2016 23:53

Add cufflinks as a joinable dataframe

426b355

Clean up None fallback

4d39cac

arahuja reviewed May 1, 2016
View reviewed changes

Simplify by assuming 1 paired sample per patient

672e58b

Minor error-handling tweaks

feea216

arahuja reviewed May 3, 2016
View reviewed changes

Minor fixes

96117d5

This was referenced May 3, 2016

Handle null counts better #16

Closed

Handle pandas dtypes better #17

Open

tavinathanson merged commit a22ad80 into master May 3, 2016

tavinathanson deleted the isovar branch May 3, 2016 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor cohorts #14

Refactor cohorts #14

tavinathanson commented May 1, 2016 •

edited

coveralls commented May 1, 2016 •

edited

coveralls commented May 1, 2016 •

edited

arahuja May 1, 2016

tavinathanson May 1, 2016

tavinathanson commented May 1, 2016

arahuja commented May 1, 2016

tavinathanson commented May 1, 2016

coveralls commented May 3, 2016 •

edited

tavinathanson commented May 3, 2016

coveralls commented May 3, 2016 •

edited

arahuja May 3, 2016

tavinathanson May 3, 2016

arahuja May 3, 2016

tavinathanson commented May 3, 2016

arahuja commented May 3, 2016

tavinathanson commented May 3, 2016

coveralls commented May 3, 2016 •

edited

Refactor cohorts #14

Refactor cohorts #14

Conversation

tavinathanson commented May 1, 2016 • edited

coveralls commented May 1, 2016 • edited

coveralls commented May 1, 2016 • edited

arahuja May 1, 2016

Choose a reason for hiding this comment

tavinathanson May 1, 2016

Choose a reason for hiding this comment

tavinathanson commented May 1, 2016

arahuja commented May 1, 2016

tavinathanson commented May 1, 2016

coveralls commented May 3, 2016 • edited

tavinathanson commented May 3, 2016

coveralls commented May 3, 2016 • edited

arahuja May 3, 2016

Choose a reason for hiding this comment

tavinathanson May 3, 2016

Choose a reason for hiding this comment

arahuja May 3, 2016

Choose a reason for hiding this comment

tavinathanson commented May 3, 2016

arahuja commented May 3, 2016

tavinathanson commented May 3, 2016

coveralls commented May 3, 2016 • edited

tavinathanson commented May 1, 2016 •

edited

coveralls commented May 1, 2016 •

edited

coveralls commented May 1, 2016 •

edited

coveralls commented May 3, 2016 •

edited

coveralls commented May 3, 2016 •

edited

coveralls commented May 3, 2016 •

edited