Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cohorts #14

Merged
merged 19 commits into from May 3, 2016
Merged

Refactor cohorts #14

merged 19 commits into from May 3, 2016

Conversation

tavinathanson
Copy link
Member

@tavinathanson tavinathanson commented May 1, 2016

Highlights:

  • Instead of passing in a million arguments to a single Cohort object, split that out into Patient, Sample, etc. This also makes me less worried about the ordering of sample_ids, tumor_bam_ids, etc.; before, it would have been easier to accidentally re-order those lists IMO.
  • Since a VCF path didn't make sense to be specified as a parameter of a Sample (where the Sample could be tumor or normal), I created PairedSample.
  • A Cohort is a Collection of Patients.
  • Move away from passing path functions around. Paths are now passed in at the time of Sample or PairedSample creation.
  • I added logic for sending extra dataframes to the Cohort for joining; that's the join_with (which dataframe to join) and join_how (how to join). Dataframes get stored in a Cohort and can be joined with the rest of the data on demand. For example: cohort = data.init_cohort(join_with=["pdl1", "tcr"], join_how="inner") or cohort = data.init_cohort(join_with="cibersort").
  • I tried, wherever I could, to make the logic work for situations where patient ID != paired sample ID != sample ID, though I don't yet have tests for this. It's not really my priority right now; I just got roped into it as a result of the new objects.
  • Move cufflinks loading to cohorts.
  • The isovar stuff hasn't changed since Expressed neoantigens and other minor changes #11.
  • Add a mannwhitneyu test, and set two-sided to be the default.
  • Plotting functions return the plot for further manipulation.
  • Fixes Possible issue: sample_id as int vs str #1 by forcing and asserting string IDs.

I tried to keep the consumption part of the API (vs. loading in the data) mostly the same; things this breaks:

  • self.clinical_dataframe is now self.as_dataframe()
  • Anything that touches the Cohort object.

Also note that some of the CohortDataFrame and group_by logic is a bit confusing. I'd like to simplify it and better document it at some point, but would prefer to defer to another PR on that.

@coveralls
Copy link

coveralls commented May 1, 2016

Coverage Status

Coverage decreased (-3.0%) to 52.93% when pulling 41dec1a on isovar into 2a06c6b on master.

@coveralls
Copy link

coveralls commented May 1, 2016

Coverage Status

Coverage decreased (-1.2%) to 54.699% when pulling 4d39cac on isovar into 2a06c6b on master.

if group_by not in ["patient", "paired_sample"]:
raise ValueError("Invalid group_by: %s" % group_by)

def first_not_none(params, default):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't understand this usage above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for the lack of comments. Wanted to get the PR out there but will update.

@tavinathanson
Copy link
Member Author

@arahuja curious to get your thoughts on Sample, PairedSample and Patient: can these be more clearly named/organized? I think Sample vs. PairedSample is pretty confusing.

P.S. I may just get rid of a bunch of confusing logic and assume, for now, that every patient has just a single associated tumor and normal sample.

@arahuja
Copy link
Contributor

arahuja commented May 1, 2016

curious to get your thoughts on Sample, PairedSample and Patient: can these be more clearly named/organized? I think Sample vs. PairedSample is pretty confusing.

Yea, it's hard to comment on the API without some experience using it. It did seem confusing at first, but I thought I'd wait to experience it first-hand a little.

I may just get rid of a bunch of confusing logic and assume, for now, that every patient has just a single associated tumor and normal sample.

I think that makes sense. I think that should always be the case. The time it might be confusing is we are likely to have an RNA BAM for tumor samples, but not normal (though we could). Similarly, if we are including TCRSeq data we have that on many samples (many normals and tumors possibly). I haven't thought about it much, so whatever you think will be simplest for is probably right

@tavinathanson
Copy link
Member Author

@arahuja Why would it always be the case that patients would just have a single sample vs. samples at multiple timepoints/locations?

@coveralls
Copy link

coveralls commented May 3, 2016

Coverage Status

Coverage decreased (-1.0%) to 54.945% when pulling 672e58b on isovar into 2a06c6b on master.

@tavinathanson
Copy link
Member Author

Simplified by:

  • Getting rid of PairedSample and moving fields into Patient.
  • Getting rid of all group_by logic.
  • All IDs are Patient IDs.

@coveralls
Copy link

coveralls commented May 3, 2016

Coverage Status

Coverage decreased (-0.5%) to 55.385% when pulling feea216 on isovar into 2a06c6b on master.

@@ -36,7 +36,7 @@ def generate_vcfs(id_to_mutation_count, file_format_func, template_name):
template_path = data_path(template_name)
vcf_reader = vcf.Reader(filename=template_path)
file_path = generated_data_path(
path.join("vcfs", file_format_func(sample_id, None, None)))
path.join("vcfs", file_format % sample_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth documenting that file_format has a to be a format string? Why did you prefer this over the function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing; shouldn't matter much? In general, I moved away from complex format functions to just setting the paths when creating the relevant objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok sorry, didn't notice where it was.

@tavinathanson
Copy link
Member Author

@arahuja One annoyance is that sample_id to patient_id renders all existing cached variants/effects/neoantigens invalid. Is re-generating the cache more annoying than is tolerable?

@arahuja
Copy link
Contributor

arahuja commented May 3, 2016

No, I think that is fine, probably worth doing anyways with an update on all related tools?

@tavinathanson
Copy link
Member Author

@arahuja Cool, that was my original thinking.

@coveralls
Copy link

coveralls commented May 3, 2016

Coverage Status

Coverage decreased (-0.8%) to 55.165% when pulling 96117d5 on isovar into 2a06c6b on master.

This was referenced May 3, 2016
@tavinathanson tavinathanson merged commit a22ad80 into master May 3, 2016
@tavinathanson tavinathanson deleted the isovar branch May 3, 2016 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants