New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cohorts #14
Refactor cohorts #14
Conversation
if group_by not in ["patient", "paired_sample"]: | ||
raise ValueError("Invalid group_by: %s" % group_by) | ||
|
||
def first_not_none(params, default): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't understand this usage above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry for the lack of comments. Wanted to get the PR out there but will update.
@arahuja curious to get your thoughts on P.S. I may just get rid of a bunch of confusing logic and assume, for now, that every patient has just a single associated tumor and normal sample. |
Yea, it's hard to comment on the API without some experience using it. It did seem confusing at first, but I thought I'd wait to experience it first-hand a little.
I think that makes sense. I think that should always be the case. The time it might be confusing is we are likely to have an RNA BAM for tumor samples, but not normal (though we could). Similarly, if we are including TCRSeq data we have that on many samples (many normals and tumors possibly). I haven't thought about it much, so whatever you think will be simplest for is probably right |
@arahuja Why would it always be the case that patients would just have a single sample vs. samples at multiple timepoints/locations? |
Simplified by:
|
@@ -36,7 +36,7 @@ def generate_vcfs(id_to_mutation_count, file_format_func, template_name): | |||
template_path = data_path(template_name) | |||
vcf_reader = vcf.Reader(filename=template_path) | |||
file_path = generated_data_path( | |||
path.join("vcfs", file_format_func(sample_id, None, None))) | |||
path.join("vcfs", file_format % sample_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth documenting that file_format
has a to be a format string? Why did you prefer this over the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for testing; shouldn't matter much? In general, I moved away from complex format functions to just setting the paths when creating the relevant objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok sorry, didn't notice where it was.
@arahuja One annoyance is that sample_id to patient_id renders all existing cached variants/effects/neoantigens invalid. Is re-generating the cache more annoying than is tolerable? |
No, I think that is fine, probably worth doing anyways with an update on all related tools? |
@arahuja Cool, that was my original thinking. |
Highlights:
Cohort
object, split that out intoPatient
,Sample
, etc. This also makes me less worried about the ordering ofsample_ids
,tumor_bam_ids
, etc.; before, it would have been easier to accidentally re-order those lists IMO.Sample
(where theSample
could be tumor or normal), I createdPairedSample
.Cohort
is aCollection
ofPatients
.Sample
orPairedSample
creation.Cohort
for joining; that's thejoin_with
(which dataframe to join) andjoin_how
(how to join). Dataframes get stored in aCohort
and can be joined with the rest of the data on demand. For example:cohort = data.init_cohort(join_with=["pdl1", "tcr"], join_how="inner")
orcohort = data.init_cohort(join_with="cibersort")
.isovar
stuff hasn't changed since Expressed neoantigens and other minor changes #11.I tried to keep the consumption part of the API (vs. loading in the data) mostly the same; things this breaks:
self.clinical_dataframe
is nowself.as_dataframe()
Cohort
object.Also note that some of the
CohortDataFrame
andgroup_by
logic is a bit confusing. I'd like to simplify it and better document it at some point, but would prefer to defer to another PR on that.