-
Notifications
You must be signed in to change notification settings - Fork 59
Moves alternate bases schema mutation to ProcessedVariantFactory. #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 335
💛 - Coveralls |
|
Folks, please note that this needs some more unit-tests, which I will add tomorrow. No new functionality is added so the unit-test coverage is not decreased but some functionality that before were buried as module "private"s are now exposed as "public" interfaces so it is better to write tests for them in isolation (e.g., all of bigquery_util and the new schema method of ProcessedVariantFactory). Also we probably need a better representation of the schema as an encapsulated class but that is left for the schema validation work that Nima is going to do as he can better design that class knowing the current status and what is needed for validation. So this does not "fix" Issue #59, just make it better. |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just a few nits, otherwise LGTM.
| # type: (str) -> bool | ||
| if info_field_name not in self._header_fields.infos: | ||
| raise ValueError('INFO field {} not found'.format(info_field_name)) | ||
| if ((self._split_alternate_allele_info_fields and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can just do return (...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I changed this a little because pylint was suggesting to do it like this (check the bottom of the page) and it did not seem to be very readable. So instead I do it the way it is now.
| description='')) | ||
| alternate_bases_record.fields.append(annotation_record) | ||
| schema.fields.append(alternate_bases_record) | ||
| proc_var_factory.add_alt_record_to_schema(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: instead of adding alt to a mutable schema, consider returning the alt schema and appending it to the schema object here (e.g. schema.fields.append(proc_var_factory.get_alt_record_schema()). This makes is more clear how the schema is being mutated in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| The latter functionality is what is needed here. | ||
| append: If true, existing records in output_table will not be | ||
| overwritten. New records will be appended to those that already exist. | ||
| omit_empty_sample_calls (bool): If true, samples that don't have a given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you also remove bool from here? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| @beam.typehints.with_input_types(processed_variant.ProcessedVariant) | ||
| class VariantToBigQuery(beam.PTransform): | ||
| """Writes PCollection of ``Variant`` records to BigQuery.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also change this to be ProcessedVariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
651a6fb to
bf8448a
Compare
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
And please ignore the "Closed PR" comment GitHub injected there :-)
| # type: (str) -> bool | ||
| if info_field_name not in self._header_fields.infos: | ||
| raise ValueError('INFO field {} not found'.format(info_field_name)) | ||
| if ((self._split_alternate_allele_info_fields and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| @beam.typehints.with_input_types(processed_variant.ProcessedVariant) | ||
| class VariantToBigQuery(beam.PTransform): | ||
| """Writes PCollection of ``Variant`` records to BigQuery.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| The latter functionality is what is needed here. | ||
| append: If true, existing records in output_table will not be | ||
| overwritten. New records will be appended to those that already exist. | ||
| omit_empty_sample_calls (bool): If true, samples that don't have a given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| description='')) | ||
| alternate_bases_record.fields.append(annotation_record) | ||
| schema.fields.append(alternate_bases_record) | ||
| proc_var_factory.add_alt_record_to_schema(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
c4b6b15 to
1003fd9
Compare
|
BTW, if you want to see the changes since last review, check this commit |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please wait for Nima's review as well and add a few unit tests to bigquery_util.py.
a102cbf to
5ed7e7e
Compare
|
bigquery_util_test added plus a TODO for the new create_alt_record_for_schema method of ProcessedVariantFactory. |
| annotation_dict[name] = annotations[index + 1] | ||
| return annotation_dict | ||
|
|
||
| def create_alt_record_for_schema(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe 'create_alt_bases_schema'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to create_alt_bases_field_schema because I want to make it clear it is not a full schema.
| split_alternate_allele_info_fields=True, | ||
| annotation_fields=None): | ||
| def generate_schema_from_header_fields( | ||
| header_fields, # type: vcf_header_parser.HeaderFields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type annotation is cool. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed :-)
| annotation_fields=None): | ||
| def generate_schema_from_header_fields( | ||
| header_fields, # type: vcf_header_parser.HeaderFields | ||
| proc_var_factory, # type: processed_variant.ProcessedVariantFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processed_variant_factory?
both 'proc' and 'var' can be confused by other more popular names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to proc_variant_factory; agreed that proc is not great but processed_variant_factory is also too long (for example I need to break the # type comment to next line which is doable but ugly). There is a docstring below which clarifies it too. Still if you feel strongly about this, please let me know.
| VCF file(s). | ||
| proc_var_factory: The factory class that knows how to convert Variant | ||
| instances to ProcessedVariant. As a side effect it also knows how to | ||
| modify BigQuery schema based on the ProcessedVariants that it generates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line
"... As a side effect it also knows how to modify BigQuery schema based on the ProcessedVariants that it generates. "
calls for a refactoring. In another PR though, this one is already big for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is not great but I can't think of a much better alternative design without too much code change either. If you have a specific suggestion, I can add a TODO.
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nima for the review. PTAL.
| annotation_dict[name] = annotations[index + 1] | ||
| return annotation_dict | ||
|
|
||
| def create_alt_record_for_schema(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to create_alt_bases_field_schema because I want to make it clear it is not a full schema.
| split_alternate_allele_info_fields=True, | ||
| annotation_fields=None): | ||
| def generate_schema_from_header_fields( | ||
| header_fields, # type: vcf_header_parser.HeaderFields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed :-)
| annotation_fields=None): | ||
| def generate_schema_from_header_fields( | ||
| header_fields, # type: vcf_header_parser.HeaderFields | ||
| proc_var_factory, # type: processed_variant.ProcessedVariantFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to proc_variant_factory; agreed that proc is not great but processed_variant_factory is also too long (for example I need to break the # type comment to next line which is doable but ugly). There is a docstring below which clarifies it too. Still if you feel strongly about this, please let me know.
| VCF file(s). | ||
| proc_var_factory: The factory class that knows how to convert Variant | ||
| instances to ProcessedVariant. As a side effect it also knows how to | ||
| modify BigQuery schema based on the ProcessedVariants that it generates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is not great but I can't think of a much better alternative design without too much code change either. If you have a specific suggestion, I can add a TODO.
5ed7e7e to
d65b158
Compare
|
BTW, if you want to see only the changes since last time, check this commit. |
nmousavi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR. LGTM!
d65b158 to
79694a0
Compare
Tested: Ran integration tests and updated unit-tests. Also ran the code manually for an annotated VCF and confirmed the generated schema is the same as before.
Issue: #59, #81.