Skip to content

Conversation

@bashir2
Copy link
Member

@bashir2 bashir2 commented Feb 22, 2018

Tested: Ran integration tests and updated unit-tests. Also ran the code manually for an annotated VCF and confirmed the generated schema is the same as before.

Issue: #59, #81.

@coveralls
Copy link

coveralls commented Feb 22, 2018

Pull Request Test Coverage Report for Build 335

  • 160 of 174 (91.95%) changed or added relevant lines in 10 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.2%) to 90.716%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/transforms/variant_to_bigquery.py 3 4 75.0%
gcp_variant_transforms/libs/variant_merge/move_to_calls_strategy.py 6 8 75.0%
gcp_variant_transforms/libs/bigquery_util.py 85 88 96.59%
gcp_variant_transforms/libs/processed_variant.py 19 27 70.37%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/transforms/variant_to_bigquery.py 1 73.08%
Totals Coverage Status
Change from base Build 333: 0.2%
Covered Lines: 3068
Relevant Lines: 3382

💛 - Coveralls

@bashir2
Copy link
Member Author

bashir2 commented Feb 22, 2018

Folks, please note that this needs some more unit-tests, which I will add tomorrow. No new functionality is added so the unit-test coverage is not decreased but some functionality that before were buried as module "private"s are now exposed as "public" interfaces so it is better to write tests for them in isolation (e.g., all of bigquery_util and the new schema method of ProcessedVariantFactory).

Also we probably need a better representation of the schema as an encapsulated class but that is left for the schema validation work that Nima is going to do as he can better design that class knowing the current status and what is needed for validation. So this does not "fix" Issue #59, just make it better.

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few nits, otherwise LGTM.

# type: (str) -> bool
if info_field_name not in self._header_fields.infos:
raise ValueError('INFO field {} not found'.format(info_field_name))
if ((self._split_alternate_allele_info_fields and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can just do return (...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I changed this a little because pylint was suggesting to do it like this (check the bottom of the page) and it did not seem to be very readable. So instead I do it the way it is now.

description=''))
alternate_bases_record.fields.append(annotation_record)
schema.fields.append(alternate_bases_record)
proc_var_factory.add_alt_record_to_schema(schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of adding alt to a mutable schema, consider returning the alt schema and appending it to the schema object here (e.g. schema.fields.append(proc_var_factory.get_alt_record_schema()). This makes is more clear how the schema is being mutated in this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The latter functionality is what is needed here.
append: If true, existing records in output_table will not be
overwritten. New records will be appended to those that already exist.
omit_empty_sample_calls (bool): If true, samples that don't have a given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you also remove bool from here? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@beam.typehints.with_input_types(processed_variant.ProcessedVariant)
class VariantToBigQuery(beam.PTransform):
"""Writes PCollection of ``Variant`` records to BigQuery."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also change this to be ProcessedVariant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bashir2 bashir2 closed this Feb 22, 2018
@bashir2 bashir2 force-pushed the refactor_schema_review branch from 651a6fb to bf8448a Compare February 22, 2018 20:53
@bashir2 bashir2 reopened this Feb 22, 2018
Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

And please ignore the "Closed PR" comment GitHub injected there :-)

# type: (str) -> bool
if info_field_name not in self._header_fields.infos:
raise ValueError('INFO field {} not found'.format(info_field_name))
if ((self._split_alternate_allele_info_fields and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@beam.typehints.with_input_types(processed_variant.ProcessedVariant)
class VariantToBigQuery(beam.PTransform):
"""Writes PCollection of ``Variant`` records to BigQuery."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The latter functionality is what is needed here.
append: If true, existing records in output_table will not be
overwritten. New records will be appended to those that already exist.
omit_empty_sample_calls (bool): If true, samples that don't have a given
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

description=''))
alternate_bases_record.fields.append(annotation_record)
schema.fields.append(alternate_bases_record)
proc_var_factory.add_alt_record_to_schema(schema)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@bashir2 bashir2 force-pushed the refactor_schema_review branch from c4b6b15 to 1003fd9 Compare February 22, 2018 21:05
@bashir2
Copy link
Member Author

bashir2 commented Feb 22, 2018

BTW, if you want to see the changes since last review, check this commit

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please wait for Nima's review as well and add a few unit tests to bigquery_util.py.

@bashir2 bashir2 force-pushed the refactor_schema_review branch 2 times, most recently from a102cbf to 5ed7e7e Compare February 23, 2018 00:07
@bashir2
Copy link
Member Author

bashir2 commented Feb 23, 2018

bigquery_util_test added plus a TODO for the new create_alt_record_for_schema method of ProcessedVariantFactory.

annotation_dict[name] = annotations[index + 1]
return annotation_dict

def create_alt_record_for_schema(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 'create_alt_bases_schema'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to create_alt_bases_field_schema because I want to make it clear it is not a full schema.

split_alternate_allele_info_fields=True,
annotation_fields=None):
def generate_schema_from_header_fields(
header_fields, # type: vcf_header_parser.HeaderFields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type annotation is cool. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :-)

annotation_fields=None):
def generate_schema_from_header_fields(
header_fields, # type: vcf_header_parser.HeaderFields
proc_var_factory, # type: processed_variant.ProcessedVariantFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processed_variant_factory?

both 'proc' and 'var' can be confused by other more popular names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to proc_variant_factory; agreed that proc is not great but processed_variant_factory is also too long (for example I need to break the # type comment to next line which is doable but ugly). There is a docstring below which clarifies it too. Still if you feel strongly about this, please let me know.

VCF file(s).
proc_var_factory: The factory class that knows how to convert Variant
instances to ProcessedVariant. As a side effect it also knows how to
modify BigQuery schema based on the ProcessedVariants that it generates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line
"... As a side effect it also knows how to modify BigQuery schema based on the ProcessedVariants that it generates. "

calls for a refactoring. In another PR though, this one is already big for review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is not great but I can't think of a much better alternative design without too much code change either. If you have a specific suggestion, I can add a TODO.

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nima for the review. PTAL.

annotation_dict[name] = annotations[index + 1]
return annotation_dict

def create_alt_record_for_schema(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to create_alt_bases_field_schema because I want to make it clear it is not a full schema.

split_alternate_allele_info_fields=True,
annotation_fields=None):
def generate_schema_from_header_fields(
header_fields, # type: vcf_header_parser.HeaderFields
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :-)

annotation_fields=None):
def generate_schema_from_header_fields(
header_fields, # type: vcf_header_parser.HeaderFields
proc_var_factory, # type: processed_variant.ProcessedVariantFactory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to proc_variant_factory; agreed that proc is not great but processed_variant_factory is also too long (for example I need to break the # type comment to next line which is doable but ugly). There is a docstring below which clarifies it too. Still if you feel strongly about this, please let me know.

VCF file(s).
proc_var_factory: The factory class that knows how to convert Variant
instances to ProcessedVariant. As a side effect it also knows how to
modify BigQuery schema based on the ProcessedVariants that it generates.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is not great but I can't think of a much better alternative design without too much code change either. If you have a specific suggestion, I can add a TODO.

@bashir2 bashir2 force-pushed the refactor_schema_review branch from 5ed7e7e to d65b158 Compare February 23, 2018 19:22
@bashir2
Copy link
Member Author

bashir2 commented Feb 23, 2018

BTW, if you want to see only the changes since last time, check this commit.

nmousavi
nmousavi previously approved these changes Feb 23, 2018
Copy link
Contributor

@nmousavi nmousavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. LGTM!

@bashir2 bashir2 merged commit d37ff53 into googlegenomics:master Feb 23, 2018
@bashir2 bashir2 deleted the refactor_schema_review branch March 9, 2018 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants