Skip to content

Conversation

@bashir2
Copy link
Member

@bashir2 bashir2 commented Nov 6, 2018

Issue: #404

Tested: Unit-tests added for the core Avro schema generation (and passed deploy_and_run_tests.sh). Also generated Avro output files for some sample VCFs then uploaded the output to BigQuery using the bq tool; compared the results with when the BigQuery table is directly generated using Variant Transforms.

@coveralls
Copy link

coveralls commented Nov 6, 2018

Pull Request Test Coverage Report for Build 1448

  • 140 of 165 (84.85%) changed or added relevant lines in 9 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.07%) to 87.614%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/libs/bigquery_util.py 16 17 94.12%
gcp_variant_transforms/libs/schema_converter.py 35 37 94.59%
gcp_variant_transforms/options/variant_transform_options.py 4 8 50.0%
gcp_variant_transforms/transforms/variant_to_avro.py 15 24 62.5%
gcp_variant_transforms/vcf_to_bq.py 1 10 10.0%
Totals Coverage Status
Change from base Build 1435: -0.07%
Covered Lines: 6423
Relevant Lines: 7331

💛 - Coveralls

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Bashir! Looks great! Just a few nits...

bigquery_type))
t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]
if bigquery_mode == TableFieldConstants.MODE_NULLABLE:
return [t, AvroConstants.NULL]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a brief comment about why nullable types become a list rather than a string. I realize you've explained it in the other module, but it's difficult to connect the dots here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# records in the array is f.name. Make sure this is according to Avro
# spec then remove this TODO.
field_dict[bigquery_util.AvroConstants.NAME] = f.name
field_dict[bigquery_util.AvroConstants.TYPE] = \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use brackets instead of \ here and everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but is this a style requirement?
I am asking because () is also used to define Tuples in Python and I find it a little confusing when it is used for line breaks (although I have done it myself before). My preference is to break in between brackets that already exist (e.g., before bigquery_util on this line, but in this case it adds an extra line).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's required by the style guide: https://github.com/google/styleguide/blob/gh-pages/pyguide.md#32-line-length
Do not use backslash line continuation except for with statements requiring three or more context managers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the pointer.

# type: (argparse.Namespace, bigquery.BigqueryV2) -> None
if not parsed_args.output_table and not parsed_args.output_avro_path:
raise ValueError('At least one of --output_table or --output_avro_path '
'options should be provided')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'options should be provided')
'options should be provided.')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (for some reason it does not let me "Apply suggestion" but I think in general I have to do it in my local version anyways, otherwise it will mess up my git flow).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good to know...i figured it's easier for these kinds of nits to just make the change as a reviewer rather than going back-and-forth....anyhow, this is a beta feature from github so hopefully they'll address these problems.

if not bigquery_type in _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP:
raise ValueError('Unknown Avro equivalent for type {}'.format(
bigquery_type))
t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in general, we've tried to avoid using short form variables (e.g. t, f, etc) in our code as it makes it harder to read (the only exception has been for one-line for loops). Consider renaming these to avro_type etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (this was a local variable with a small scope, hence the original short name).

class GenerateSchemaFromHeaderFieldsTest(unittest.TestCase):
"""Test cases for the ``generate_schema_from_header_fields`` function."""

def _validation_hook(self, expected_fields, actual_schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not just name this _validate_schema , as it's more obvious what it's validating (unless you're trying to validate more stuff, but the comments say otherwise).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
The term hook is there to make it clear that it is intended for sub-classes to override this function; so yes the sub-class implementations may choose to do more as ConvertTableSchemaToJsonAvroSchemaTest does. Changed it to what you suggested as it seems it did not transfer that message.


class ConvertTableSchemaToJsonAvroSchemaTest(
GenerateSchemaFromHeaderFieldsTest):
""" Test cases for `convert_table_schema_to_json_avro_schema`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""" Test cases for `convert_table_schema_to_json_avro_schema`.
"""Test cases for `convert_table_schema_to_json_avro_schema`.

def _convert_schema_to_avro_dict(schema):
# type: (bigquery.TableSchema) -> Dict
fields_dict = {}
# TODO(bashir2): Check if we need `namespace` and `name` at the top level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this means the TBD field will also be removed? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, without this name the Avro schema parser fails but I am not clear what is the implications of this name (and namespace), hence the TODO.

Copy link
Contributor

@allieychen allieychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Bashir! It looks very neat. I added a few comments. :)

return schema


def _convert_repeated_field_to_avro_array(f, fields_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add the type hints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

help='The output path to write Avro files under.')

def validate(self, parsed_args):
# type: (argparse.Namespace, bigquery.BigqueryV2) -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove , bigquery.BigqueryV2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""Initializes the transform.
Args:
output_table: The path under which output Avro files are generated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/output_table/output_path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; thanks for catching.

output_table: The path under which output Avro files are generated.
header_fields: Representative header fields for all variants. This is
needed for dynamically generating the schema.
variant_merger: The strategy used for merging variants (if any). Some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please switch the order of variant_merger and proc_var_factory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
As a side note proc_var_factory is a required argument both here and in VariantToBigQuery class. That's why I dropped the default None. I added a TODO in the other class too to fix it in a future PR.

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to rename the script (maybe something like schema_converter), or move the conversions between BigQuery schema and Avro schema to another file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Yeah, I was thinking about this too when I made changes to this file. I decided against having separate modules (because I thought the new functionality is part of various schema conversions, VCF->BQ, BQ->VCF, BQ->Avro, ...). But I agree the name should be more generic, which is what I just changed.

bigquery_util.get_avro_type_from_bigquery_type_mode(f.type, f.mode)
}
# All repeated fields are nullable.
return [bigquery_util.AvroConstants.NULL, array_dict]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether the order matters, but from the example shows below, I feel the array_dict comes before null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the order matters.


from collections import OrderedDict
from typing import Any, Dict, Union # pylint: disable=unused-import
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change the order of from collections import OrderedDict and import json, import logging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this a style requirement? My reading from the import formatting section tells me that this is the right order. The importorder tool also produces the same order as here, even if I move collections after logging. I think the correct fix/change here is to use collections directly instead of importing OrderedDict from it (style guide section) but that is added before this PR and I prefer not to touch it in this PR (as I should several lines below as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guideline is as simple as Imports should be grouped with the order being most generic to least generic. But I agree with you about the correct fix/change.


class ConvertTableSchemaToJsonAvroSchemaTest(
GenerateSchemaFromHeaderFieldsTest):
""" Test cases for `convert_table_schema_to_json_avro_schema`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the leading space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self._omit_empty_sample_calls = omit_empty_sample_calls

def expand(self, pcoll):
avro_records = pcoll | 'ConvertToAvroRecords' >> beam.ParDo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still actually variant records?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean it is the variant record represented in Avro format (or did I misunderstand your question?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if I understande correctly, it is not something specific about Avro, we basically had the same code here and in variant_to_bigquery (The rows are the same, only the schema changes). I would suggest use a more general name instead of avro_records. But we can leave it when refactoring common parts of VariantToAvroFiles and VariantToBigQuery as you mentioned in one of the TODOs. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's leave it for after the refactoring TODO.

# TODO(bashir2): Add an integration test that outputs to Avro files and
# also imports to BigQuery. Then import those Avro outputs using the bq
# tool and verify that the two tables are identical.
_ = variants | 'FlattenPartitions' >> beam.Flatten() \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use () instead of \.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FlattenPartition step may already done. See line 266-267.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that I forgot to address the second comment of yours here in my previous pass: Yes Flatten may have happened before but IIUC variants is still a list of PCollections, although it may only have one element. Or to put it differently, do you see any problems with this code right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both are called, it may have an error (something like Transform does not have a stable unique label).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean now; you are talking about the name not actually doing the Flatten, right? Good point, I renamed this step.

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both for your review.

if not bigquery_type in _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP:
raise ValueError('Unknown Avro equivalent for type {}'.format(
bigquery_type))
t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (this was a local variable with a small scope, hence the original short name).

bigquery_type))
t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]
if bigquery_mode == TableFieldConstants.MODE_NULLABLE:
return [t, AvroConstants.NULL]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Yeah, I was thinking about this too when I made changes to this file. I decided against having separate modules (because I thought the new functionality is part of various schema conversions, VCF->BQ, BQ->VCF, BQ->Avro, ...). But I agree the name should be more generic, which is what I just changed.


from collections import OrderedDict
from typing import Any, Dict, Union # pylint: disable=unused-import
import json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this a style requirement? My reading from the import formatting section tells me that this is the right order. The importorder tool also produces the same order as here, even if I move collections after logging. I think the correct fix/change here is to use collections directly instead of importing OrderedDict from it (style guide section) but that is added before this PR and I prefer not to touch it in this PR (as I should several lines below as well).

return schema


def _convert_repeated_field_to_avro_array(f, fields_list):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

help='The output path to write Avro files under.')

def validate(self, parsed_args):
# type: (argparse.Namespace, bigquery.BigqueryV2) -> None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# type: (argparse.Namespace, bigquery.BigqueryV2) -> None
if not parsed_args.output_table and not parsed_args.output_avro_path:
raise ValueError('At least one of --output_table or --output_avro_path '
'options should be provided')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (for some reason it does not let me "Apply suggestion" but I think in general I have to do it in my local version anyways, otherwise it will mess up my git flow).

"""Initializes the transform.
Args:
output_table: The path under which output Avro files are generated.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; thanks for catching.

output_table: The path under which output Avro files are generated.
header_fields: Representative header fields for all variants. This is
needed for dynamically generating the schema.
variant_merger: The strategy used for merging variants (if any). Some
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
As a side note proc_var_factory is a required argument both here and in VariantToBigQuery class. That's why I dropped the default None. I added a TODO in the other class too to fix it in a future PR.

self._omit_empty_sample_calls = omit_empty_sample_calls

def expand(self, pcoll):
avro_records = pcoll | 'ConvertToAvroRecords' >> beam.ParDo(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean it is the variant record represented in Avro format (or did I misunderstand your question?).

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# records in the array is f.name. Make sure this is according to Avro
# spec then remove this TODO.
field_dict[bigquery_util.AvroConstants.NAME] = f.name
field_dict[bigquery_util.AvroConstants.TYPE] = \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's required by the style guide: https://github.com/google/styleguide/blob/gh-pages/pyguide.md#32-line-length
Do not use backslash line continuation except for with statements requiring three or more context managers.

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am submitting this soon as I think all comments have been addressed.

# records in the array is f.name. Make sure this is according to Avro
# spec then remove this TODO.
field_dict[bigquery_util.AvroConstants.NAME] = f.name
field_dict[bigquery_util.AvroConstants.TYPE] = \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the pointer.

@bashir2 bashir2 merged commit fd5a2cb into googlegenomics:master Nov 9, 2018
@bashir2 bashir2 deleted the avro branch November 13, 2018 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants