The first version of Avro output generation. #411

bashir2 · 2018-11-06T04:00:04Z

Issue: #404

Tested: Unit-tests added for the core Avro schema generation (and passed deploy_and_run_tests.sh). Also generated Avro output files for some sample VCFs then uploaded the output to BigQuery using the bq tool; compared the results with when the BigQuery table is directly generated using Variant Transforms.

coveralls · 2018-11-06T04:06:55Z

Pull Request Test Coverage Report for Build 1448

140 of 165 (84.85%) changed or added relevant lines in 9 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.07%) to 87.614%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/libs/bigquery_util.py	16	17	94.12%
gcp_variant_transforms/libs/schema_converter.py	35	37	94.59%
gcp_variant_transforms/options/variant_transform_options.py	4	8	50.0%
gcp_variant_transforms/transforms/variant_to_avro.py	15	24	62.5%
gcp_variant_transforms/vcf_to_bq.py	1	10	10.0%

Totals
Change from base Build 1435:	-0.07%
Covered Lines:	6423
Relevant Lines:	7331

💛 - Coveralls

arostamianfar

Thanks, Bashir! Looks great! Just a few nits...

arostamianfar · 2018-11-06T16:43:24Z

gcp_variant_transforms/libs/bigquery_util.py

+        bigquery_type))
+  t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]
+  if bigquery_mode == TableFieldConstants.MODE_NULLABLE:
+    return [t, AvroConstants.NULL]


Please add a brief comment about why nullable types become a list rather than a string. I realize you've explained it in the other module, but it's difficult to connect the dots here.

arostamianfar · 2018-11-06T16:46:51Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

+    # records in the array is f.name. Make sure this is according to Avro
+    # spec then remove this TODO.
+    field_dict[bigquery_util.AvroConstants.NAME] = f.name
+    field_dict[bigquery_util.AvroConstants.TYPE] = \


nit: please use brackets instead of \ here and everywhere.

Done, but is this a style requirement?
I am asking because () is also used to define Tuples in Python and I find it a little confusing when it is used for line breaks (although I have done it myself before). My preference is to break in between brackets that already exist (e.g., before bigquery_util on this line, but in this case it adds an extra line).

yes, it's required by the style guide: https://github.com/google/styleguide/blob/gh-pages/pyguide.md#32-line-length
Do not use backslash line continuation except for with statements requiring three or more context managers.

I see, thanks for the pointer.

arostamianfar · 2018-11-06T16:49:58Z

gcp_variant_transforms/options/variant_transform_options.py

+    # type: (argparse.Namespace, bigquery.BigqueryV2) -> None
+    if not parsed_args.output_table and not parsed_args.output_avro_path:
+      raise ValueError('At least one of --output_table or --output_avro_path '
+                       'options should be provided')


Suggested change

'options should be provided')

'options should be provided.')

Done (for some reason it does not let me "Apply suggestion" but I think in general I have to do it in my local version anyways, otherwise it will mess up my git flow).

ah, good to know...i figured it's easier for these kinds of nits to just make the change as a reviewer rather than going back-and-forth....anyhow, this is a beta feature from github so hopefully they'll address these problems.

arostamianfar · 2018-11-07T15:37:29Z

gcp_variant_transforms/libs/bigquery_util.py

+  if not bigquery_type in _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP:
+    raise ValueError('Unknown Avro equivalent for type {}'.format(
+        bigquery_type))
+  t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]


nit: in general, we've tried to avoid using short form variables (e.g. t, f, etc) in our code as it makes it harder to read (the only exception has been for one-line for loops). Consider renaming these to avro_type etc.

Done (this was a local variable with a small scope, hence the original short name).

arostamianfar · 2018-11-07T15:45:53Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter_test.py

 class GenerateSchemaFromHeaderFieldsTest(unittest.TestCase):
  """Test cases for the ``generate_schema_from_header_fields`` function."""

+  def _validation_hook(self, expected_fields, actual_schema):


nit: why not just name this _validate_schema , as it's more obvious what it's validating (unless you're trying to validate more stuff, but the comments say otherwise).

Done.
The term hook is there to make it clear that it is intended for sub-classes to override this function; so yes the sub-class implementations may choose to do more as ConvertTableSchemaToJsonAvroSchemaTest does. Changed it to what you suggested as it seems it did not transfer that message.

arostamianfar · 2018-11-07T15:46:17Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter_test.py


+class ConvertTableSchemaToJsonAvroSchemaTest(
+    GenerateSchemaFromHeaderFieldsTest):
+  """ Test cases for `convert_table_schema_to_json_avro_schema`.


Suggested change

""" Test cases for `convert_table_schema_to_json_avro_schema`.

"""Test cases for `convert_table_schema_to_json_avro_schema`.

arostamianfar · 2018-11-07T15:49:17Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

+def _convert_schema_to_avro_dict(schema):
+  # type: (bigquery.TableSchema) -> Dict
+  fields_dict = {}
+  # TODO(bashir2): Check if we need `namespace` and `name` at the top level.


I assume this means the TBD field will also be removed? :)

Not really, without this name the Avro schema parser fails but I am not clear what is the implications of this name (and namespace), hence the TODO.

allieychen

Thank you, Bashir! It looks very neat. I added a few comments. :)

allieychen · 2018-11-06T15:52:16Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

  return schema


+def _convert_repeated_field_to_avro_array(f, fields_list):


nit: please add the type hints.

allieychen · 2018-11-06T15:58:26Z

gcp_variant_transforms/options/variant_transform_options.py

+                        help='The output path to write Avro files under.')
+
+  def validate(self, parsed_args):
+    # type: (argparse.Namespace, bigquery.BigqueryV2) -> None


Please remove , bigquery.BigqueryV2.

allieychen · 2018-11-06T16:09:07Z

gcp_variant_transforms/transforms/variant_to_avro.py

+    """Initializes the transform.
+
+    Args:
+      output_table: The path under which output Avro files are generated.


s/output_table/output_path

Done; thanks for catching.

allieychen · 2018-11-06T16:10:17Z

gcp_variant_transforms/transforms/variant_to_avro.py

+      output_table: The path under which output Avro files are generated.
+      header_fields: Representative header fields for all variants. This is
+        needed for dynamically generating the schema.
+      variant_merger: The strategy used for merging variants (if any). Some


nit: Please switch the order of variant_merger and proc_var_factory.

Done.
As a side note proc_var_factory is a required argument both here and in VariantToBigQuery class. That's why I dropped the default None. I added a TODO in the other class too to fix it in a future PR.

allieychen · 2018-11-06T16:19:44Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and


I think it is better to rename the script (maybe something like schema_converter), or move the conversions between BigQuery schema and Avro schema to another file.

Done.
Yeah, I was thinking about this too when I made changes to this file. I decided against having separate modules (because I thought the new functionality is part of various schema conversions, VCF->BQ, BQ->VCF, BQ->Avro, ...). But I agree the name should be more generic, which is what I just changed.

allieychen · 2018-11-06T19:41:44Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

+        bigquery_util.get_avro_type_from_bigquery_type_mode(f.type, f.mode)
+    }
+  # All repeated fields are nullable.
+  return [bigquery_util.AvroConstants.NULL, array_dict]


I don't know whether the order matters, but from the example shows below, I feel the array_dict comes before null.

I don't think the order matters.

allieychen · 2018-11-06T21:47:11Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py


 from collections import OrderedDict
-from typing import Any, Dict, Union  # pylint: disable=unused-import
+import json


nit: change the order of from collections import OrderedDict and import json, import logging

Hmm, is this a style requirement? My reading from the import formatting section tells me that this is the right order. The importorder tool also produces the same order as here, even if I move collections after logging. I think the correct fix/change here is to use collections directly instead of importing OrderedDict from it (style guide section) but that is added before this PR and I prefer not to touch it in this PR (as I should several lines below as well).

My guideline is as simple as Imports should be grouped with the order being most generic to least generic. But I agree with you about the correct fix/change.

allieychen · 2018-11-06T21:51:31Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter_test.py


+class ConvertTableSchemaToJsonAvroSchemaTest(
+    GenerateSchemaFromHeaderFieldsTest):
+  """ Test cases for `convert_table_schema_to_json_avro_schema`.


nit: remove the leading space.

allieychen · 2018-11-06T21:53:40Z

gcp_variant_transforms/transforms/variant_to_avro.py

+    self._omit_empty_sample_calls = omit_empty_sample_calls
+
+  def expand(self, pcoll):
+    avro_records = pcoll | 'ConvertToAvroRecords' >> beam.ParDo(


Is it still actually variant records?

Yes, I mean it is the variant record represented in Avro format (or did I misunderstand your question?).

Well, if I understande correctly, it is not something specific about Avro, we basically had the same code here and in variant_to_bigquery (The rows are the same, only the schema changes). I would suggest use a more general name instead of avro_records. But we can leave it when refactoring common parts of VariantToAvroFiles and VariantToBigQuery as you mentioned in one of the TODOs. :)

Right, let's leave it for after the refactoring TODO.

allieychen · 2018-11-06T21:57:25Z

gcp_variant_transforms/vcf_to_bq.py

+    # TODO(bashir2): Add an integration test that outputs to Avro files and
+    # also imports to BigQuery. Then import those Avro outputs using the bq
+    # tool and verify that the two tables are identical.
+    _ = variants | 'FlattenPartitions' >> beam.Flatten() \


nit: please use () instead of \.

The FlattenPartition step may already done. See line 266-267.

Seems that I forgot to address the second comment of yours here in my previous pass: Yes Flatten may have happened before but IIUC variants is still a list of PCollections, although it may only have one element. Or to put it differently, do you see any problems with this code right now?

If both are called, it may have an error (something like Transform does not have a stable unique label).

Oh I see what you mean now; you are talking about the name not actually doing the Flatten, right? Good point, I renamed this step.

bashir2

Thanks both for your review.

bashir2 · 2018-11-08T15:31:00Z

gcp_variant_transforms/libs/bigquery_util.py

+  if not bigquery_type in _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP:
+    raise ValueError('Unknown Avro equivalent for type {}'.format(
+        bigquery_type))
+  t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]


Done (this was a local variable with a small scope, hence the original short name).

bashir2 · 2018-11-08T15:36:51Z

gcp_variant_transforms/libs/bigquery_util.py

+        bigquery_type))
+  t = _BIG_QUERY_TYPE_TO_AVRO_TYPE_MAP[bigquery_type]
+  if bigquery_mode == TableFieldConstants.MODE_NULLABLE:
+    return [t, AvroConstants.NULL]


bashir2 · 2018-11-08T15:45:04Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and


Done.
Yeah, I was thinking about this too when I made changes to this file. I decided against having separate modules (because I thought the new functionality is part of various schema conversions, VCF->BQ, BQ->VCF, BQ->Avro, ...). But I agree the name should be more generic, which is what I just changed.

bashir2 · 2018-11-08T16:15:11Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py


 from collections import OrderedDict
-from typing import Any, Dict, Union  # pylint: disable=unused-import
+import json


Hmm, is this a style requirement? My reading from the import formatting section tells me that this is the right order. The importorder tool also produces the same order as here, even if I move collections after logging. I think the correct fix/change here is to use collections directly instead of importing OrderedDict from it (style guide section) but that is added before this PR and I prefer not to touch it in this PR (as I should several lines below as well).

bashir2 · 2018-11-08T16:17:17Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

  return schema


+def _convert_repeated_field_to_avro_array(f, fields_list):


bashir2 · 2018-11-08T16:47:28Z

gcp_variant_transforms/options/variant_transform_options.py

+                        help='The output path to write Avro files under.')
+
+  def validate(self, parsed_args):
+    # type: (argparse.Namespace, bigquery.BigqueryV2) -> None


bashir2 · 2018-11-08T16:49:28Z

gcp_variant_transforms/options/variant_transform_options.py

+    # type: (argparse.Namespace, bigquery.BigqueryV2) -> None
+    if not parsed_args.output_table and not parsed_args.output_avro_path:
+      raise ValueError('At least one of --output_table or --output_avro_path '
+                       'options should be provided')


Done (for some reason it does not let me "Apply suggestion" but I think in general I have to do it in my local version anyways, otherwise it will mess up my git flow).

bashir2 · 2018-11-08T16:49:51Z

gcp_variant_transforms/transforms/variant_to_avro.py

+    """Initializes the transform.
+
+    Args:
+      output_table: The path under which output Avro files are generated.


Done; thanks for catching.

bashir2 · 2018-11-08T16:53:27Z

gcp_variant_transforms/transforms/variant_to_avro.py

+      output_table: The path under which output Avro files are generated.
+      header_fields: Representative header fields for all variants. This is
+        needed for dynamically generating the schema.
+      variant_merger: The strategy used for merging variants (if any). Some


Done.
As a side note proc_var_factory is a required argument both here and in VariantToBigQuery class. That's why I dropped the default None. I added a TODO in the other class too to fix it in a future PR.

bashir2 · 2018-11-08T16:54:46Z

gcp_variant_transforms/transforms/variant_to_avro.py

+    self._omit_empty_sample_calls = omit_empty_sample_calls
+
+  def expand(self, pcoll):
+    avro_records = pcoll | 'ConvertToAvroRecords' >> beam.ParDo(


Yes, I mean it is the variant record represented in Avro format (or did I misunderstand your question?).

arostamianfar

LGTM!

arostamianfar · 2018-11-09T15:34:35Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

+    # records in the array is f.name. Make sure this is according to Avro
+    # spec then remove this TODO.
+    field_dict[bigquery_util.AvroConstants.NAME] = f.name
+    field_dict[bigquery_util.AvroConstants.TYPE] = \


yes, it's required by the style guide: https://github.com/google/styleguide/blob/gh-pages/pyguide.md#32-line-length
Do not use backslash line continuation except for with statements requiring three or more context managers.

bashir2

I am submitting this soon as I think all comments have been addressed.

bashir2 · 2018-11-09T16:42:54Z

gcp_variant_transforms/libs/bigquery_vcf_schema_converter.py

+    # records in the array is f.name. Make sure this is according to Avro
+    # spec then remove this TODO.
+    field_dict[bigquery_util.AvroConstants.NAME] = f.name
+    field_dict[bigquery_util.AvroConstants.TYPE] = \


I see, thanks for the pointer.

…orms into avro

the first working version of Avro output

e68a141

bashir2 requested review from allieychen and arostamianfar November 6, 2018 04:00

arostamianfar reviewed Nov 7, 2018

View reviewed changes

allieychen reviewed Nov 8, 2018

View reviewed changes

bashir2 commented Nov 8, 2018

View reviewed changes

addressed review comments

974f981

bashir2 force-pushed the avro branch from 119f068 to 974f981 Compare November 8, 2018 17:26

renamed the Flatten step

0ca23c6

arostamianfar approved these changes Nov 9, 2018

View reviewed changes

bashir2 commented Nov 9, 2018

View reviewed changes

Merge branch 'master' of github.com:googlegenomics/gcp-variant-transf…

ec3907b

…orms into avro

bashir2 merged commit fd5a2cb into googlegenomics:master Nov 9, 2018

bashir2 deleted the avro branch November 13, 2018 15:38

bashir2 mentioned this pull request Mar 22, 2020

Add a flag to allow users to keep AVRO files #568

Merged

	""" Test cases for `convert_table_schema_to_json_avro_schema`.
	"""Test cases for `convert_table_schema_to_json_avro_schema`.

		return schema


		def _convert_repeated_field_to_avro_array(f, fields_list):

The first version of Avro output generation. #411

The first version of Avro output generation. #411

Uh oh!

Conversation

bashir2 commented Nov 6, 2018

Uh oh!

coveralls commented Nov 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 1448

💛 - Coveralls

Uh oh!

arostamianfar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allieychen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

coveralls commented Nov 6, 2018 •

edited

Loading