SchemaDescriptor class #134

nmousavi · 2018-03-08T19:43:49Z

Define SchemaDescriptor class.

Provide serialization and lookup API for Bigquery schema.

Design Doc: https://goo.gl/NXe27p

Tested:
unit test

Tested: unit test

Provide serialization and lookup API for BQ schema fields. Will be used in bigquery_vcf_schema.py.

Provides serialization and lookup API for type/mode of schema fields. Tested: unit test

coveralls · 2018-03-08T19:48:45Z

Pull Request Test Coverage Report for Build 455

66 of 68 (97.06%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.1%) to 90.378%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
gcp_variant_transforms/libs/bigquery_schema_descriptor_test.py	43	45	95.56%

Totals
Change from base Build 451:	0.1%
Covered Lines:	3325
Relevant Lines:	3679

💛 - Coveralls

* Add multiple query support to integration tests googlegenomics#120. update all tests in integration small_tests to support multiple query. update the validate_table in run_tests, add a loop for all test cases in the test file. changed the required keys for the .json file. Remove "validation_query" and "expected_query_result", and add "test_cases". Ran ./deploy_and_run_tests.sh and all integration tests passed. Update the development guide doc (googlegenomics#124) Update the development guide doc. Add IntelliJ IDE setup. Add more details. Added an INFO message for the full command. Tested: Ran manually and checked the new log message. Uses the macros to replace the common queries. (googlegenomics#127) Define NUM_ROWS, SUM_START, SUM_END in QueryFormatter, and replaces them in the query to avoid duplicate code. TESTED: deploy_and_run_tests. Define SchemaDescriptor class. Provides serialization and lookup API for type/mode of schema fields. Tested: unit test

bashir2

Thanks Nima for creating this wrapper class. Sorry I have only glanced through the design doc but does it worth clarifying why this wrapper is needed. Maybe in the new module or SchemaDescriptor documentation? I suppose you are going to add more functionality to resolve conflicts (e.g., types) and make a TableSchema at the end again, correct?

bashir2 · 2018-03-14T01:37:35Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+from collections import namedtuple
+from apache_beam.io.gcp.internal.clients import bigquery  # pylint: disable=unused-import
+
+__all__ = ['SchemaDescriptor']


Out of curiosity, why do you add this? Isn't that better to simply have anything that we don't want to be imported in other modules, start with '_' and have no __all__? Or is there other reasons I don't know?

I guess this is a convention adopted for VT (all classes have it). Otherwise, it's not relevant here (at least for now). I have added it here for consistency.

I prefer if we drop this if we don't really need it. For example, I did not add it in processed_variant module because I did not see any reason for it, although I realized it is in other modules. Adding @arostamianfar to make sure there is no specific reason we are missing.

Talked offline with Asha. No particular reason to keep it, removed it.

bashir2 · 2018-03-14T01:38:48Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""A dict based description for BigQuery's schema."""


nit: "BigQuery schema" (please make it consistent with line 23).

bashir2 · 2018-03-14T01:42:28Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+__all__ = ['SchemaDescriptor']
+
+# Stores data about a simple field (not a record) in BigQuery Schema.
+FieldDescriptor = namedtuple('FieldDescriptor', ['type', 'mode'])


Can you please check how you can use typing.NamedTuple to declare the type of type and mode (preferred)? If it is not doable in Python 2.7 (i.e., using comments) consider documenting the expected types (not preferred).

bashir2 · 2018-03-14T01:57:51Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+  def __init__(self, table_schema):
+    # type: (bigquery.TableSchema) -> None
+
+    # Dict of (field_name, :class:`FieldDescriptor`).


Is this for documentation only or do you intent to declare the type of this field as well? For typing, it should be defined on the same line or the line after, with # type: prefix in the comment to have some typing effect. And for Dict there is a particular typing format (here is one example).

If you don't want to declare the type of these formally, then please ignore this comment (I think it is useful but up to you).

Yeah it was just for documentation.

bashir2 · 2018-03-14T02:03:52Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+
+    # Dict of (field_name, :class:`FieldDescriptor`).
+    self.field_descriptor_dict = {}
+    # Dict of (record_name, :class:`SchemaDescriptor`).


bashir2 · 2018-03-14T02:23:43Z

gcp_variant_transforms/libs/bigquery_schema_descriptor_test.py

+    self._string = bigquery_util.TableFieldConstants.TYPE_STRING
+
+    self._nullable = bigquery_util.TableFieldConstants.MODE_NULLABLE
+    self._repeated = bigquery_util.TableFieldConstants.MODE_REPEATED


nit: Another option instead of these assignments is to import bigquery_util.TableFieldConstants as Consts or something like that. I know the style guide does not approve individual class imports but we already do that in tests for exactly this reason of long names.

bashir2 · 2018-03-14T02:28:16Z

gcp_variant_transforms/libs/bigquery_schema_descriptor_test.py

+        name='record_1', type=self._record, mode=self._repeated,
+        description='foo desc')
+    record_field.fields.append(bigquery.TableFieldSchema(
+        name='record_1-field_1', type=self._boolean, mode=self._nullable,


nit: I think in BigQuery, column names should not have "-" according to this. I know it does not matter here since you are testing some other logic but I thought I mention since we have specific functions to convert to BQ compliant field names.

bashir2 · 2018-03-14T02:32:09Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+      else:
+        # Simple field.
+        self.field_descriptor_dict[field.name] = FieldDescriptor(
+            type=field.type, mode=field.mode)


Is it okay that you drop field.description in the SchemaDescriptor representation? I mean wouldn't you need the description later on? (I am not sure how you are going to use this class later, hence asking.)

Correct, we don't need desc. FieldDescriptor is used to check if a value for a BigQuery field matches the definition of the field, so we only need to know the type and mode of the field.

I see, okay, I thought maybe you are going to use this class to create a unified (e.g., after resolving conflicts) TableSchema in next PRs as well, hence wondering about descriptions. I guess once you add a little more detail in class documentation that would be clear.

bashir2 · 2018-03-14T02:32:51Z

gcp_variant_transforms/libs/bigquery_schema_descriptor_test.py

+    schema = bigquery.TableSchema()
+    schema.fields.append(bigquery.TableFieldSchema(
+        name='field_1', type=self._string, mode=self._nullable,
+        description='foo desc'))


If you don't care about descriptions, why bother setting them in the test?

bashir2 · 2018-03-14T02:34:45Z

gcp_variant_transforms/libs/bigquery_schema_descriptor_test.py

+      self.fail('Non existence field should throw an exceprion')
+
+  def test_field_descriptor_at_first_level(self):
+    print self._get_table_schema()


Is this print needed?

Ops, deleted!

bashir2 · 2018-03-14T05:05:17Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+    for field in table_schema.fields:
+      if field.fields:
+        # Record field.
+        self.schema_descriptor_dict[field.name] = SchemaDescriptor(field)


Thinking a little more about this: Would it make more sense to call this class RecordDescriptor instead of SchemaDescriptor?

I like SchemaDescriptor better as this class is built from bigquery.TableSchema obj, and basically is a searchable representation of it. I found it also a bit more helpful in my next PR as it's reminds me it's related to schema. Word 'record' is a bit more abused in VT, for example we use Record to refer to a variant line/record in VCF file.

nmousavi

PTAL

…-transforms into schema-desc

bashir2

Remaining comments are minor, please feel free to submit once they are addressed; well, I guess you need to ping again once you change something, thanks GitHub :-)

bashir2 · 2018-03-14T21:11:40Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+
+
+# Stores data about a simple field (not a record) in BigQuery Schema.
+FieldDescriptor = NamedTuple('FieldDescriptor', [('type', str), ('mode', str)])


Sweet! I was not expecting this format for types to work in Python 2.7 :-)

bashir2 · 2018-03-14T21:12:50Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+
+
+class SchemaDescriptor(object):
+  """A dict based description for :class:`bigquery.TableSchema` object."""


Do you mind extending this a little bit (please see my question/comment on the whole PR in my last review)?

bashir2 · 2018-03-14T21:14:16Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+      else:
+        # Simple field.
+        self.field_descriptor_dict[field.name] = FieldDescriptor(
+            type=field.type, mode=field.mode)


I see, okay, I thought maybe you are going to use this class to create a unified (e.g., after resolving conflicts) TableSchema in next PRs as well, hence wondering about descriptions. I guess once you add a little more detail in class documentation that would be clear.

bashir2 · 2018-03-14T21:15:32Z

gcp_variant_transforms/libs/bigquery_schema_descriptor.py

+    # type: (bigquery.TableSchema) -> None
+
+    # Dict of (field_name, :class:`FieldDescriptor`).
+    self.field_descriptor_dict = {}


Should this be "private", i.e., start with "_"? ditto next field.

Tested: unit test

nmousavi

PTAL

nmousavi · 2018-03-14T22:01:42Z

Thanks!

* Define class SchemaDescriptor. Tested: unit test

nmousavi added 3 commits March 8, 2018 14:27

Define class SchemaDescriptor.

935be77

Tested: unit test

Define SchemaDescriptor class.

d7c7851

Provide serialization and lookup API for BQ schema fields. Will be used in bigquery_vcf_schema.py.

Define SchemaDescriptor class.

57f5408

Provides serialization and lookup API for type/mode of schema fields. Tested: unit test

nmousavi requested a review from arostamianfar March 8, 2018 19:44

allieychen and others added 2 commits March 9, 2018 13:18

Merge branch 'master' into schema-desc

89f968a

nmousavi requested review from bashir2 and removed request for arostamianfar March 12, 2018 19:00

bashir2 reviewed Mar 14, 2018

View reviewed changes

nmousavi force-pushed the schema-desc branch from 1ae91e7 to f4f993d Compare March 14, 2018 19:14

nmousavi commented Mar 14, 2018

View reviewed changes

Merge branch 'schema-desc' of https://github.com/nmousavi/gcp-variant…

646fce8

…-transforms into schema-desc

nmousavi force-pushed the schema-desc branch from f4f993d to 646fce8 Compare March 14, 2018 19:23

bashir2 previously approved these changes Mar 14, 2018

View reviewed changes

SchemaDescriptor class

eed9e71

Tested: unit test

nmousavi commented Mar 14, 2018

View reviewed changes

nmousavi dismissed bashir2’s stale review via eed9e71 March 14, 2018 21:55

bashir2 approved these changes Mar 14, 2018

View reviewed changes

Merge branch 'master' into schema-desc

f2d4634

nmousavi merged commit 69dd930 into googlegenomics:master Mar 14, 2018

mhsaul pushed a commit to mhsaul/gcp-variant-transforms that referenced this pull request Mar 29, 2018

SchemaDescriptor class (googlegenomics#134)

59f2d0e

* Define class SchemaDescriptor. Tested: unit test

mhsaul pushed a commit to mhsaul/gcp-variant-transforms that referenced this pull request Mar 29, 2018

SchemaDescriptor class (googlegenomics#134)

476bde5

* Define class SchemaDescriptor. Tested: unit test

nmousavi deleted the schema-desc branch April 26, 2018 18:10



		# Stores data about a simple field (not a record) in BigQuery Schema.
		FieldDescriptor = NamedTuple('FieldDescriptor', [('type', str), ('mode', str)])



		class SchemaDescriptor(object):
		"""A dict based description for :class:`bigquery.TableSchema` object."""

SchemaDescriptor class #134

SchemaDescriptor class #134

Uh oh!

Conversation

nmousavi commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 455

💛 - Coveralls

Uh oh!

bashir2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmousavi left a comment

Choose a reason for hiding this comment

Uh oh!

bashir2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nmousavi commented Mar 8, 2018 •

edited

Loading

coveralls commented Mar 8, 2018 •

edited

Loading