Skip to content

Conversation

@nmousavi
Copy link
Contributor

@nmousavi nmousavi commented Mar 8, 2018

Define SchemaDescriptor class.

Provide serialization and lookup API for Bigquery schema.

Design Doc: https://goo.gl/NXe27p

Tested:
unit test

nmousavi added 3 commits March 8, 2018 14:27
Provide serialization and lookup API for BQ schema fields.
Will be used in bigquery_vcf_schema.py.
Provides serialization and lookup API for type/mode of schema fields.

Tested:
  unit test
@nmousavi nmousavi requested a review from arostamianfar March 8, 2018 19:44
@coveralls
Copy link

coveralls commented Mar 8, 2018

Pull Request Test Coverage Report for Build 455

  • 66 of 68 (97.06%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.1%) to 90.378%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/libs/bigquery_schema_descriptor_test.py 43 45 95.56%
Totals Coverage Status
Change from base Build 451: 0.1%
Covered Lines: 3325
Relevant Lines: 3679

💛 - Coveralls

allieychen and others added 2 commits March 9, 2018 13:18
* Add multiple query support to integration tests googlegenomics#120.
update all tests in integration small_tests to support multiple query.
update the validate_table in run_tests, add a loop for all test cases in the test file.
changed the required keys for the .json file. Remove "validation_query" and "expected_query_result", and add "test_cases".
Ran ./deploy_and_run_tests.sh and all integration tests passed.

Update the development guide doc (googlegenomics#124)

Update the development guide doc.

Add IntelliJ IDE setup.
Add more details.

Added an INFO message for the full command.

Tested: Ran manually and checked the new log message.

Uses the macros to replace the common queries. (googlegenomics#127)

Define NUM_ROWS, SUM_START, SUM_END in QueryFormatter, and replaces them in the query to avoid duplicate code.

TESTED:
deploy_and_run_tests.

Define SchemaDescriptor class.

Provides serialization and lookup API for type/mode of schema fields.

Tested:
  unit test
@nmousavi nmousavi requested review from bashir2 and removed request for arostamianfar March 12, 2018 19:00
Copy link
Member

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nima for creating this wrapper class. Sorry I have only glanced through the design doc but does it worth clarifying why this wrapper is needed. Maybe in the new module or SchemaDescriptor documentation? I suppose you are going to add more functionality to resolve conflicts (e.g., types) and make a TableSchema at the end again, correct?

from collections import namedtuple
from apache_beam.io.gcp.internal.clients import bigquery # pylint: disable=unused-import

__all__ = ['SchemaDescriptor']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do you add this? Isn't that better to simply have anything that we don't want to be imported in other modules, start with '_' and have no __all__? Or is there other reasons I don't know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a convention adopted for VT (all classes have it). Otherwise, it's not relevant here (at least for now). I have added it here for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer if we drop this if we don't really need it. For example, I did not add it in processed_variant module because I did not see any reason for it, although I realized it is in other modules. Adding @arostamianfar to make sure there is no specific reason we are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline with Asha. No particular reason to keep it, removed it.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""A dict based description for BigQuery's schema."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "BigQuery schema" (please make it consistent with line 23).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

__all__ = ['SchemaDescriptor']

# Stores data about a simple field (not a record) in BigQuery Schema.
FieldDescriptor = namedtuple('FieldDescriptor', ['type', 'mode'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check how you can use typing.NamedTuple to declare the type of type and mode (preferred)? If it is not doable in Python 2.7 (i.e., using comments) consider documenting the expected types (not preferred).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

def __init__(self, table_schema):
# type: (bigquery.TableSchema) -> None

# Dict of (field_name, :class:`FieldDescriptor`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for documentation only or do you intent to declare the type of this field as well? For typing, it should be defined on the same line or the line after, with # type: prefix in the comment to have some typing effect. And for Dict there is a particular typing format (here is one example).

If you don't want to declare the type of these formally, then please ignore this comment (I think it is useful but up to you).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it was just for documentation.


# Dict of (field_name, :class:`FieldDescriptor`).
self.field_descriptor_dict = {}
# Dict of (record_name, :class:`SchemaDescriptor`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto :-)

self._string = bigquery_util.TableFieldConstants.TYPE_STRING

self._nullable = bigquery_util.TableFieldConstants.MODE_NULLABLE
self._repeated = bigquery_util.TableFieldConstants.MODE_REPEATED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Another option instead of these assignments is to import bigquery_util.TableFieldConstants as Consts or something like that. I know the style guide does not approve individual class imports but we already do that in tests for exactly this reason of long names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

name='record_1', type=self._record, mode=self._repeated,
description='foo desc')
record_field.fields.append(bigquery.TableFieldSchema(
name='record_1-field_1', type=self._boolean, mode=self._nullable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think in BigQuery, column names should not have "-" according to this. I know it does not matter here since you are testing some other logic but I thought I mention since we have specific functions to convert to BQ compliant field names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

else:
# Simple field.
self.field_descriptor_dict[field.name] = FieldDescriptor(
type=field.type, mode=field.mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay that you drop field.description in the SchemaDescriptor representation? I mean wouldn't you need the description later on? (I am not sure how you are going to use this class later, hence asking.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we don't need desc. FieldDescriptor is used to check if a value for a BigQuery field matches the definition of the field, so we only need to know the type and mode of the field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, okay, I thought maybe you are going to use this class to create a unified (e.g., after resolving conflicts) TableSchema in next PRs as well, hence wondering about descriptions. I guess once you add a little more detail in class documentation that would be clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

schema = bigquery.TableSchema()
schema.fields.append(bigquery.TableFieldSchema(
name='field_1', type=self._string, mode=self._nullable,
description='foo desc'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't care about descriptions, why bother setting them in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

self.fail('Non existence field should throw an exceprion')

def test_field_descriptor_at_first_level(self):
print self._get_table_schema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this print needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops, deleted!

for field in table_schema.fields:
if field.fields:
# Record field.
self.schema_descriptor_dict[field.name] = SchemaDescriptor(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little more about this: Would it make more sense to call this class RecordDescriptor instead of SchemaDescriptor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like SchemaDescriptor better as this class is built from bigquery.TableSchema obj, and basically is a searchable representation of it. I found it also a bit more helpful in my next PR as it's reminds me it's related to schema. Word 'record' is a bit more abused in VT, for example we use Record to refer to a variant line/record in VCF file.

Copy link
Contributor Author

@nmousavi nmousavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

bashir2
bashir2 previously approved these changes Mar 14, 2018
Copy link
Member

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments are minor, please feel free to submit once they are addressed; well, I guess you need to ping again once you change something, thanks GitHub :-)



# Stores data about a simple field (not a record) in BigQuery Schema.
FieldDescriptor = NamedTuple('FieldDescriptor', [('type', str), ('mode', str)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! I was not expecting this format for types to work in Python 2.7 :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack!



class SchemaDescriptor(object):
"""A dict based description for :class:`bigquery.TableSchema` object."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind extending this a little bit (please see my question/comment on the whole PR in my last review)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

else:
# Simple field.
self.field_descriptor_dict[field.name] = FieldDescriptor(
type=field.type, mode=field.mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, okay, I thought maybe you are going to use this class to create a unified (e.g., after resolving conflicts) TableSchema in next PRs as well, hence wondering about descriptions. I guess once you add a little more detail in class documentation that would be clear.

# type: (bigquery.TableSchema) -> None

# Dict of (field_name, :class:`FieldDescriptor`).
self.field_descriptor_dict = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "private", i.e., start with "_"? ditto next field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Tested:
  unit test
Copy link
Contributor Author

@nmousavi nmousavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

@nmousavi
Copy link
Contributor Author

Thanks!

@nmousavi nmousavi merged commit 69dd930 into googlegenomics:master Mar 14, 2018
mhsaul pushed a commit to mhsaul/gcp-variant-transforms that referenced this pull request Mar 29, 2018
* Define class SchemaDescriptor.

Tested:
  unit test
mhsaul pushed a commit to mhsaul/gcp-variant-transforms that referenced this pull request Mar 29, 2018
* Define class SchemaDescriptor.

Tested:
  unit test
@nmousavi nmousavi deleted the schema-desc branch April 26, 2018 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants