-
Notifications
You must be signed in to change notification settings - Fork 59
SchemaDescriptor class #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tested: unit test
Provide serialization and lookup API for BQ schema fields. Will be used in bigquery_vcf_schema.py.
Provides serialization and lookup API for type/mode of schema fields. Tested: unit test
Pull Request Test Coverage Report for Build 455
💛 - Coveralls |
* Add multiple query support to integration tests googlegenomics#120. update all tests in integration small_tests to support multiple query. update the validate_table in run_tests, add a loop for all test cases in the test file. changed the required keys for the .json file. Remove "validation_query" and "expected_query_result", and add "test_cases". Ran ./deploy_and_run_tests.sh and all integration tests passed. Update the development guide doc (googlegenomics#124) Update the development guide doc. Add IntelliJ IDE setup. Add more details. Added an INFO message for the full command. Tested: Ran manually and checked the new log message. Uses the macros to replace the common queries. (googlegenomics#127) Define NUM_ROWS, SUM_START, SUM_END in QueryFormatter, and replaces them in the query to avoid duplicate code. TESTED: deploy_and_run_tests. Define SchemaDescriptor class. Provides serialization and lookup API for type/mode of schema fields. Tested: unit test
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nima for creating this wrapper class. Sorry I have only glanced through the design doc but does it worth clarifying why this wrapper is needed. Maybe in the new module or SchemaDescriptor documentation? I suppose you are going to add more functionality to resolve conflicts (e.g., types) and make a TableSchema at the end again, correct?
| from collections import namedtuple | ||
| from apache_beam.io.gcp.internal.clients import bigquery # pylint: disable=unused-import | ||
|
|
||
| __all__ = ['SchemaDescriptor'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do you add this? Isn't that better to simply have anything that we don't want to be imported in other modules, start with '_' and have no __all__? Or is there other reasons I don't know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is a convention adopted for VT (all classes have it). Otherwise, it's not relevant here (at least for now). I have added it here for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer if we drop this if we don't really need it. For example, I did not add it in processed_variant module because I did not see any reason for it, although I realized it is in other modules. Adding @arostamianfar to make sure there is no specific reason we are missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked offline with Asha. No particular reason to keep it, removed it.
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """A dict based description for BigQuery's schema.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "BigQuery schema" (please make it consistent with line 23).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| __all__ = ['SchemaDescriptor'] | ||
|
|
||
| # Stores data about a simple field (not a record) in BigQuery Schema. | ||
| FieldDescriptor = namedtuple('FieldDescriptor', ['type', 'mode']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please check how you can use typing.NamedTuple to declare the type of type and mode (preferred)? If it is not doable in Python 2.7 (i.e., using comments) consider documenting the expected types (not preferred).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| def __init__(self, table_schema): | ||
| # type: (bigquery.TableSchema) -> None | ||
|
|
||
| # Dict of (field_name, :class:`FieldDescriptor`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for documentation only or do you intent to declare the type of this field as well? For typing, it should be defined on the same line or the line after, with # type: prefix in the comment to have some typing effect. And for Dict there is a particular typing format (here is one example).
If you don't want to declare the type of these formally, then please ignore this comment (I think it is useful but up to you).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it was just for documentation.
|
|
||
| # Dict of (field_name, :class:`FieldDescriptor`). | ||
| self.field_descriptor_dict = {} | ||
| # Dict of (record_name, :class:`SchemaDescriptor`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto :-)
| self._string = bigquery_util.TableFieldConstants.TYPE_STRING | ||
|
|
||
| self._nullable = bigquery_util.TableFieldConstants.MODE_NULLABLE | ||
| self._repeated = bigquery_util.TableFieldConstants.MODE_REPEATED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Another option instead of these assignments is to import bigquery_util.TableFieldConstants as Consts or something like that. I know the style guide does not approve individual class imports but we already do that in tests for exactly this reason of long names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| name='record_1', type=self._record, mode=self._repeated, | ||
| description='foo desc') | ||
| record_field.fields.append(bigquery.TableFieldSchema( | ||
| name='record_1-field_1', type=self._boolean, mode=self._nullable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think in BigQuery, column names should not have "-" according to this. I know it does not matter here since you are testing some other logic but I thought I mention since we have specific functions to convert to BQ compliant field names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| else: | ||
| # Simple field. | ||
| self.field_descriptor_dict[field.name] = FieldDescriptor( | ||
| type=field.type, mode=field.mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay that you drop field.description in the SchemaDescriptor representation? I mean wouldn't you need the description later on? (I am not sure how you are going to use this class later, hence asking.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, we don't need desc. FieldDescriptor is used to check if a value for a BigQuery field matches the definition of the field, so we only need to know the type and mode of the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, okay, I thought maybe you are going to use this class to create a unified (e.g., after resolving conflicts) TableSchema in next PRs as well, hence wondering about descriptions. I guess once you add a little more detail in class documentation that would be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| schema = bigquery.TableSchema() | ||
| schema.fields.append(bigquery.TableFieldSchema( | ||
| name='field_1', type=self._string, mode=self._nullable, | ||
| description='foo desc')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't care about descriptions, why bother setting them in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| self.fail('Non existence field should throw an exceprion') | ||
|
|
||
| def test_field_descriptor_at_first_level(self): | ||
| print self._get_table_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this print needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops, deleted!
| for field in table_schema.fields: | ||
| if field.fields: | ||
| # Record field. | ||
| self.schema_descriptor_dict[field.name] = SchemaDescriptor(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a little more about this: Would it make more sense to call this class RecordDescriptor instead of SchemaDescriptor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like SchemaDescriptor better as this class is built from bigquery.TableSchema obj, and basically is a searchable representation of it. I found it also a bit more helpful in my next PR as it's reminds me it's related to schema. Word 'record' is a bit more abused in VT, for example we use Record to refer to a variant line/record in VCF file.
nmousavi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments are minor, please feel free to submit once they are addressed; well, I guess you need to ping again once you change something, thanks GitHub :-)
|
|
||
|
|
||
| # Stores data about a simple field (not a record) in BigQuery Schema. | ||
| FieldDescriptor = NamedTuple('FieldDescriptor', [('type', str), ('mode', str)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! I was not expecting this format for types to work in Python 2.7 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack!
|
|
||
|
|
||
| class SchemaDescriptor(object): | ||
| """A dict based description for :class:`bigquery.TableSchema` object.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind extending this a little bit (please see my question/comment on the whole PR in my last review)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| else: | ||
| # Simple field. | ||
| self.field_descriptor_dict[field.name] = FieldDescriptor( | ||
| type=field.type, mode=field.mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, okay, I thought maybe you are going to use this class to create a unified (e.g., after resolving conflicts) TableSchema in next PRs as well, hence wondering about descriptions. I guess once you add a little more detail in class documentation that would be clear.
| # type: (bigquery.TableSchema) -> None | ||
|
|
||
| # Dict of (field_name, :class:`FieldDescriptor`). | ||
| self.field_descriptor_dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "private", i.e., start with "_"? ditto next field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Tested: unit test
nmousavi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
|
Thanks! |
* Define class SchemaDescriptor. Tested: unit test
* Define class SchemaDescriptor. Tested: unit test
Define SchemaDescriptor class.
Provide serialization and lookup API for Bigquery schema.
Design Doc: https://goo.gl/NXe27p
Tested:
unit test