-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery backend #1170
BigQuery backend #1170
Conversation
ibis/bigquery/tests/conftest.py
Outdated
| @@ -0,0 +1,80 @@ | |||
| import uuid | |||
| import pytest | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a bq = pytest.importorskip('google.datalab.bigquery')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in TST: use pytest.importorskip for bigquery dependency
| @@ -40,6 +40,11 @@ | |||
| pass | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a small section in setup.py to add additional requires, something like: https://github.com/ibis-project/ibis/pull/1167/files#diff-2eeaed663bd0d25b7e608891384b7298
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in DOC: add bigquery_requires to setup.py
| @@ -40,6 +40,11 @@ | |||
| pass | |||
|
|
|||
| try: | |||
| import ibis.bigquery.api as bigquery | |||
| except ImportError: # pip install ibis-framework[bigquery] | |||
| pass | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add the pre-reqs to ci/requirements-2.7/3.6 (leave 3.4/4.5 so that they skip). need to mod appveyor.yml (see #1167 for what you need to add).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in TST: add "git+https://github.com/googledatalab/..." as requirement
I don't think there's a pip package for pydatalab yet, so only have the github.com url
db3caf9
to
65ae6d8
Compare
|
How should the tests be getting bigquery credentials? |
appveyor.yml
Outdated
| @@ -30,7 +30,7 @@ test_script: | |||
| - "%CONDA% install conda=4.3.22 --channel conda-forge" | |||
| - "%CONDA% create --name \"ibis_%PYTHON_VERSION%\" python=%PYTHON_VERSION% --channel conda-forge" | |||
| - "%ACTIVATE% \"ibis_%PYTHON_VERSION%\"" | |||
| - "pip install -e .\"[sqlite, postgres, visualization, pandas]\"" | |||
| - "pip install -e .\"[sqlite, postgres, visualization, pandas, git+https://github.com/googledatalab/pydatalab@v1.1]\"" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should just be able to list this as: bigquery (pip should then call the appropriate requires)
bcd29b5
to
29e70da
Compare
I guess we need a test account? @cpcloud |
|
@tsdlovell You're failing the |
ibis/bigquery/client.py
Outdated
|
|
||
| def __init__(self, project_id, dataset_id): | ||
| self.__project_id = project_id | ||
| self.__dataset_id = dataset_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __s seem unnecessary. Instance attributes starting with _ are considered private. Mutating them is out of the question and if someone decides to do that they are on their own. Besides, __ doesn't guarantee that you can't mutate these attributes, it just makes it extremely cumbersome.
ibis/bigquery/client.py
Outdated
|
|
||
|
|
||
| @toolz.memoize | ||
| def _bq_get_context(project_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both _bq_get_context and _bq_make_context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making a context seems to be very costly in terms of time, so I expose a way to make one if you really want to but the default should probably be to just use get
ibis/bigquery/client.py
Outdated
| (table_id, dataset_id) = _ensure_split(name, dataset) | ||
| if dataset_id and dataset: | ||
| raise ValueError( | ||
| 'Can\'t pass a fully qualified table name *AND* a dataset' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use double quotes when embedding a single quote.
ibis/bigquery/client.py
Outdated
| 'TIMESTAMP': dt.timestamp, | ||
| # 'BYTES': None, | ||
| # 'ARRAY': None, | ||
| # 'STRUCT': None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need any of these now that you have an implementation that can handle arrays and structs? What about BYTES?
| def extract_field_formatter(translator, expr): | ||
| op = expr.op() | ||
| arg = translator.translate(op.args[0]) | ||
| return "extract({0!s} from {1!s})".format(sql_attr, arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format argument indexes and !s are unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the format from the impala compiler. Can rewrite if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, we can leave as is.
ibis/bigquery/tests/conftest.py
Outdated
|
|
||
|
|
||
| @pytest.fixture(scope='session') | ||
| def bq_context(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to prefix everything with _bq, we're already under bigquery/tests.
ibis/bigquery/tests/conftest.py
Outdated
| def client(bq_dataset): | ||
| bq_name = bq_dataset.name | ||
| client = ibc.BigQueryClient(bq_name.project_id, bq_name.dataset_id) | ||
| ibis.options.default_backend = client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't set this for unit tests unless you are explicitly testing this, because it modifies global state.
ibis/bigquery/tests/test_client.py
Outdated
| assert str(result) == expected | ||
|
|
||
|
|
||
| @pytest.mark.xfail() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xfail doesn't need parens
ibis/bigquery/client.py
Outdated
| if schema: | ||
| raise NotImplementedError() | ||
| else: | ||
| schema = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this line.
ibis/bigquery/client.py
Outdated
| return bq.Table(dataset_id + '.' + table_id, _bq_get_context(project_id)) | ||
|
|
||
|
|
||
| def _bq_get_schema(table_id, dataset_id, project_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep a _context on the client (I see that you have a property based on the project_id) and use the bq.Table/bq.Dataset(s) objects directly instead of having these extra 1-2 line functions?
ibis/bigquery/compiler.py
Outdated
|
|
||
| def _get_query(expr, context, params=None): | ||
| ast = build_ast(expr, context, params=params) | ||
| (query, *rest) = ast.queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't Python 2 compatible syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
a799d6a
to
903951e
Compare
ci/requirements-dev-3.6.yml
Outdated
| @@ -23,3 +23,4 @@ dependencies: | |||
| - hdfs>=2.0.0 | |||
| - clickhouse-driver | |||
| - clickhouse-sqlalchemy | |||
| - git+https://github.com/googledatalab/pydatalab@v1.1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why you are using the Datalab libraries instead of google-cloud-python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast , its what I was exposed to in some demos of the tooling we can use to access gcp.
Is google-cloud-python the right thing to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's what we recommend at https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python Datalab is great, but it's not really built to be used outside of the context of their notebook environment.
|
Disable your tests on Windows for now. Once we get everything passing on CircleCI then I'll add a Windows google-cloud-sdk installation to our appveyor CI. I can deal with adding testing on Windows once this is merged in. |
ibis/bigquery/tests/conftest.py
Outdated
| return ibc.BigQueryClient(PROJECT_ID, DATASET_ID) | ||
|
|
||
|
|
||
| @pytest.fixture(scope='session') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be yield_fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency or correctness?
https://docs.pytest.org/en/latest/yieldfixture.html
Since pytest-3.0, fixtures using the normal fixture decorator can use a yield statement to provide fixture values and execute teardown code, exactly like yield_fixture in previous versions.
Marking functions as yield_fixture is still supported, but deprecated and should not be used in new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sweet! Thanks for the link.
ibis/bigquery/tests/conftest.py
Outdated
| assert client.project_id == bq_dataset.name.project_id | ||
| assert bq_dataset.name.project_id == bq_table._name_parts.project_id | ||
| assert client.dataset_id == bq_dataset.name.dataset_id | ||
| assert bq_dataset.name.dataset_id == bq_table._name_parts.dataset_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Break these assertions out into a test. Assume the fixture is generated correctly if you want to use it expression/compilation/execution test or create two clients, one which you assume is constructed correctly and one whose properties you want to confirm are in fact correct.
ibis/bigquery/tests/conftest.py
Outdated
| assert bq_table.exists() | ||
| _df = bq_table.to_dataframe() | ||
| assert _df.equals(df) | ||
| yield client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be:
try:
yield client
finally:
bq_table.delete()
bq_dataset.delete()otherwise you won't execute the delete calls when an exception is raised in the yielded-to function.
ibis/bigquery/tests/test_client.py
Outdated
| @@ -0,0 +1,97 @@ | |||
| import uuid | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need this if you use ibis.util.guid().
ibis/bigquery/tests/test_client.py
Outdated
|
|
||
|
|
||
| def test_array_execute(table, df): | ||
| col_name = df.select_dtypes(['float']).columns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the column name directly?
ibis/bigquery/tests/test_client.py
Outdated
|
|
||
|
|
||
| def test_simple_aggregate_execute(table, df): | ||
| col_name = df.select_dtypes(['float']).columns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, just hard code the column name.
ibis/bigquery/tests/test_client.py
Outdated
| client_with_table.dataset_id) | ||
| actual = client_with_table.list_tables() | ||
| expected = [el.name.table_id for el in bq_dataset.tables()] | ||
| assert sorted(actual) == sorted(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hard code the exact names you expect to be there and assert that they are in fact there, otherwise this is just copying the implementation in a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from the postgres tests
assert db.list_tables() == self.con.list_tables()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to revisit that then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so what I'm actually referring to here is the use of the _proxy API, which is internal to ibis. You should do what the postgres tests do which is construct a database and then confirm that the database has the same tables as the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm doing the same thing here
def get_dataset(self, dataset_id):
return bq.Dataset(dataset_id, self.context)
ibis/bigquery/tests/test_client.py
Outdated
| assert not t.exists() | ||
|
|
||
|
|
||
| @pytest.mark.xfail() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No parens needed here.
ibis/bigquery/tests/test_client.py
Outdated
| assert bq_table.exists() | ||
|
|
||
| df_tuples = list(tuple(v.tolist()) for (k, v) in df.iterrows()) | ||
| table_tuples = list(bq_table.fetch_data()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test might be a bit flaky due to the use of table.insert_data(), which uses the streaming buffer. Rows in the streaming buffer don't appear in the list of rows from a table right away. (In fact, it could be as much as 90 minutes to get out of the streaming buffer.)
ibis/bigquery/tests/conftest.py
Outdated
| bq_table = bq_dataset.table(TABLE_ID, schema) | ||
| bq_table.create() | ||
| bq_table.reload() | ||
| bq_table.insert_data(tuples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll have more reliable tests if you upload from a file and wait for the load job to complete. Example:
Note: line-delimited JSON or Avro rather than CSV will be more robust in general, but CSV is probably fine for a test fixture like this.
39930db
to
b194615
Compare
ibis/bigquery/tests/conftest.py
Outdated
| pytestmark = pytest.mark.bigquery | ||
|
|
||
|
|
||
| PROJECT_ID = 'bq-for-ibis' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 'ibis-gbq'.
ibis/bigquery/tests/conftest.py
Outdated
|
|
||
| @pytest.fixture(scope='session') | ||
| def client(): | ||
| return ibc.BigQueryClient(PROJECT_ID, DATASET_ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call ibis.bigquery.connect(PROJECT_ID, DATASET_ID) here.
ibis/bigquery/tests/test_client.py
Outdated
| from ibis.bigquery.tests import conftest as conftest | ||
|
|
||
|
|
||
| pytestmark = pytest.mark.bigquery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for test files, not fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant this for the conftest file.
ibis/bigquery/tests/test_client.py
Outdated
| import ibis.util | ||
| import ibis.expr.types as ir | ||
| from ibis.bigquery import client as ibc | ||
| from ibis.bigquery.tests import conftest as conftest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't import conftest here. Just make TABLE_ID a fixture and use it as a parameter to your other fixtures and tests.
ibis/bigquery/tests/conftest.py
Outdated
| json_file = six.BytesIO( | ||
| '\n'.join((v.to_json() for (k, v) in df.iterrows())).encode('UTF-8') | ||
| ) | ||
| schema = infer_schema_from_df(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I uploaded functional_alltypes to the ibis-gbq project. Run your tests against that so we don't have to upload anything during tests. Then you don't need to import this function, and you'll be able to pass conda build CI.
ibis/bigquery/tests/conftest.py
Outdated
|
|
||
| import pandas as pd | ||
| import ibis.util | ||
| import ibis.bigquery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just import ibis. ibis.bigquery may not exist if someone doesn't have google-cloud-bigquery installed. This module should be importable without the required dependencies so that pytest can still run.
ibis/bigquery/tests/test_client.py
Outdated
| @pytest.mark.xfail | ||
| def test_df_upload(client): | ||
| expected = pd.DataFrame(dict(a=[1], b=[2.], c=['a'], d=[True])) | ||
| schema = ibc.infer_schema_from_df(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done inside the insert method on the table (you may need to define it if it doesn't exist), so that users are not exposed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave this functionality for another PR so that you don't have to deal with this stuff in this one.
ibis/bigquery/tests/test_client.py
Outdated
| t = client.table('rando', schema) | ||
| t.upload(expected) | ||
| result = t.execute() | ||
| t.delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DELETE means something different than DROP, but it looks like you're dropping the table here. If that's the case then you should implement this via the drop method so the API is consistent with the rest of ibis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as insert, I would leave this for the next PR and remove it in this one.
ibis/bigquery/tests/test_client.py
Outdated
| expected = pd.DataFrame(dict(a=[1], b=[2.], c=['a'], d=[True])) | ||
| schema = ibc.infer_schema_from_df(expected) | ||
| t = client.table('rando', schema) | ||
| t.upload(expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my other comments in this test, this functionality should be exposed through the insert method.
06ba847
to
0160276
Compare
|
Merging. |
|
Thanks @tsdlovell, great to get this going! |
|
Huzzah! Thanks @tsdlovell! |
closes #1139