Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery backend #1170

Closed
wants to merge 40 commits into from
Closed

Conversation

tsdlovell
Copy link

@tsdlovell tsdlovell commented Oct 15, 2017

closes #1139

@jreback jreback added the feature Features or general enhancements label Oct 16, 2017
@@ -0,0 +1,80 @@
import uuid
import pytest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a bq = pytest.importorskip('google.datalab.bigquery')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -40,6 +40,11 @@
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a small section in setup.py to add additional requires, something like: https://github.com/ibis-project/ibis/pull/1167/files#diff-2eeaed663bd0d25b7e608891384b7298

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -40,6 +40,11 @@
pass

try:
import ibis.bigquery.api as bigquery
except ImportError: # pip install ibis-framework[bigquery]
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add the pre-reqs to ci/requirements-2.7/3.6 (leave 3.4/4.5 so that they skip). need to mod appveyor.yml (see #1167 for what you need to add).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in TST: add "git+https://github.com/googledatalab/..." as requirement

I don't think there's a pip package for pydatalab yet, so only have the github.com url

@tsdlovell
Copy link
Author

How should the tests be getting bigquery credentials?

appveyor.yml Outdated
@@ -30,7 +30,7 @@ test_script:
- "%CONDA% install conda=4.3.22 --channel conda-forge"
- "%CONDA% create --name \"ibis_%PYTHON_VERSION%\" python=%PYTHON_VERSION% --channel conda-forge"
- "%ACTIVATE% \"ibis_%PYTHON_VERSION%\""
- "pip install -e .\"[sqlite, postgres, visualization, pandas]\""
- "pip install -e .\"[sqlite, postgres, visualization, pandas, git+https://github.com/googledatalab/pydatalab@v1.1]\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just be able to list this as: bigquery (pip should then call the appropriate requires)

@jreback
Copy link
Contributor

jreback commented Oct 16, 2017

How should the tests be getting bigquery credentials?

I guess we need a test account? @cpcloud

@cpcloud
Copy link
Member

cpcloud commented Oct 16, 2017

@tsdlovell You're failing the flake8 command:

./ibis/bigquery/api.py:1:1: F401 'ibis.bigquery.client.BigQueryClient' imported but unused
./ibis/bigquery/api.py:2:1: F401 'ibis.config.options' imported but unused
./ibis/bigquery/tests/conftest.py:7:1: E402 module level import not at top of file
./ibis/bigquery/tests/conftest.py:16:1: E303 too many blank lines (3)
Exited with code 1


def __init__(self, project_id, dataset_id):
self.__project_id = project_id
self.__dataset_id = dataset_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __s seem unnecessary. Instance attributes starting with _ are considered private. Mutating them is out of the question and if someone decides to do that they are on their own. Besides, __ doesn't guarantee that you can't mutate these attributes, it just makes it extremely cumbersome.



@toolz.memoize
def _bq_get_context(project_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both _bq_get_context and _bq_make_context?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making a context seems to be very costly in terms of time, so I expose a way to make one if you really want to but the default should probably be to just use get

(table_id, dataset_id) = _ensure_split(name, dataset)
if dataset_id and dataset:
raise ValueError(
'Can\'t pass a fully qualified table name *AND* a dataset'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use double quotes when embedding a single quote.

'TIMESTAMP': dt.timestamp,
# 'BYTES': None,
# 'ARRAY': None,
# 'STRUCT': None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need any of these now that you have an implementation that can handle arrays and structs? What about BYTES?

def extract_field_formatter(translator, expr):
op = expr.op()
arg = translator.translate(op.args[0])
return "extract({0!s} from {1!s})".format(sql_attr, arg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format argument indexes and !s are unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the format from the impala compiler. Can rewrite if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, we can leave as is.



@pytest.fixture(scope='session')
def bq_context():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to prefix everything with _bq, we're already under bigquery/tests.

def client(bq_dataset):
bq_name = bq_dataset.name
client = ibc.BigQueryClient(bq_name.project_id, bq_name.dataset_id)
ibis.options.default_backend = client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't set this for unit tests unless you are explicitly testing this, because it modifies global state.

assert str(result) == expected


@pytest.mark.xfail()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xfail doesn't need parens

if schema:
raise NotImplementedError()
else:
schema = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this line.

return bq.Table(dataset_id + '.' + table_id, _bq_get_context(project_id))


def _bq_get_schema(table_id, dataset_id, project_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep a _context on the client (I see that you have a property based on the project_id) and use the bq.Table/bq.Dataset(s) objects directly instead of having these extra 1-2 line functions?

@cpcloud cpcloud self-assigned this Oct 16, 2017

def _get_query(expr, context, params=None):
ast = build_ast(expr, context, params=params)
(query, *rest) = ast.queries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't Python 2 compatible syntax.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@tsdlovell tsdlovell force-pushed the bigquery-backend branch 2 times, most recently from a799d6a to 903951e Compare October 16, 2017 20:31
@@ -23,3 +23,4 @@ dependencies:
- hdfs>=2.0.0
- clickhouse-driver
- clickhouse-sqlalchemy
- git+https://github.com/googledatalab/pydatalab@v1.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you are using the Datalab libraries instead of google-cloud-python?

Copy link
Author

@tsdlovell tsdlovell Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast , its what I was exposed to in some demos of the tooling we can use to access gcp.

Is google-cloud-python the right thing to use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's what we recommend at https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python Datalab is great, but it's not really built to be used outside of the context of their notebook environment.

@cpcloud cpcloud added this to the 0.13 milestone Oct 17, 2017
@cpcloud
Copy link
Member

cpcloud commented Oct 17, 2017

Disable your tests on Windows for now. Once we get everything passing on CircleCI then I'll add a Windows google-cloud-sdk installation to our appveyor CI. I can deal with adding testing on Windows once this is merged in.

return ibc.BigQueryClient(PROJECT_ID, DATASET_ID)


@pytest.fixture(scope='session')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be yield_fixture.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency or correctness?

https://docs.pytest.org/en/latest/yieldfixture.html

Since pytest-3.0, fixtures using the normal fixture decorator can use a yield statement to provide fixture values and execute teardown code, exactly like yield_fixture in previous versions.

Marking functions as yield_fixture is still supported, but deprecated and should not be used in new code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sweet! Thanks for the link.

assert client.project_id == bq_dataset.name.project_id
assert bq_dataset.name.project_id == bq_table._name_parts.project_id
assert client.dataset_id == bq_dataset.name.dataset_id
assert bq_dataset.name.dataset_id == bq_table._name_parts.dataset_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break these assertions out into a test. Assume the fixture is generated correctly if you want to use it expression/compilation/execution test or create two clients, one which you assume is constructed correctly and one whose properties you want to confirm are in fact correct.

assert bq_table.exists()
_df = bq_table.to_dataframe()
assert _df.equals(df)
yield client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be:

try:
    yield client
finally:
    bq_table.delete()
    bq_dataset.delete()

otherwise you won't execute the delete calls when an exception is raised in the yielded-to function.

@@ -0,0 +1,97 @@
import uuid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this if you use ibis.util.guid().



def test_array_execute(table, df):
col_name = df.select_dtypes(['float']).columns[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the column name directly?



def test_simple_aggregate_execute(table, df):
col_name = df.select_dtypes(['float']).columns[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, just hard code the column name.

client_with_table.dataset_id)
actual = client_with_table.list_tables()
expected = [el.name.table_id for el in bq_dataset.tables()]
assert sorted(actual) == sorted(expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hard code the exact names you expect to be there and assert that they are in fact there, otherwise this is just copying the implementation in a test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the postgres tests

assert db.list_tables() == self.con.list_tables()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to revisit that then :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so what I'm actually referring to here is the use of the _proxy API, which is internal to ibis. You should do what the postgres tests do which is construct a database and then confirm that the database has the same tables as the client.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm doing the same thing here

    def get_dataset(self, dataset_id):
        return bq.Dataset(dataset_id, self.context)

assert not t.exists()


@pytest.mark.xfail()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No parens needed here.

assert bq_table.exists()

df_tuples = list(tuple(v.tolist()) for (k, v) in df.iterrows())
table_tuples = list(bq_table.fetch_data())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test might be a bit flaky due to the use of table.insert_data(), which uses the streaming buffer. Rows in the streaming buffer don't appear in the list of rows from a table right away. (In fact, it could be as much as 90 minutes to get out of the streaming buffer.)

bq_table = bq_dataset.table(TABLE_ID, schema)
bq_table.create()
bq_table.reload()
bq_table.insert_data(tuples)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have more reliable tests if you upload from a file and wait for the load job to complete. Example:

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/6345c06ceb151d07761abb36f577909a3b76d7d4/docs/bigquery/snippets.py#L401-L408

Note: line-delimited JSON or Avro rather than CSV will be more robust in general, but CSV is probably fine for a test fixture like this.

pytestmark = pytest.mark.bigquery


PROJECT_ID = 'bq-for-ibis'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 'ibis-gbq'.


@pytest.fixture(scope='session')
def client():
return ibc.BigQueryClient(PROJECT_ID, DATASET_ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call ibis.bigquery.connect(PROJECT_ID, DATASET_ID) here.

@tsdlovell
Copy link
Author

@jreback , @cpcloud : conda build tests are failing

Tests seem to be failing due bad dependency specification. What needs to be changed?

E   ImportError: No module named 'google'

from ibis.bigquery.tests import conftest as conftest


pytestmark = pytest.mark.bigquery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for test files, not fixtures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant this for the conftest file.

import ibis.util
import ibis.expr.types as ir
from ibis.bigquery import client as ibc
from ibis.bigquery.tests import conftest as conftest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't import conftest here. Just make TABLE_ID a fixture and use it as a parameter to your other fixtures and tests.

json_file = six.BytesIO(
'\n'.join((v.to_json() for (k, v) in df.iterrows())).encode('UTF-8')
)
schema = infer_schema_from_df(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I uploaded functional_alltypes to the ibis-gbq project. Run your tests against that so we don't have to upload anything during tests. Then you don't need to import this function, and you'll be able to pass conda build CI.


import pandas as pd
import ibis.util
import ibis.bigquery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just import ibis. ibis.bigquery may not exist if someone doesn't have google-cloud-bigquery installed. This module should be importable without the required dependencies so that pytest can still run.

@pytest.mark.xfail
def test_df_upload(client):
expected = pd.DataFrame(dict(a=[1], b=[2.], c=['a'], d=[True]))
schema = ibc.infer_schema_from_df(expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done inside the insert method on the table (you may need to define it if it doesn't exist), so that users are not exposed this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this functionality for another PR so that you don't have to deal with this stuff in this one.

t = client.table('rando', schema)
t.upload(expected)
result = t.execute()
t.delete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DELETE means something different than DROP, but it looks like you're dropping the table here. If that's the case then you should implement this via the drop method so the API is consistent with the rest of ibis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as insert, I would leave this for the next PR and remove it in this one.

expected = pd.DataFrame(dict(a=[1], b=[2.], c=['a'], d=[True]))
schema = ibc.infer_schema_from_df(expected)
t = client.table('rando', schema)
t.upload(expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other comments in this test, this functionality should be exposed through the insert method.

@cpcloud
Copy link
Member

cpcloud commented Oct 22, 2017

Merging.

@cpcloud cpcloud closed this in 120738e Oct 22, 2017
@cpcloud cpcloud changed the title WIP: bigquery backend BigQuery backend Oct 22, 2017
@cpcloud
Copy link
Member

cpcloud commented Oct 22, 2017

Thanks @tsdlovell, great to get this going!

@tsdlovell tsdlovell deleted the bigquery-backend branch October 22, 2017 16:35
@wesm
Copy link
Member

wesm commented Oct 23, 2017

Huzzah! Thanks @tsdlovell!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery backend for Ibis?
5 participants