Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery Covariance operator compilation includes string representation of table instead of table ID #2367

Closed
tswast opened this issue Sep 9, 2020 · 1 comment · Fixed by #2368

Comments

@tswast
Copy link
Collaborator

tswast commented Sep 9, 2020

Failing test

param(
lambda t, where: t.double_col.cov(t.float_col),
lambda t, where: t.double_col.cov(t.float_col),
id='covar',
),

Test output

$ pytest ibis/tests/all/test_aggregation.py::test_reduction_ops[BigQuery-no_cond-covar] \
    ibis/tests/all/test_aggregation.py::test_reduction_ops[BigQuery-is_in-covar]

Output:

======================================================= test session starts =======================================================
platform darwin -- Python 3.7.8, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /Users/swast/src/ibis, inifile: setup.cfg
plugins: forked-1.2.0, mock-3.1.1, cov-2.10.0, xdist-1.34.0
collected 2 items                                                                                                                 

ibis/tests/all/test_aggregation.py FF                                                                                       [100%]

============================================================ FAILURES =============================================================
___________________________________________ test_reduction_ops[BigQuery-no_cond-covar] ____________________________________________

backend = <ibis.tests.backends.BigQuery object at 0x7fc25b7935d0>
alltypes = BigQueryTable[table]
  name: swast-scratch.testing.functional_alltypes
  schema:
    index : int64
    Unnamed_0 : int...4
    date_string_col : string
    string_col : string
    timestamp_col : timestamp
    year : int64
    month : int64
df =       index  Unnamed_0    id  bool_col  tinyint_col  ...  date_string_col  string_col           timestamp_col  year  m...     True            6  ...         01/31/10           6 2010-01-31 05:06:13.650  2010      1

[7300 rows x 15 columns]
result_fn = <function <lambda> at 0x7fc25b7c08c0>, expected_fn = <function <lambda> at 0x7fc25b7c0950>
ibis_cond = <function <lambda> at 0x7fc25b7c0e60>, pandas_cond = <function <lambda> at 0x7fc25b7c0ef0>

    @pytest.mark.parametrize(
        ('result_fn', 'expected_fn'),
        [
            param(
                lambda t, where: t.bool_col.count(where=where),
                lambda t, where: len(t.bool_col[where].dropna()),
                id='count',
            ),
            param(
                lambda t, where: t.bool_col.any(),
                lambda t, where: t.bool_col.any(),
                id='any',
            ),
            param(
                lambda t, where: t.bool_col.notany(),
                lambda t, where: ~t.bool_col.any(),
                id='notany',
            ),
            param(
                lambda t, where: -t.bool_col.any(),
                lambda t, where: ~t.bool_col.any(),
                id='any_negate',
            ),
            param(
                lambda t, where: t.bool_col.all(),
                lambda t, where: t.bool_col.all(),
                id='all',
            ),
            param(
                lambda t, where: t.bool_col.notall(),
                lambda t, where: ~t.bool_col.all(),
                id='notall',
            ),
            param(
                lambda t, where: -t.bool_col.all(),
                lambda t, where: ~t.bool_col.all(),
                id='all_negate',
            ),
            param(
                lambda t, where: t.double_col.sum(),
                lambda t, where: t.double_col.sum(),
                id='sum',
            ),
            param(
                lambda t, where: t.double_col.mean(),
                lambda t, where: t.double_col.mean(),
                id='mean',
            ),
            param(
                lambda t, where: t.double_col.min(),
                lambda t, where: t.double_col.min(),
                id='min',
            ),
            param(
                lambda t, where: t.double_col.max(),
                lambda t, where: t.double_col.max(),
                id='max',
            ),
            param(
                lambda t, where: t.double_col.approx_median(),
                lambda t, where: t.double_col.median(),
                id='approx_median',
                marks=pytest.mark.xpass_backends([Clickhouse]),
            ),
            param(
                lambda t, where: t.double_col.std(how='sample'),
                lambda t, where: t.double_col.std(ddof=1),
                id='std',
            ),
            param(
                lambda t, where: t.double_col.var(how='sample'),
                lambda t, where: t.double_col.var(ddof=1),
                id='var',
            ),
            param(
                lambda t, where: t.double_col.std(how='pop'),
                lambda t, where: t.double_col.std(ddof=0),
                id='std_pop',
            ),
            param(
                lambda t, where: t.double_col.var(how='pop'),
                lambda t, where: t.double_col.var(ddof=0),
                id='var_pop',
            ),
            param(
                lambda t, where: t.double_col.cov(t.float_col),
                lambda t, where: t.double_col.cov(t.float_col),
                id='covar',
            ),
            param(
                lambda t, where: t.double_col.corr(t.float_col),
                lambda t, where: t.double_col.corr(t.float_col),
                id='corr',
            ),
            param(
                lambda t, where: t.string_col.approx_nunique(),
                lambda t, where: t.string_col.nunique(),
                id='approx_nunique',
                marks=pytest.mark.xfail_backends([MySQL, SQLite]),
            ),
            param(
                lambda t, where: t.double_col.arbitrary(how='first'),
                lambda t, where: t.double_col.iloc[0],
                id='arbitrary_first',
            ),
            param(
                lambda t, where: t.double_col.arbitrary(how='last'),
                lambda t, where: t.double_col.iloc[-1],
                id='arbitrary_last',
            ),
        ],
    )
    @pytest.mark.parametrize(
        ('ibis_cond', 'pandas_cond'),
        [
            param(lambda t: None, lambda t: slice(None), id='no_cond'),
            param(
                lambda t: t.string_col.isin(['1', '7']),
                lambda t: t.string_col.isin(['1', '7']),
                id='is_in',
            ),
        ],
    )
    @pytest.mark.xfail_unsupported
    def test_reduction_ops(
        backend, alltypes, df, result_fn, expected_fn, ibis_cond, pandas_cond
    ):
        expr = result_fn(alltypes, ibis_cond(alltypes))
>       result = expr.execute()

ibis/tests/all/test_aggregation.py:209: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ibis/expr/types.py:219: in execute
    self, limit=limit, timecontext=timecontext, params=params, **kwargs
ibis/client.py:368: in execute
    return backend.execute(expr, limit=limit, params=params, **kwargs)
ibis/client.py:221: in execute
    result = self._execute_query(query, **kwargs)
ibis/client.py:228: in _execute_query
    return query.execute()
ibis/bigquery/client.py:194: in execute
    query_parameters=self.query_parameters,
ibis/bigquery/client.py:475: in _execute
    query.result()  # blocks until finished
../../miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/cloud/bigquery/job.py:3207: in result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
../../miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/cloud/bigquery/job.py:812: in result
    return super(_AsyncJob, self).result(timeout=timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.bigquery.job.QueryJob object at 0x7fc25b890c10>, timeout = None

    def result(self, timeout=None):
        """Get the result of the operation, blocking if necessary.
    
        Args:
            timeout (int):
                How long (in seconds) to wait for the operation to complete.
                If None, wait indefinitely.
    
        Returns:
            google.protobuf.Message: The Operation's result.
    
        Raises:
            google.api_core.GoogleAPICallError: If the operation errors or if
                the timeout is reached before the operation completes.
        """
        self._blocking_poll(timeout=timeout)
    
        if self._exception is not None:
            # pylint: disable=raising-bad-type
            # Pylint doesn't recognize that this is valid in this case.
>           raise self._exception
E           google.api_core.exceptions.BadRequest: 400 Syntax error: Expected ")" but got identifier "BigQueryTable" at [3:3]
E           
E           (job ID: fabc9d5c-9c79-482e-9320-acffbd787de1)
E           
E                         -----Query Job SQL Follows-----               
E           
E               |    .    |    .    |    .    |    .    |    .    |
E              1:SELECT
E              2:  COVAR_SAMP(ref_0
E              3:  BigQueryTable[table]
E              4:    name: swast-scratch.testing.functional_alltypes
E              5:    schema:
E              6:      index : int64
E              7:      Unnamed_0 : int64
E              8:      id : int64
E              9:      bool_col : boolean
E             10:      tinyint_col : int64
E             11:      smallint_col : int64
E             12:      int_col : int64
E             13:      bigint_col : int64
E             14:      float_col : float64
E             15:      double_col : float64
E             16:      date_string_col : string
E             17:      string_col : string
E             18:      timestamp_col : timestamp
E             19:      year : int64
E             20:      month : int64
E             21:  
E             22:  double_col = Column[float64*] 'double_col' from table
E             23:    ref_0, ref_0
E             24:  BigQueryTable[table]
E             25:    name: swast-scratch.testing.functional_alltypes
E             26:    schema:
E             27:      index : int64
E             28:      Unnamed_0 : int64
E             29:      id : int64
E             30:      bool_col : boolean
E             31:      tinyint_col : int64
E             32:      smallint_col : int64
E             33:      int_col : int64
E             34:      bigint_col : int64
E             35:      float_col : float64
E             36:      double_col : float64
E             37:      date_string_col : string
E             38:      string_col : string
E             39:      timestamp_col : timestamp
E             40:      year : int64
E             41:      month : int64
E             42:  
E             43:  float_col = Column[float64*] 'float_col' from table
E             44:    ref_0) AS `tmp`
E             45:FROM `swast-scratch.testing.functional_alltypes`
E               |    .    |    .    |    .    |    .    |    .    |

../../miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/api_core/future/polling.py:130: BadRequest
____________________________________________ test_reduction_ops[BigQuery-is_in-covar] _____________________________________________

backend = <ibis.tests.backends.BigQuery object at 0x7fc25b7935d0>
alltypes = BigQueryTable[table]
  name: swast-scratch.testing.functional_alltypes
  schema:
    index : int64
    Unnamed_0 : int...4
    date_string_col : string
    string_col : string
    timestamp_col : timestamp
    year : int64
    month : int64
df =       index  Unnamed_0    id  bool_col  tinyint_col  ...  date_string_col  string_col           timestamp_col  year  m...     True            6  ...         01/31/10           6 2010-01-31 05:06:13.650  2010      1

[7300 rows x 15 columns]
result_fn = <function <lambda> at 0x7fc25b7c08c0>, expected_fn = <function <lambda> at 0x7fc25b7c0950>
ibis_cond = <function <lambda> at 0x7fc25b7c0f80>, pandas_cond = <function <lambda> at 0x7fc25b7c3050>

    @pytest.mark.parametrize(
        ('result_fn', 'expected_fn'),
        [
            param(
                lambda t, where: t.bool_col.count(where=where),
                lambda t, where: len(t.bool_col[where].dropna()),
                id='count',
            ),
            param(
                lambda t, where: t.bool_col.any(),
                lambda t, where: t.bool_col.any(),
                id='any',
            ),
            param(
                lambda t, where: t.bool_col.notany(),
                lambda t, where: ~t.bool_col.any(),
                id='notany',
            ),
            param(
                lambda t, where: -t.bool_col.any(),
                lambda t, where: ~t.bool_col.any(),
                id='any_negate',
            ),
            param(
                lambda t, where: t.bool_col.all(),
                lambda t, where: t.bool_col.all(),
                id='all',
            ),
            param(
                lambda t, where: t.bool_col.notall(),
                lambda t, where: ~t.bool_col.all(),
                id='notall',
            ),
            param(
                lambda t, where: -t.bool_col.all(),
                lambda t, where: ~t.bool_col.all(),
                id='all_negate',
            ),
            param(
                lambda t, where: t.double_col.sum(),
                lambda t, where: t.double_col.sum(),
                id='sum',
            ),
            param(
                lambda t, where: t.double_col.mean(),
                lambda t, where: t.double_col.mean(),
                id='mean',
            ),
            param(
                lambda t, where: t.double_col.min(),
                lambda t, where: t.double_col.min(),
                id='min',
            ),
            param(
                lambda t, where: t.double_col.max(),
                lambda t, where: t.double_col.max(),
                id='max',
            ),
            param(
                lambda t, where: t.double_col.approx_median(),
                lambda t, where: t.double_col.median(),
                id='approx_median',
                marks=pytest.mark.xpass_backends([Clickhouse]),
            ),
            param(
                lambda t, where: t.double_col.std(how='sample'),
                lambda t, where: t.double_col.std(ddof=1),
                id='std',
            ),
            param(
                lambda t, where: t.double_col.var(how='sample'),
                lambda t, where: t.double_col.var(ddof=1),
                id='var',
            ),
            param(
                lambda t, where: t.double_col.std(how='pop'),
                lambda t, where: t.double_col.std(ddof=0),
                id='std_pop',
            ),
            param(
                lambda t, where: t.double_col.var(how='pop'),
                lambda t, where: t.double_col.var(ddof=0),
                id='var_pop',
            ),
            param(
                lambda t, where: t.double_col.cov(t.float_col),
                lambda t, where: t.double_col.cov(t.float_col),
                id='covar',
            ),
            param(
                lambda t, where: t.double_col.corr(t.float_col),
                lambda t, where: t.double_col.corr(t.float_col),
                id='corr',
            ),
            param(
                lambda t, where: t.string_col.approx_nunique(),
                lambda t, where: t.string_col.nunique(),
                id='approx_nunique',
                marks=pytest.mark.xfail_backends([MySQL, SQLite]),
            ),
            param(
                lambda t, where: t.double_col.arbitrary(how='first'),
                lambda t, where: t.double_col.iloc[0],
                id='arbitrary_first',
            ),
            param(
                lambda t, where: t.double_col.arbitrary(how='last'),
                lambda t, where: t.double_col.iloc[-1],
                id='arbitrary_last',
            ),
        ],
    )
    @pytest.mark.parametrize(
        ('ibis_cond', 'pandas_cond'),
        [
            param(lambda t: None, lambda t: slice(None), id='no_cond'),
            param(
                lambda t: t.string_col.isin(['1', '7']),
                lambda t: t.string_col.isin(['1', '7']),
                id='is_in',
            ),
        ],
    )
    @pytest.mark.xfail_unsupported
    def test_reduction_ops(
        backend, alltypes, df, result_fn, expected_fn, ibis_cond, pandas_cond
    ):
        expr = result_fn(alltypes, ibis_cond(alltypes))
>       result = expr.execute()

ibis/tests/all/test_aggregation.py:209: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ibis/expr/types.py:219: in execute
    self, limit=limit, timecontext=timecontext, params=params, **kwargs
ibis/client.py:368: in execute
    return backend.execute(expr, limit=limit, params=params, **kwargs)
ibis/client.py:221: in execute
    result = self._execute_query(query, **kwargs)
ibis/client.py:228: in _execute_query
    return query.execute()
ibis/bigquery/client.py:194: in execute
    query_parameters=self.query_parameters,
ibis/bigquery/client.py:475: in _execute
    query.result()  # blocks until finished
../../miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/cloud/bigquery/job.py:3207: in result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
../../miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/cloud/bigquery/job.py:812: in result
    return super(_AsyncJob, self).result(timeout=timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.bigquery.job.QueryJob object at 0x7fc25c1aa7d0>, timeout = None

    def result(self, timeout=None):
        """Get the result of the operation, blocking if necessary.
    
        Args:
            timeout (int):
                How long (in seconds) to wait for the operation to complete.
                If None, wait indefinitely.
    
        Returns:
            google.protobuf.Message: The Operation's result.
    
        Raises:
            google.api_core.GoogleAPICallError: If the operation errors or if
                the timeout is reached before the operation completes.
        """
        self._blocking_poll(timeout=timeout)
    
        if self._exception is not None:
            # pylint: disable=raising-bad-type
            # Pylint doesn't recognize that this is valid in this case.
>           raise self._exception
E           google.api_core.exceptions.BadRequest: 400 Syntax error: Expected ")" but got identifier "BigQueryTable" at [3:3]
E           
E           (job ID: 23caa920-3cf1-4d62-9523-e81e1d58e9a9)
E           
E                         -----Query Job SQL Follows-----               
E           
E               |    .    |    .    |    .    |    .    |    .    |
E              1:SELECT
E              2:  COVAR_SAMP(ref_0
E              3:  BigQueryTable[table]
E              4:    name: swast-scratch.testing.functional_alltypes
E              5:    schema:
E              6:      index : int64
E              7:      Unnamed_0 : int64
E              8:      id : int64
E              9:      bool_col : boolean
E             10:      tinyint_col : int64
E             11:      smallint_col : int64
E             12:      int_col : int64
E             13:      bigint_col : int64
E             14:      float_col : float64
E             15:      double_col : float64
E             16:      date_string_col : string
E             17:      string_col : string
E             18:      timestamp_col : timestamp
E             19:      year : int64
E             20:      month : int64
E             21:  
E             22:  double_col = Column[float64*] 'double_col' from table
E             23:    ref_0, ref_0
E             24:  BigQueryTable[table]
E             25:    name: swast-scratch.testing.functional_alltypes
E             26:    schema:
E             27:      index : int64
E             28:      Unnamed_0 : int64
E             29:      id : int64
E             30:      bool_col : boolean
E             31:      tinyint_col : int64
E             32:      smallint_col : int64
E             33:      int_col : int64
E             34:      bigint_col : int64
E             35:      float_col : float64
E             36:      double_col : float64
E             37:      date_string_col : string
E             38:      string_col : string
E             39:      timestamp_col : timestamp
E             40:      year : int64
E             41:      month : int64
E             42:  
E             43:  float_col = Column[float64*] 'float_col' from table
E             44:    ref_0) AS `tmp`
E             45:FROM `swast-scratch.testing.functional_alltypes`
E               |    .    |    .    |    .    |    .    |    .    |

../../miniconda3/envs/ibis-dev/lib/python3.7/site-packages/google/api_core/future/polling.py:130: BadRequest
======================================================== warnings summary =========================================================
ibis/tests/all/test_aggregation.py::test_reduction_ops[BigQuery-no_cond-covar]
  /Users/swast/src/ibis/ibis/bigquery/client.py:545: PendingDeprecationWarning: Client.dataset is deprecated and will be removed in a future version. Use a string like 'my_project.my_dataset' or a cloud.google.bigquery.DatasetReference object, instead.
    table_ref = self.client.dataset(dataset, project=project).table(name)

ibis/tests/all/test_aggregation.py::test_reduction_ops[BigQuery-no_cond-covar]
  /Users/swast/src/ibis/ibis/bigquery/client.py:432: PendingDeprecationWarning: Client.dataset is deprecated and will be removed in a future version. Use a string like 'my_project.my_dataset' or a cloud.google.bigquery.DatasetReference object, instead.
    dataset_ref = self.client.dataset(dataset, project=project)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
===================================================== short test summary info =====================================================
FAILED ibis/tests/all/test_aggregation.py::test_reduction_ops[BigQuery-no_cond-covar] - google.api_core.exceptions.BadRequest: 4...
FAILED ibis/tests/all/test_aggregation.py::test_reduction_ops[BigQuery-is_in-covar] - google.api_core.exceptions.BadRequest: 400...
================================================== 2 failed, 2 warnings in 4.17s ==================================================

Thoughts on fix

I believe this is the relevant source:

@compiles(ops.Covariance)
def compiles_covar(translator, expr):
expr = expr.op()
left = expr.left
right = expr.right
where = expr.where
if expr.how == 'sample':
how = 'SAMP'
elif expr.how == 'pop':
how = 'POP'
else:
raise ValueError(
"Covariance with how={!r} is not supported.".format(how)
)
if where is not None:
left = where.ifelse(left, ibis.NA)
right = where.ifelse(right, ibis.NA)
return "COVAR_{}({}, {})".format(how, left, right)

Perhaps it is just missing some calls to translator.translate?

@tswast
Copy link
Collaborator Author

tswast commented Sep 9, 2020

Found in #2353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants