Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: create_table(temp=True) timing out due to slow table existence check #9216

Closed
1 task done
asiunov opened this issue May 21, 2024 · 2 comments · Fixed by #9425
Closed
1 task done

bug: create_table(temp=True) timing out due to slow table existence check #9216

asiunov opened this issue May 21, 2024 · 2 comments · Fixed by #9425
Labels
bug Incorrect behavior inside of ibis

Comments

@asiunov
Copy link

asiunov commented May 21, 2024

What happened?

create_table for BigQuery backend may become inappropriately inefficient.

bigquery.Backend.create_table(name, df, temp=True) calls _register_in_memory_tables, which calls _register_in_memory_table, which checks if the table exists using:

if raw_name not in self.list_tables(database=(project, dataset)):

The problem for BigQuery is caused by the fact that BQ uses hidden datasets to store temp data (see _make_session function). These datasets are reused between sessions and may collect more and more tables within table expiration window. I've got 10s of thousands of temp tables. The listing for such large dataset takes minutes (hours, etc) and even causes request throttling from google side.

I have to use this monkey-patching to bypass the issue:

import google.api_core.exceptions as google_exceptions
import google.cloud.bigquery as bq_api
import ibis.expr.operations as ibis_ops
import sqlglot as sg
from ibis.backends.bigquery import Backend as BigQueryBackend
from ibis.backends.bigquery.datatypes import BigQuerySchema


def table_exists(backend: BigQueryBackend, table_name: str, database: tuple[str, str] | str | None = None) -> bool:
  try:
    _ = backend.table(table_name, database=database)
    return True
  except google_exceptions.NotFound as e:
    if "Not found: Table" in e.message:
      return False
    else:
      raise


def _register_in_memory_table(self: BigQueryBackend, op: ibis_ops.InMemoryTable) -> None:
  raw_name = op.name

  project = self._session_dataset.project
  dataset = self._session_dataset.dataset_id

  if not table_exists(self, raw_name, database=(project, dataset)):
    table_id = sg.table(raw_name, db=dataset, catalog=project, quoted=False).sql(dialect=self.name)

    bq_schema = BigQuerySchema.from_ibis(op.schema)
    load_job = self.client.load_table_from_dataframe(
      op.data.to_frame(),
      table_id,
      job_config=bq_api.LoadJobConfig(
        # fail if the table already exists and contains data
        write_disposition=bq_api.WriteDisposition.WRITE_EMPTY,
        schema=bq_schema,
      ),
    )
    load_job.result()


BigQueryBackend._register_in_memory_table = _register_in_memory_table

What version of ibis are you using?

9.0.0.

What backend(s) are you using, if any?

BigQuery

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@asiunov asiunov added the bug Incorrect behavior inside of ibis label May 21, 2024
@asiunov
Copy link
Author

asiunov commented May 21, 2024

Related feature request: add a method for table existence check #9215

@jcrist
Copy link
Member

jcrist commented May 22, 2024

Thanks for opening this! I can see how this could be an issue. Moving this existence check to be cached in the backend (maybe keeping a set of already registered memtables by name or something) would help to alleviate this, beyond improving the efficiency of checking for a table's existence (as asked for in #9215).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants