Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: several example datasets broken on postgres #8792

Closed
1 task done
gforsyth opened this issue Mar 27, 2024 · 4 comments
Closed
1 task done

bug: several example datasets broken on postgres #8792

gforsyth opened this issue Mar 27, 2024 · 4 comments
Assignees
Labels
bug Incorrect behavior inside of ibis examples Issues/PRs related to `ibis.examples` postgres The PostgreSQL backend

Comments

@gforsyth
Copy link
Member

What happened?

[ins] In [1]: import ibis

[ins] In [2]: con = ibis.postgres.connect(
         ...:     user="postgres",
         ...:     password="postgres",
         ...:     host="0.0.0.0",
         ...:     port=5432,
         ...:     database="ibis_testing",
         ...: )

[ins] In [3]: ibis.set_backend(con)

[ins] In [4]: ibis.examples.starwars.fetch()
---------------------------------------------------------------------------
NumericValueOutOfRange                    Traceback (most recent call last)
Cell In[4], line 1
----> 1 ibis.examples.starwars.fetch()

File ~/github.com/ibis-project/ibis/ibis/examples/__init__.py:85, in Example.fetch(self, table_name, backend)
     81 # TODO: It should be possible to avoid this memtable call, once all
     82 # backends support passing a `pyarrow.Table` to `create_table`
     83 # directly.
     84 obj = ibis.memtable(table)
---> 85 return backend.create_table(table_name, obj, temp=True, overwrite=True)

File ~/github.com/ibis-project/ibis/ibis/backends/postgres/__init__.py:692, in Backend.create_table(self, name, obj, schema, database, temp, overwrite)
    689     else:
    690         table = obj
--> 692     self._run_pre_execute_hooks(table)
    694     query = self._to_sqlglot(table)
    695 else:

File ~/github.com/ibis-project/ibis/ibis/backends/__init__.py:1030, in BaseBackend._run_pre_execute_hooks(self, expr)
   1028 self._define_udf_translation_rules(expr)
   1029 self._register_udfs(expr)
-> 1030 self._register_in_memory_tables(expr)

File ~/github.com/ibis-project/ibis/ibis/backends/sql/__init__.py:257, in SQLBackend._register_in_memory_tables(self, expr)
    255 def _register_in_memory_tables(self, expr: ir.Expr) -> None:
    256     for memtable in expr.op().find(ops.InMemoryTable):
--> 257         self._register_in_memory_table(memtable)

File ~/github.com/ibis-project/ibis/ibis/backends/postgres/__init__.py:155, in Backend._register_in_memory_table(self, op)
    153 with self.begin() as cur:
    154     cur.execute(create_stmt_sql)
--> 155     execute_batch(cur, sql, data, 128)

File /nix/store/xglf45k8zzyfsk3a18c0f8j3kv77gkh7-python3-3.12.2-env/lib/python3.12/site-packages/psycopg2/extras.py:1216, in execute_batch(cur, sql, argslist, page_size)
   1214 for page in _paginate(argslist, page_size=page_size):
   1215     sqls = [cur.mogrify(sql, args) for args in page]
-> 1216     cur.execute(b";".join(sqls))

NumericValueOutOfRange: bigint out of range

penguins is also broken. For starwars, at least, the issue is that the height column is defined as bigint when it should definitely be a float.

What version of ibis are you using?

main @ b3e27eb

What backend(s) are you using, if any?

postgres

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@gforsyth gforsyth added bug Incorrect behavior inside of ibis examples Issues/PRs related to `ibis.examples` postgres The PostgreSQL backend labels Mar 27, 2024
@gforsyth
Copy link
Member Author

I tried this out on 8.0.0 and it's also broken there, so not a new regression related to the sqlglot shift.

@gforsyth
Copy link
Member Author

I was poking at this a little bit, just trying to narrow down the problem.

The immediate cause appears to be the pyarrow CSV reader -- if the backend isn't one of duckdb or polars, then we load the csv using pyarrow.csv, then make it into a memtable, then pass it to create_table.

pyarrow.csv is either not sniffing enough lines of the csv (probably this), but it could be something else.

I will note that this impacts most SQL backends that have strong opinions about types (so all of them except sqlite).
We could probably get around this by converting all of the example datasets to parquet instead using duckdb (and just replacing those on the pinboard)

@chloeh13q
Copy link
Contributor

chloeh13q commented Apr 2, 2024

@gforsyth Thanks for the pointer!

I dug a little deeper into what you had said above and I believe the issue arises when there are null values in a column. In the starwars example, the column height is breaking the import. Specifically, we convert the data into a pandas df and use .itertuples() to generate the SQL, and depending on how the nan value gets executed it may throw an error. In Postgres nan gets turned into 'NaN'::float and that throws and error in a column that is supposed to be integers. In mysql it throws an exception like ProgrammingError: nan can not be used with MySQL. In Pyspark it throws PySparkTypeError: [CANNOT_ACCEPT_OBJECT_IN_TYPE] `LongType()` can not accept object `172.0` in type `float`.

Can we replace these NaN values with None/NULL?

I wasn't able to test all of the backends (because I have trouble setting some of them up) but this is working for bigquery, clickhouse, dask, datafusion, pandas, sqlite, trino. Broken in mysql, pyspark.

I can raise a PR to fix this if we're on the same page.

@gforsyth
Copy link
Member Author

I think it's definitely fine to convert NaNs to NULL -- the important thing here is for the backends to load some example data, so anything in service of that seems good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis examples Issues/PRs related to `ibis.examples` postgres The PostgreSQL backend
Projects
Archived in project
Development

No branches or pull requests

2 participants