Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet FLOAT vs. DOUBLE type issue #2215

Closed
prrao87 opened this issue Oct 14, 2023 · 3 comments · Fixed by #3887
Closed

Parquet FLOAT vs. DOUBLE type issue #2215

prrao87 opened this issue Oct 14, 2023 · 3 comments · Fixed by #3887
Assignees
Labels
bug Something isn't working data-import-export Issues related to data importing or exporting, such as copy to/from statements feature New features or missing components of existing features

Comments

@prrao87
Copy link
Member

prrao87 commented Oct 14, 2023

Kùzu version: 0.0.10

I've tried upgrading to the latest version 0.0.10 that was just released, and noticed that there's an issue with parquet reading FLOAT vs DOUBLE data type, which wasn't present in the previous version. I think the default behaviour should be more accepting of both data types, as the data in parquet can come from many different sources? The parquet file attached here came from Polars with snappy compression.

MRE

import duckdb

conn_1 = duckdb.connect()
tbl = conn_1.sql("SELECT * FROM 'cities.parquet'")

print(tbl)

import kuzu

db = kuzu.Database(f"./test")
conn = kuzu.Connection(db)

def create_city_node_table(conn) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            City(
                id INT64,
                city STRING,
                state STRING,
                country STRING,
                lat FLOAT,
                lon FLOAT,
                population INT64,
                PRIMARY KEY (id)
            )
        """
    )

create_city_node_table(conn)
conn.execute(f"COPY City FROM 'cities.parquet';")
$ python test.py
┌───────┬────────────┬─────────┬─────────┬─────────┬───────────┬────────────┐
│  id   │    city    │  state  │ country │   lat   │    lng    │ population │
│ int64 │  varchar   │ varchar │ varchar │ double  │  double   │   int32    │
├───────┼────────────┼─────────┼─────────┼─────────┼───────────┼────────────┤
│     1 │ Airdrie    │ Alberta │ Canada  │ 51.2917 │ -114.0144 │      61581 │
│     2 │ Beaumont   │ Alberta │ Canada  │ 53.3572 │ -113.4147 │      17396 │
│     3 │ Blackfalds │ Alberta │ Canada  │ 52.3833 │    -113.8 │       9328 │
│     4 │ Brooks     │ Alberta │ Canada  │ 50.5642 │ -111.8989 │      14451 │
│     5 │ Calgary    │ Alberta │ Canada  │   51.05 │ -114.0667 │    1239220 │
└───────┴────────────┴─────────┴─────────┴─────────┴───────────┴────────────┘

Traceback (most recent call last):
  File "/code/kuzudb-study/test.py", line 31, in <module>
    conn.execute(f"COPY City FROM 'cities.parquet';")
  File "/code/kuzudb-study/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 90, in execute
    self._connection.execute(
RuntimeError: Binder exception: Column `lat` type mismatch. Expected FLOAT but got DOUBLE.

Example file

cities.parquet.zip

As can be seen, the parquet output from DuckDB is perfectly fine, and is of type DOUBLE. I think the correct behaviour here should be that Kùzu coerces the data type to FLOAT without the user having to worry about it?

@ray6080
Copy link
Contributor

ray6080 commented Oct 15, 2023

Hi @prrao87 , yeah this is a known issue, I totally agree with you. We introduced more rigid rules for data type checks during the binding phase to fix some type mismatch bugs, but not able to finish the casting part yet. Before the next major release, we will add back more flexible data type castings. Sorry for the inconvenience caused on your side.

@semihsalihoglu-uw semihsalihoglu-uw added data-import-export Issues related to data importing or exporting, such as copy to/from statements bug Something isn't working feature New features or missing components of existing features labels Jan 8, 2024
@semihsalihoglu-uw
Copy link
Contributor

Labeling with both bug and feature as it's not very clear to me which one is more accurate.

@andyfengHKU
Copy link
Contributor

Solved in #3887

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data-import-export Issues related to data importing or exporting, such as copy to/from statements feature New features or missing components of existing features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants