Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Pyarrow and Arrow2 don't agree on Timestamp resolution #700

Closed
ritchie46 opened this issue Dec 22, 2021 · 2 comments · Fixed by #701
Closed

Pyarrow and Arrow2 don't agree on Timestamp resolution #700

ritchie46 opened this issue Dec 22, 2021 · 2 comments · Fixed by #701
Assignees
Labels
bug Something isn't working

Comments

@ritchie46
Copy link
Collaborator

ritchie46 commented Dec 22, 2021

simple dataset

    data = {
        "datetime": [  # unix timestamp in ms
            1618354800000,
            1618354740000,
            1618354680000,
            1618354620000,
            1618354560000,
        ],
        "laf_max": [73.1999969482, 71.0999984741, 74.5, 69.5999984741, 69.6999969482],
        "laf_eq": [59.5999984741, 61.0, 62.2999992371, 56.9000015259, 60.0],
    }
    df = pl.DataFrame(data)
    df = df.with_column(df["datetime"].cast(pl.Datetime))
    df
shape: (5, 3)
┌────────────────────────────┬───────────────┬───────────────┐
│ datetime                   ┆ laf_max       ┆ laf_eq        │
│ ---                        ┆ ---           ┆ ---           │
│ datetime                   ┆ f64           ┆ f64           │
╞════════════════════════════╪═══════════════╪═══════════════╡
│ 1970-01-01 00:26:58.354800 ┆ 73.1999969482 ┆ 59.5999984741 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354740 ┆ 71.0999984741 ┆ 61            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354680 ┆ 74.5          ┆ 62.2999992371 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354620 ┆ 69.5999984741 ┆ 56.9000015259 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:26:58.354560 ┆ 69.6999969482 ┆ 60            │
└────────────────────────────┴───────────────┴───────────────┘

    f = io.BytesIO()
    df.to_parquet(f, use_pyarrow=True)
    f.seek(0)
    read = pl.read_parquet(f)
    read
shape: (5, 3)
┌───────────────────────────────┬───────────────┬───────────────┐
│ datetime                      ┆ laf_max       ┆ laf_eq        │
│ ---                           ┆ ---           ┆ ---           │
│ datetime                      ┆ f64           ┆ f64           │
╞═══════════════════════════════╪═══════════════╪═══════════════╡
│ 1970-01-01 00:00:01.618354800 ┆ 73.1999969482 ┆ 59.5999984741 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354740 ┆ 71.0999984741 ┆ 61            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354680 ┆ 74.5          ┆ 62.2999992371 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354620 ┆ 69.5999984741 ┆ 56.9000015259 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-01 00:00:01.618354560 ┆ 69.6999969482 ┆ 60            │
└───────────────────────────────┴───────────────┴───────────────┘

If we write and read with pyarrow the timestamp is correct. If we read and write with arrow2 the timestamp is also correct.

@jorgecarleitao jorgecarleitao added the bug Something isn't working label Dec 22, 2021
@jorgecarleitao jorgecarleitao self-assigned this Dec 22, 2021
@jorgecarleitao
Copy link
Owner

This is an interesting case: pyarrow writes arrows' nanosecond precision as parquet's logical type microseconds, and divides the number accordingly. So, the file ends up with

  • parquet's logical type: microseconds
  • arrow's logical type (in the schema's metadata): nanoseconds

The bug on our end is that we ignore parquet's logical type when deserializing, which caused us reading parquet's microseconds into arrow's nanoseconds without correctly converting them.

@ritchie46
Copy link
Collaborator Author

I had to read it twice, but I think I understand. :D So there are two logical types in a parquet file? The one written and the destination's logical type.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants