Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binaryData: Discrepancy between documentation and serialized page for VARIABLE_WIDTH offsets #22601

Open
jagill opened this issue Apr 24, 2024 · 0 comments
Labels

Comments

@jagill
Copy link
Contributor

jagill commented Apr 24, 2024

In the documentation for serialized pages (https://prestodb.io/docs/current/develop/serialized-page.html#variable-width-encoding), offsets for VARIABLE_WIDTH columns are stated to be the starting offsets of each entry. I find that they are the ending offsets for each entry.

cc @arhimondr @mbasmanova

Your Environment

  • Presto version used: 0.287-edge18.1
  • Storage (HDFS/S3/GCS..): None, literal query
  • Data source and connector used: None, literal query
  • Deployment (Cloud or On-prem): Meta internal

Expected Behavior

For the query (using binaryData=true)

SELECT
  s
FROM (
  VALUES
    'cat',
    'bird'
) AS t(s)

The offsets of the VARIABLE_WIDTH column should be [0, 3], the starting offsets according to https://prestodb.io/docs/current/develop/serialized-page.html#variable-width-encoding .

Current Behavior

The offsets are [3, 7], the ending offsets.

Possible Solution

Change either documentation or implementation to be consistent.

Steps to Reproduce

Issue query (with binaryData=true)

SELECT
  s
FROM (
  VALUES
    'cat',
    'bird'
) AS t(s)

It should return a base64 encoded slice similar to

"AgAAAAQuAAAALgAAAJki1XAAAAAAAQAAAA4AAABWQVJJQUJMRV9XSURUSAIAAAADAAAABwAAAAAHAAAAY2F0YmlyZAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="

Notice that bytes 47..55 encode the offsets as two little-endian int32s, and they are 3 and 7. The documentation suggests that they should be 0 and 3.

Context

I have worked around this, but the documentation and implementation should be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

1 participant