Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add tf.data APIs for reading batches #1488

Merged
merged 4 commits into from
Nov 2, 2023
Merged

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Oct 31, 2023

Closes #1499

Copy link

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 changed the title wip: try to handle a take stream feat: add tf.data APIs for reading batches Nov 1, 2023
@wjones127 wjones127 marked this pull request as ready for review November 1, 2023 18:23
@wjones127 wjones127 requested a review from eddyxu November 1, 2023 18:24
Copy link
Contributor

@eddyxu eddyxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming issue. Rest seems good
Curious what performance this is . Also they don't require shuffling within Batch?

dataset = lance.dataset(dataset)
num_rows = dataset.count_rows()
num_batches = (num_rows + batch_size - 1) // batch_size
indices = tf.data.Dataset.range(num_batches, dtype=tf.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say a 1B dataset, this one can be 1-million batches, so a few MB batch ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the user sets the batch size, so they control the memory use here.

return (start, end)


def lance_batches(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does tf.data has a from_batches, can we name this to from_batch to be more consist with the tf style

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

python/python/lance/tf/data.py Show resolved Hide resolved
@wjones127
Copy link
Contributor Author

Also they don't require shuffling within Batch?

Users can make the batches as small as they want, and can always call the shuffle method to shuffle multiple batches together. I don't think we need to handle that ourselves.

@wjones127 wjones127 merged commit 92f0f8d into main Nov 2, 2023
10 checks passed
@wjones127 wjones127 deleted the wjones127/tf-take branch November 2, 2023 18:06
wjones127 added a commit that referenced this pull request Nov 2, 2023
Closes #1500.

This is a follow up to #1488.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tf.data API to read data in batches
2 participants