-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add tf.data APIs for reading batches #1488
Conversation
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
83d614c
to
3cacb62
Compare
3cacb62
to
34dea38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor naming issue. Rest seems good
Curious what performance this is . Also they don't require shuffling within Batch?
dataset = lance.dataset(dataset) | ||
num_rows = dataset.count_rows() | ||
num_batches = (num_rows + batch_size - 1) // batch_size | ||
indices = tf.data.Dataset.range(num_batches, dtype=tf.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say a 1B dataset, this one can be 1-million batches, so a few MB batch ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the user sets the batch size, so they control the memory use here.
python/python/lance/tf/data.py
Outdated
return (start, end) | ||
|
||
|
||
def lance_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does tf.data
has a from_batches
, can we name this to from_batch
to be more consist with the tf style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Users can make the batches as small as they want, and can always call the |
Closes #1499