Adding support for raw python `generator` in addition to `Dataset` for pipelines #14352

Narsil · 2021-11-10T09:50:22Z

The main goal is to ease the create of streaming data to the pipe.

Dataset is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled #14250 but can be proposed as a standalone PR.

from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)

The main caveat of this, is the interaction with DataLoader with
num_workers>1. When you have multiple workers, each receive a copy
of the generator (like IterableDataset). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using num_workers=1 is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
Dataset is more advised (since then differents workers can just jump
themselves).

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).

sgugger

Nice improvement, thanks for adding this!

src/transformers/pipelines/base.py

LysandreJik

Ok nice! Looks cool and clean. Thanks for working on that, @Narsil.

…or pipelines (huggingface#14352) * Adding support for raw python `generator` in addition to `Dataset` The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves). * Adding iterator support for `tf` too.

Narsil requested review from LysandreJik and sgugger November 10, 2021 09:53

Narsil changed the title ~~# What does this PR do?~~ Adding support for raw python generator in addition to Dataset for pipelines Nov 10, 2021

Narsil changed the title ~~Adding support for raw python generator in addition to Dataset for pipelines~~ Adding support for raw python generator in addition to Dataset for pipelines Nov 10, 2021

sgugger approved these changes Nov 10, 2021

View reviewed changes

src/transformers/pipelines/base.py Outdated Show resolved Hide resolved

src/transformers/pipelines/base.py Outdated Show resolved Hide resolved

src/transformers/pipelines/base.py Outdated Show resolved Hide resolved

Adding iterator support for tf too.

9ef09a7

LysandreJik approved these changes Nov 11, 2021

View reviewed changes

Narsil merged commit ed5d155 into huggingface:master Nov 12, 2021

Narsil deleted the qol_iterator_pipeline branch November 12, 2021 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for raw python `generator` in addition to `Dataset` for pipelines #14352

Adding support for raw python `generator` in addition to `Dataset` for pipelines #14352

Narsil commented Nov 10, 2021 •

edited

Loading

sgugger left a comment

LysandreJik left a comment

Adding support for raw python generator in addition to Dataset for pipelines #14352

Adding support for raw python generator in addition to Dataset for pipelines #14352

Conversation

Narsil commented Nov 10, 2021 • edited Loading

Before submitting

Who can review?

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Adding support for raw python `generator` in addition to `Dataset` for pipelines #14352

Adding support for raw python `generator` in addition to `Dataset` for pipelines #14352

Narsil commented Nov 10, 2021 •

edited

Loading