Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for raw python generator in addition to Dataset for pipelines #14352

Merged
merged 2 commits into from
Nov 12, 2021

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Nov 10, 2021

The main goal is to ease the create of streaming data to the pipe.

Dataset is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled #14250 but can be proposed as a standalone PR.

from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)

The main caveat of this, is the interaction with DataLoader with
num_workers>1. When you have multiple workers, each receive a copy
of the generator (like IterableDataset). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using num_workers=1 is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
Dataset is more advised (since then differents workers can just jump
themselves).

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled huggingface#14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).
@Narsil Narsil changed the title # What does this PR do? Adding support for raw python generator in addition to Dataset for pipelines Nov 10, 2021
@Narsil Narsil changed the title Adding support for raw python generator in addition to Dataset for pipelines Adding support for raw python generator in addition to Dataset for pipelines Nov 10, 2021
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement, thanks for adding this!

src/transformers/pipelines/base.py Outdated Show resolved Hide resolved
src/transformers/pipelines/base.py Outdated Show resolved Hide resolved
src/transformers/pipelines/base.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nice! Looks cool and clean. Thanks for working on that, @Narsil.

@Narsil Narsil merged commit ed5d155 into huggingface:master Nov 12, 2021
@Narsil Narsil deleted the qol_iterator_pipeline branch November 12, 2021 08:20
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…or pipelines (huggingface#14352)

* Adding support for raw python `generator` in addition to `Dataset`

The main goal is to ease the create of streaming data to the pipe.

`Dataset` is more involved and pytorch specific.

This PR, provides a way to use a python iterator too.
This enabled huggingface#14250 but can be proposed as a standalone PR.

```python
from transformers import pipeline

def read_data(filename):
    with open(filename, 'r') as f:
        for line in f:
            yield f

pipe = pipeline("text-classification")
for classified in pipe(read_data("large_file.txt")):
    print("Success ! ", classified)
```

The main caveat of this, is the interaction with `DataLoader` with
`num_workers>1`. When you have multiple workers, each receive a copy
of the generator (like `IterableDataset`). That means the naive Iterator
will fail since all workers iterate on all items of the generator.

There are ways to do clever "skipping", but it could be bad still
because all workers still do have to pass through all items of the
generator (they just ignore items they don't handle), depending on
the case it might be bad.

Using `num_workers=1` is the simplest fix and if the cost of loading
your data is small enough should be good enough. In the above example
trying to do smart tricks to skip some lines is unlikely to be a net
positive for instance.

If there are better ways to do "jumps" on some data, then using
`Dataset` is more advised (since then differents workers can just jump
themselves).

* Adding iterator support for `tf` too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants