Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow streams #461

Closed
davidmezzetti opened this issue Apr 17, 2023 · 0 comments
Closed

Add workflow streams #461

davidmezzetti opened this issue Apr 17, 2023 · 0 comments
Assignees
Milestone

Comments

@davidmezzetti
Copy link
Member

Currently when running a workflow through the API, data is expected to be passed through the API. This is fine for most situations but inefficient when running a large data processing operation.

When a workflow is run through Python, a generator can be passed to efficiently process a large dataset. This issue will add a new YAML argument to workflows named stream. A stream will effectively act as a data generator and stream data to the workflow. This will enable full server side processing of content.

The following is an example workflow that indexes the cnn_dailymail Hugging Face dataset.

writable: True
embeddings:
  path: sentence-transformers/all-MiniLM-L6-v2
  content: True 

tabular:
  idcolumn: id
  textcolumns:
    - highlights

workflow:
  index:
    stream:
      action: datasets.load_dataset
      args:
        name: 3.0.0
        split: test
    tasks:
      - tabular
      - index
@davidmezzetti davidmezzetti added this to the v5.5.0 milestone Apr 17, 2023
@davidmezzetti davidmezzetti self-assigned this Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant