Add workflow streams #461

davidmezzetti · 2023-04-17T19:28:58Z

Currently when running a workflow through the API, data is expected to be passed through the API. This is fine for most situations but inefficient when running a large data processing operation.

When a workflow is run through Python, a generator can be passed to efficiently process a large dataset. This issue will add a new YAML argument to workflows named stream. A stream will effectively act as a data generator and stream data to the workflow. This will enable full server side processing of content.

The following is an example workflow that indexes the cnn_dailymail Hugging Face dataset.

writable: True
embeddings:
  path: sentence-transformers/all-MiniLM-L6-v2
  content: True 

tabular:
  idcolumn: id
  textcolumns:
    - highlights

workflow:
  index:
    stream:
      action: datasets.load_dataset
      args:
        name: 3.0.0
        split: test
    tasks:
      - tabular
      - index

The text was updated successfully, but these errors were encountered:

davidmezzetti added this to the v5.5.0 milestone Apr 17, 2023

davidmezzetti self-assigned this Apr 17, 2023

davidmezzetti closed this as completed in 178f4c2 Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workflow streams #461

Add workflow streams #461

davidmezzetti commented Apr 17, 2023

Add workflow streams #461

Add workflow streams #461

Comments

davidmezzetti commented Apr 17, 2023