Support combining multiple columns into a single text feature #3426

tgaddair · 2023-05-30T23:51:10Z

For LLMs and other text models, it's common to treat multiple inputs as a single string of text to feed into the model as input, rather than construct separate "towers" for each feature. This PR provides a standard way to do this via the prompt config.

For ECD:

input_features:
    - name: context
      type: text
      preprocessing:
          prompt:
            template: The {color} {animal} jumped over the {size} {object}

For LLM:

model_type: llm
prompt:
    template: The {color} {animal} jumped over the {size} {object}
input_features:
    - name: context
      type: text

Here, it is not required that context be a valid column of the input dataset, as it will be generated from the prompt. However, each of color, animal, size, and object are assumed to be columns of the input dataset.

github-actions · 2023-05-31T00:58:48Z

Unit Test Results

  6 files ±0   6 suites ±0 1h 22m 39s ⏱️ + 5m 23s
33 tests ±0 29 ✔️ ±0   4 💤 ±0 0 ❌ ±0
99 runs ±0 87 ✔️ ±0 12 💤 ±0 0 ❌ ±0

Results for commit 16ebf37. ± Comparison against base commit 2bada30.

♻️ This comment has been updated with latest results.

justinxzhao

LGTM! Just 2 small comments.

ludwig/data/prompt.py

tests/integration_tests/test_preprocessing.py

for more information, see https://pre-commit.ci

justinxzhao · 2023-05-31T19:15:17Z

ludwig/data/prompt.py

@@ -169,44 +174,64 @@ def format_input_with_prompt(
            template = DEFAULT_ZERO_SHOT_PROMPT_TEMPLATE


OOC: Would a user be able to use the DEFAULT_ZERO_SHOT_PROMPT_TEMPLATE for the prompt template within an input feature definition like so?

input_features: - name: context type: text preprocessing: prompt: template: None # Use default.

Yes, though I believe in this case they need to provide a task.

Yeah that's right, they'd need to provide a task

geoffreyangus · 2023-05-31T20:12:14Z

ludwig/data/preprocessing.py


    preprocessing = input_feature_config["preprocessing"]
-    if "prompt" in preprocessing and preprocessing["prompt"]["task"] is not None:
+    if _has_prompt_section(preprocessing):


How does this work? Are we accepting prompt keys at both the local and global level?

The LLM schema requires the prompt at the top-level, and does not allow feature-level prompts.

The ECD schema allows prompts at the feature-level, but not the top level.

geoffreyangus · 2023-05-31T20:13:35Z

ludwig/data/prompt.py

-        if is_few_shot:
-            df["context"] = retrieval_model.search(df, backend, k=k, return_data=True)
+        def generate_prompt_for_row(row):
+            kwargs = {col: field_to_dtype[col](row[col]) for col in template_fields}


why do we need to cast the values?

If you have a number like 0.1234 in the DF and you want to format it like {number:.2f} it will fail if the number is represented as a string in the DF. So we need to cast it to a float. I can add a comment to this effect.

tgaddair added 3 commits May 30, 2023 16:17

WIP: multi-column prompt

6152e7c

Fixed validation

27d3cc3

Cleanup

ccf7ca9

tgaddair requested review from justinxzhao and geoffreyangus May 30, 2023 23:51

Add ray backend

90ff73c

Fixed tests

897ac32

justinxzhao reviewed May 31, 2023

View reviewed changes

ludwig/data/prompt.py Outdated Show resolved Hide resolved

tests/integration_tests/test_preprocessing.py Outdated Show resolved Hide resolved

tgaddair and others added 4 commits May 31, 2023 09:27

Test float formatting

8993141

Fixed column merge

944cdbd

Fixed float formatting

6b2c70b

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad09fde

for more information, see https://pre-commit.ci

justinxzhao approved these changes May 31, 2023

View reviewed changes

geoffreyangus reviewed May 31, 2023

View reviewed changes

geoffreyangus approved these changes May 31, 2023

View reviewed changes

tgaddair added 2 commits May 31, 2023 16:41

Added tests

df52575

Fixed tests

16ebf37

tgaddair merged commit d2f71c5 into master Jun 1, 2023
16 checks passed

tgaddair deleted the multi-col-prompt branch June 1, 2023 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support combining multiple columns into a single text feature #3426

Support combining multiple columns into a single text feature #3426

tgaddair commented May 30, 2023 •

edited

github-actions bot commented May 31, 2023 •

edited

justinxzhao left a comment

justinxzhao May 31, 2023

tgaddair May 31, 2023

arnavgarg1 May 31, 2023

geoffreyangus May 31, 2023

tgaddair May 31, 2023

geoffreyangus May 31, 2023

tgaddair May 31, 2023

		@@ -169,44 +174,64 @@ def format_input_with_prompt(
		template = DEFAULT_ZERO_SHOT_PROMPT_TEMPLATE

Support combining multiple columns into a single text feature #3426

Support combining multiple columns into a single text feature #3426

Conversation

tgaddair commented May 30, 2023 • edited

github-actions bot commented May 31, 2023 • edited

Unit Test Results

justinxzhao left a comment

Choose a reason for hiding this comment

justinxzhao May 31, 2023

Choose a reason for hiding this comment

tgaddair May 31, 2023

Choose a reason for hiding this comment

arnavgarg1 May 31, 2023

Choose a reason for hiding this comment

geoffreyangus May 31, 2023

Choose a reason for hiding this comment

tgaddair May 31, 2023

Choose a reason for hiding this comment

geoffreyangus May 31, 2023

Choose a reason for hiding this comment

tgaddair May 31, 2023

Choose a reason for hiding this comment

tgaddair commented May 30, 2023 •

edited

github-actions bot commented May 31, 2023 •

edited