Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support combining multiple columns into a single text feature #3426

Merged
merged 11 commits into from
Jun 1, 2023

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented May 30, 2023

For LLMs and other text models, it's common to treat multiple inputs as a single string of text to feed into the model as input, rather than construct separate "towers" for each feature. This PR provides a standard way to do this via the prompt config.

For ECD:

input_features:
    - name: context
      type: text
      preprocessing:
          prompt:
            template: The {color} {animal} jumped over the {size} {object}

For LLM:

model_type: llm
prompt:
    template: The {color} {animal} jumped over the {size} {object}
input_features:
    - name: context
      type: text

Here, it is not required that context be a valid column of the input dataset, as it will be generated from the prompt. However, each of color, animal, size, and object are assumed to be columns of the input dataset.

@github-actions
Copy link

github-actions bot commented May 31, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   1h 22m 39s ⏱️ + 5m 23s
33 tests ±0  29 ✔️ ±0    4 💤 ±0  0 ±0 
99 runs  ±0  87 ✔️ ±0  12 💤 ±0  0 ±0 

Results for commit 16ebf37. ± Comparison against base commit 2bada30.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just 2 small comments.

ludwig/data/prompt.py Outdated Show resolved Hide resolved
tests/integration_tests/test_preprocessing.py Outdated Show resolved Hide resolved
@@ -169,44 +174,64 @@ def format_input_with_prompt(
template = DEFAULT_ZERO_SHOT_PROMPT_TEMPLATE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC: Would a user be able to use the DEFAULT_ZERO_SHOT_PROMPT_TEMPLATE for the prompt template within an input feature definition like so?

input_features:
    - name: context
      type: text
      preprocessing:
          prompt:
            template: None  # Use default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though I believe in this case they need to provide a task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's right, they'd need to provide a task


preprocessing = input_feature_config["preprocessing"]
if "prompt" in preprocessing and preprocessing["prompt"]["task"] is not None:
if _has_prompt_section(preprocessing):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Are we accepting prompt keys at both the local and global level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLM schema requires the prompt at the top-level, and does not allow feature-level prompts.

The ECD schema allows prompts at the feature-level, but not the top level.

if is_few_shot:
df["context"] = retrieval_model.search(df, backend, k=k, return_data=True)
def generate_prompt_for_row(row):
kwargs = {col: field_to_dtype[col](row[col]) for col in template_fields}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to cast the values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a number like 0.1234 in the DF and you want to format it like {number:.2f} it will fail if the number is represented as a string in the DF. So we need to cast it to a float. I can add a comment to this effect.

@tgaddair tgaddair merged commit d2f71c5 into master Jun 1, 2023
16 checks passed
@tgaddair tgaddair deleted the multi-col-prompt branch June 1, 2023 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants