Evaluation Methods: Similarity Check ? #6

MentalGear · 2023-05-22T15:27:42Z

First off: Thank you for providing a node FOSS prompt-testing framework! Also, the web view is really handy !

Yet, when it comes to explaining how evaluation is done, I find it lacking in details: How exactly are outputs scored, simply by keyword matching, exact overlap or are advanced functions like distance similarity based on the embeddings build-in?

An excellent library to get inspired from that does semantic similarity testing (python) is squidgy-testy.

EDIT: Corrected Link.

MentalGear · 2023-05-22T15:35:22Z

Correction: link is https://github.com/squidgyai/squidgy-testy

typpo · 2023-05-22T15:36:46Z

I appreciate the feedback! Right now, evaluation is done in one of three ways:

Direct string comparison: this is the default behavior for anything you put in the __expected column.
Basic Javascript logic: using the eval: prefix, you can run string checks and keyword matches on output. For example:
```
eval: output.includes('foo')
```
The test runner expects a piece of Javascript code that returns a pass/fail boolean.
Self-grading with LLM: using the grade: prefix, you can ask an LLM to evaluate the output against your criteria. For example:
```
grade: output contains a reference to a movie
```
The test runner uses the provider specified in the --grader option
Human evaluation - the web ui helps facilitate the thumbsup/thumbsdown rating. You can aggregate these ratings and pick the "best" prompt accordingly.

In short, keyword matching and exact overlap should be handled by case 1. Semantic similarity testing is a great suggestion. I'll look into this and see if I can get it added :)

See also: https://www.promptfoo.dev/docs/configuration/expected-outputs

typpo · 2023-05-22T16:39:02Z

Support for semantic similarity is added in #7. When it lands, I'll deploy a new version of the library 0.5.0.

It works like this:

Semantic similarity: using the similar prefix, you can run compare the semantic similarity of expected vs output using OpenAI embeddings.

For example, the directive similar(0.8): hello world will test that cosine distance is >= 0.8 for test outputs

Hope this helps!

MentalGear · 2023-05-22T18:16:08Z

Thank you for the super-swift reply @typpo and planning on #7 !

With semantic similarity added, promptfoo should be among, if not the, very best open-source prompt-testing framework! (and even be on par with what commercial platforms like Vellum are offering for testing)

Just one more quick suggestion: You might want to consolidate your documentation into one place. It's excellent on https://www.promptfoo.dev/docs/intro - which is also where I found after a bit of digging the evaluation methods you mentioned. But if you also keep a version with different content in the readme.md, it can be confusing as there's no single source of truth (SSOT).

I would suggest keeping the intro and "promo gifts" along with the icon grid in the readme, and have a big link directly to the documentation at https://www.promptfoo.dev/docs/intro. :)

MentalGear · 2023-05-22T20:23:46Z

Thank you and darn, that was quick ! I just finished writing a blog post about open-source PT frameworks and already had to update it. 😅

typpo · 2023-05-23T14:59:08Z

Thanks for the suggestions, @MentalGear! I've simplified the github readme and pointed users toward the docs website, which is definitely easier to navigate.

typpo closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Methods: Similarity Check ? #6

Evaluation Methods: Similarity Check ? #6

MentalGear commented May 22, 2023 •

edited

MentalGear commented May 22, 2023 •

edited

typpo commented May 22, 2023 •

edited

typpo commented May 22, 2023

MentalGear commented May 22, 2023 •

edited

MentalGear commented May 22, 2023 •

edited

typpo commented May 23, 2023

Evaluation Methods: Similarity Check ? #6

Evaluation Methods: Similarity Check ? #6

Comments

MentalGear commented May 22, 2023 • edited

MentalGear commented May 22, 2023 • edited

typpo commented May 22, 2023 • edited

typpo commented May 22, 2023

MentalGear commented May 22, 2023 • edited

MentalGear commented May 22, 2023 • edited

typpo commented May 23, 2023

MentalGear commented May 22, 2023 •

edited

MentalGear commented May 22, 2023 •

edited

typpo commented May 22, 2023 •

edited

MentalGear commented May 22, 2023 •

edited

MentalGear commented May 22, 2023 •

edited