Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Methods: Similarity Check ? #6

Closed
MentalGear opened this issue May 22, 2023 · 6 comments
Closed

Evaluation Methods: Similarity Check ? #6

MentalGear opened this issue May 22, 2023 · 6 comments

Comments

@MentalGear
Copy link
Contributor

MentalGear commented May 22, 2023

First off: Thank you for providing a node FOSS prompt-testing framework! Also, the web view is really handy !

Yet, when it comes to explaining how evaluation is done, I find it lacking in details: How exactly are outputs scored, simply by keyword matching, exact overlap or are advanced functions like distance similarity based on the embeddings build-in?

An excellent library to get inspired from that does semantic similarity testing (python) is squidgy-testy.

EDIT: Corrected Link.

@MentalGear
Copy link
Contributor Author

MentalGear commented May 22, 2023

Correction: link is https://github.com/squidgyai/squidgy-testy

@typpo
Copy link
Collaborator

typpo commented May 22, 2023

I appreciate the feedback! Right now, evaluation is done in one of three ways:

  1. Direct string comparison: this is the default behavior for anything you put in the __expected column.

  2. Basic Javascript logic: using the eval: prefix, you can run string checks and keyword matches on output. For example:

    eval: output.includes('foo')
    

    The test runner expects a piece of Javascript code that returns a pass/fail boolean.

  3. Self-grading with LLM: using the grade: prefix, you can ask an LLM to evaluate the output against your criteria. For example:

    grade: output contains a reference to a movie
    

    The test runner uses the provider specified in the --grader option

  4. Human evaluation - the web ui helps facilitate the thumbsup/thumbsdown rating. You can aggregate these ratings and pick the "best" prompt accordingly.

In short, keyword matching and exact overlap should be handled by case 1. Semantic similarity testing is a great suggestion. I'll look into this and see if I can get it added :)

See also: https://www.promptfoo.dev/docs/configuration/expected-outputs

@typpo
Copy link
Collaborator

typpo commented May 22, 2023

Support for semantic similarity is added in #7. When it lands, I'll deploy a new version of the library 0.5.0.

It works like this:

Semantic similarity: using the similar prefix, you can run compare the semantic similarity of expected vs output using OpenAI embeddings.

For example, the directive similar(0.8): hello world will test that cosine distance is >= 0.8 for test outputs

Hope this helps!

@MentalGear
Copy link
Contributor Author

MentalGear commented May 22, 2023

Thank you for the super-swift reply @typpo and planning on #7 !

With semantic similarity added, promptfoo should be among, if not the, very best open-source prompt-testing framework! (and even be on par with what commercial platforms like Vellum are offering for testing)

Just one more quick suggestion: You might want to consolidate your documentation into one place. It's excellent on https://www.promptfoo.dev/docs/intro - which is also where I found after a bit of digging the evaluation methods you mentioned. But if you also keep a version with different content in the readme.md, it can be confusing as there's no single source of truth (SSOT).

I would suggest keeping the intro and "promo gifts" along with the icon grid in the readme, and have a big link directly to the documentation at https://www.promptfoo.dev/docs/intro. :)

@MentalGear
Copy link
Contributor Author

MentalGear commented May 22, 2023

Thank you and darn, that was quick ! I just finished writing a blog post about open-source PT frameworks and already had to update it. 😅

@typpo
Copy link
Collaborator

typpo commented May 23, 2023

Thanks for the suggestions, @MentalGear! I've simplified the github readme and pointed users toward the docs website, which is definitely easier to navigate.

@typpo typpo closed this as completed May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants