26 Jul 16:21

f7fab1d

We're excited to release two new nodes: Chat Turns and LLM Scorers. These nodes came from feedback during user sessions:

Some users wanted to first tell chat models 'how to act', and then wanted to put their real prompt in the second turn.
Some users wanted a quicker, cheaper way to 'evaluate' responses and visualize results.

We describe these new nodes below, as well as a few quality-of-life improvements.

🗣️ Chat Turn nodes

Chat models are all the rage (in fact, they are so important that OpenAI announced it would no longer support plain-old text generation models going forward.) Yet strikingly, very few prompt engineering tools let you evaluate LLM outputs beyond a prompt.

Now with Chat Turn nodes, you can continue conversations beyond a single prompt. In fact, you can:

Continue multiple conversations simultaneously across multiple LLMs

Just connect the Chat Turn to your initial Prompt Node, and voilà:

Here, I've first prompted four chat models: GPT3.5, GPT-4, Claude-2, and PaLM with the question: "What was the first {game} game?". Then I ask a follow-up question, "What was the second?" By default, Chat Turns continue the conversation with all LLMs that were used before, allowing you to follow-up on LLM responses in parallel. (You can also toggle that off, if you want to query different models --more details below).

Template chat messages, just like prompts

You can do everything you can with Chat Turns that you could with Prompt Nodes, including prompt templating and adding input variables. For instance, here's a prompt template as a follow-up message:

Note
In fact, Chat Turns are merely modified Prompt Nodes, and use the underlying PromptNode class.

Start a conversation with one LLM, and continue it with a different LLM

Chat Turns include a toggle of whether you'd like to continue chatting with the same LLMs, or query different ones, passing chat context to the new models. With this, you can start a conversation with one LLM and continue it with another (or several):

Supported chat models

Simple in concept, chat turns were the result of 2 weeks' work, revising many parts of the ChainForge backend to store and carry chat context. Chat history is automatically translated to the appropriate format for a number of providers:

OpenAI chat models
Anthropic models (Claude)
Google PaLM2 chat
HuggingFace (you need to set 'Model Type' in Settings to 'chat', and choose a Conversation model or custom endpoint. Currently there's only one chat model listed in ChainForge dropdown: microsoft/DialoGPT. Go to the HuggingFace site to find more!)

Warning
If you use a non-chat, text completions model like GPT-2, chat turns will still function, but the chat context won't be passed into the text completions model.

Let us know what you think!

🤖 LLM Scorer nodes

More commonly called "LLM evaluators", LLM scorer nodes allow you to use an LLM to 'grade'/score outputs of other LLMs:

Although ChainForge supported this functionality before via prompt chaining, it was not straightforward and required an additional chain to a code evaluator node for postprocessing. You can now connect the output of the scorer directly to a Vis Node to plot outputs. For instance, here's GPT-4 scoring whether different LLM responses apologized for a mistake:

Note that LLM scores are finicky --if one score isn't in the right format (true/false), visualization nodes won't work properly, because they'll think the outputs are notof boolean type but categorical. We'll work on improving this, but, for now, enjoy LLM scorers!

❗ Why we're not calling LLM scorers 'LLM evaluators'

We thought long and hard about what to call LLMs that score outputs of other LLMs. Ultimately, using LLMs to score outputs is helpful, and can save time when it's hard to write code to achieve the same effect. However, LLMs are imperfect. Although the AI community currently uses the term 'LLM evaluator,' we ultimately decided not to use that term, for a few reasons:

LLM scores should not be blindly trusted. They are helpful if you already have a sense of what you're looking for, and want to grade hundreds of responses and don't care about picture-perfect accuracy. This is especially true after playing with LLM scorer nodes for a while and finding that small tweaks to the scoring prompt can result in vast differences in results.
Evaluators, like 'graders' or 'annotations,' is a term that has connotations with humans (i.e. human evaluator). We want to avoid anthropomorphizing LLMs, which contributes to peoples' over-trust in them. 'Scorers' still has human connotations, but arguably less so, and less authoritative ones than 'evaluator'.
Evaluators is a term in ChainForge that refers to programs that score responses. Calling LLM scorers 'evaluators' loosely equates them with programmatic evaluators, suggesting they carry the same authority. Although code can be wrong or incorrect, the scoring process for code is inspectable and auditable --not so with LLMs.

Fundamentally, then, we disagree with the positions taken by projects like LangChain, which tend to emphasize LLM scorers as the go-to solution for evaluation. We believe this is a massive mistake that tends to mislead people and causing them to over-trust AI outputs, including ML researchers at MIT. In choosing the term Scorers, we aim to --at the very least --distance ourselves from such positions.

Other changes

Inspecting true/false scored responses (in Evaluators or LLM scorers) will now show false in red, to easily eyeball failure cases:

In Response Inspectors, the term "Hierarchy" has been replaced with "Grouped List". Grouped Lists are again the default.
In table view of the response inspector, you can now choose what variable to use for columns. With this method you can compare across prompt templates or indeed anything of interest:

Future Work

Chat Turns opened up a whole new can of worms, both for the UI, and for evaluation. Some open questions are:

How can we display Chat History in response inspectors? Right now, you'll only see the latest response from the LLM. There's more design work to do such that you can view the chat context of specific responses.
Should there be a Chat History node so you can predefine/preset chat histories to test on, without needing to query an LLM?

We hope to prioritize such features based on user feedback. If you use Chat Turns or LLM Scorers, let us know what you think --open an Issue or start a Discussion! 👍

Assets 2

0 Join discussion

19 Jul 21:35

ianarawjo

v0.2.1.2

0388329

v0.2.1.2: Table view, Response Inspectors keep state

There's two minor, but important quality-of-life improvements in this release.

Table view

Now in response inspectors, you can elect to see a table, rather than a hierarchical grouping of prompt variables:

Columns are prompt variables, followed by LLMs. We might add the ability to change columns in the future, if there's interest.

Persistent state in response inspectors

Response inspectors' state will, to an extent, persist across runs. For instance, say you were inspecting a specific response grouping:

Imagine you now close the inspector window, delete one of the models and then increase num generations per prompt to 2. You will now see:

Right where you left off, with the updated responses. It also keeps track if you've selected Table view, and retains the view you last selected.

Specify hostname and port (v0.2.1.3)

I've added --host and --port flags when you're running ChainForge locally. You can specify what hostname and port to run it on like so:

chainforge serve --host 0.0.0.0 --port 3400

The front-end app also knows you're running it from Flask (locally) regardless of what the hostname and port is.

Assets 2

12 Jul 21:33

ianarawjo

v0.2.1

3657609

v0.2.1: Prompt previews, Toggleable prompt variables, Anthropic Claude2

We've made several quality-of-life improvements from 0.2 to this release.

Prompt previews

You can now inspect what generated prompts will be sent off to LLMs. For a quick glance, simply hover over the 'list' icon on Prompt Nodes:

For full inspection, just click the button to bring up a popup inspector.

Thanks to Issue #90 raised by @profplum700 !

Ability To Enable/Disable Prompt Variables in Text Fields Without Deleting Them

You can now enable/disable prompt variables selectively:

selective-field-visibility.mov

Thanks to Issue #93 raised by @profplum700 !

Anthropic model Claude-2

We've also added the newest Claude model, Claude-2. All prior models remain supported; however, strangely, Claude-1 and 100k context models have disappeared from the Anthropic API documentation. So, if you are using earlier Claude models, just know that they may stop working at some future point.

Bug fixes

There have also been numerous bug fixes, including:

braces { and } inside Tabular Data tables are now escaped by default when data is pulled from the nodes, so that they are never treated as prompt templates
escaping template braces { and } now removes the escape slash when generating prompts for models
outputs of Prompt Nodes, when chained into other Prompt Nodes, now escape the braces in LLM responses by default. Note that whenever prompts are generated, the escaped braces are cleaned up to just { and }. In response inspectors, input variables will appear with escaped braces, as input variables in ChainForge may themselves be templates.

Future Goals

We've been running pilot studies internally at Harvard HCI and getting some informal feedback.

One point that keeps coming up echoes Issue #56 , raised by @jjordanbaird : the ability to keep chat context and evaluate multiple chatbot turns. We are thinking to implement this as a Chat Turn Node, where optionally, one can provide "past conversation" context as input. The overall structure will be similar to Prompt Nodes, except that only Chat Models will be available. See #56 for more details.
Another issue we're aware of is the need for better documentation on what you can do with ChainForge, particularly on the rather unique feature of chaining prompt templates together.

As always, if you have any feedback or comments, open an Issue or start a Discussion.

Contributors

profplum700 and jjordanbaird

Assets 2

30 Jun 19:15

ianarawjo

v0.2

7cd5b67

v0.2: App logic now runs in browser, HuggingFace Models, JavaScript Evaluators, Comment Nodes

Note
This release includes a breaking change regarding cache'ing responses. If you are working on a current flow, export your ChainForge flow to a cforge file before installing the new version.

We're closer than ever to hosting ChainForge on chainforge.ai, so that no installation is required to try it out. Latest changes below.

The entire backend has been rewritten in TypeScript 🥷🧑‍💻️

Thousands of lines of Python code, comprising nearly the entire backend, has been rewritten in TypeScript. The mechanism for generating prompt permutations, querying LLMs and cache'ing responses is performed now in the front-end (entirely in the browser). Tests were added in jest to ensure the outputs of the TypeScript functions performed the same as their original Python versions. There are additional performance and maintainability benefits to adding static type checking. We've also added ample docstrings, which should help devs looking to get involved.

Functionally, you should not experience any difference (expect maybe a slight speed boost).

Javascript Evaluator Nodes 🧩

Because the application logic has moved to the browser, we added JavaScript evaluator nodes. These let you write evaluation functions in JavaScript, and function the same as Python evaluators.

Here is a side-by-side comparison of JavaScript and Python evaluator nodes, showing semantically equivalent code and the in-node support for displaying console.log and print output:

When you are running ChainForge on localhost, you can still use Python evaluator nodes, which will execute on your local Flask server (the Python backend) as before. JavaScript evaluators run entirely in the browser (specifically, eval sandboxed inside an iframe).

HuggingFace Models 🤗

We added support for querying text generation models hosted on the HuggingFace Inference API. For instance, here is falcon.7b.instruct, an open-source model:

For HF models, there is a 250 token limit. This can sometimes be rather limiting, so we've added a "number of continuations" setting to help with that. You can set it to > 0 to feed the response back into the API for text completions models, which will generate longer completions, for up to 1500 tokens.

We also support HF Inference Endpoints for text generation models. Simply put the API call URL in the custom_model field of the settings window.

Comment Nodes ✏️

You can write comments about your evaluation using a comment node:

'Browser unsupported' error 💢

If you load ChainForge on a mobile device or unsupported browser, it will now display an error message:

This helps for our public release. If you'd like ChainForge to support more browsers, open an Issue or (better yet) make a Pull Request.

Fun example

Finally, I wanted to share a fun practical example: an evaluation to check if the LLM reveals a secret key. This evaluation, including all API calls and JavaScript evaluation code, was run entirely in the browser:

Questions, concerns?

Open an Issue or start a Discussion!

This was a major, serious change to ChainForge. Although we've written tests, it's possible we have missed something, and there's a bug somewhere. Note that unfortunately, Azure OpenAI 🔷 support is again untested following the rewrite, as we don’t have access to it. Someone in the community, let me know if it works for you! (Also, if you work at Microsoft and can give us access, let us know!)

A browser-based, hosted version of ChainForge will be publicly available July 5th (next Wednesday) on chainforge.ai 🌍🎉

Assets 3

23 Jun 13:38

ianarawjo

v0.1.7.2

ef51e9e

v0.1.7.2: Autosaving + on the way to nicer plots

This minor release includes two features:

Autosaving

Now, ChainForge autosaves your work to `localStorage` every 60 seconds.

This helps tremendously in case you accidentally close the window without exporting the flow, your systme crashes, or you encounter a bug.

To create a new flow now, just click the New Flow button to get a new canvas.

Plots now have clear with y-axis, x-axis, groupBy selectors on Vis Node

We've added a header bar to the Vis Node, clarifying what is plotted on each axis / dimension:

In addition, as you see above, the y-labels can be up to two lines (~40 chars long), making it easier to read.

Finally, when num of generations per prompt is 1, we now output bar charts by default:

Box-and-whiskers plots are still used whenever num generations n > 1.

Note that improving the Vis Nodes is a work-in-progress, and functionally, everything is the same as before.

Assets 2

21 Jun 14:25

ianarawjo

v0.1.7

ea3d730

v0.1.7: UI improvements to response inspector

We've made a number of improvements to the inspector UI and beyond.

Side-by-side comparison across LLM responses

Responses now appear side-by-side for up to five LLMs queried:

Collapseable response groups

You can also collapse LLM responses grouped by their prompt template variable, for easier selective inspection. Just click on a response group header to show/hide:

collapsable-groups.mov

Accuracy plots by default

Boolean (true/false) evaluation metrics now use accuracy plots by default. For instance, for ChainForge's prompt injection example:

This makes it extremely easy to see differences across models for the specified evaluation. Stacked bar charts are still used when a prompt variable is selected. For instance, here is plotting a meta-variable, 'Domain', across two LLMs, testing whether or not the code outputs had an import statement (another new feature):

Added 'Inspect results' footer to both Prompt and Eval nodes

The tiny response previews footer in the Prompt Node has been changed to 'Inspect Responses' button that brings up a fullscreen response inspector. In addition, evaluation results can be easily inspected by clicking 'Inspect results':

Evaluation scores appear in bold at the top of each response block:

In addition, both Prompt and Eval nodes now load cache'd results upon initialization. Simply load an example flow and click the respective Inspect button.

Added `asMarkdownAST` to `response` object in Evaluator node

Given how often developers wish to parse markdown, we've added a function asMarkdownAST() to the ResponseInfo class that uses the mistune library to parse markdown as an abstract syntax tree (AST).

For instance, here's code which detects if an 'import' statement appeared anywhere in the codeblocks of a chat response:

Assets 2

15 Jun 20:16

ianarawjo

v0.1.6

08bd734

v0.1.6: OpenAI evals

Added 188 OpenAI Evals to Example Flows

We've added 188 example flows generated directly from OpenAI evals benchmarks.
In Example Flows, navigate to the "OpenAI Evals" tab, and click the benchmark you wish to load:

Screen.Recording.2023-06-15.at.3.49.32.PM.mov

The code in each Evaluator is the appropriate code for each evaluation, as referenced from the OpenAI eval-templates doc.

Example: Tetris problems

For example, I was able to compare GPT-4's performance on tetris problems with GPT3.5, simply by loading the eval, adding GPT-4, and pressing run:

I was curious whether the custom system message had any effect on GPT3.5's performance, so I added a version without it, and in 5 seconds found out that the system message had no effect:

Supported OpenAI evals

A large subset of OpenAI evals are supported. We currently display OpenAI evals with:

a common system message
a single 'turn' (prompt)
evaluation types of 'includes', 'match', and 'fuzzy match',
and a reasonable number of prompts (e.g., spanish-lexicon, which is not included, has 53,000 prompts)

We hope to add those with model evaluations (e.g., Chain-of-thought prompting) in the near future.

The cforge flows were precompiled from the oiaevals registry. To save space, the files are not included in the PyPI chainforge package, but rather fetched from GitHub on an as-needed basis. We precompiled the evals to avoid forcing users to install OpenAI evals, as it requires Git LFS, Python 3.9+, and a large number of dependencies.

Note finally that responses are not cache'd for these flows, unlike the other examples --you will need to query OpenAI models yourself to run them.

Minor Notes

This release also:

Changed Textareas to contenteditable p tags inside Tabular Data Nodes. Though this compromises usability slightly, there is a huge gain in performance when loading large tables (e.g., 1000 rows or more), which is required for some OpenAI evals in the examples package.
Fixed a bug in VisNode where a plot was not displaying when a single LLM was present, the number of prompt variables >= 1, and no variables were selected

If you run into any problems using OpenAI evals examples, or with any other part of CF, please let us know.

We could not manually test all of the new example flows, due to how many API calls would be required. Happy ChainForging!

Assets 2

13 Jun 22:10

ianarawjo

v0.1.5.3

655e1e6

v0.1.5.3: OpenAI Function Calls, Azure Support

This is an emergency release to add basic support for the new OpenAI models and 'function call' ability. It also includes support for Azure OpenAI endpoints, closing Issue #53 .

OpenAI function calls

You can now specify the newest models of ChatGPT, 0613:

In addition, you can set the value of functions by passing a valid JSON schema object. This will be passed to the functions of the OpenAI chat completions call:

I've created a basic example flow to detect when a given prompt triggers a function call, using OpenAI's get_current_weather example in their press release:

In the coming weeks, we will think about making this user experience more streamlined, but for now, enjoy being able to mess around!

Azure OpenAI API support

Thanks to community members @chuanqisun , @bhctest123 , and @levidehaan , we now have added Azure OpenAI support:

To use Azure OpenAI, you just need to set your keys in ChainForge Settings:

And then make sure you set the right Deployment Name in the individual model settings. The settings also includes OpenAI function calls (not sure if you can deploy 0613 models on Azure yet, but it's there).

As always, let us know if you run into any issues.

Collapsing duplicate responses

As part of this release, duplicate LLM responses when num generations n>1 are now detected and automatically collapsed in Inspectors. The number of duplicates is indicated in the top-right corner:

Contributors

levidehaan, chuanqisun, and bhctest123

Assets 2

11 Jun 16:59

ianarawjo

v0.1.5

3bfee02

v0.1.5: Tabular Data Node, Evaluation Output

We've added Tabular Data to ChainForge, to help conduct ground truth evaluations. Full release notes below.

Tabular Data Nodes 🗂️

You can now input and import tabular data (spreadsheets) into ChainForge. Accepted formats are jsonl, xlsx, and csv. Excel and CSV files must have a header row with column names.

Tabular data provides an easy way to enter associated prompt parameters or import existing datasets and benchmarks. A typical use case is ground truth evaluation, where we have some inputs to a prompt, and an "ideal" or expected answer:

Here, we see variables {first}, {last}, and {invention} "carry together" when filling the prompt template: ChainForge knows they are all associated with one another, connected via the row. Thus, it constructs 4 prompts from the input parameters.

Accessing tabular data, even if it's not input into the prompt directly

Alongside tabular data is a new property of response objects in Evaluation nodes: the meta dict. This allows you to get access to column data that is associated with inputs to a prompt template, but was not itself directly input into the prompt template. For instance, in the new example flow for ground truth evaluation of math problems:

Notice the evaluator uses meta to get "Expected", which is associated with the prompt input variable question by virtue of it being on the same row of the table.

def evaluate(response):
  return response.text[:4] == \
         response.meta['Expected']

Example flows

Tabular data allows us to run many more types of LLM evaluations. For instance, here is the ground truth evaluation multistep-word-problems from OpenAI evals, loaded into ChainForge:

We've added an Example Flow for ground truth evaluation that provides a good starting point.

Evaluation Node output 📟

Curious what the format of a response object is like? You can now print inside evaluate functions to print output directly to the browser:

In addition, Exceptions raised inside your evaluation function will also print to the node out:

Slight styling improvements in Response Inspectors

We removed the use of blue Badges to display unselected prompt variable and replaced them with text that blends into the background:

The fullscreen inspector also displays slightly larger font size for readability:

Final thoughts / comments

Tabular Data was a major feature, as it enables many types of LLM evaluation. Our goal now is to illustrate what people can currently do in ChainForge through better documentation and connecting to existing datasets (e.g. OpenAI evals). We also will focus on quality-of-life improvements to the UI and adding more models/extensibility.
We know there is a minor layout issue with the table not autosizing to the best fit the width of cell content. This happens as some browsers do not appear to autofit column widths properly when <textarea> is an element of a table cell. We are working on a fix so columns are automatically sized based on their content.

Want to see a feature / have a comment? Start a Discussion or submit an Issue!

Assets 2

0 Join discussion

08 Jun 00:55

ianarawjo

v0.1.4

000b612

v0.1.4: Failure Progress, Inspect popup, Firefox support

This release includes the following features:

Selective Failure on API requests ♨️

ChainForge now has selective failure on PromptNodes: API calls that fail no longer stop the remaining requests, but rather collect in red error bars within the progress bars:

progress-errors.mov

An error message will display all errors once all API requests return (whether successfully or with errors). This saves $$ and time. (As always, ChainForge cache's responses the moment it receives them, so you don't need to worry about re-running prompt nodes re-calling APIs.)

Inspector Pop-up 🔍

In addition, we've added an Inspector pop-up which you can access by clicking the response preview box on a PromptNode:

popup-inspector.mov

This makes it much easier to inspect responses without needing to attach a dedicated Inspect Node. We're going to build this out (and add it to the EvaluatorNode) soon, but for now I hope you find this feature useful.

LLM Color Consistency 🌈

Now, each LLM you create has a dedicated color that remains consistent across VisNode plots and Inspector responses.

Firefox Support 🦊

Due to demand for more browsers, we've added support for FireFox. This involved a minor change to how model settings forms work.
As well, (though it isn't formatted exactly right) other browsers should now work too, as we removed a dependency on Regex lookaheads/behinds which was causing some browsers like Safari to not load the app at all.

Website

As an aside, we've created a website at chainforge.ai. It's not much yet, but it's a start. We will add tutorials in the near future for new users.

Upcoming features

Major priorities right now are:

Tabular data nodes: Load tabular data and reference columns in EvaluatorNode code
Ground truth example flows: An example flow that evaluates responses against a 'ground truth' which differs per prompt parameter value
Azure support: Yes, we heard you! :) I am hoping to get this very soon.

Assets 2

0 Join discussion

Releases: ianarawjo/ChainForge

v0.2.5: Chat Turns, LLM Scorers

🗣️ Chat Turn nodes

Continue multiple conversations simultaneously across multiple LLMs

Template chat messages, just like prompts

Start a conversation with one LLM, and continue it with a different LLM

Supported chat models

🤖 LLM Scorer nodes

❗ Why we're not calling LLM scorers 'LLM evaluators'

Other changes

Future Work

v0.2.1.2: Table view, Response Inspectors keep state

Table view

Persistent state in response inspectors

Specify hostname and port (v0.2.1.3)

v0.2.1: Prompt previews, Toggleable prompt variables, Anthropic Claude2

Prompt previews

Ability To Enable/Disable Prompt Variables in Text Fields Without Deleting Them

Anthropic model Claude-2

Bug fixes

Future Goals

Contributors

v0.2: App logic now runs in browser, HuggingFace Models, JavaScript Evaluators, Comment Nodes

The entire backend has been rewritten in TypeScript 🥷🧑‍💻️

Javascript Evaluator Nodes 🧩

HuggingFace Models 🤗

Comment Nodes ✏️

'Browser unsupported' error 💢

Fun example

Questions, concerns?

A browser-based, hosted version of ChainForge will be publicly available July 5th (next Wednesday) on chainforge.ai 🌍🎉

v0.1.7.2: Autosaving + on the way to nicer plots

Autosaving

Now, ChainForge autosaves your work to localStorage every 60 seconds.

Plots now have clear with y-axis, x-axis, groupBy selectors on Vis Node

v0.1.7: UI improvements to response inspector

Side-by-side comparison across LLM responses

Collapseable response groups

Accuracy plots by default

Added 'Inspect results' footer to both Prompt and Eval nodes

Added asMarkdownAST to response object in Evaluator node

v0.1.6: OpenAI evals

Added 188 OpenAI Evals to Example Flows

Example: Tetris problems

Supported OpenAI evals

Minor Notes

If you run into any problems using OpenAI evals examples, or with any other part of CF, please let us know.

v0.1.5.3: OpenAI Function Calls, Azure Support

OpenAI function calls

Azure OpenAI API support

Collapsing duplicate responses

Contributors

v0.1.5: Tabular Data Node, Evaluation Output

Tabular Data Nodes 🗂️

Accessing tabular data, even if it's not input into the prompt directly

Example flows

Evaluation Node output 📟

Slight styling improvements in Response Inspectors

Final thoughts / comments

Want to see a feature / have a comment? Start a Discussion or submit an Issue!

v0.1.4: Failure Progress, Inspect popup, Firefox support

Selective Failure on API requests ♨️

Inspector Pop-up 🔍

LLM Color Consistency 🌈

Firefox Support 🦊

Website

Upcoming features

Now, ChainForge autosaves your work to `localStorage` every 60 seconds.

Added `asMarkdownAST` to `response` object in Evaluator node