Question: adding a long horizon textual stress test axis to Arena style evaluations

Hi and thank you for FastChat and Chatbot Arena.
It is one of the few places where people get a realistic sense
of how models feel in actual interaction.

I am working on something that might be relevant for a different axis
of evaluation.
It is a text only stress test pack called
“WFGY 3.0 · Singularity Demo” in onestardao/WFGY.
The pack consists of 131 S class open problems
(math, physics, alignment, social systems) written as a BlackHole style TXT.

The idea is to use it as a long horizon “tension crash test”
for conversational agents:

models read the TXT pack as context

then they are driven through a scripted but open ended demo flow

we observe where the reasoning chain collapses, loops,
or starts to hallucinate under high conceptual tension

This is very different from a normal chat arena turn,
but it feels complementary.
Arena tells you how the model feels in short to medium interactions.
A long horizon tension test tells you
how fragile or stable the model is when you push it into extreme territory.

My question for you is simple:

From your point of view,
would an optional “long horizon tension” stress test axis
make sense for Arena style evaluations,
or is this too far from the scope you want to maintain?

If it does make sense,
I would be happy to write a short technical note
about how to drive the TXT pack in a reproducible way
so that others can plug it into FastChat style setups.

Either way,
thank you for the work you already do to keep the community grounded
with real evaluations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: adding a long horizon textual stress test axis to Arena style evaluations #3771

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: adding a long horizon textual stress test axis to Arena style evaluations #3771

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions