Hi and thank you for FastChat and Chatbot Arena.
It is one of the few places where people get a realistic sense
of how models feel in actual interaction.
I am working on something that might be relevant for a different axis
of evaluation.
It is a text only stress test pack called
“WFGY 3.0 · Singularity Demo” in onestardao/WFGY.
The pack consists of 131 S class open problems
(math, physics, alignment, social systems) written as a BlackHole style TXT.
The idea is to use it as a long horizon “tension crash test”
for conversational agents:
models read the TXT pack as context
then they are driven through a scripted but open ended demo flow
we observe where the reasoning chain collapses, loops,
or starts to hallucinate under high conceptual tension
This is very different from a normal chat arena turn,
but it feels complementary.
Arena tells you how the model feels in short to medium interactions.
A long horizon tension test tells you
how fragile or stable the model is when you push it into extreme territory.
My question for you is simple:
From your point of view,
would an optional “long horizon tension” stress test axis
make sense for Arena style evaluations,
or is this too far from the scope you want to maintain?
If it does make sense,
I would be happy to write a short technical note
about how to drive the TXT pack in a reproducible way
so that others can plug it into FastChat style setups.
Either way,
thank you for the work you already do to keep the community grounded
with real evaluations.
Hi and thank you for FastChat and Chatbot Arena.
It is one of the few places where people get a realistic sense
of how models feel in actual interaction.
I am working on something that might be relevant for a different axis
of evaluation.
It is a text only stress test pack called
“WFGY 3.0 · Singularity Demo” in onestardao/WFGY.
The pack consists of 131 S class open problems
(math, physics, alignment, social systems) written as a BlackHole style TXT.
The idea is to use it as a long horizon “tension crash test”
for conversational agents:
models read the TXT pack as context
then they are driven through a scripted but open ended demo flow
we observe where the reasoning chain collapses, loops,
or starts to hallucinate under high conceptual tension
This is very different from a normal chat arena turn,
but it feels complementary.
Arena tells you how the model feels in short to medium interactions.
A long horizon tension test tells you
how fragile or stable the model is when you push it into extreme territory.
My question for you is simple:
From your point of view,
would an optional “long horizon tension” stress test axis
make sense for Arena style evaluations,
or is this too far from the scope you want to maintain?
If it does make sense,
I would be happy to write a short technical note
about how to drive the TXT pack in a reproducible way
so that others can plug it into FastChat style setups.
Either way,
thank you for the work you already do to keep the community grounded
with real evaluations.