v0.3.0

matee8 released this 19 Nov 00:29

· 10 commits to main since this release

c065391

Added

Full Reinforcement Learning pipeline for training custom backtracking policies (backtracking-llm-train-rl).
RlPolicyOperator for executing trained PPO policies during inference.
Gymnasium environment with LLM-as-a-Judge scoring and intermediate reward shaping.
GenerationSession class for fine-grained, stateful control over the generation loop.

Fixed

Fixed critical state leakage in CLI chat sessions where operators retained history between turns.
Fixed "amnesia" bug in NGramOverlap where backtracking erased too much history.
Fixed greedy decoding to be truly deterministic when temperature is set to 0.

Assets 2