We're using the reward/loss as our proxy for RL correctness, which is understandable. But for understanding whether or not the RL loop isn't reward hacking, it would be helpful to additionally log a few of the prompt/responses/targets per step (?) in wandb and/or logger.