As part of my work for the Applied NLP course at the University of Arizona, I tested the effectiveness of 3 lightweight language models in common sense reasoning tasks by comparing 1-step and 2-step ReAct agents using GPT-4o mini and the TextWorldExpress Commonsense environment.
For this course project, I did conducted 50 episodes of testing, and the results demonstrate that the two-step ReAct agent achieves the highest performance, though scoring approximately 17% lower than similar agents using the full GPT-4 model. This suggests that lightweight models, when combined with appropriate prompting strategies, can offer a viable cost-efficient alternative in scenarios where moderate performance trade-offs are acceptable.