Framework for Prompt Engineering

Context

With the rise in popularity of large language models (LLMs) like ChatGPT and GPT-4 comes advice on how best to interact with this form of artificial intelligence. My LinkedIn feed is full of posts about how to write prompts to get the desired output from ChatGPT and GPT-4. While these are likely well-intentioned and clearly topical, not a single one of them has actually been data-driven. They are purely anecdotes from a selection of users' experiences.

The purpose of this project is to provide a framework for how one can evaluate the effect of altering a prompt on the output of such models. The intention is not to provide an exhaustive list of all the ways to alter a prompt or offer context - it's purely to provide some guardrails that I think will become increasingly useful and important as usage of ChatGPT and GPT-4 rises.

Data and Task

I'm very interested in text summarization, so I chose this as my task. I used the publicly-available SAMsum dataset, which is available for download here. It contains over 16,000 chat dialogues with manually-annotated summaries.

In the interest of saving money, I limited my analysis to just the first 1,000 dialogues, as API calls to OpenAI would get pricey with this entire dataset.

I chose ROUGE-1, ROUGE-2, and ROUGE-L as my evaluation metrics. Here is a great article on these scores.

Experiments

There are loads of different ways to alter a prompt to produce the desired output from a LLM. I chose to focus on how providing examples within a prompt influences ROUGE scores. I compared zero-shot (no examples), one-shot (one example), and multi-shot (more than one example) methods. I computed the ROUGE-1, ROUGE-2, and ROUGE-L scores for the summarizations generated by GPT-3.5-text-davinci-003 against a reference summary ("ground_truth" in my notebook). I limited the output summaries to 32 tokens and set the temperature to 0 to avoid the model getting too "creative" with its summaries.

Results

The results of my experiments show that, compared to providing no examples at all, providing this model with 1 example summary within the prompt produces significantly higher ROUGE-1, ROUGE-2, and ROUGE-L scores (all p < 0.05). The same can be said when comparing prompts with no examples and prompts with more than 1 example.

Interestingly, there is no significant difference in ROUGE-1, ROUGE-2, and ROUGE-L scores when comparing prompts with one example summary against those with more than one.

Conclusions

It is advantageous to provide an LLM with at least one example of what you'd like the model to return to you. But, there are diminishing returns on continuing to include additional examples.

Next Steps

Here is how you adapt this framework:

Choose a task (i.e., summarization, question answering, named entity recognition, etc.)
Pick an appropriate evaluation metric based on the task
Either partition a subset of your dataset (if you have one) or find a freely-available one that you can use to experiment
Evaluate how different prompting techniques affect your evaluation metric
Make a conclusion about which technique produces results closest to the desired output (and potentially evaluate costs if your dataset is large)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

notebooks

notebooks

scripts

scripts

README.md

README.md

Repository files navigation

Framework for Prompt Engineering

Context

Data and Task

Experiments

Results

Conclusions

Next Steps

About

Releases

Packages

Languages

jvgalvin/prompt_engineering

Folders and files

Latest commit

History

Repository files navigation

Framework for Prompt Engineering

Context

Data and Task

Experiments

Results

Conclusions

Next Steps

About

Resources

Stars

Watchers

Forks

Languages