# Few-shot Prompting

Provide demonstrations to the model to better steer the it towards the desired results. 

In [11]:
# Warning control
import warnings
warnings.filterwarnings("ignore")

In [12]:
# Importing libraries
import openai
import os
from dotenv import load_dotenv
import random
import json

# Loading environment variables
load_dotenv()

from openai import OpenAI
client = OpenAI()

In [13]:
def get_chat_completion(messages, model="gpt-4o", temperature=0.0):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature
    )
    return response.choices[0].message.content

It's typical to start with a zero-shot prompt on your task. 

### Zero-shot Prompting

In [14]:
messages=[
    {
      "role": "system",
      "content": [
        {
          "text": "Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]",
          "type": "text"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Abstract: Training large language models (LLM) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM model are preferred to outputs from OpenAI ChatGPT. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing large language models. Our codes and generated data are public at <link>"
        }
      ]
    }
  ]

response = get_chat_completion(messages)

print(response)


["WizardLM"]


### Few-shot Prompting


**Why few-shot prompting?** -- The idea of the providing example here is that we hope to steer the model better on the types of model names we want to extract. 

**Structuring the few-shot prompt** -- For readability and reliability, we can leverage user + assistant to structure the examples/demonstrations. The idea here is to leverage the dialogue interface to set expectations for the kind of information and the style of the outputs we desire. 

**How many demonstrations do you need?** -- The more demonstrations the better. Consider starting with 5-10 examples and regularly evaluate if adding more examples leads to improvements. If you can afford it, some of the more recent models can handle up 100s to 1000s of examples (i.e., many-shot learning). Focus on performance first and then optimize for other things like cost and latency (to do this in general).

**Randomize the demonstrations for robustness** -- This check is for ensuring that your overall prompt is robust enough to changes like order of demonstrations. Keep in mind that you will be regularly optimizing this prompt so this is crucial. 

Now let's look at what a few-shot prompt looks like for the above task. 

In [15]:
messages=[
    {
      "role": "system",
      "content": [
        {
          "text": "Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]",
          "type": "text"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Abstract: Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video. We conducted extensive experiments to demonstrate the superiority of our method in terms of motion and video quality. "
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "Tags: ['SadTalker', 'ExpNet', 'PoseVAE']"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Abstract: We propose a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: "
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "Tags: ['NA']"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Abstract: Topological deep learning is a rapidly growing field that pertains to the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations. In this paper, we present a unifying deep learning framework built upon a richer data structure that includes widely adopted topological domains. Specifically, we first introduce combinatorial complexes, a novel type of topological domain. Combinatorial complexes can be seen as generalizations of graphs that maintain certain desirable properties. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations. In addition, combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial and cell complexes. Thus, combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces. Second, building upon combinatorial complexes and their rich combinatorial and algebraic structure, we develop a general class of message-passing combinatorial complex neural networks (CCNNs), focusing primarily on attention-based CCNNs. We characterize permutation and orientation equivariances of CCNNs, and discuss pooling and unpooling operations within CCNNs in detail. Third, we evaluate the performance of CCNNs on tasks related to mesh shape analysis and graph learning. Our experiments demonstrate that CCNNs have competitive performance as compared to state-of-the-art deep learning models specifically tailored to the same tasks. Our findings demonstrate the advantages of incorporating higher-order relations into deep learning models in different applications."
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "Tags: ['CCNNs']"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Abstract: Training large language models (LLM) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM model are preferred to outputs from OpenAI ChatGPT. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing large language models. Our codes and generated data are public at"
        }
      ]
    },
    {
      "role": "assistant",
      "refusal": False,
      "content": [
        {
          "type": "text",
          "text": "Tags: ['Evol-Instruct', 'WizardLM', 'ChatGPT', 'LLaMa']"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Abstract: Two of the most powerful models are LlaMa and ChatGPT. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations. In addition, combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial and cell complexes. Thus, combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces. Second, building upon combinatorial complexes and their rich combinatorial and algebraic structure, we develop a general class of message-passing combinatorial complex neural networks (CCNNs), focusing primarily on attention-based CCNNs. "
        }
      ]
    }
]

response = get_chat_completion(messages)

print(response)

Tags: ['LlaMa', 'ChatGPT', 'CCNNs']


#### What demonstrations to use in few-shot

A few tips for what to consider when preparing demonstrations for few-shot prompts:

- **Provide many input/output pairs (demonstrations) showing the behavior you want**  
  - Not every exemplar needs input/output pair; it depends on the task

- **Make the demonstrations as relevant and diverse as possible**  
  - Using similar demonstrations could also work in some domains  
  - Experiment with different formats and styles but try to use common formats (e.g., Q:A)

- **Make sure to account for label distribution**  
  - Generally go for a balanced distribution or base it on your data distribution  
  - Aim for high-quality, properly labeled exemplars  

- **Experiment with roles and different format and styles**
  - Leverage user + assistant roles to structure few-shot demonstrations and combine this with explicit indicators where possible
  - Use delimiters when adding demonstrations to system prompt to structure them better

- **Pay attention to ordering of exemplars as it can affect the results**  

- **Cover failure cases/edge cases**

Let's convert the few-shot examples as a template to easily reuse:

In [16]:
import json

with open('data/abstracts.json', 'r') as f:
    abstracts = json.load(f)

few_shot_examples = [] 

# append the abstracts list to the few_shot_examples
for abstract in abstracts:
    # Add user message (abstract)
    few_shot_examples.append([{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Abstract: " + abstract['abstract']
            }
        ]
    }, 
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": str(abstract['tags'])
            }
        ]
    }])

In [17]:
few_shot_examples

[[{'role': 'user',
   'content': [{'type': 'text',
     'text': 'Abstract: Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a cond

In [18]:
system_message = [
    {
      "role": "system",
      "content": [
        {
          "text": "Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\"model_name\"]. If you don't find model names in the abstract or you are not sure, return [\"NA\"]",
          "type": "text"
        }
      ]
    }
]

user_message = [{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": "Abstract: We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. "
        }
    ]
}]

messages = system_message + [item for sublist in few_shot_examples for item in sublist] + user_message

print("Full Prompt: \n", messages, "\n\n")

response = get_chat_completion(messages)

print("Response: \n", response)

Full Prompt: 
 [{'role': 'system', 'content': [{'text': 'Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format ["model_name"]. If you don\'t find model names in the abstract or you are not sure, return ["NA"]', 'type': 'text'}]}, {'role': 'user', 'content': [{'type': 'text', 'text': 'Abstract: Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coeffi

To test the prompt robustness, you can try randomizing the order of the few shot examples:

In [19]:
random.shuffle(few_shot_examples)

In [20]:
few_shot_examples

[[{'role': 'user',
   'content': [{'type': 'text',
     'text': 'Abstract: Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a cond

In [21]:
messages = system_message + [item for sublist in few_shot_examples for item in sublist] + user_message

response = get_chat_completion(messages)

print(response)

['PokerBench', 'GPT-4', 'ChatGPT 3.5', 'Llama', 'Gemma']


### Evaluating Few-shot Prompting

The goal is to compare our new few-shot prompt to see how much we are improving compared with the zero-shot base prompt?

In [22]:
# import val_data.json
import json

with open('data/val_data.json', 'r') as f:
    val_data = json.load(f)

In [23]:
print(val_data[0])

{'paper': 'Grandmaster-Level Chess Without Search', 'abstract': "The recent breakthrough successes in machine learning are mainly attributed to scale: namely large-scale attention-based architectures and datasets of unprecedented scale. This paper investigates the impact of training at scale for chess. Unlike traditional chess engines that rely on complex heuristics, explicit search, or a combination of both, we train a 270M parameter transformer model with supervised learning on a dataset of 10 million chess games. We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine, leading to roughly 15 billion data points. Our largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit search algorithms. We also show that our model outperforms AlphaZero's policy and value networks (without MCTS) and GPT-3.5-turbo-instruct. A systematic

In [24]:
# helps to format the user message
def format_user_message(paper):
    return [
        {
            "role": "user",
            "content": [{"type": "text", "text": "Abstract: " + paper["abstract"]}]
        }
    ]

def get_zero_shot_predictions(val_data, system_message):
    """Calls the model with the system message and returns the predictions"""
    predictions = []
    for paper in val_data:
        messages = system_message + format_user_message(paper)
        response = get_chat_completion(messages)
        predictions.append(response)
    return predictions

# helps to get the few_shot_predictions
def get_few_shot_predictions(val_data, few_shot_examples, system_message):
    """Calls the model with the few shot examples and returns the predictions"""
    predictions = []
    for paper in val_data:
        messages = system_message + few_shot_examples + format_user_message(paper)
        response = get_chat_completion(messages)
        predictions.append(response)
    return predictions

final_few_shot_examples =[item for sublist in few_shot_examples for item in sublist]

# testing it with a paper
few_shot_predictions = get_few_shot_predictions([val_data[31]],  final_few_shot_examples, system_message)
zero_shot_predictions = get_zero_shot_predictions([val_data[31]], system_message)

print(few_shot_predictions)
print(zero_shot_predictions)

["['AlphaCodium', 'GPT-4']"]
['["AlphaCodium"]']


IMPROVEMENT: Maybe might be good to validate results or even use structured outputs for this use case

Run the predictions using both zero-shot prompting and few-shot prompting (takes about 3 mins to run, ~$0.20 at the time of recording):

In [25]:
# run both the zero shot and few shot predictions for all the papers
zero_shot_predictions = get_zero_shot_predictions(val_data, system_message)
few_shot_predictions = get_few_shot_predictions(val_data, final_few_shot_examples, system_message)

In [26]:
# eval the output of the zero shot predictions
zero_shot_predictions = [eval(prediction) for prediction in zero_shot_predictions]
# eval the output of the few shot predictions
few_shot_predictions = [eval(prediction) for prediction in few_shot_predictions]

Save the predictions for easy reuse later:

In [27]:
# Uncomment if you want export the predictions as a json file for future use
#with open('data/predictions.json', 'w') as f:
#    json.dump({'zero_shot_predictions': zero_shot_predictions, 'few_shot_predictions': few_shot_predictions}, f)

Load the results the next time you run the notebook (uncomment if you need it):

In [28]:
# read in the predictions from the json file
#with open('data/predictions.json', 'r') as f:
#    predictions = json.load(f)
#zero_shot_predictions = predictions['zero_shot_predictions']
#few_shot_predictions = predictions['few_shot_predictions']

In [29]:
# eval the output of the zero shot predictions
#zero_shot_predictions = [eval(prediction) for prediction in zero_shot_predictions]
# eval the output of the few shot predictions
#few_shot_predictions = [eval(prediction) for prediction in few_shot_predictions]

In [30]:
zero_shot_predictions

[['NA'],
 ['AnyTool', 'ToolLLM', 'GPT-4'],
 ['NA'],
 ['NA'],
 ['ALOHA 2', 'ALOHA'],
 ['NA'],
 ['SELF-DISCOVER'],
 ['DeepSeekMath 7B', 'DeepSeek-Coder-Base-v1.5 7B', 'Gemini-Ultra', 'GPT-4'],
 ['NA'],
 ['NA'],
 ['OLMo'],
 ['NA'],
 ['CRAG'],
 ['NA'],
 ['NA'],
 ['MoE-Tuning', 'MoE-LLaVA', 'LLaVA-1.5-7B', 'LLaVA-1.5-13B'],
 ['WRAP'],
 ['Retrieval-Augmented Generation (RAG)'],
 ['NA'],
 ['SliceGPT', 'LLAMA2-70B', 'OPT 66B', 'Phi-2'],
 ['Depth Anything'],
 ['Llama-2', 'MPT', 'OpenLLaMA'],
 ['MambaByte'],
 ['DreamPaint', 'Diffuse to Choose'],
 ['WARM'],
 ['NA'],
 ['RTVLM', 'LLaVA-v1.5'],
 ['Lumiere'],
 ['Medusa', 'Medusa-1', 'Medusa-2'],
 ['AgentBoard'],
 ['AlphaGeometry'],
 ['AlphaCodium'],
 ['Llama2-13B', 'GPT-3.5', 'GPT-4'],
 ['Self-Rewarding Language Models', 'Llama 2 70B'],
 ['proxy-tuning', 'Llama2-70B'],
 ['ReFT'],
 ['NA'],
 ['Patchscopes'],
 ['QLoRA'],
 ['Mamba', 'MoE-Mamba', 'Transformer-MoE'],
 ['InseRF'],
 ['NA'],
 ['ChatGPT'],
 ['MagicVideo-V2',
  'Runway',
  'Pika 1.0',
  'Morph'

In [31]:
# get actual tags
actual_tags = [eval(paper['gold_labels']) for paper in val_data]

# clean up: remove white spaces from the items in the arrays for the actual tags
actual_tags = [[item.strip() for item in tag] for tag in actual_tags]

In [32]:
actual_tags[0:4]

[['AlphaZero', 'GPT-3.5-turbo-instruct'],
 ['AnyTool', 'GPT-4', 'ToolLLM'],
 ['NA'],
 ['GPT-3.5-turbo', 'Gemini-pro']]

### LLM-as-a-judge

In [33]:
# compare predictions with the actual tags using LLM-as-a-judge
def get_llm_as_a_judge_predictions(predictions, actual_tags):
    """Calls the model with the predictions and returns the predictions"""

    final_assessment = []

    for prediction, actual_tag in zip(predictions, actual_tags):

        messages = [
            {
            "role": "system",
            "content": [{"type": "text", "text": "You are a judge that will compare the predictions with the actual tags and return either 'correct' or 'incorrect' for each prediction. The predictions are an array of model names and the actual tags are an array of model names. The predictions and actual tags don't need to be in the same order to be correct as long as the correct model names are present in both the predictions and the actual tags."}]
            },
            {
            "role": "user",
            "content": [{"type": "text", "text": f"Predictions: {prediction} Actual Tags: {actual_tag}"}]
            }
        ]

        response = get_chat_completion(messages)
        final_assessment.append(response)


    return final_assessment

# assess a few predictions
assessment = get_llm_as_a_judge_predictions(zero_shot_predictions[0:10], actual_tags[0:10])

Now run assessment for both types of predictions (takes less than a minute; ~$0.04):

In [34]:
# run assessment for zero shot predictions
zero_shot_assessment = get_llm_as_a_judge_predictions(zero_shot_predictions, actual_tags)
# run assessment for few shot predictions
few_shot_assessment = get_llm_as_a_judge_predictions(few_shot_predictions, actual_tags)

In [35]:
# count the number of correct predictions
zero_shot_correct = zero_shot_assessment.count('correct')
few_shot_correct = few_shot_assessment.count('correct')

print(f"Zero-shot accuracy: {zero_shot_correct/len(zero_shot_assessment)}")
print(f"Few-shot accuracy: {few_shot_correct/len(few_shot_assessment)}")


Zero-shot accuracy: 0.6
Few-shot accuracy: 0.6


Not a lot of improvement but we can continue to optimize the system prompt and take a closer look at the LLM-as-a-Judge and make it less strict.

The good news is that we now have a way to systematically test any improvements on our few-shot prompt.

Other things to try:
- Optimize better the system prompt
- Use the o1-mini model to the LLM-as-a-judge evaluation
- Expand the few-shot prompt examples to more edge cases but perform the error analysis first