# Few-shot learning and in-context learning

In this notebook, we explore few-shot learning and in-context learning, two powerful paradigms that allow LLMs to generalize and solve new tasks with minimal examples — often just a handful. These techniques are especially valuable in situations where labeled data is scarce, and traditional fine-tuning is impractical.

In [1]:
import os
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Initialize the language model
We instantiate a lightweight GPT model from OpenAI using LangChain.

In [2]:
# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18")

## Basic few-shot learning
Few-shot learning is a powerful prompting technique in which a small number of labeled examples are included directly in the input to guide the language model in learning the desired task behavior. Unlike traditional supervised learning, the model is not fine-tuned — instead, it leverages its general knowledge and the in-prompt examples to make inferences.

In this section, we demonstrate few-shot learning for sentiment classification, where the model is given just three labeled examples (positive, negative, neutral) and is then asked to classify a new, unseen input.

Few-shot learning is highly effective when:
- We want to bootstrap new tasks quickly.
- We don’t have enough labeled data for fine-tuning.
- We are dealing with dynamic or domain-specific input.

In [3]:
def few_shot_sentiment_classification(input_text):
    # Define a few-shot prompt with 3 labeled examples
    few_shot_prompt = PromptTemplate(
        input_variables=["input_text"],
        template="""
        Classify the sentiment as Positive, Negative, or Neutral.

        Examples:
        Text: I love this product! It's amazing.
        Sentiment: Positive

        Text: This movie was terrible. I hated it.
        Sentiment: Negative

        Text: The weather today is okay.
        Sentiment: Neutral

        Now, classify the following:
        Text: {input_text}
        Sentiment:
        """
    )

    # Create a chain by combining the prompt with the language model
    chain = few_shot_prompt | llm
    # Invoke the chain with input text
    result = chain.invoke(input_text).content

    # Remove extra whitespace from the result
    result = result.strip()
    # Extract only the sentiment label
    if ':' in result:
        result = result.split(':')[1].strip()

    return result  # This will now return just "Positive", "Negative", or "Neutral"

test_text = "I can't believe how great this new restaurant is!"
result = few_shot_sentiment_classification(test_text)

print(f"Input: {test_text}")
print(f"Predicted Sentiment: {result}")

Input: I can't believe how great this new restaurant is!
Predicted Sentiment: Positive


1. Few-shot prompt construction:
   - We define a `PromptTemplate` that includes three labeled examples for sentiment classification. Each example contains a short sentence followed by its correct sentiment label.
   - The prompt ends with a new sentence (`{input_text}`) and a blank sentiment label, prompting the model to infer the label using the same pattern.
2. Prompt execution: The prompt is passed to the `llm` using LangChain’s `|` (pipe) syntax, which creates a `Runnable` chain — in this case, one that formats the prompt and sends it to the model.
3. Inference: The `invoke()` method runs the chain with the user's input text. The model analyzes the few-shot prompt and completes the missing sentiment label.
4. Post-processing: We clean up the model’s response by trimming whitespace. If the model includes a colon (e.g., `"Sentiment: Positive"`), we extract only the label ("Positive") for cleaner output.
5. Output: The final result is a simple string representing the sentiment category: `"Positive"`, `"Negative"`, or `"Neutral"`.

This approach works well because:
- The model is shown the pattern through examples. There is no need for training loops or labeled datasets — it generalizes from examples in the prompt.
- It uses its pretrained knowledge of sentiment cues and linguistic structure.


## Advanced few-shot techniques
We now expand our few-shot setup to support multi-task learning. While basic few-shot learning is effective for single-task setups, many real-world applications require language models to handle multiple distinct tasks — such as sentiment classification, language detection, text summarization, etc.

This is where multi-task learning comes into play. The idea is to use a single prompt format that supports multiple task types, allowing the model to dynamically adjust its behavior based on the task description provided in the prompt.

This setup forms the foundation of prompt-based multi-task learners, which are highly valuable in real-world systems like customer support agents, multilingual chatbots, and analytics pipelines.

In [4]:
def multi_task_few_shot(input_text, task):
    # Define a multi-task prompt with examples for sentiment and language detection
    few_shot_prompt = PromptTemplate(
        input_variables=["input_text", "task"],
        template="""
        Perform the specified task on the given text.

        Examples:
        Text: I love this product! It's amazing.
        Task: sentiment
        Result: Positive

        Text: Bonjour, comment allez-vous?
        Task: language
        Result: French

        Now, perform the following task:
        Text: {input_text}
        Task: {task}
        Result:
        """
    )

    # Create a chain that links the prompt to the language model
    chain = few_shot_prompt | llm
    # Invoke the chain with both the text and the task type
    return chain.invoke({"input_text": input_text, "task": task}).content

# Test case 1: Classify sentiment
print(multi_task_few_shot("I can't believe how great this is!", "sentiment"))
# Test case 2: Detect language
print(multi_task_few_shot("Guten Tag, wie geht es Ihnen?", "language"))

Positive
German


1. Prompt design:
   - We create a prompt that contains two task-specific examples:
     - One for sentiment classification.
     - One for language detection.
   - Each example includes a `Text`, a `Task`, and a `Result`. This shows the model how the task parameter influences the expected output.
   - The prompt ends with a new instruction: a fresh `Text` and a `Task`, for which the model must produce a `Result`.
2. Dynamic task switching:
   - The model is able to switch behavior depending on the task provided (e.g., "sentiment" vs "language").
   - This enables the reuse of a single model and prompt structure across multiple NLP tasks.
3. Execution pipeline:
   - We use LangChain’s `PromptTemplate` to format the prompt.
   - We pipe it into the `llm` object using `|` to create a chain.
   - The `invoke()` method sends the input variables to the model and returns the model’s prediction.
4. Task testing:
   - We test the function on two inputs:
     - A positive sentence for sentiment analysis.
     - A German phrase for language identification.


This approach works well because:
- Efficiency: We don’t need to fine-tune separate models for each task.
- Flexibility: New tasks can be added simply by extending the prompt with more examples.
- Generalization: The model can infer the rules for different tasks by analogy.


## In-context learning (ICL)
In-context learning is a form of few-shot prompting where the LLMs learn how to perform a task from examples and instructions embedded directly in the prompt. Unlike traditional learning paradigms, ICL does not require any parameter updates or fine-tuning — the model adapts its behavior dynamically based on the given context.

In this section, we demonstrate in-context learning through a simple custom task: converting English words into Pig Latin. We will give the model a description of the task, a few input–output examples, and then ask it to perform the same transformation on a new input.

Use cases for in-context learning:
- Text transformations (e.g., translation, reformatting)
- Code generation with syntax patterns
- Structured data extraction
- Logical or arithmetic pattern recognition
- Any domain where patterns can be communicated through examples

In [5]:
def in_context_learning(task_description, examples, input_text):
    # Format the examples into a single string block
    example_text = "".join([f"Input: {e['input']}\nOutput: {e['output']}\n\n" for e in examples])

    # Construct the prompt using the task description, examples, and test input
    in_context_prompt = PromptTemplate(
        input_variables=["task_description", "examples", "input_text"],
        template="""
        Task: {task_description}

        Examples:
        {examples}

        Now, perform the task on the following input:
        Input: {input_text}
        Output:
        """
    )

    # Pipe the prompt to the language model to create a chain
    chain = in_context_prompt | llm
    # Invoke the chain with the specified inputs
    return chain.invoke({"task_description": task_description, "examples": example_text, "input_text": input_text}).content

# Task description for the model
task_desc = "Convert the given text to pig latin."
# Few-shot examples demonstrating the task pattern
examples = [
    {"input": "hello", "output": "ellohay"},
    {"input": "apple", "output": "appleay"}
]
# New input to test
test_input = "python"

# Run the model
result = in_context_learning(task_desc, examples, test_input)
print(f"Input: {test_input}")
print(f"Output: {result}")

Input: python
Output: Output: ythonpay


1. Task setup:
   - We define a *task description* ("Convert the given text to pig latin.") to inform the model what it’s expected to do.
   - We provide *few-shot examples* that illustrate the desired input-output behavior.
2. Prompt construction:
   - All examples are formatted into a single string and inserted into a structured prompt template along with the task and the new input.
   - The prompt clearly separates examples from the new task using phrases like: `"Now, perform the task on the following input:"`.
3. Model invocation:
   - The `PromptTemplate` is piped (`|`) to the language model (`llm`) to form a processing chain.
   - The `invoke()` method executes the chain with our variables and retrieves the model’s output.
4. Inference: The model reads the pattern from the examples and generalizes it to the new input — `"python"` — to generate the correct Pig Latin output.


This approach works well because:
- No fine-tuning required: The model learns the task in real time from the examples — no retraining needed.
- Rapid prototyping: We can build and test new task behaviors in seconds by just changing the prompt.
- Model generalization: LLMs can accurately replicate complex logic from only a few examples.

## Best practices and evaluation
Designing effective few-shot or in-context prompts requires more than just feeding the model a few examples. The quality, clarity, and structure of our prompts can significantly impact performance.

### Prompting best practices
To ensure few-shot and in-context learning yield reliable results:
- **Example selection**:
  - **Clarity** – Use clear, unambiguous examples that are easy for the model (and humans) to interpret. Avoid linguistic ambiguity or unclear task definitions.
  - **Diversity** – Cover different aspects and expressions of the task (e.g., positive, negative, and neutral sentiment).
  - **Relevance** – Choose examples that closely resemble the types of inputs your model will receive during inference.
  - **Balance** – Ensure that all relevant categories or classes are fairly represented to avoid model bias.
  - **Edge cases** – Include examples that are slightly unusual or potentially tricky. This helps the model generalize better.
- **Prompt Engineering**:
  - **Explicit instructions** – Clearly state what task the model is being asked to perform. Ambiguity in task wording can lead to poor completions.
  - **Consistent format** – Maintain a uniform input-output structure across all examples. Repetition helps the model learn the pattern more effectively.
  - **Conciseness** – Avoid unnecessary context or verbose descriptions. Only include information essential to the task.
  - **Structural alignment** – Match the format and tone of your examples to the inputs the model will see during deployment.


#### Evaluate model performance
Evaluating a language model’s performance is a critical step in ensuring it behaves reliably and accurately in real-world scenarios. In the context of few-shot and in-context learning, evaluation allows us to validate the effectiveness of our prompt design and understand how well the model generalizes to new, unseen inputs based on limited examples.

Evaluation helps us:
- **Validate prompt effectiveness**: It tells us whether the examples and structure used in our prompt are actually helping the model understand the task.
- **Identify weaknesses**: Repeated errors or consistent misclassifications (e.g., mistaking neutral statements for positive) highlight areas where the prompt might need refinement.
- **Benchmark progress**: As we iterate on our few-shot design, evaluation metrics help track whether changes lead to meaningful improvements.
- **Support decision-making**: Accurate evaluation gives us confidence before deploying models into production environments, where mistakes may have higher consequences.


To assess how well our few-shot sentiment classification model performs, we define an evaluation function. This function takes a model and a set of labeled test cases and computes its accuracy.

In [6]:
def evaluate_model(model_func, test_cases):
    '''
    Evaluate the model on a set of test cases.

    Args:
    model_func: The function that makes predictions.
    test_cases: A list of dictionaries, where each dictionary contains an "input" text and a "label" for the input.

    Returns:
    The accuracy of the model on the test cases.
    '''
    correct = 0
    total = len(test_cases)

    for case in test_cases:
        input_text = case['input']
        true_label = case['label']
        # Call the prediction function with the input
        prediction = model_func(input_text).strip()

        # Compare prediction to ground truth (case-insensitive)
        is_correct = prediction.lower() == true_label.lower()
        correct += int(is_correct)

        # Print detailed result
        print(f"Input: {input_text}")
        print(f"Predicted: {prediction}")
        print(f"Actual: {true_label}")
        print(f"Correct: {is_correct}\n")

    # Compute accuracy as correct predictions / total predictions
    accuracy = correct / total
    return accuracy

# Define a small test set of labeled examples
test_cases = [
    {"input": "This product exceeded my expectations!", "label": "Positive"},
    {"input": "I'm utterly disappointed with the service.", "label": "Negative"},
    {"input": "The temperature today is 72 degrees.", "label": "Neutral"}
]

# Evaluate the few-shot sentiment classifier
accuracy = evaluate_model(few_shot_sentiment_classification, test_cases)
# Report model performance
print(f"Model Accuracy: {accuracy:.2f}")

Input: This product exceeded my expectations!
Predicted: Positive
Actual: Positive
Correct: True

Input: I'm utterly disappointed with the service.
Predicted: Negative
Actual: Negative
Correct: True

Input: The temperature today is 72 degrees.
Predicted: Neutral
Actual: Neutral
Correct: True

Model Accuracy: 1.00


1. We pass a list of test cases, where each case contains:
   - An input text (e.g., a customer review).
   - A correct label (e.g., `"Positive"`).
2. The evaluation function loops through each test case:
   - It calls the model function to generate a prediction.
   - It compares the predicted sentiment to the actual label.
   - It logs the results and counts correct predictions.
3. Accuracy is computed as the number of correct predictions divided by the total number of test cases.

We can extend this approach with:
- Larger and more diverse test sets. More data allows us to better estimate generalization performance across varied sentence styles and topics.
- F1 score or precision/recall metrics, especially for imbalanced classes.
- Error analysis to understand failure patterns.