[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeljov/NAP2025/blob/main/LangChain_Zero_and_Few_Shot_Learning_for_Requirements_Tagging.ipynb)

# Zero shot and Few shot learning for text classification (tagging)

In this notebook we'll explore the use of LangChain for **few shot learning** in the context of text classification (tagging).

**Few shot learning** is a form of machine learning where a model is taught to perform a particular task, such as classification or named entity recognition, by being provided with a few examples only, instead of a large training set, as is the case in 'traditional' machine learning tasks.

As any other task based on large languge models (LLMs), few shot learning relies on the use of appropriate prompts. LangChain offers prompt template classes specifically built for few shot learning, which we will use to classify a set of software requirements into functional and non-functional.

Note: there is also **zero-shot learning**, where the model is asked to performe a specific task without any examples being given, only guidance / instruction in natural language (prompt).


We start by installing the required python packages...

In [None]:
!pip -q install langchain langchain_community langchain_groq

In [None]:
# adding packages for OpenAI's models in case we run out of free access to Groq
# !pip install -q openai langchain-openai

To select examples for few-shot learning, we will use [`faiss-cpu`](https://pypi.org/project/faiss-cpu/), a python implentation of the Facebook's [FAISS](https://github.com/facebookresearch/faiss) library for efficient similarity search and clustering of text embeddings

In [None]:
!pip -q install faiss-cpu

Finally, we need two libraries for the embeddings model we will use to create vector representation of examples for few-shot learning (such embeddings-based representation of examples is required for performing search of relevant examples, as will be explained later on)


In [None]:
!pip -q install sentence_transformers FlagEmbedding langchain-huggingface

### Instantiate an LLM chat model

As before, we will use Groq API to access Meta's LLama 3.1 8B model. More precisely, we will leverage [LangChain's integration with Groq API](https://python.langchain.com/docs/integrations/chat/groq/), which further simplifies instantiation of LLMs that Groq offers.

To use Groq services, we need to load Groq API key into the current application environment (as shown below).

For more details about LLama 3 model and access to it via Groq API, see [the LangChain Intro notebook](https://github.com/jeljov/NAP2025/blob/main/LangChain_Intro.ipynb).

In [None]:
import os
from google.colab import userdata
from langchain_groq import ChatGroq

os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

llm = ChatGroq(model="llama-3.1-8b-instant",
               temperature=0.0)

The code below if instantiating an OpenAIChat model in case we run out of free tokens on Groq

In [None]:
# from langchain_openai import ChatOpenAI

# os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

# llm = ChatOpenAI(model='gpt-4o-mini', temperature=0.0)

## Zero shot learning through prompting

Let's see an example where we will try to instruct an LLM to classify software requirements into functional and non-fuctional.

We start by checking if the model is 'aware' of these two main categories of software requirements:

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

system_message = """
You are excellent software engineer with a lot of experience in gathering and managing software requirements.
You assist the user by answering their questions about software requirements in a clear and concise manner. Restrict your answer to one paragraph only.
"""

user_message = "What are {req_type} requirements?"

initial_prompt = ChatPromptTemplate.from_messages([
    ('system', system_message),
    ('human', user_message),
])

simple_chain = initial_prompt | llm | StrOutputParser()

In [None]:
resp_functional_req = simple_chain.invoke('functional')
print(resp_functional_req)

We will add a simple utility function that wraps the text so that it is easier to read the model's output:

In [None]:
def wrap_and_print(text, width=100):
  import textwrap
  wrapped_text = textwrap.fill(text, width=width)
  print(wrapped_text +'\n')

In [None]:
wrap_and_print(resp_functional_req)

In [None]:
resp_nonfunctional_req = simple_chain.invoke('non-functional')
wrap_and_print(resp_nonfunctional_req)

Since the LLM is quite knowledgeable about software requirements, we may first try classification of requirements into functional and non-functional via **zero-shot learning**:

In [None]:
from langchain_core.output_parsers import JsonOutputParser

zero_shot_system_msg = """
You are an experienced software engineer with expertise in requirements engineering. Please help the user classify software requirements into functional and non-functional.
The user will give you a list of software requirements, each requirement starting in a new line. Classify each requirement in the user's list into one of the following two categories "Functional" or "Non-functional".
Think through each requirement in the user's list before classifying it. If you don't know how to classify a requirement, assign to it the "None" label.
Return your response as a JSON list of tuples, where the first element in the tuple is the text of a requirement from the user's list, while the second elements is the category assigned to that requirement (or "None" if you do no know how to classify it).
Your response should include only this JSON list and nothing else.
"""

zero_shot_user_msg="Software requirements to classify:\n{requirements}"

zero_shot_template = ChatPromptTemplate.from_messages([
    ('system', zero_shot_system_msg),
    ('human', zero_shot_user_msg),
])

zero_shot_simple_chain = zero_shot_template | llm | JsonOutputParser()

As an initial set of data for testing the zero-shot prompting, we'll use a small set of requirements in the form of user stories, for a (imaginary) music recommender mobile app:

In [None]:
sample_requirements_list = [
    "I want to sign up with my email and password so I can access the app's features.",
    "I want to create, name, and modify a custom playlist so I can organize my favorite tracks.",
    "I want the music recommendation list to load in less than two seconds so I don't have to wait.",
    "I expect all my listening history and personal data to be securely encrypted so my information is protected.",
    "I want to be able to save a recommended song for offline playback so I can listen to it without an internet connection.",
    "I need the app to be available 99.9% of the time so I can reliably listen to music during my travels.",
    "I need the app's interface to look and behave consistently across all screens so it's intuitive and easy to use.",
    "I want to be able to search for music by typing an artist name, song title, or genre so I can quickly find what I'm looking for.",
    "I need the app to continue playing music if I briefly lose my cellular connection so my listening experience isn't interrupted.",
    "I want the app to use no more than 100MB of storage for cached data so my phone doesn't run out of space.",
    "I want to be able to 'Like' or 'Dislike' a recommended song so the system can improve future recommendations."
]

sample_requrements_str = "- " + "\n- ".join(sample_requirements_list)

In [None]:
print(zero_shot_template.format(requirements = sample_requrements_str))

In [None]:
zero_shot_classes = zero_shot_simple_chain.invoke(sample_requrements_str)

In [None]:
type(zero_shot_classes)

In [None]:
for i, classified_req in enumerate(zero_shot_classes):
  req, cls = classified_req
  print(f"{i+1}. {req}: {cls.upper()}")

This worked well, but as few-shot generally performs better than zero-shot and this was just a tiny sample, we'll next explore few-shot learning.

## Few shot learning

The idea behind **few shot learning** is to provide an LLM with some additional knowledge and / or instruction how to use the knowledge it has, to be able to perform a particular task.

Note: knowledge that an LLM 'develops' through training (both pre-training and instruction and preference tuning) is often referred to as **parametric knowledge**, whereas any additional knowledge and / or instruction it receives later, during the actual use is known as **contextual knowledge**.

In case of Chat LLMs, this *contextual knowledge*, comes in the form of a few pairs of interactions between the user and the chatbot, serving as examples of what we expect the model to do after receiving an input from the user. It has been shown that in large majority of cases, with only a few examples (3-4) and sometimes even only one example, a model can be well instructed for almost any specific task.

A nice quick intro to few shot learning (through prompting) is given in [this](https://youtu.be/JSBjj09xJeM?si=DJwLOeFHX7KOTGq2) short (5min) lecture from U. of Texas.

One of the main challenges in few-shot learning is the selection of examples to give to the model to learn from, as these should be sufficiently represenative and distinct among themselves. LangChain offers support for example selection, as we'll explore next.  

### LangChain's support for few shot learning


LangChain supports few shot learning for chat models throught a combination of two prompt template classes - [`ChatPromptTemplate`](https://reference.langchain.com/python/langchain_core/prompts/#langchain_core.prompts.chat.ChatPromptTemplate) and `FewShotChatMessagePromptTemplate`.

`FewShotChatMessagePromptTemplate` is particularly useful as it allows us to seamlessly select examples to be included based on different criteria such as the length of the prompt or mutual similarity of the examples. It also allows for selecting examples based on the user's input (how similar examples are to the input). We'll now examine how such dynamic inclusion/exclusion of examples works.





To learn about dynamic few-shot prompting (i.e., few-shot prompting with dynamic example selection) , we will use a dataset of software requirements with human-assigned functional / non-functional labels. The dataset orginates from the "[Software requirements dataset](https://www.kaggle.com/datasets/iamvaibhav100/software-requirements-dataset)" available from Kaggle.com

We will start by loading the dataset (stored localy in the `software_requirements_processed.csv` file) and selecting a small sample of requirements to serve as examples for few shot learning, whereas the remaining ones will be used for evaluation purposes.

In [None]:
from google.colab import files

data_file = files.upload()

The uploaded object is a dictionary having the filename and content as itâ€™s key-value pair. So, we can extract the file name and read in the file content using pandas (to have it in the pandas DataFrame format)

In [None]:
import pandas as pd

file_name = list(data_file.keys())[0]
req_data = pd.read_csv(file_name)
req_data.shape

In [None]:
req_data.head(10)

We can examine the distribution of functional / non-functional requirements in this dataset:

In [None]:
req_data.Category.value_counts(normalize=True)

Next, we take a small random sample - say 50 requirements - from which we will select (a much smaller subset of) examples for few shot learning

In [None]:
rand_state = 9 # can be any number; used for initialising the random numbers generator
req_ex_sample = req_data.sample(50, random_state=rand_state)
req_ex_sample.Category.value_counts()

This random set of labelled requirements need to be turned into a list of examples, where each example is presented as a dictionary of two items: a requirement and its corresponding type (functional / non-functional), as exemplified below:

```
examples = [
    {
      "requirement": "This is a placeholder for a real software requirement",
      "type": "Functional"
    },
    {
      "requirement": "This is a placeholder for a real software requirement",
      "type": "Non-functional"
    },
    #....
]
```
To create such a list, we change the name of the data frame columns and transform the data frame into a list of dictionaries:

In [None]:
req_ex_sample.rename(columns={'Requirement':'requirement', 'Category':'type'}, inplace=True)
req_ex_list = req_ex_sample[['requirement','type']].to_dict('records')
req_ex_list[:5]

To select a subset of examples to be used for few shot learning, we will use LangChain's `MaxMarginalRelevanceExampleSelector`, an example selector that selects examples so that they are highly similar to the input, while also optimizing for their mutual diversity. It does this by finding examples that have the highest similarity with the inputs, and then iteratively adding them while penalizing them for similarity to the already selected examples.

To compute similarity between the user's input and the examples as well as similarity among the examples, we need to represent the inputs and the examples in a numerical format, suitable for (similarity) computation. To that end, we will use **text embeddings** as currently the most widely used approach for representing textual content.  

Specifically, we will use the [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) open source model for creating text embeddings. It was selected due to its small size and good performance. It was developed by the Beijing Academy of AI and as other open source language models, it is available from the [HuggingFace Hub](https://huggingface.co/models).

You may want to explore the embedding models leaderboard at HuggingFace: [https://huggingface.co/spaces/mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to find alternative models you may want to try out.

Note that LangChain directly supports large number of embedding models; for a list of supported models and instructions (code examples) how to use them, go to [Integrations > Embedding models](https://python.langchain.com/docs/integrations/text_embedding/) in LangChain documentation.

A reminder: To access a model hosted at HuggingFace Hub, you'll need a HuggingFace (HF) access token. To obtain such a token, you should first create an account at [HuggingFace.co](https://huggingface.co/). Then, click on your profile in the top-right corner, then *Settings*, then *Access Tokens*, then *New Token*, set *Role* to *write* and *Generate*.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

model_name = "BAAI/bge-large-en-v1.5"

bge_embeddings = HuggingFaceEmbeddings(
  model_name=model_name,
  model_kwargs={'device': 'cpu'}, # tells it to use CPU; alternative is 'cuda' to run on GPU
  encode_kwargs={'normalize_embeddings': True} # normalising embeddings as cosine similarity will be used later
)

Next, we instantiate the `MaxMarginalRelevanceExampleSelector`, with the BGE embeddings model and [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) as the vectors store.

FAISS is an in-memory vector store, which allows for quick search of (embedded / vectorised) documents based on different similarity measures

In [None]:
from langchain_core.example_selectors.semantic_similarity import MaxMarginalRelevanceExampleSelector
from langchain_community.vectorstores.faiss import FAISS

mmr_example_selector = MaxMarginalRelevanceExampleSelector.from_examples(
    # the list of examples available to select from
    examples=req_ex_list,
    # the embeddings model to be used to represent the examples, so that their semantic similarity can be computed
    embeddings=bge_embeddings,
    # The VectorStore class that is used to store the embeddings and do a similarity search over
    vectorstore_cls=FAISS,
    # The number of examples to select
    k=5,
    # the key of the example element used for search / selection;
    # should not be used when this selector is used in FewShotChatMessagePromptTemplate (as we do further below)
    # input_keys=['requirement']
)

We can first try it out

In [None]:
sample_input = {"requirement": "A new user shall be able to navigate through the league and team pages within 30 seconds of reaching the start-up page."}

selected_examples = mmr_example_selector.select_examples(sample_input)

for ex in selected_examples:
  print(f"Requirement: \"{ex['requirement']}\": {ex['type']}")

Now we can create a few shot prompt with examples selected by the MMR example selector. This task is facilitated by the LangChain's `FewShotChatMessagePromptTemplate` class, as shown below:

In [None]:
from langchain_core.prompts.few_shot import FewShotChatMessagePromptTemplate

few_shot_mmr_prompt = FewShotChatMessagePromptTemplate(
    # setting the example selector we've created above
    example_selector=mmr_example_selector,
    # this is how we want to format the examples when we insert them into the prompt
    example_prompt=ChatPromptTemplate.from_messages([
        ("user", "{requirement}"),
        ("ai", "{type}"),
    ]),
    # the variable(s) that will receive the user's input; it has to be named 'input'
    input_variables=["input"]
)

In [None]:
few_shot_system_msg = """
You are an experienced software engineer with expertise in requirements engineering. Please help the user classify software requirements into functional and non-functional.
Think through the requirement the user gives you before responding.
Limit your response to one term: "Functional" or "Non-functional".
If you don't know the answer, respond with "None".

Below are a couple of examples of classified requirements, to help you with the classification task.
EXAMPLES:"""

final_few_shot_mmr_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", few_shot_system_msg),
        few_shot_mmr_prompt,
        ("user", "Requirement to classify:\n{input}"),
    ]
)

In [None]:
test_req = "The current repair facility ratings shall be displayed to the user."

print(final_few_shot_mmr_prompt.format(input=test_req))

In [None]:
few_shot_mmr_chain = final_few_shot_mmr_prompt | llm | StrOutputParser()

In [None]:
few_shot_mmr_chain.invoke({"input":test_req})

## Evaluation with human labelled data

Let's now take the remaining part of the dataset we've loaded to examine how well the model performs the classification task. To that end, we will first drop instances that were used for examples selection (to avoid 'data leakage'):

In [None]:
example_indices = req_ex_sample.index.tolist()
req_data.drop(example_indices, inplace=True)

In [None]:
req_data.shape

For evaluation purposes, we will take a random sample of requirements that the model has not seen. We are taking a small sample, in order not to overstep the available free time / tokens on Groq.

In [None]:
eval_sample_size = 50

req_eval_sample = req_data.sample(eval_sample_size, random_state=rand_state)
req_eval_sample.reset_index(drop=True, inplace=True)
req_eval_sample.head()

Now, we can pass the test requirements to the previously built `few_shot_mmr_chain` and add a new column to the data frame based on the obtained results:

In [None]:
from time import sleep

def llm_tagging(requirement):

  try:
    llm_response = few_shot_mmr_chain.invoke(requirement)
    # print(llm_response)
  except Exception as e:
    print(f"The following error occurred when tagging requirement \"{requirement}\": {e}")
    llm_response = None

  # put the model to sleep for 5 sec
  sleep(5)

  return llm_response

req_eval_sample['Predicted_category'] = req_eval_sample.Requirement.apply(llm_tagging)

In [None]:
req_eval_sample[['Requirement', 'Category', 'Predicted_category']]

Check the level of matching between human labels and the labels assigned by the LLM:

In [None]:
from statistics import mean

def label_match(row):
  human_lbl = row['Category']
  llm_lbl = row['Predicted_category']
  return human_lbl == llm_lbl

req_type_match = req_eval_sample.apply(label_match, axis=1)
print(f"Number of matched labels: {sum(req_type_match)}")
print(f"Proportion of matched labels: {mean(req_type_match):.4f}")

We can also compute the reqular classification eval metrics (confusion matrix, precision, recall, F1):

In [None]:
def compute_confusion_matrix_stats(positive_cls:str, labelled_req: pd.DataFrame) -> dict:
  TP = FP = TN = FN = 0
  for _, row in labelled_req.iterrows():
    true_val = row['Category']
    pred_val = row['Predicted_category']
    if true_val == pred_val:
      if true_val == positive_cls: TP += 1
      else: TN += 1
    else:
      if pred_val == positive_cls: FP += 1
      else: FN += 1
  return {
      'TP': TP, 'FP': FP, 'TN':TN, 'FN': FN
  }

def compute_classification_metrics(tp:int, tn:int, fp:int, fn:int) -> dict:
  precision = tp / (tp + fp)
  recall = tp / (tp + fn)
  f1 = 2 * precision * recall / (precision + recall)
  return {
      'precision': precision,
      'recall': recall,
      'F1': f1
  }


In [None]:
positive_cls = 'Functional'

cm_stats = compute_confusion_matrix_stats(positive_cls, req_eval_sample)
print(cm_stats)

In [None]:
cls_metrics = compute_classification_metrics(tp=cm_stats['TP'],
                               tn=cm_stats['TN'],
                               fp=cm_stats['FP'],
                               fn=cm_stats['FN'])

for metric, val in cls_metrics.items():
  print(f"{metric}: {val:.4f}")

Let's examine where the LLM made mistakes:

In [None]:
req_eval_sample['correct_cls'] = req_type_match
req_eval_sample.loc[~req_eval_sample.correct_cls, ['Requirement','Category','Predicted_category']]

In [None]:
fname = "sample_requirements_labelled.csv"
req_eval_sample.to_csv(fname, index=False)
files.download(fname)