# Select by Maximal Marginal Relevance example selector (MMR)
Selecting an example question matching the user's question using maximal marginal relevance.
"MMR not only finds the similarities but also trying it best to make them diverse. In other word, it will tell the system to select the first example as similar to the input as possible, but the subsequent example should be different to the previous one as much as possible but still similar to the input."

Source: https://medium.com/@larry_nguyen/langchain-101-lesson-2-example-selectors-37b891ca9268

In [8]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp311-cp311-win_amd64.whl.metadata (3.8 kB)
Downloading faiss_cpu-1.8.0.post1-cp311-cp311-win_amd64.whl (14.6 MB)
   ---------------------------------------- 0.0/14.6 MB ? eta -:--:--
   ---------------------------------------- 0.2/14.6 MB 10.9 MB/s eta 0:00:02
   ---------------------------------------- 0.2/14.6 MB 10.9 MB/s eta 0:00:02
   - -------------------------------------- 0.6/14.6 MB 5.4 MB/s eta 0:00:03
   - -------------------------------------- 0.6/14.6 MB 5.0 MB/s eta 0:00:03
   --- ------------------------------------ 1.2/14.6 MB 5.4 MB/s eta 0:00:03
   --- ------------------------------------ 1.2/14.6 MB 5.5 MB/s eta 0:00:03
   ---- ----------------------------------- 1.7/14.6 MB 5.4 MB/s eta 0:00:03
   ----- ---------------------------------- 1.8/14.6 MB 5.6 MB/s eta 0:00:03
   ----- ---------------------------------- 2.1/14.6 MB 5.1 MB/s eta 0:00:03
   ------ --------------------------------- 2.5/14.6 MB 5.7 MB/s


[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
!pip install tiktoken




[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Initial example test from online article

In [13]:
from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import (
    MaxMarginalRelevanceExampleSelector,
    SemanticSimilarityExampleSelector,
)
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai import OpenAIEmbeddings

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_selector = MaxMarginalRelevanceExampleSelector.from_examples(
    # The list of examples available to select from.
    examples,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    FAISS,
    # The number of examples to produce.
    k=2,
)
mmr_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)
# Input is a feeling, so should select the happy/sad example as the first one
print(mmr_prompt.format(adjective="worried"))

Give the antonym of every input

Input: happy
Output: sad

Input: windy
Output: calm

Input: worried
Output:


### Applying the method to our use case (example questions) with MMR

In [29]:
from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import (
    MaxMarginalRelevanceExampleSelector,
    SemanticSimilarityExampleSelector,
)
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai import OpenAIEmbeddings
import json

example_prompt = PromptTemplate(
    input_variables=["question"],
    template="Similar Question: {question}",
)

with open('questions_and_queries.json', 'r', encoding='utf-8') as file:
    example_data = json.load(file)

# Change the key from "user_question" to "question"
example_questions = [{"question": item["user_question"]} for item in example_data]

example_selector = MaxMarginalRelevanceExampleSelector.from_examples(
    # The list of examples available to select from.
    example_questions,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    FAISS,
    # The number of examples to produce.
    k=2,
)

mmr_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the question similar to the input question",
    suffix="Your Question: {question}",
    input_variables=["question"],
)
# Input is a feeling, so should select the happy/sad example as the first one
print(mmr_prompt.format(question="What were the last decisions about high school?"))

Give the question similar to the input question

Similar Question: Wat zijn de laatste 10 besluiten en wanneer zijn deze gepubliceerd?

Similar Question: Where can I go swimming?

Your Question: What were the last decisions about high school?


# Select by Semantic Similarity example selector (SS)

Source: https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/similarity/

In [30]:
from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import (
    MaxMarginalRelevanceExampleSelector,
    SemanticSimilarityExampleSelector,
)
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai import OpenAIEmbeddings

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_selector = SemanticSimilarityExampleSelector.from_examples(
    # The list of examples available to select from.
    examples,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    FAISS,
    # The number of examples to produce.
    k=2,
)
ss_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)
print(ss_prompt.format(adjective="worried"))

Give the antonym of every input

Input: happy
Output: sad

Input: sunny
Output: gloomy

Input: worried
Output:


### Applying the method to our use case (example questions) with SS

In [4]:
from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import (
    MaxMarginalRelevanceExampleSelector,
    SemanticSimilarityExampleSelector,
)
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai import OpenAIEmbeddings
import json

example_prompt = PromptTemplate(
    input_variables=["question"],
    template="Similar Question: {question}",
)

with open('questions_and_queries.json', 'r', encoding='utf-8') as file:
    example_data = json.load(file)

# Change the key from "user_question" to "question"
example_questions = [{"question": item["user_question"]} for item in example_data]

example_selector = SemanticSimilarityExampleSelector.from_examples(
    # The list of examples available to select from.
    example_questions,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    FAISS,
    # The number of examples to produce.
    k=2,
)

ss_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the question similar to the input question",
    suffix="Your Question: {question}",
    input_variables=["question"],
)
# Input is a feeling, so should select the happy/sad example as the first one
print(ss_prompt.format(question="Where can I bike?"))

Give the question similar to the input question

Similar Question: Where can I go swimming?

Similar Question: How can I do my recycling?

Your Question: Where can I bike?


In [5]:
print(ss_prompt.format(question="What are the recent decisions in Gent?"))

Give the question similar to the input question

Similar Question: Wat waren de laatste 10 beslissingen met betrekking tot het milieu in Gent?

Similar Question: Welke besluiten heeft de burgemeester genomen?

Your Question: What are the recent decisions in Gent?


# Semantic textual similarity with SentenceTransformers
Sources: 
- https://www.sbert.net/examples/applications/semantic-search/README.html
- https://sbert.net/

In [1]:
import torch
import json
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [2]:
embedder = SentenceTransformer("all-MiniLM-L6-v2")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [10]:
# Load example data
with open('questions_and_queries.json', 'r', encoding='utf-8') as file:
    example_data = json.load(file)

# Extract example questions
example_questions = [item["user_question"] for item in example_data]

example_questions

['Wat waren de laatste 10 beslissingen met betrekking tot het milieu in Gent?',
 'Welke maatregelen worden er tijdens de bouw genomen voor stofbeheersing en -reductie?',
 'How can I do my recycling?',
 'Where can I go swimming?',
 'Wat zijn de laatste 10 besluiten en wanneer zijn deze gepubliceerd?',
 'Welke besluiten heeft de burgemeester genomen?']

In [7]:
question_embeddings = embedder.encode(example_questions, convert_to_tensor=True)

In [9]:
question = "Where can I go biking?"
top_k = min(2, len(example_questions))
question_embedding = embedder.encode(question, convert_to_tensor=True)
similarity_scores = embedder.similarity(question_embedding, question_embeddings)[0]
scores, indices = torch.topk(similarity_scores, k=top_k)

print("Question:", question)
print("Top most similar sentence in questions:")

for score, idx in zip(scores, indices):
    print(example_questions[idx], "(Score: {:.4f})".format(score))

Question: Where can I go biking?
Top most similar sentence in questions:
Where can I go swimming? (Score: 0.5896)
How can I do my recycling? (Score: 0.1440)
