**How to use Query transformation and Hypothetical answering to re-rank retrieved articles and enhance the performance of your RAG pipeline?**

[Hanane DUPOUY LinkedIn](https://https://www.linkedin.com/in/hanane-d-algo-trader/): https://www.linkedin.com/in/hanane-d-algo-trader/

Ever wondered how to accurately answer the question, `'What impact did the global outage of CrowdStrike, extensively used by Microsoft, have on Microsoft's stock price?'`

To tackle this, we'll employ the **hypothetical answer re-ranking** technique:


-**GOAL**: I'll evaluate whether the **hypothetical answer** can improve the re-ranking and the retrieved context, and subsequently enhance the LLM's response within our RAG pipeline. Alternatively, we will assess if the original query alone is sufficient to retrieve the appropriate context and deliver accurate results.

- Different techniques will be employed: **Query Transformation, Hypothetical Answers, Embeddings, and Similarity Scoring** to retrieve the relevant context from news articles fetched from the NEWS API.


- In these techniques, we will compare the **capabilities of three LLMs**: **gpt-4o-mini** (the latest small model from OpenAI), **gpt-4o** (the most capable LLM from OpenAI), and **gpt-3.5-turbo**."


- I'll use use **3 evaluation metrics** from **deepEval** for RAG pipelines: **Faithfulness, Context Relevancy and Answer relevancy**.
These metrics are explained in the notebook.

- We will compare the three LLMs using two techniques: **Hypothetical Answer Re-ranking vs. Original Query Retrieval**.

**Steps:**

For each of the three LLMs: **GPT-4o-mini, GPT-4o, and GPT-3.5-turbo**:

**1-** We will perform **Search Queries** (or **Query transformation**) using the LLM to generate various formulation with the same keywords from the original user query.

**2-** We will use an LLM to generate a **hypothetical answer**. This creative response will serve as a potential answer, using placeholders instead of actual facts.

**3- **Based on each query from the search queries (1-), we will retrieve news article from NEWS API.

**4-** We will **embedd** user query, hypothetical answer and the collected articles

**5-** We compute the **similarity score** between 2 sets:

  5-1- Hypothetical answer (2-) vs retrieved context (3-)

  5-1- Original query (2-) vs retrieved context (3-)

**6- **Ask the LLM to give the final answer based on the user query and the retrieved context

**7-** Use 3 evalutaions metrics from DeepVal to evaluate the RAG pipeline: **Faithfulness, Context Relevancy and Answer relevancy.**

**8-** Key Takeways


# Install Lib

In [None]:
import json

In [None]:
!pip install openai
from google.colab import userdata
openai_api_key = userdata.get('OPENAI_API_KEY')
news_api_key = userdata.get('NEWS_API_KEY')

from openai import OpenAI
client = OpenAI(api_key=openai_api_key)

Chat method OpenAI

In [None]:
def get_completion_gpt(input, gpt_model = "gpt-3.5-turbo"):
  completion = client.chat.completions.create(
        model=gpt_model,
        messages=[
            {"role": "system", "content": "Output only valid JSON"},
            {"role": "user", "content": input},
        ],
        response_format={ "type": "json_object" }
    )

  text = completion.choices[0].message.content
  parsed = json.loads(text)
  return parsed

# Search NEWS API

In [None]:
import requests

def search_news(query, news_api_key= news_api_key,num_articles=5, from_datetime = "2024-07-18",to_datetime = "2024-07-21"):
    response = requests.get(
        "https://newsapi.org/v2/everything",
        params={
            "q": query,
            "apiKey": news_api_key,
            "pageSize": num_articles,
            "sortBy": "relevancy",
            "from": from_datetime,
            "to": to_datetime,
        },
    )

    return response.json()

# Generate Search Queries

In [None]:
user_query = "What impact did the global outage of CrowdStrike, which is used extensively by Microsoft, have on Microsoft's stock price?"

input = f"""
You have access to a NEWS API that returns recent news articles related to the user's question.

1. Make a list of search queries that match the topic described in the user's question.
2. Use different keywords related to the topic to create a variety of queries, making some general and others more specific.
3. Be imaginative and generate as many queries as possible. More queries will help you find better results.
4. Pick 10 of these queries.
For example, you can include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].

# User question: {user_query}

# Format: {{"queries": ["query_1", "query_2", "query_3"]}}
"""


llms = ["gpt-3.5-turbo", "gpt-4o-mini", "gpt-4o"]

dict_questions = {}
for llm in llms:
  print(llm)
  parsed = get_completion_gpt(input, gpt_model = llm)
  dict_questions[llm] = parsed

In [None]:
dict_questions

# Create a Hypothetical answer

In [None]:
hypoth_answer = f"""
Make up an answer to the user's question. We'll use this fabricated answer to sort the search results.
Imagine you have all the details to answer, but don't use real facts. Do not give any numbers.
Instead, use placeholders like 'EVENT affected something,' 'NAME mentioned something on DATE,' or 'EVENT has caused something.'

User question: {user_query}

Format: {{"hypotheticalAnswer": "hypothetical answer text"}}
"""

print(hypoth_answer)

In [None]:
#Trying differenet llms:

hypoth_answer_llms = {}
for llm in llms:
  # print(llm)
  parsed_hypothet_answer = get_completion_gpt(hypoth_answer, gpt_model = llm)
  hypoth_answer_llms[llm] = parsed_hypothet_answer['hypotheticalAnswer']
  print(f"{llm}\n {hypoth_answer_llms[llm]}")

# Fetch news articles from NEWS API for each query:

In [None]:
def get_articles_from_news_api(queries):
  articles = []
  for query in queries:
    result = search_news(query)
    if result['status'] == 'ok':
      articles = articles + result['articles']
    else:
      raise Exception(result["message"])
  return articles

In [None]:
articles={}
for llm in llms:
  queries = dict_questions[llm]['queries']
  queries.append(user_query)
  articles[llm] =  get_articles_from_news_api(queries)
  if articles[llm]!=None:
    articles[llm] = list({article["url"]: article for article in articles[llm]}.values())

In [None]:
for llm in llms:
  print(len(articles[llm]))

In [None]:
# #To save data locally
# for llm in llms:
#   pd.DataFrame(articles[llm]).to_csv("articles_"+llm+".csv")

In [None]:
#Display some articles:
print("Total number of articles:", len(articles)) #3 LLM ==> 3 set of articles
llm = llms[-1]
for article in articles[llm][0:5]:
    print("Title:", article["title"])
    print("Url:", article["url"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:300] + "...")
    print()

# Embeddings and cosine similarity

## Methods

In [None]:
def get_embeddings(input):
    response = client.embeddings.create(model="text-embedding-ada-002", input=input)
    return [data.embedding for data in response.data]

In [None]:
def get_embeddings_articles(articles):
  articles_prepare_embedd =  [
        f"{article['title']} {article['description']} {article['content'][0:700]}"
        for article in articles
    ]

  print(f"Length of articles to embed: {len(articles_prepare_embedd)}")
  article_embeddings = get_embeddings(articles_prepare_embedd)
  return article_embeddings

In [None]:
similarity_score_func=lambda x, y: 1 - spatial.distance.cosine(x, y)

def calculate_cosine_distance(embedding_hypoth, article_embeddings):

    cosine_similarities = []
    for article_embedding in article_embeddings:
        cosine_similarities.append(similarity_score_func(embedding_hypoth, article_embedding))
    return cosine_similarities

In [None]:
def sort_articles_by_cosine_similarity(articles, cosine_similarities):
    scored_articles = zip(articles, cosine_similarities)
    sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)
    print(f"Top 5 articles scores: {[score for _,score in sorted_articles[0:5]]}\n")
    # for article, score in sorted_articles[0:5]:
    #     print("Title:", article["title"])
    #     # print("Url:", article["url"])
    #     # print("Date of publication:", article["publishedAt"])
    #     # print("Description:", article["description"])
    #     # print("Content:", article["content"][0:50] + "...")
    #     print("Score:", score)
    #     print()
    return sorted_articles

In [None]:
def context_retrieval(sorted_articles):
    """Get top 5 articles based on their similarity scores."""
    formatted_top_results = [article["title"]+"\n"+article["description"]+"\n"+article["content"] for article, _score in sorted_articles[0:5]]

    return formatted_top_results


def get_final_answer(user_query, formatted_top_results, llm):
  """Answer the user's question based on the retrieved context using a GPT model: gpt-4o, gpt-4o-mini, gpt-3.5-turbo."""
  final_input = f"""
  Generate an answer to the user's question based on the given search results.
  TOP_RESULTS: {formatted_top_results}
  USER_QUESTION: {user_query}

  Include as much information as possible in the answer. Reference the relevant search result urls as markdown links.
  """

  completion = client.chat.completions.create(
          model=llm,
          messages=[
              {"role": "user", "content": final_input},
          ],
      )

  return completion.choices[0].message.content


# Example 1 LLM:

## Similarities against the Hypothetical Answer

In [None]:
llm = llms[0]
embedding_hypoth = get_embeddings(hypoth_answer_llms[llm])[0]
article_embeddings = get_embeddings_articles(articles[llm]) #{list of embedded articles , there are 26 articles}
cosine_similarities_hypoth = calculate_cosine_distance(embedding_hypoth, article_embeddings)

print(f" len embedding vector={len(embedding_hypoth)}, len artciles embedded={len(article_embeddings)}, len cosine_distance vector={len(cosine_similarities_hypoth)}")
print(cosine_similarities_hypoth[:5])
print("\n")

sorted_articles_hypoth = sort_articles_by_cosine_similarity(articles[llm], cosine_similarities_hypoth)

## Similarities against the original query

In [None]:
llm = llms[0]
embedding_original_query = get_embeddings(user_query)[0]
# article_embeddings = get_embeddings_articles(articles[llm]) #already embedded in the cell before
cosine_similarities_original= calculate_cosine_distance(embedding_original_query, article_embeddings)

print(f" len embedding vector={len(embedding_original_query)}, len artciles embedded={len(article_embeddings)}, len cosine_distance vector={len(cosine_similarities_original)}")
print(cosine_similarities_original[:5])
print("\n")

sorted_articles_original = sort_articles_by_cosine_similarity(articles[llm], cosine_similarities_original)

## Final Answer: Calling LLM to answer the user query

### Against the hypothetical answer and the original user query

In [None]:
#Using the retrieved context coming from the hypothetical answer
formatted_top_results_hypoth = context_retrieval(sorted_articles_hypoth)
final_answer_hypoth = get_final_answer(user_query, formatted_top_results_hypoth, llm)
print("Final answer against the Hypothetical query")
display.display(display.Markdown(final_answer_hypoth))

#Using the retrieved context coming from the original answer
formatted_top_results_original = context_retrieval(sorted_articles_original)
final_answer_original = get_final_answer(user_query, formatted_top_results_original, llm)
print("Final answer against the original query")
display.display(display.Markdown(final_answer_original))

# All together: With the 3 LLMs

In [None]:
llms

In [None]:
embedding_original_query = get_embeddings(user_query)[0]

for llm in llms:
  print(llm)
  #Hypothetical answer
  embedding_hypoth = get_embeddings(hypoth_answer_llms[llm])[0]
  article_embeddings = get_embeddings_articles(articles[llm])
  cosine_similarities_hypoth = calculate_cosine_distance(embedding_hypoth, article_embeddings)
  print("Hypothetical Answer: Most relevant News\n")
  sorted_articles_hypoth = sort_articles_by_cosine_similarity(articles[llm], cosine_similarities_hypoth)
  print("-"*50)

  #Original Query
  cosine_similarities_original= calculate_cosine_distance(embedding_original_query, article_embeddings)
  print("Original Answer: Most relevant News\n")
  sorted_articles_original = sort_articles_by_cosine_similarity(articles[llm], cosine_similarities_original)
  print("-"*50)

  formatted_top_results_hypoth = context_retrieval(sorted_articles_hypoth)
  final_answer_hypoth = get_final_answer(user_query, formatted_top_results_hypoth, llm)
  print("Final answer against the Hypothetical query")
  display.display(display.Markdown(final_answer_hypoth))
  print("-"*50)

  formatted_top_results_original = context_retrieval(sorted_articles_original)
  final_answer_original = get_final_answer(user_query, formatted_top_results_original, llm)
  print("Final answer against the original query")
  display.display(display.Markdown(final_answer_original))
  print("-"*50)
  print("-"*50)

In [None]:
len(sorted_articles_hypoth)

# Evaluation

We will be using DeepEval, to compute 3 metrics:

*   Faithfulness
*   Context Relevancy
*   Anwser Relevancy



In [None]:
!pip install deepeval -q

You need to speficy your OpenAI API key to use DeepEval, in our case.

To compute metrics, this library makes several calls to a given LLM, per default they are using GPT-4o.
You can use a custom LLM if you want.

However note that the under-hood pormpt templates, in the metrics, the LLM is asked to outpout a json format, if you are using a small LLM, this part may not work.

In [None]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
import os
os.environ[ "OPENAI_API_KEY" ] = OPENAI_API_KEY

## Faithfullness: Retrieved Context vs LLM's final answer

This evaluates the factual consistency of the **generated answer** relative to the **provided context**.

it outputs a **reason** for its **metric score**.

### Methodology

1- Use an LLM to break it into statements

2- Using an LLM, assert if the statement can or not be inferred from the context ⇒ Verdict: yes or no or idk.

3- Compute Faithfulness Score:

Faithfulness= Number of Truthful Claims/Total Number of Claims

https://docs.confident-ai.com/docs/metrics-faithfulness

In [None]:
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

In [None]:
def get_faithfulness_metric(user_query,final_answer,formatted_top_results):
  metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o",
    include_reason=True
  )
  test_case = LLMTestCase(
      input= user_query,
      actual_output=final_answer,
      retrieval_context=formatted_top_results
  )

  metric.measure(test_case)
  score = metric.score
  reason = metric.reason
  return score, reason

### Original query

In [None]:
metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o",
    include_reason=True
)
test_case = LLMTestCase(
    input=user_query,
    actual_output=final_answer_original,
    retrieval_context=formatted_top_results_original
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

### Hypothetical answer

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
score, reason = get_faithfulness_metric(user_query,final_answer_hypoth,formatted_top_results_hypoth)
print(f"Score: {score}, Reason: {reason}"

In [None]:
score, reason = get_faithfulness_metric(user_query,final_answer_original,formatted_top_results_original)

## Context Relevancy

This evaluates how relevant the **retrieved context** is to the **input query**.

It outputs a **reason** for its **metric score**.

### Methodology

1- Use an LLM to extract statements from the retrieved context

2- Using an LLM, assert if each statement is relevant to the input query ==>  yes or no.

3- Compute Contextual Relevancy Score:

Contextual Relevancy=
Number of Relevant Statements/Total Number of Statements​

https://docs.confident-ai.com/docs/metrics-contextual-relevancy

In [None]:
from deepeval.metrics import ContextualRelevancyMetric
# from deepeval.test_case import LLMTestCase

In [None]:
def get_context_relevancy_metric(user_query,final_answer_hypoth,formatted_top_results_hypoth):
  metric = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o",
    include_reason=True
  )
  test_case = LLMTestCase(
      input= user_query,
      actual_output=final_answer_hypoth,
      retrieval_context=formatted_top_results_hypoth
  )

  metric.measure(test_case)
  score = metric.score
  reason = metric.reason
  return score, reason

### Original query

In [None]:
score, reason = get_context_relevancy_metric(user_query,final_answer_original,formatted_top_results_original)
print(f"Score: {score}, Reason: {reason}")

### Hypothetical answer

In [None]:
score, reason = get_context_relevancy_metric(user_query,final_answer_hypoth,formatted_top_results_hypoth)
print(f"Score: {score}, Reason: {reason}")

## Answer Relevancy

The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the **actual_output** (final answer) of your LLM application is compared to the provided **input**.


deepeval's answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a **reason** for its **metric score**.

https://docs.confident-ai.com/docs/metrics-answer-relevancy

In [None]:
from deepeval.metrics import AnswerRelevancyMetric

In [None]:
def get_answer_relevancy_metric(user_query,final_answer_hypoth):
  metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o",
    include_reason=True
  )
  test_case = LLMTestCase(
      input= user_query,
      actual_output=final_answer_hypoth,
      # retrieval_context=formatted_top_results_hypoth
  )

  metric.measure(test_case)
  score = metric.score
  reason = metric.reason
  return score, reason

### Original query

In [None]:
score, reason = get_answer_relevancy_metric(user_query,final_answer_original)
print(f"Score: {score}, Reason: {reason}")

### Hypothetical answer

In [None]:
score, reason = get_answer_relevancy_metric(user_query,final_answer_hypoth)
print(f"Score: {score}, Reason: {reason}")

# All together: LLMs + Evaluation

## Run All

In [None]:
def get_all_eval_metrics(user_query,final_answer,formatted_top_results):
  score_faithfulness, reason_faithfulness = get_faithfulness_metric(user_query,final_answer,formatted_top_results)
  score_cxt_relev, reason_cxt_relev = get_context_relevancy_metric(user_query,final_answer,formatted_top_results)
  score_answ_relev, reason_answ_relev = get_answer_relevancy_metric(user_query,final_answer)
  print("\nFaithfulness\n")
  print(score_faithfulness, reason_faithfulness)
  print("\nContext Relevancy\n")
  print(score_cxt_relev, reason_cxt_relev)
  print("\nAnswer Relevancy\n")
  print(score_answ_relev, reason_answ_relev)

  return score_faithfulness, reason_faithfulness, score_cxt_relev, reason_cxt_relev, score_answ_relev, reason_answ_relev
    # return (score_faithfulness, reason_faithfulness), (score_cxt_relev, reason_cxt_relev), (score_answ_relev, reason_answ_relev)

In [None]:
# llm = llms[0]

embedding_original_query = get_embeddings(user_query)[0]
scores_hypoth = {}
scores_original = {}

for llm in llms:
  print(llm)
  #Hypothetical answer
  embedding_hypoth = get_embeddings(hypoth_answer_llms[llm])[0]
  article_embeddings = get_embeddings_articles(articles[llm]) #{list of embedded articles , there are 26 articles}
  cosine_similarities_hypoth = calculate_cosine_distance(embedding_hypoth, article_embeddings)
  print("Hypothetical Answer: Most relevant News\n")
  sorted_articles_hypoth = sort_articles_by_cosine_similarity(articles[llm], cosine_similarities_hypoth)
  print("-"*50)

  #Original Query
  cosine_similarities_original= calculate_cosine_distance(embedding_original_query, article_embeddings)
  print("Original Answer: Most relevant News\n")
  sorted_articles_original = sort_articles_by_cosine_similarity(articles[llm], cosine_similarities_original)
  print("-"*50)

  formatted_top_results_hypoth = context_retrieval(sorted_articles_hypoth)
  final_answer_hypoth = get_final_answer(user_query, formatted_top_results_hypoth, llm)
  print("#Final answer against the Hypothetical query")
  display.display(display.Markdown(final_answer_hypoth))
  print("-"*50)

  #Get Evaluations Metrics
  scores_hypoth[llm] = get_all_eval_metrics(user_query,final_answer_hypoth,formatted_top_results_hypoth)

  formatted_top_results_original = context_retrieval(sorted_articles_original)
  final_answer_original = get_final_answer(user_query, formatted_top_results_original, llm)
  print("#Final answer against the original query")
  display.display(display.Markdown(final_answer_original))

  #Get Evaluations Metrics
  scores_original[llm] = get_all_eval_metrics(user_query,final_answer_original,formatted_top_results_original)

## Final Results

In [None]:
scores_original

In [None]:
scores_hypoth

In [None]:
import pandas as pd
import numpy as np

### Against Original Query

In [None]:
scores_original_values = {}
for llm in llms:
  scores = [score for score in scores_original[llm] if type(score)!= str]
  scores_original_values[llm] = scores

pd.DataFrame(scores_original_values, index=['faithfulness','context_relevancy','answer_relevancy'])

In [None]:
scores_original_raisons = {}
for llm in llms:
  raisons = [score for score in scores_original[llm] if type(score)== str]
  scores_original_raisons[llm] = raisons

pd.DataFrame(scores_original_raisons, index=['faithfulness','context_relevancy','answer_relevancy'])

In [None]:
index_metrics=['faithfulness','context_relevancy','answer_relevancy']
for llm in llms:
  print(f"#{llm}:")
  for i in range(len(index_metrics)):
    print(f"{index_metrics[i]}")
    print(scores_original_raisons[llm][i])
  print("-"*50)

In [None]:
for llm in llms:
  mean_score = np.mean([score for score in scores_original[llm] if type(score)!= str])
  print(f"{round(mean_score,3)} = Mean score for {llm}")

**Key Takeaway 1**: gpt-4o-mini shows the best score among the other LLMs. Its score in answer relevancy was better than the one from gpt-4o.

### Against Hypothetical Answer

In [None]:
scores_hypoth_values = {}
for llm in llms:
  scores = [score for score in scores_hypoth[llm] if type(score)!= str]
  scores_hypoth_values[llm] = scores

pd.DataFrame(scores_hypoth_values, index=['faithfulness','context_relevancy','answer_relevancy'])

In [None]:
scores_hypoth_raisons = {}
for llm in llms:
  raisons = [score for score in scores_hypoth[llm] if type(score)== str]
  scores_hypoth_raisons[llm] = raisons

pd.DataFrame(scores_hypoth_raisons, index=['faithfulness','context_relevancy','answer_relevancy'])

In [None]:
index_metrics=['faithfulness','context_relevancy','answer_relevancy']
for llm in llms:
  print(f"#{llm}:")
  for i in range(len(index_metrics)):
    print(f"{index_metrics[i]}")
    print(scores_hypoth_raisons[llm][i])
  print("-"*50)

In [None]:
for llm in llms:
  mean_score = np.mean([score for score in scores_hypoth[llm] if type(score)!= str])
  print(f"{round(mean_score,3)} = Mean score for {llm}")

**Key Takeaway 2**: Again, in the hypothetical answer, GPT-4o-mini shows the best score among the other LLMs. Its score in answer relevancy was significantly better than GPT-4o (0.85 vs. 0.5) and even better than GPT-3.5-turbo (0.3). Furthermore, its score in faithfulness was better than the others.

**Key Takeaway 3** : Another important takeway:

The score of the results coming from retrieval based on the hypotethical answer (0.619) is better than the one where retrieval is based on the original query (0.583), when using gpt-4o-mini. This higlights the fact that the re-ranking process leads to better results.

**Next:**

**Retrieval Context:**

Even if gpt-4o-mini is showing a good performance, however, the context relevance metric is 0 for all LLMs. This part needs to be reworked again.
In the retrieval part, I took title + description and the beginning of the content. That was not enough.
A good way needs to be : Parsing the whole html for each article, and gathering all this information together, chunking it in a given size

**DeepEval and gpt-4o-mini:**

It could be interesting to run evaluation metrics with gpt-4o-mini instead of gpt-40. Because the underhood calculation of the scores in the evaluation metrics are based on templated prompts and the capability of the LLM to well compare a given claim/statement in the retrieved context (for example) vs the final answer.

This leads me to this conclusion, because I was not expecting gpt-4o-mini to outperfom gpt-4o!!