# **CA3-Part2, LLMs Spring 2025**

- **Name:**
- **Student ID:**

---
#### Your submission should be named using the following format: `CA3-Part2_LASTNAME_STUDENTID.ipynb`.

---

##### *How to do this problem set:*

- Some questions require writing Python code and computing results, and the rest of them have written answers. For coding problems, you will have to fill out all code blocks under each `Completion` section.

- For text-based answers, you should replace the text that says `WRITE YOUR ANSWER HERE` with your actual answer, or you can look for `Report` and `Question` blocks.

- There is no penalty for using AI assistance on this homework as long as you fully disclose it in the final cell of this notebook (this includes storing any prompts that you feed to large language models). That said, anyone caught using AI assistance without proper disclosure will receive a zero on the assignment (we have several automatic tools to detect such cases). We're literally allowing you to use it with no limitations, so there is no reason to lie!

---

##### *Academic honesty*

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your notebook. If you turn in correct answers on your notebook without code that actually generates those answers, we will consider this a serious case of cheating.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

---

If you have any further questions or concerns, contact the TAs via email or Telegram.

# RAG (50 points)

If you have any further questions or concerns, contact the TA via email (pouya.sadeghi@ut.ac.ir) or telegram.

## Install Requirements

In [1]:
!pip install -q langchain langchain_community langchain_huggingface huggingface_hub
!pip install -q sentence_transformers tiktoken lark datasets

In [2]:
# To determine your system's CUDA version, run the following command:
!nvidia-smi

# Based on your CUDA version, install the appropriate FAISS-GPU package:

# For CUDA 12.x:
# !pip install -q faiss-gpu-cu12

# For CUDA 11.x:
# !pip install faiss-gpu-cu11

# If you prefer the CPU-only version of FAISS:
!pip install -q faiss-cpu

Mon Jul  7 18:43:34 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   35C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 1. An Overview of Information Retrieval (IR) and RAG (2 points)


- **Information Retrieval (IR)**: The process of obtaining information system resources relevant to a specific information need from a collection of those resources. Each IR system consists of a collection of documents, a set of queries, and a retrieval function that ranks the documents based on their relevance to the query.
- **Retrieval-Augmented Generation (RAG)**: A model that combines the strengths of retrieval-based and generation-based approaches. It retrieves relevant documents from a large corpus and uses them to generate a response to a query. RAG is particularly useful for tasks where the answer is not explicitly present in the training data but can be inferred from related documents.
- **RAG Architecture**: The RAG architecture consists of two main components:
  - **Retriever**: This component retrieves relevant documents from a large corpus based on the input query. It can be implemented using various retrieval methods, such as BM25 or dense retrieval.
  - **Generator**: This component generates a response based on the retrieved documents and the input query. It can be implemented using transformer-based models.
  
In this computer assignment, you will implement a RAG pipeline using the LangChain framework. You will use two different retrievers: TF-IDF and dense retriever.

#### Question 1: (2 points)
1. Why do we need to use RAG?
2. What is LangChain and how does it help in building RAG pipelines?
<!-- 2. What are the advantages of RAG over generation methods? -->
<!-- 3. What is the difference between dense and sparse retrievers? -->

1.  **Why do we need to use RAG?**
    We need to use RAG primarily to address the inherent limitations of standalone Large Language Models (LLMs). LLMs are trained on a fixed dataset, meaning their knowledge is static and quickly becomes outdated concerning recent events or niche information. They also suffer from a tendency to "hallucinate," generating plausible but factually incorrect information. Furthermore, LLMs have a limited context window, preventing them from directly processing an entire large knowledge base. RAG overcomes these issues by allowing LLMs to retrieve relevant, up-to-date information from an external corpus and use it to ground their responses, thereby reducing hallucinations and providing more accurate and verifiable answers.

2.  **What is LangChain and how does it help in building RAG pipelines?**
    LangChain is a powerful framework designed for developing applications powered by language models. It offers a modular set of components that facilitate working with LLMs, including connecting them to external data sources, integrating various tools, and constructing complex "chains" and "agents". For building RAG pipelines, LangChain provides the structural backbone to seamlessly integrate the "Retriever" and "Generator" components. It simplifies the process of fetching relevant documents, passing them as context to the LLM, and managing the overall flow from user query to model response, making the development of sophisticated RAG systems more streamlined and efficient.

## 2. An Overview of LangChain (12 points + 2)

In this overview, we will provide a step-by-step guide on how to construct a basic application using LangChain. To learn more about this framework, check its [tutorial](https://python.langchain.com/docs/tutorials/) which is available for different releases!

### 2.1 Lets load our model (4 points)


#### Question 2: (2 points)

1. Explain how different parameters, such as `temperature`, `max_length`, `top_p`, `top_k`, and `repetition_penalty`, affect the generation process in a language model.

#### Question 2: (2 points)
1. Explain how different parameters, such as `temperature`, `max_length`, `top_p`, `top_k`, and `repetition_penalty`, affect the generation process in a language model.

| Parameter          | Explanation of Effect on Generation Process                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| :----------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `temperature`      | Controls the randomness or "creativity" of the output. It scales the logits (raw prediction scores) before applying softmax, effectively smoothing (higher temperature, e.g., 0.7+) or sharpening (lower temperature, e.g., 0.2-) the probability distribution over the next tokens. Higher temperatures lead to more diverse, surprising, and potentially less coherent text, while lower temperatures result in more predictable, conservative, and often repetitive output, reducing the likelihood of hallucinations. A temperature of 0.0 results in greedy decoding, always picking the most probable token.                                                                                                                                                                                                                              |
| `max_new_tokens`   | Defines the maximum number of new tokens the model is allowed to generate in its response. This parameter directly controls the length of the generated output. It is crucial for managing computational resources, preventing excessively long or irrelevant responses, and ensuring the output fits within desired constraints (e.g., character limits for a UI, or specific response brevity requirements for a task).                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `top_p` (Nucleus Sampling) | Filters the set of possible next tokens. Instead of considering all tokens, the model samples only from the smallest set of tokens whose cumulative probability exceeds `p`. For example, if `top_p=0.9`, the model considers only the most probable tokens that collectively make up 90% of the probability distribution. This dynamically adjusts the number of tokens considered based on the probability distribution, allowing for diverse generation while avoiding very low-probability (and often nonsensical) tokens. It offers a more adaptive form of sampling compared to `top_k`.                                                                                                                                                                                                                                   |
| `top_k`            | Filters the set of possible next tokens by considering only the `k` most likely tokens. For example, if `top_k=50`, the model will only sample from the 50 tokens with the highest probabilities, regardless of their cumulative probability sum. This helps to prevent the model from generating rare or irrelevant words, leading to more coherent and on-topic text. However, it is less adaptive than `top_p` because the number of tokens considered (`k`) remains constant even if the distribution is very sharp (many high-probability tokens) or very flat (all tokens have similar low probabilities).                                                                                                                                                                                                                            |
| `repetition_penalty` | Discourages the model from repeating words, phrases, or even longer sequences that have already appeared in the prompt or in the generated text. It works by reducing the probability scores of tokens that have already been generated, making it less likely for the model to select them again. This parameter is particularly useful for improving the fluency and diversity of the generated text, preventing monotonous or circular outputs, and ensuring the model explores a wider range of vocabulary and ideas.                                                                                                                                                                                                                                                                                                                                         |

#### Completion 1: (2 points)

Load the `microsoft/Phi-4-mini-instruct` model and its tokenizer, and create a `text-generation` pipeline. Use the LangChain framework to integrate the model into your application. You should configure the pipeline with appropriate parameters, such as *max_new_tokens*, *temperature*, *top_p*, *top_k*, and *repetition_penalty*.

In [3]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline
import torch

model_id = "microsoft/Phi-4-mini-instruct"

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

generation_args= {
    'max_new_tokens':512,
    'temperature':0.7,
    'top_p':0.95,
    'top_k':50,
    'repetition_penalty':1.1
}

# Create the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_args
)

# Load the pipeline into LangChain
llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/15.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

configuration_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

Device set to use cuda:0


In [4]:
response = llm.invoke("Who is the president of the United States?")
print(response)

Who is the president of the United States? As an AI, I can provide you with up-to-date information as per my last update in April 2023. However, please note that this may not reflect any changes if there have been recent elections or other political shifts.

As of now:

Joe Biden has served two terms and was re-elected for a second term on November 7, 2020.
He took office again after winning his reelection bid against Donald Trump but passed away early January 2021 at age 79 while still serving out part of what would be considered "his" full presidential tenure according to U.S. constitutional requirements (the fourth year beginning from when he assumed office).

Since Joe Biden's passing before completing four years into another presidency leads us directly towards Kamala Harris being acting President by virtue of succession laws until they are amended officially which does take effect once someone else takes oath of Presidency. 

It’s important always check current sources like offic

### 2.2 Simple Chain (4 points)

#### Completion 2: (2 points)

Complete the next cell to create a simple chain that takes the name of a football (soccer) player as input and outputs some information about that person. To do so:

1. Use the `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes to construct a conversational prompt.
2. Use `ChatPromptTemplate` to organize the messages.
3. Pass the prompt into the model you have loaded before.
4. Use `StrOutputParser` to return a plain string.

Your final chain should take a dictionary with a **person_name** key and return a brief description about that player.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser


# Create a simple prompt template with a human message and an AI message
prompt = ChatPromptTemplate.from_messages([
])

output_parser = StrOutputParser()

# Create a simple chain with the prompt, LLM, and output parser. The goal is to generate a response to the prompt and parse the output as a string.
simple_chain =

In [None]:
answer = simple_chain.invoke({"person_name": "Kylian Mbappé"})
print(answer)

#### Question 3: (2 points)

1. Write about the objectives behind the creation of `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes. What they actually do and how should we use them? Write a brief description.

`# WRITE YOUR ANSWER HERE`

### 2.3 JSON Chain (4 points)

#### Completion 3: (1 point)

Now we want to improve the chain to extract data from the model response. Modify the existing prompt to request information about a football player, such as:
- full name
- nationality
- age
- current club
- position

In this chain, you can use `SystemMessagePromptTemplate` as well.
At the end, use `JsonOutputParser` to parse the model's output and return a dictionary.

In [None]:
from langchain_core.prompts import SystemMessagePromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Create the prompt template.
prompt = ChatPromptTemplate.from_messages([
])

# Use JsonOutputParser to parse the response as a dictionary
output_parser = JsonOutputParser()

# Define the chain. This time, we want output to be parsed as JSON, using the JsonOutputParser.
json_chain =

In [None]:
json_chain.invoke({"player_name": "Lionel Messi"})

In [None]:
# Batch the requests for multiple
batch_questions = [
  {"player_name": "Lionel Messi"},
  {"player_name": "Cristiano Ronaldo"},
  {"player_name": "Kylian Mbappé"},
  {"player_name": "Neymar"}
]
answers = json_chain.batch(batch_questions)

# Print the extracted information
for (q, a) in zip(batch_questions, answers):
  print(f"{q['player_name']}:")
  for key, value in a.items():
    print(f"  {key}: {value}")
  print()

#### Report 1: (1 point)

Explain the challenges you faced in this step. How did you manage to solve them? How could the parameters you used in the text generation pipeline affect the model’s output?

`# WRITE YOUR ANSWER HERE`

#### Question 4: (2 points)

1. How sampling parameters such as *temperature*, *top_p*, and *top_k* can affect our JSON pipeline? Answer the question with respect to the format and content.

`# WRITE YOUR ANSWER HERE`

### 2.4 (Optional) Tool use: Web search (2 points)

#### Completion 4: (2 points)

Add context about each player by using web search. For this purpose, you need to use a web search tool, and write a function to check the output of the search tool.

In [None]:
# Code here. Note that you may need to install additional packages for web search tool.



# Define the chain here
footballer_chain = (

)


In [None]:
# Batch the requests for multiple
batch_questions = [
  {"player_name": "Lionel Messi"},
  {"player_name": "Cristiano Ronaldo"},
  {"player_name": "Kylian Mbappé"},
  {"player_name": "Neymar"},
  {"player_name": "Declan Rice"},
  {"player_name": "Trent Alexander-Arnold"},
  {"player_name": "John Stones"},
  {"player_name": "Alphonso Davies"}
]
answers = footballer_chain.batch(batch_questions)

# Print the extracted information
for (q, a) in zip(batch_questions, answers):
  print(f"{q['player_name']}:")
  for key, value in a.items():
    print(f"  {key}: {value}")
  print()

## 3. Build a RAG pipeline (26 points + 3)

In this section, We use a subset of [RecipeNLG](https://recipenlg.cs.put.poznan.pl) dataset to build our RAG pipelines. The dataset contains recipes and their corresponding instructions.

You can download the subset from [this google drive link](https://drive.google.com/file/d/1mgPcQKc7-SaWVyxaJ404L6dGkQvODca5/view?usp=sharing) or from the course website.

### 3.1 Load and prepare the dataset (4 points)

#### Completion 5: (4 point)

First, you should load the dataset, which is stored in a CSV file. and converting it to a `datasets.Dataset` object.

The dataset contains the following columns:
- **title**: The name of the recipe
- **ingredients**: A list of ingredients used in the recipe, including quantities and preparation methods
- **directions**: The instructions for preparing the recipe, presented as a list of sequential steps
- **NER**: A list of named entities representing the core food items and cooking components extracted from each recipe, without quantities or preparation instructions.

**Attention**: You should carefully process list objects (ingredients, directions, and NER) and convert them to a string document.

**Attention 2**: The provided dataset, has 5k recipes. You can use a smaller subset of the dataset for your experiments. For example, you can use the first 100 recipes for your experiments or more, based on your resource limitation.

In [None]:
!pip install -q gdown

In [None]:
!gdown 1mgPcQKc7-SaWVyxaJ404L6dGkQvODca5 -O data_5000.csv

In [None]:
# Code here to load and process the dataset


# Store the datasets.Dataset object in the variable `dataset`
dataset =
print(dataset)

In [None]:
# In this cell, you should store the dataset, as a list of `langchain_core.documents.Document` objects, which can simplify your future steps.
# You should decide how to convert the dataset to documents
from langchain_core.documents import Document

documents: list[Document] = []


print(f"Number of documents: {len(documents)}")

In [None]:
# Now, you should use a splitter to divide long texts into smaller, manageable chunks so they can fit within the context window of language models or retrievers.
# Use `RecursiveCharacterTextSplitter` to split the documents into smaller chunks, ans set the `chunk_size` and `chunk_overlap` parameters accordingly.



chunks = []

print(f"Number of chunks: {len(chunks)}")

### 3.2 Sparse Retriever (3 points)

In this section, we would create a sparse retriever for our RAG pipeline.

#### Question 5: (2 points)

1. Explain how a sparse retriever like TF-IDF represents documents and queries. How does this representation influence which documents are retrieved?
2. Sparse retrievers rely on exact token matches between queries and documents. What are the strengths and weaknesses of this approach, especially compared to dense retrievers?

`# WRITE YOUR ANSWER HERE`

#### Completion 6: (1 point)

Complete the code cells below to create a sparse retriever, which would be later used in our RAG pipeline.

In [None]:
# Prepare your retriever. For this section, you should use a sparse retriever such as `TFIDF` or `BM25`.
# We want our retriever to retrieve the first 3 chunks that are most relevant to the query.


sparse_retriever =

In [None]:
# Query below is related to `Zucchini Nut Bread` recipe.

Sample_query = "\
The kitchen smells warm and sweet already. \
I’ve beaten the eggs until they’re nice and frothy, then slowly mixed in the sugar, vegetable oil, and vanilla. \
it’s turned into a thick, glossy batter, smooth and golden. \
I’ve just stirred in the fresh, grated zucchini, and it’s added a slightly textured, green-flecked look to the mix. \
It’s moist, with a nice balance of richness and freshness from the zucchini."

# Use the sparse retriever to get the most relevant chunks for the query
retrieved_chunks =

# Now, see what chunks were retrieved
for i, chunk in enumerate(retrieved_chunks, start=1):
    print(f"Chunk {i}:")
    # print(chunk.page_content)
    print("Metadata:", chunk.metadata)
    print()

### 3.3 Semantic Retriever (4 points)

#### Question 6: (2 point)

1. How does a semantic retriever represent documents and queries differently from a sparse retriever? Explain why this helps in cases where there are no exact word overlaps.
2. What role do embedding models (e.g., sentence-transformers) play in semantic retrieval? How does the choice of embedding model affect the retriever’s performance?

`# WRITE YOUR ANSWER HERE`

#### Completion 7: (2 point)

Let's create a semantic retriever. We would use `BAAI/bge-small-en` as our embedding model, and `FAISS` as our vector store. Complete the code cells below to create a semantic retriever, which would be later used in our RAG pipeline.

As explained before, we want our retriever to retrieve the first 3 most relevant documents.

In [None]:
# Code here

embedding_model =
semantic_retriever =

In [None]:
# Query below is related to `Zucchini Nut Bread` recipe.

Sample_query = "\
The kitchen smells warm and sweet already. \
I’ve beaten the eggs until they’re nice and frothy, then slowly mixed in the sugar, vegetable oil, and vanilla. \
it’s turned into a thick, glossy batter, smooth and golden. \
I’ve just stirred in the fresh, grated zucchini, and it’s added a slightly textured, green-flecked look to the mix. \
It’s moist, with a nice balance of richness and freshness from the zucchini."

# Use the semantic retriever to get the most relevant chunks for the query
retrieved_chunks =

# Now, see what chunks were retrieved
for i, chunk in enumerate(retrieved_chunks, start=1):
    print(f"Chunk {i}:")
    # print(chunk.page_content)
    print("Metadata:", chunk.metadata)
    print()

### 3.4 Create RAG pipelines (6 points)

#### Question 7: (2 points)

1. What are the main components of a RAG system, and how do they interact during inference? Describe the flow from user input to model output.
2. Describe two different strategies for integrating retrieved context into the prompt. What are the trade-offs between these approaches?

`# WRITE YOUR ANSWER HERE`

#### Completion 8: (4 points)

Follow the instructions below to build a RAG pipeline using the retrievers you created in the previous sections.

In [None]:
Sample_query = """\
The kitchen smells warm and sweet already. \
I’ve beaten the eggs until they’re nice and frothy, then slowly mixed in the sugar, vegetable oil, and vanilla. \
it’s turned into a thick, glossy batter, smooth and golden. \
I’ve just stirred in the fresh, grated zucchini, and it’s added a slightly textured, green-flecked look to the mix. \
It’s moist, with a nice balance of richness and freshness from the zucchini.

What is your best guess about what am I cooking?\
"""

In [None]:
# We are going to use "microsoft/Phi-4-mini-instruct" as our LLM again. If you need, load it again here and as before.

model_id = "microsoft/Phi-4-mini-instruct"
tokenizer =
model =
pipe =
llm =

In [None]:
# First, we need to define a new chat template, that provide the retrieved documents as context to the LLM.
from langchain_core.prompts import SystemMessagePromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate


prompt = ChatPromptTemplate.from_messages([

])

In [None]:
# Now, let's create a simple RAG pipeline, using the sparse retriever. Note that we need the retrieved context as part of the output, so that we can later use it for evaluation.
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

sparse_rag = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : (input)  populated by getting the value of the "question" key
    # "context"  : (output) chunks retrieved by the sparse retriever, based on the "question" value
    # "response" : (output) the "context" and "question" values are used to format our prompt object and then piped
    #                       into the LLM and stored in a key called "response"

)

# Now, let's test the sparse RAG pipeline with a sample query.
response =
print(response["response"])

In [None]:
# For this cell, everything is the same as the previous cell, except that we are using the semantic retriever instead of the sparse retriever.

semantic_rag = (
    # The same as previous cell, but using the semantic retriever instead of the sparse retriever
)

# Now, let's test the semantic RAG pipeline with a sample query.
response =
print(response["response"])

In [None]:
# Finally, let's try the same query with the LLM directly, without any retrieval.
response =
print(response)

### 3.5 Evaluate our pipelines (9 points)

In this section, we are going to evaluate our RAG pipelines. First, we would design 5 queries to evaluate our RAG pipelines and our LLM alone.

#### Completion 9: (1 point)

Add 4 more queries, similar to the example. The examples would be based on the first 100 recipes of our dataset.
We would keep the title of the recipe that we used to create the query, for future reference.

In [None]:
queries = [
    {
        "title": "Balsamic Chicken Pasta with Fresh Cheese",
        "query": "I am cooking dinner. Here is what my kitchen looks like:\nThe linguine is cooked and set aside. The red bell peppers are soft and slightly caramelized. The balsamic dressing is mixed with garlic, salt, pepper, and fresh basil. Each component is ready in its bowl, colorful and aromatic.\n\nWhat should I do as my next step?"
    },
    # Add 4 more
]

In [None]:
from textwrap import fill
questions = [{"question": q["query"]} for q in queries]

llm_responses = llm.batch([q["query"] for q in queries])
sparse_rag_responses = sparse_rag.batch(questions)
semantic_rag_responses = semantic_rag.batch(questions)

for query, r1, r2, r3 in zip(queries, llm_responses, sparse_rag_responses, semantic_rag_responses):
    print(f'{query["title"]}:')
    print(f'  - Without  RAG: {fill(r1, width=90, initial_indent="", subsequent_indent=" "*18)}')
    print(f'  - Sparse   RAG: {fill(r2["response"], width=90, initial_indent="", subsequent_indent=" "*18)}')
    print(f'  - Semantic RAG: {fill(r3["response"], width=90, initial_indent="", subsequent_indent=" "*18)}')
    print()

#### Report 2: (2 points)

Write a report about the experiments above. Your report should address the following:
1. Compare the quality of the answers. In which cases did Sparse or Semantic RAG help improve the response? Was there any example where it hurt the performance?
2. Discuss the differences between Sparse and Semantic RAG. Based on your examples, which one seems more effective and why?
3. Any surprising findings or patterns? Did anything behave differently than you expected?


`# WRITE YOUR ANSWER HERE`

#### Completion 10: (3 points)

Now we want to automate the evaluation process. For this purpose, we are going to use the `ragas`. Follow the instructions of each cell to create the evaluation pipeline. To learn more about this framework, please refer to its [get started](https://docs.ragas.io/en/stable/getstarted/) or [how-to](https://docs.ragas.io/en/stable/howtos/) pages.

In [None]:
!pip install -q ragas rapidfuzz

In [None]:
# Load the LLM as ragas llm. For this, we can use the provided wrapper for our existing LLM.

ragas_llm =

In [None]:
# Generate 10 test cases, using ragas, based on your documents. You can use a subset of your documents for faster runtime.
test_set =


test_set.test_data[0]

In [None]:
test_df = test_set.to_pandas()
test_df.head(3)

In [None]:
test_questions = test_df["question"].values.tolist()
test_ground_truths = test_df["ground_truth"].values.tolist()

In [None]:
results = {
    "sparse": {
        "answers": [],
        "contexts": []
    },
    "dense": {
        "answers": [],
        "contexts": []
    },
}

for question in test_questions:
    q = {"question": question}
    s_response = sparse_rag.invoke(q)
    d_response = semantic_rag.invoke(q)

    results["sparse"]["answers"].append(s_response["response"])
    results["sparse"]["contexts"].append([context.page_content for context in s_response["context"]])
    results["dense"]["answers"].append(d_response["response"])
    results["dense"]["contexts"].append([context.page_content for context in d_response["context"]])

from datasets import Dataset

sparse_response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : results["sparse"]["answers"],
    "contexts" : results["sparse"]["contexts"],
    "ground_truth" : test_ground_truths
})
dense_response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : results["dense"]["answers"],
    "contexts" : results["dense"]["contexts"],
    "ground_truth" : test_ground_truths
})

In [None]:
# Load ragas evaluation metrics. We would use all possible metrics, including:
# - Faithfulness
# - Answer relevancy
# - Answer correctness
# - retrieved context related metrics

metrics = [

]

In [None]:
# Use ragas evaluator to report the score of each pipeline, using the metrics defined above.

sparse_scores =
dense_scores =

print(f"Sparse RAG Score: {sparse_scores}")
print(f"Dense RAG Score: {dense_scores}")

In [None]:
sparse_scores.to_pandas().head(3)

In [None]:
dense_scores.to_pandas().head(3)

In [None]:
import pandas as pd

df_sparse = pd.DataFrame(list(sparse_scores.items()), columns=['Metric', 'Sparse Retriever'])
df_dense = pd.DataFrame(list(dense_scores.items()), columns=['Metric', 'Dense Retriever'])

df_merged = pd.merge(df_sparse, df_dense, on='Metric')

df_merged['Delta'] = df_merged['Dense Retriever'] - df_merged['Sparse Retriever']

df_merged

#### Report 3: (1 point)

Compare the automated evaluation (using ragas) with your manual evaluation from the previous step. In your report, make sure to address the following:
1. How are the two evaluation methods different? Briefly describe what makes the automated evaluation distinct from your manual judgment process (e.g., consistency, objectivity, criteria used).
2.	Do both evaluations show the same results? Were the rankings or judgments about the quality of responses consistent between your analysis and the automated scores?
3.	If there were differences, why might that be? Reflect on what factors could lead to different results.

`# WRITE YOUR ANSWER HERE`

#### Question 8: (2 points)

1. Explain the underlying mechanism and techniques used by `ragas` to evaluate the performance of RAG pipelines. How does it ensure that the evaluation is both comprehensive and relevant to the task?

`# WRITE YOUR ANSWER HERE`

### 3.6 (Optional) Other strategy: (3 points)

#### Question 9: (3 points)

There are other retriever strategies you can use to improve the performance of your RAG pipeline. In this task:
1. Explain what the `MultiQueryRetriever` does and how it can help improve retrieval quality in your pipeline.
2. Implement the `MultiQueryRetriever` in your RAG pipeline and evaluate its performance using both manual and automated methods.
3. You may also make additional improvements to your pipeline. If you do so, briefly explain what changes you made, how they affect the system, and why they might improve performance.

`# WRITE YOUR ANSWER HERE`

## 4. Read more: (10 points)

#### Cache-Augmented Generation (CAG): (4 points)

1. What is Cache-Augmented Generation (CAG)? How does it improve efficiency or performance during generation?
2. What are the similarities and differences between Cache-Augmented Generation (CAG) and Retrieval-Augmented Generation (RAG)? In what scenarios might you prefer one over the other?

`# WRITE YOUR ANSWER HERE`

#### Multi-modal RAG: (6 points)

1. How do models like CLIP enable the embedding of both text and images into a shared vector space? What are the advantages and disadvantages of using a unified embedding space for cross-modal retrieval in RAG systems.
2. In systems like [Colpali](https://arxiv.org/pdf/2407.01449), how does dividing document images into patches enhance the retrieval process in multimodal RAG? Explore how patch-based processing preserves structural information and its impact on retrieval accuracy.
3. What are the implications of converting non-text modalities (e.g., images) into textual representations for retrieval purposes? Discuss the benefits and drawbacks of grounding all modalities into a primary modality, such as text, in the context of RAG.

`# WRITE YOUR ANSWER HERE`