
<div align="center">
  <h1></h1>
  <h1>Stylized Retrieval-Augmented Generation</h1>
  <h4 align="center">Assignmnet II</h4>
</div>

Welcome to Assignment II! In this notebook, you will build and implement a Retrieval-Augmented Generation (RAG) pipeline tailored for a text style transfer application.

**By the end of this assignment, you'll be able to:**

*   Build a Retrieval-Augmented Generation (RAG) pipeline to enhance text generation with external knowledge.
*   Retrieve relevant information from a dataset or knowledge base to support text generation.
*   Implement a neural style transfer model to transform text into a desired writing style.
*   Combine retrieved content and style transfer to create a coherent and stylistically customized output.





## Important Note on Submission


*   Do not use ChatGPT or any other AI tool to directly produce the code. If you need assistance, refer to Exercise 5 for guidance.
*   You are allowed to work in a group of up to 3 members.
*   Do not copy code or answers from other groups. Collaboration is encouraged only within your own group.
*   Ensure that your notebook is runnable without any errors. Submissions with errors will not be accepted.
*   Answers to open-ended questions must be original and not copied from other groups or AI tools like ChatGPT.
*   The submission should be one .ipynb notebook with the group members' names on Openlat and matriculation numbers on it.



## Group Members


1. First memebr: 
  * Name: Hoang Long Nguyen
  * Matrikel-Nr.: 428832
2. Second memebr:
  * Name: Mateen Mahmood
  * Matrikel-Nr.: 426365
2. Third memebr:
  * Name:Vibha Kedigemane Trivikram
  * Matrikel-Nr.: 429106

### Table of Contents
- [1. Access to Hugging Face](#1-access-to-hugging-face)
- [2. Packages](#2-packages)
- [3. Problem Statement](#3-problem-statement)
- [4. Fetch and Parse](#4-fetch-and-parse)
- [5. Calculate Word Stats](#5-calculate-word-stats)
- [6. Set Up LLM](#6-set-up-llm)
- [7. BM25 Retriever](#7-bm25-retriever)
- [8. Build Chroma](#8-build-chroma)
- [9. Ensemble Retriever](#9-ensemble-retriever)
- [10. Format Documents](#10-format-documents)
- [11. RAG Chain](#11-rag-chain)
- [12. Final Response](#12-final-response)


# 1. Access to Hugging face
Execute the following cell to connect to your Hugging Face account.

In [1]:
import getpass
import os

# Prompt user for Hugging Face API token if not already set
if "HUGGINGFACEHUB_API_TOKEN" not in os.environ:
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your Huggingfacehub API token: ")

# 2. Packages
Execute the following code cells for installing the packages needed for creating your Stylized RAG.

note: If there are package conflics you can use pip-tools to automatically find and install the compatible versions.

In [2]:
# !pip install langchain
# !pip install langchain-community
# !pip install langchain-huggingface
# !pip install bs4
# !pip install rank_bm25
# !pip install huggingface_hub
# !pip install requests
# !pip install langchain-chroma
# pip install transformers==4.46.0

# 3. Problem Statement
In this assignment, we will implement **Text Style Transfer**, a technique that modifies text style while preserving its content. They will build an **ensemble retriever** combining **BM25** for keyword-based retrieval and **Chroma** for semantic search to retrieve relevant documents, which will be used as input for the style transfer process. This project integrates classical retrieval methods with modern neural embeddings for practical NLP applications.

**what is text style transfer?**

**Text Style Transfer** is a natural language processing (NLP) technique that modifies the style of a given text while preserving its original content. It allows for the transformation of linguistic expressions to convey different tones, emotions, or writing styles without altering the underlying meaning. For example, it can rephrase formal text into a casual tone, adapt neutral statements into an emotional tone, or convert modern language into a Shakespearean style. This technique has applications in personalized communication, creative writing, sentiment adjustment, and even domain adaptation, making it a powerful tool for generating diverse textual outputs tailored to specific needs.

### Example of Text Style Transfer:

#### **Input (Neutral Tone):**
"I am excited about the opportunity to work on this project."

#### **Output (Formal Tone):**
"I am genuinely enthusiastic about the prospect of contributing to this project."

#### **Output (Casual Tone):**
"I'm super pumped to get started on this project!"

#### **Output (Shakespearean Style):**
"Verily, I am thrilled by the chance to partake in this noble endeavor."


# 4. Fetch and Parse
In this part of the assignment, you are tasked with:

*    Fetching and parsing web content: Write a function that fetches the HTML content of a webpage and processes it to extract clean, readable text.
*    Splitting text into smaller chunks: Implement a function to split the text into overlapping chunks, ensuring that each chunk is manageable for downstream tasks.

In [3]:
import os
import requests
import numpy as np
from bs4 import BeautifulSoup
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

def fetch_and_parse(url: str) -> str:
    """
    Fetch the webpage content at `url` and return a cleaned string of text.

    Parameters:
    - url (str): The URL of the webpage to fetch.

    Returns:
    - str: Cleaned text content extracted from the webpage.
    """

    # Step 1: Fetch the webpage content using the requests library.
    # Fetch the content of the URL.
    # Ensure the request is successful.
    
    # Step 2: Parse the HTML content using BeautifulSoup.

    # Step 3: Extract the text content from the parsed HTML.

    # Step 4: Return the cleaned text.

    # Write your code here.
    try:
        # Step 1
        fetch = requests.get(url, timeout = 10)
        if fetch.status_code != 200:
            print(f"Failed to fetch URL: {url} with status code: {fetch.status_code}")
            return ""
        # Step 2
        soup = BeautifulSoup(fetch.content, 'html.parser')
        # Step 3
        cleaned_text = soup.get_text()
        # Step 4
        return cleaned_text
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return ""

def split_text_into_documents(text: str, chunk_size: int = 500, chunk_overlap: int = 100):
    """
    Split a long text into overlapping chunks and return them as a list of Documents.

    Parameters:
    - text (str): The long text to split.
    - chunk_size (int): The size of each chunk (default is 1000 characters).
    - overlap (int): The number of overlapping characters between consecutive chunks (default is 100).

    Returns:
    - list: A list of Documents, each containing a chunk of text.
    """
    docs = []
    # Initialize an empty list to store the chunks.
    step = chunk_size - chunk_overlap
    for i in range(0, len(text), step):
        chunk = text[i:i+chunk_size]
        docs.append(Document(page_content=chunk))
    return docs

In [4]:
print("Step 1: Scraping content and splitting into documents...")
example_urls = [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://en.wikipedia.org/wiki/Machine_learning"
]

# Step 1A: Initialize an empty list to store all documents
all_docs = []

# Step 1B: Iterate through the URLs to fetch and process content
for url in example_urls:
    print(f"Scraping content from: {url}")

    # Step 1B.1: Fetch and parse the raw text from the URL
    raw_text = fetch_and_parse(url)

    # Step 1B.2: Split the raw text into chunks (documents)
    splits = split_text_into_documents(raw_text)

    # Step 1B.3: Add the chunks to the list of documents
    all_docs.extend(splits)
print(type(all_docs))

print(f"Total number of documents: {len(all_docs)}")

Step 1: Scraping content and splitting into documents...
Scraping content from: https://en.wikipedia.org/wiki/Artificial_intelligence
Scraping content from: https://en.wikipedia.org/wiki/Machine_learning
<class 'list'>
Total number of documents: 820


1. Why do we split the text into smaller chunks before storing or processing it?

Answer: In order to avoid the risk of exceeding the LLM’s context window, we have to split the text into smaller chunks before storing and processing. Splitting the text into chunks ensures that the documents are workable for retrieval and LLM input. Splitting gives context to these chunks which makes easy to retrieve the text we require. So we do it for contextual clarity, indexing and avoiding the input size limits of model.

2. What challenges might you face when fetching and parsing web content, and how would you handle them?

Answer: Websites do have different structure which makes it difficult to extract the information, for this we can use python libraries i.e. BeautifulSoup which scrapes the web content and handles varied structure.  Websites do have rate-limiting requests set to block or limit the scraping attemps. we can use headers to mimic the behaviour of browser and use request throttling. Another challenge we might face is broken or invalid URLS which lead to error codes. In order to handle this, we have implemented the error handling for fetching. 

3. In the context of RAG, how would errors in the fetch_and_parse function affect the overall pipeline?

Answer: If we have errors in fetch and parse function in our RAG pipeline then the retrieved answers may be incomplete and inaccurate. it affects the pipeline ability to retrieve relevant information which we are looking for. it may produce low-quality embeddings which will effect the retrieval process and resulting into generating irrelevant content. 

# 5. Calculate Word Stats

In this task, you will implement a function to calculate basic word and character statistics for a list of documents. Each document is represented as a Document object with a page_content attribute that contains its text.

Your task is to:

1. Calculate the total number of words and characters across all documents.
2. Compute the average number of words and characters per document.
3. Print the average statistics in a human-readable format.

In [5]:
def calculate_word_stats(texts):
    """
    Calculate and display average word and character statistics for a list of documents.

    Parameters:
    - texts (list): A list of Document objects, where each Document contains a `page_content` attribute.

    Returns:
    - None: Prints the average word and character counts per document.
    """

    # Step 1: Initialize variables to keep track of total words and total characters.
    total_words, total_characters = 0, 0

    # Step 2: Iterate through each document in the `texts` list.
    for doc in texts:
        content = doc.page_content
        word_count = len(content.split())
        char_count = len(content)
        total_words += word_count
        total_characters += char_count

    # Step 3: Calculate the average words and characters per document.
    # - Avoid division by zero by checking if the `texts` list is not empty.
    num_docs = len(texts)
    avg_words = total_words / num_docs if num_docs > 0 else 0
    avg_characters = total_characters / num_docs if num_docs > 0 else 0

    # Step 4: Print the calculated averages in a readable format.
    # Example: "Average words per document: 123.45"
    print(f"Average words per document: {avg_words}")
    print(f"Average characters per document: {avg_characters}")

In [6]:
# Execute this cell to test your calculate_word_stats function.
# Create sample Document objects with text content for testing your code above.
sample_docs = [
    Document(page_content="This is the first test document."),
    Document(page_content="Here is another example document for testing."),
    Document(page_content="Short text."),
    Document(page_content="This document has more content. It's longer and has more words in it for testing purposes."),
]

# Call the function with the sample documents to calculate word statistics.
calculate_word_stats(sample_docs)


Average words per document: 7.75
Average characters per document: 44.5


1. What potential issues could arise if the texts list is empty or contains documents with no content, and how would you address them?

Answer: If the text list is empty then total_words / num_docs or total_characters / num_docs will give us the division by zero error. If there is a empty document then our calculated avaerage words and characters will be misleading mainly in datasets with many empty documents. In order to address them, we could check whether our list is empty or not before performing the calculation if num_docs > 0 else 0.  Also, we can skip the documents with empty content during the iteration 'if not content.strip(): continue'

2. Why is it beneficial to calculate both word count and character count instead of just one of them?

Answer: It is beneficial because language models work on tokenizd inputs, and the character count works as a proxy for calculating the token usage. If we only count the word then it might not capture the 
tokenization cost especially when we have special characters. If the document have high word count with low character count then it might it indicate the shorter and simpler words and the higher character count with low word count indicate complex words. The character count gives more detail about the size of the text including spaces and punctuation and helps in finding the anomalies in the text. 
Finally, words count are useful for readability and summarization tasks and character counts are useful for token usage.

# 6. Set Up LLM

In this part of the assignment, you will implement a function to set up a Large Language Model (LLM) using the Hugging Face Endpoint API. This function will:

1. Initialize and connect to a pre-trained model available on Hugging Face.
2. Allow customization of parameters like the model repository ID and generation temperature.
3. Return the configured LLM object, which will be used later for text generation tasks in the RAG pipeline.

In [7]:
from langchain_huggingface import HuggingFaceEndpoint

def setup_llm(repo_id="mistralai/Mistral-7B-Instruct-v0.3"):
    """
    Set up and return a Hugging Face LLM using the specified model repository ID and generation parameters.

    Parameters:
    - repo_id (str): The repository ID of the Hugging Face model to use (default: "mistralai/Mistral-7B-Instruct-v0.3").
    - temperature (float): The generation temperature to control creativity in outputs (default: 1.0).

    Returns:
    - HuggingFaceEndpoint: A configured LLM object ready for text generation.
    """

    # Step 1: Import the HuggingFaceEndpoint class.
    # - This class allows you to connect to a Hugging Face model hosted on an endpoint.

    # Step 2: Configure the LLM connection.
    # - Use the HuggingFaceEndpoint class to set up the LLM.

    # Step 3: Return the configured LLM object.
    # - The returned LLM can be used for generating text based on input prompts.

    # Write your code here.
    llm = HuggingFaceEndpoint(
        repo_id=repo_id,
        temperature = 1.0,
    )
    return llm

1. What would happen if the temperature is set to an extreme value (e.g., 0 or 10)? How would you prevent misuse?

Answer: If the temperature set to extreme value (e.g., 0 or 10) then it will b either too deterministic (e.g. 0) or it will generate highly random outputs (e.g. 10). In case of 0, the model will be
generating predictable and repetitive responses and creativity of the model will be lower. However, if the temperature =0, the probability distribution is heavily skewed towards less probable tokens leading to inconsistant responses. In order to prevent this, we can set the default temperature range for example (0.5-1.2), and validation to check temperature is within the acceptable limits. 
 
2. If the LLM generates incorrect or irrelevant responses, what steps would you take to diagnose and fix the issue?

Answer: To diagnose the issue, we can implement tests across all tasks, print out the results and analyze where the problem reside. In this assignment, one thing that we notice that make a great impact on the outout is the split_text_to_documents method. While, using the RecursiveCharacterTextSplitter(), the results is great. But as soon as we change to the for loops to perform chunking, we run into issues where the results produces nonsensical texts, we mitigate this problem by focusing on choosing the right values for chunk_size and chunk_overlap. Finally, we arrive at the values for them, 500 and 100, respectively. The results still produce nonsense sentences, but it's largely reduced in comparison to other chunking values. We understand that the prebuilt method like RecursiveCharacterTextSplitter produces more meaningful chunk of documents with complete sentences. With a simple for loops, the stucture is rigid and we can't take into account the structure of the text.

# 7. BM25 Retriever

In this task, students will implement a BM25 Retriever, a critical component of the RAG pipeline.
Your task is to:

1. Initialize the BM25 retriever with a set of documents.
2. Implement a method to retrieve the top k most relevant documents for a given query.
3. Use efficient tokenization and scoring to ensure accurate and fast results.
This component will enable the pipeline to fetch relevant information from a corpus, which is then passed to the LLM for further processing.

In [8]:
from rank_bm25 import BM25Okapi
from langchain_core.runnables import RunnablePassthrough

class BM25Retriever:
    """
    A class to implement BM25-based document retrieval.

    Attributes:
    - documents (list): A list of Document objects.
    - corpus (list): A list of strings representing the document contents.
    - tokenized_corpus (list): A list of tokenized documents (lists of words).
    - bm25 (BM25Okapi): The BM25 retriever initialized with the tokenized corpus.
    """

    def __init__(self, documents):
        """
        Initialize the BM25 retriever with the given documents.

        Parameters:
        - documents (list): A list of Document objects.
        """
        # Step 1: Store the input documents.
        # Hint: Use the `page_content` attribute of each Document object to extract text.
        # Step 2: Tokenize the corpus.
        # Hint: Use the `.split()` method to tokenize each document into words.
        # Step 3: Initialize the BM25 retriever with the tokenized corpus.
        self.documents = documents
        self.corpus = [doc.page_content for doc in documents]
        self.tokenized_corpus = [doc.lower().split() for doc in self.corpus]
        self.bm25 = BM25Okapi(self.tokenized_corpus)

    def retrieve(self, query, k=5):
        """
        Retrieve the top `k` most relevant documents for a given query.

        Parameters:
        - query (str): The input query as a string.
        - k (int): The number of top documents to return (default is 5).

        Returns:
        - list: A list of the top `k` relevant documents as strings.
        """
        # Step 1: Tokenize the input query.
        # Hint: Use `.split()` to tokenize the query into words.
        # Step 2: Use the BM25 retriever to score and rank documents.
        # Hint: Use the `bm25.get_top_n()` method to retrieve the top `k` documents.
        # Step 3: Return the top `k` relevant documents.
        tokenized_query = query.lower().split()
        bm25_k_results = self.bm25.get_top_n(tokenized_query, self.corpus, n=k)
        return bm25_k_results

Execute the following code to test your implementation.

In [9]:
from langchain.schema import Document

# Create sample Document objects.
sample_docs = [
    Document(page_content="Machine learning is a method of data analysis that automates analytical model building."),
    Document(page_content="Deep learning is a subset of machine learning that uses neural networks with three or more layers."),
    Document(page_content="Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning."),
    Document(page_content="Natural language processing is a field of AI focused on the interaction between computers and human language."),
]

# Initialize the retriever with the sample documents.
retriever = BM25Retriever(sample_docs)

# Test the retriever with a query.
query = "What is machine learning?"
top_docs = retriever.retrieve(query,k=2)

# Print the results.
print("Top Relevant Documents:")
for idx, doc in enumerate(top_docs, 1):
    print(f"{idx}. {doc}")

Top Relevant Documents:
1. Machine learning is a method of data analysis that automates analytical model building.
2. Deep learning is a subset of machine learning that uses neural networks with three or more layers.


Expected output:

Top Relevant Documents:
1. Machine learning is a method of data analysis that automates analytical model building.
2. Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning.


1. If two documents have identical content except for minor differences (e.g., synonyms or paraphrasing), how might BM25 handle this, and why?

Answer: If two documents have identical content except for minor differences like paraphrasing (e.g., "AI" vs. "artificial intelligence") then the document with exact matches will score higher. BM25 matches words in the query to the words in document. So, if the query is "what is artificial intelligence then the BM25 will match these document which has the words artificial intelligence rather than matching wih the paraphrasing term "AI". The BM25 will rank the paraphrased documents lower as compared to exact matches becasue BM25 ranks documents on term frequency. Hence, we can say a small differences in words choice can lead to small deviations in BM25 scores.

2. What challenges might arise if the corpus contains very short or very long documents? How would you address these challenges?

Answer: with short documents, you'd face the issue where the context of the documents is not clear, leading to low scoring on all documents, for example, if the user ask a complex questions, the term frequency can be low, so the corpus needs to contain enough information that allow the query to be compared to the corpus's context. To handle this issue, we can increase the length of the corpus, or we group the smaller documents into larger groups with context similarity.

With very long documents, the data can be "watered down", meaning the significance of important words is reduced, for example, if a corpus contain too many words like 'is', 'are', 'a', 'the', etc. these words can make the key words lose their value. We handle this problem by breaking down large documents into smaller chunk, we can use prebuilt methods that can help us identify a good breaking point to preverse meangingful context within the chunk.

# 8. Build Chroma
In this task, students will implement a function to build a Chroma vector store, a key component of the RAG pipeline. The Chroma vector store enables efficient semantic search by embedding documents into a high-dimensional vector space. Using these embeddings, the retriever can find documents that are semantically similar to a given query.

The task involves:

1. Initializing a vector store (Chroma) with Hugging Face embeddings.
2. Adding a list of documents to the vector store.
3. Returning the vector store for later use in the retrieval and generation pipeline.

This function sets up the semantic retrieval system, allowing for more meaningful and context-aware results than keyword-based retrieval.

In [10]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
def build_chroma(documents: list[Document]) -> Chroma:
    """
    Build a Chroma vector store using Hugging Face embeddings
    and add the documents to it.

    Parameters:
    - documents (list[Document]): A list of Document objects to add to the vector store.

    Returns:
    - Chroma: The Chroma vector store containing the embedded documents.
    """

    # Step 1: Initialize Hugging Face embeddings.
    # - Use a pre-trained embedding model (e.g., "sentence-transformers/all-mpnet-base-v2").
    # - HuggingFaceEmbeddings generates dense vector representations for text.
    model_name = "sentence-transformers/all-mpnet-base-v2"
    embeddings = HuggingFaceEmbeddings(model_name=model_name)

    # Step 2: Initialize the Chroma vector store.
    # - Set the collection name for the vector store (e.g., "EngGenAI").
    # - Pass the Hugging Face embeddings as the embedding function.
    vector_store = Chroma(
        collection_name="EngGenAI",
        embedding_function=embeddings,
    )

    # Step 3: Add the input documents to the Chroma vector store.
    # - Use the `add_documents` method to embed and store the documents.
    vector_store.add_documents(documents=documents)
    # Step 4: Return the Chroma vector store for later use.
    return vector_store

Execute the following code to test your implementation.

In [11]:
from langchain.schema import Document

# Create sample Document objects.
sample_docs = [
    Document(page_content="Machine learning is a method of data analysis that automates analytical model building."),
    Document(page_content="Deep learning is a subset of machine learning that uses neural networks with three or more layers."),
    Document(page_content="Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning."),
    Document(page_content="Natural language processing is a field of AI focused on the interaction between computers and human language."),
]

# Call the function to build the Chroma vector store.
vector_store = build_chroma(sample_docs)

# Test retrieval (optional, if supported).
print("Vector store built successfully!")
print(vector_store)  # Print the vector store object to verify.

  embeddings = HuggingFaceEmbeddings(model_name=model_name)
  from .autonotebook import tqdm as notebook_tqdm
  vector_store = Chroma(


Vector store built successfully!
<langchain_community.vectorstores.chroma.Chroma object at 0x11c973210>


Expected output:

Vector store built successfully!

<langchain.vectorstores.Chroma object at 0x7f8c1a4b3f10>

1. What happens if two documents have identical embeddings? How would you handle this in the retrieval process?

Answer:  Two documents having identical embeddings means that they have the same vector in the database. so when we perform retrieval process, both of them are matches for the same query. in the retrieval process, even though they have the same embeddings, they have different ID and metadata. Or if we can compare the semantic content of the two documents, if they are likely to be the same, we can simply get rid of one and keep the other.

# 9. Ensemble Retriever

In this task, students will implement an Ensemble Retriever that combines the strengths of Chroma (semantic similarity) and BM25 (keyword-based retrieval) to create a hybrid retriever. This ensemble approach ensures more robust and comprehensive retrieval results by leveraging both semantic and lexical search techniques.

You should:

1. Retrieve documents from both Chroma (semantic search) and BM25 (lexical search).
2. Combine the results from both retrievers while deduplicating overlapping results.
3. Return the top k most relevant and unique documents.

This function plays a vital role in the RAG pipeline by ensuring that the retrieved documents are relevant and diverse, combining semantic understanding with precise keyword matching.

In [12]:
from langchain.schema import Document

class EnsembleRetriever:
    """
    Merges results from Chroma similarity search and BM25 lexical search.
    """

    def __init__(self, chroma_store, bm25_retriever):
        """
        Initialize the EnsembleRetriever with Chroma and BM25 retrievers.

        Parameters:
        - chroma_store: The Chroma vector store for semantic retrieval.
        - bm25_retriever: The BM25 retriever for lexical retrieval.
        """
        # Step 1: Store the Chroma vector store and BM25 retriever.
        # Hint: Assign the inputs `chroma_store` and `bm25_retriever` to instance variables.
        self.chroma_store = chroma_store  # Replace with your implementation.
        self.bm25_retriever = bm25_retriever  # Replace with your implementation.

    def get_relevant_documents(self, query: str, k: int = 5):
        """
        Retrieve relevant documents by combining results from Chroma and BM25.

        Parameters:
        - query (str): The input search query.
        - k (int): The number of top unique documents to return (default: 5).

        Returns:
        - list[Document]: A list of unique relevant documents.
        """
        
        # Step 1: Retrieve top-k documents from Chroma (semantic similarity).
        chroma_docs =  self.chroma_store.similarity_search(query, k=k) # Replace with your implementation.

        # Step 2: Retrieve top-k documents from BM25 (lexical matching).
        bm25_docs =  self.bm25_retriever.retrieve(query, k=k) # Replace with your implementation.

        # Step 3: Combine results from both retrievers into a single list.
        combined = chroma_docs + bm25_docs  # Replace with your implementation.

        # Step 4: Deduplicate the combined results.
        # Hint: Use a `set` to track seen content based on document text.
        seen = set()
        unique_docs = []
        for doc in combined:
            # Retrieve content for deduplication (check if `page_content` exists).
            # Hint: Use `doc.page_content` if it's a Document object; otherwise, use `doc` as is.
            if isinstance(doc, Document):
                content = doc.page_content  # Replace with your implementation.
            elif isinstance(doc, str):
                content = doc
            else:
                raise ValueError("Nothing is expected.")
            # Use the first 60 characters of the document text as a key for deduplication.
            key = content[:60]  # Replace with your implementation.

            if key not in seen:
                # Convert plain strings to Document objects if necessary.
                # Hint: Use `Document(page_content=doc)` for plain text.
                if isinstance(doc, str):
                    doc = Document(page_content=doc)  # Replace with your implementation.
                unique_docs.append(doc)
                seen.add(key)

        # Step 5: Return the top-k unique documents.
        return unique_docs[:k]  # Replace with your implementation.


Run the following code to test your implementation.

In [13]:
from langchain.schema import Document

# Sample documents
sample_docs = [
    Document(page_content="Machine learning automates model building using data."),
    Document(page_content="Deep learning is a type of machine learning using neural networks."),
    Document(page_content="AI includes technologies like machine learning and deep learning."),
    Document(page_content="Natural language processing focuses on human-computer language interaction."),
]

# Sample Chroma and BM25 retrievers (mock behavior)
class MockChroma:
    def similarity_search(self, query, k):
        return [Document(page_content="Machine learning automates model building using data.")]

class MockBM25:
    def retrieve(self, query, k):
        return ["Deep learning is a type of machine learning using neural networks."]

# Initialize mock retrievers
chroma = MockChroma()
bm25 = MockBM25()

# Initialize EnsembleRetriever
ensemble_retriever = EnsembleRetriever(chroma, bm25)

# Test the retriever with a query
query = "What is machine learning?"
results = ensemble_retriever.get_relevant_documents(query, k=5)

# Print the results
print("Ensemble Retrieval Results:")
for idx, doc in enumerate(results, 1):
    print(f"{idx}. {doc.page_content}")

Ensemble Retrieval Results:
1. Machine learning automates model building using data.
2. Deep learning is a type of machine learning using neural networks.


Ensemble Retrieval Results:
1. Machine learning automates model building using data.
2. Deep learning is a type of machine learning using neural networks.


1. Why is it beneficial to combine semantic retrieval (Chroma) and lexical retrieval (BM25) in an Ensemble Retriever?

Answer: We combine semantic retrieval (Chroma) and lexical retrieval (BM25) in an Ensemble Retriever to increase the chances of retrieving the relevant documents because BM25 matches with the exact words and Chroma captures the context and sysnonyms of the words. by combining these both we could get diverse results resulting in higher chances of getting the desired output. Another benefit is Robustness meaning, if the semantic retrieval fails to retrieve the relevant result then BM25 perform well in the same query. Both Chroma and BM25 have limitations, combining them can reduce the weaknesses and creates a balance. BM25 works best for keyword-based queries and Chroma works well for complex queries, combining them can ensures good retrieval results across query types.

2. If the results from Chroma and BM25 are drastically different (little to no overlap), how might this impact the quality of the combined results?

Answer: it can have a positve or negative impact. 

Combining both gets diverse range of resulting and more coverage as BM25 and Chroma alone do. when use together, they can compensate for each other's weaknesses. for example, if i ask, "what is quantum mechanics?", and in the documents, "quantum mechanics" is abbreviated to "QM", BM25 can do very poorly. However, Chroma, uses semantic search using embeddings, can help recognizing semantic similarity in the documents, even if the exact keyword doesn't appear. This is particularly useful with synonyms or paraphrases. however, both results may return irrelevance documents, this might the case for producing nonsense sentences in the results texts.
It can also bring irrelevant information as one method can retrieves the unrelated information which dilutes the overall quality of the result. There could be a ranking confusion when comining these methods as there are no special ordering, agreement definded between them. So, a high relevant semanctic match could rank lower than less relevant lexical match. 
It can also lack coherence, users might not be able to find the useful information. 


In [14]:
from langchain_core.output_parsers import BaseOutputParser

class StrOutputParser(BaseOutputParser):
    def parse(self, text: str):
        return text

# 10. Format Documents

In this task, you will implement two key components to enhance the formatting and styling of documents in the RAG pipeline:

format_docs(docs):

This function takes a list of documents (docs) and formats them into a readable, numbered list. If no documents are provided, it returns a default message indicating the absence of context.

style_prompt:

This is a prompt template that prepares the input for a neural style transfer task. It asks an AI model to rewrite a given text (original_text) in a specified style, optionally using a contextual snippet (context) from the retrieved documents.

In [15]:
from langchain.prompts import PromptTemplate

def format_docs(docs):
    """
    Format a list of documents into a numbered, readable string.

    Parameters:
    - docs (list[Document]): A list of Document objects to format.

    Returns:
    - str: A string containing the formatted documents or a default message if no documents are provided.
    """

    # Step 1: Check if the list of documents is empty.
    # Hint: If `docs` is empty, return the string "No relevant context found."
    if not docs:
        return "No relevant context found."  # Replace with your implementation.

    # Step 2: Initialize an empty list to store formatted snippets.
    snippet_list = []

    # Step 3: Iterate over the documents and format each one.
    # - Use `enumerate` to get the index and document.
    # - Extract and clean the `page_content` of the document.
    # - Replace newlines with spaces and remove unnecessary whitespace.
    # - Add a formatted string to the `snippet_list` (e.g., "1. Cleaned content").
    for i, doc in enumerate(docs):
        text = doc.page_content
        formatted_string = text.replace('\n', ' ').strip()
        snippet_list.append(f"{i + 1}. {formatted_string}")

    # Step 4: Join the snippets with newline characters and return the result.
    snippet = "\n".join(snippet_list)
    return snippet  # Replace with your implementation.


# Define the style transfer prompt template
style_prompt = PromptTemplate(
    input_variables=["style", "context", "original_text"],
    template=(
            # Replace with your prompt for changing the style of the text. Avoid using complicated prompts.
            "rewrite the original text: {original_text} with {style} style using the context: {context}"
    )
)

Execute the following code to test your implementation.

In [16]:
from langchain.schema import Document
from langchain_huggingface import HuggingFaceEndpoint  # Or the specific LLM library you're using

# Example setup for LLM (ensure this is compatible with your LLM)
def setup_llm():
    return HuggingFaceEndpoint(
        repo_id="mistralai/Mistral-7B-Instruct-v0.3",  # Replace with the appropriate model
        temperature=1.0
    )

# Sample documents
sample_docs = [
    Document(page_content="Machine learning automates data analysis."),
    Document(page_content="Deep learning uses neural networks to learn patterns."),
    Document(page_content="Artificial intelligence includes various technologies."),
]

# Test the format_docs function
formatted_docs = format_docs(sample_docs)
print("Formatted Documents:\n")
print(formatted_docs)

# Test the style_prompt with sample inputs
style = "poetic"
context = formatted_docs
original_text = "Artificial intelligence is transforming the world."

styled_prompt = style_prompt.format(
    style=style,
    context=context,
    original_text=original_text,
)

print("\nGenerated Prompt for Style Transfer:\n")
print(styled_prompt)

# Pass the prompt to the LLM
llm = setup_llm()  # Initialize the LLM
styled_output = llm(styled_prompt)  # Generate the styled text

print("\n--- Rewritten (Styled) Text ---")
print(styled_output)

Formatted Documents:

1. Machine learning automates data analysis.
2. Deep learning uses neural networks to learn patterns.
3. Artificial intelligence includes various technologies.

Generated Prompt for Style Transfer:

rewrite the original text: Artificial intelligence is transforming the world. with poetic style using the context: 1. Machine learning automates data analysis.
2. Deep learning uses neural networks to learn patterns.
3. Artificial intelligence includes various technologies.


  styled_output = llm(styled_prompt)  # Generate the styled text



--- Rewritten (Styled) Text ---


In a realm of wonder and innovation, artificial intelligence takes flight,
A symphony of algorithms and patterns, working with unmatched insight.
Machine learning, our humble servant, sifting through data with ease,
Analyzing mountains of information, where others are beseeched.

Deep learning, the intellect's own child, harnesses the power of neural nets,
Learning from the patterns and complexities that often leave us in debt.
Neural networks, interwoven strands, work tirelessly to seek,
The unspoken patterns within chaos, providing a rich, intricate speak.

Artificial intelligence, an expansive tapestry, weaves technologies as one,
A merging of minds and machinery, a tale yet undone.
With every stitch, every connection, it redefines what we know,
Paving the path to a future where progress will continue to glow.


# 11. RAG chain

In this task, students will implement a RAG chain that integrates an ensemble retriever (Chroma and BM25), formats retrieved context, applies a prompt template, and generates styled output using a Language Model (LLM).

The goal is to:

Use the EnsembleRetriever to retrieve relevant documents from Chroma and BM25.
Format the retrieved documents into a readable context.
Generate a prompt for neural style transfer using the retrieved context and the input query.
Pass the prompt to the LLM and parse the model's response to return the final styled output.

In [17]:
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import PromptTemplate

def build_rag_chain(llm, chroma_store, bm25_retriever):
    """
    Build a RAG chain using an ensemble retriever with Chroma and BM25,
    followed by formatting the context, applying the prompt, and parsing the output.

    Parameters:
    - llm: The language model for generating styled text.
    - chroma_store: Chroma vector store for semantic retrieval.
    - bm25_retriever: BM25 retriever for lexical retrieval.

    Returns:
    - rag_chain: A function that processes inputs through the RAG pipeline.
    """

    # Step 1: Define the Ensemble Retriever
    ensemble_retriever = EnsembleRetriever(chroma_store,bm25_retriever)  # Replace with your implementation.

    # Step 2: Define a function to retrieve and format context
    def retrieve_and_format_context(query, k=5):
        """
        Retrieve relevant documents and format them into a readable context.

        Parameters:
        - query (str): The input query.
        - k (int): The number of documents to retrieve (default: 5).

        Returns:
        - str: The formatted context string.
        """
        # Step 2.1: Retrieve relevant documents using the ensemble retriever.
        context_docs = ensemble_retriever.get_relevant_documents(query,k=k)  # Replace with your implementation.

        # Step 2.2: Format the retrieved documents.
        context = format_docs(context_docs)  # Replace with your implementation.

        return context

    # Step 3: Define the RAG chain
    def rag_chain(inputs):
        """
        Process inputs through the RAG pipeline to generate styled output.

        Parameters:
        - inputs (dict): A dictionary containing:
            - "question" (str): The query for retrieving context.
            - "style" (str): The desired writing style.
            - "original_text" (str): The text to be rewritten.

        Returns:
        - str: The final styled output.
        """

        # Step 3.1: Retrieve and format the context using the helper function.
        query = inputs["question"]
        context = retrieve_and_format_context(query,k=5)  # Replace with your implementation.

        # Step 3.2: Generate the prompt using the `style_prompt`.
        prompt = style_prompt.format(
            style=inputs["style"],
            context=context,
            original_text=inputs["original_text"]
        )  # Replace with your implementation.

        # Step 3.3: Pass the prompt through the LLM to generate the output.
        llm_output = llm(prompt)  # Replace with your implementation.

        # Step 3.4: Parse the LLM's output to extract the final styled text.
        parse_passthrough = RunnablePassthrough()
        parser = parse_passthrough.invoke(llm_output)  # Replace with your implementation.
        result = parser  # Replace with your implementation.

        return result

    return rag_chain

# 12. Final response

In this task, students will implement the main script that integrates all components of the RAG pipeline into a complete application. The script will:

1. Scrape content from specified URLs, process the raw text, and split it into smaller, retrievable chunks.
2. Build the retrievers: Create a Chroma vector store and a BM25 retriever using the processed documents.
3. Build the RAG chain: Set up a pipeline that integrates the retrievers, context formatting, and an LLM to perform neural style transfer.
4. Run the application: Accept a user query and a target style, then process the input through the RAG chain to produce styled output.

In [18]:
if __name__ == "__main__":
    """
    Main script for scraping, building retrievers, setting up the RAG chain,
    and running a neural style transfer demo.
    """

    # Step 1: Scrape content and split into documents
    print("Step 1: Scraping content and splitting into documents...")
    example_urls = [
        "https://en.wikipedia.org/wiki/Artificial_intelligence",
        "https://en.wikipedia.org/wiki/Machine_learning"
    ]

    # Step 1A: Initialize an empty list to store all documents
    all_docs = []

    # Step 1B: Iterate through the URLs to fetch and process content
    for url in example_urls:
        print(f"Scraping content from: {url}")

        # Step 1B.1: Fetch and parse the raw text from the URL
        raw_text = fetch_and_parse(url)

        # Step 1B.2: Split the raw text into chunks (documents)
        splits = split_text_into_documents(raw_text)

        # Step 1B.3: Add the chunks to the list of documents
        all_docs.extend(splits)

    print(f"Total number of documents: {len(all_docs)}")

    # Step 2: Build Chroma and BM25 retrievers
    print("Step 2: Building Chroma vector store and BM25 retriever...")

    # Step 2A: Build the Chroma vector store
    chroma_store = build_chroma(all_docs)  # Replace with your implementation

    # Step 2B: Build the BM25 retriever
    bm25_retriever = BM25Retriever(all_docs)  # Replace with your implementation

    # Step 3: Build the RAG chain
    print("Step 3: Building RAG chain...")

    # Step 3A: Set up the LLM
    llm = setup_llm()  # Replace with your implementation

    # Step 3B: Build the RAG chain
    rag_chain = build_rag_chain(llm,chroma_store,bm25_retriever)  # Replace with your implementation

    # Step 4: Neural Style Transfer Demo
    print("\nStep 4: Neural Style Transfer Demo...")

    # Step 4A: Define the user query and target style
    user_text = "Explain machine learning."
    target_style = "as if it were a recipe for cooking"
    inputs = {"question": user_text, "style": target_style, "original_text": user_text}

    print("\n============================================")
    print("        Neural Style Transfer Demo          ")
    print("============================================")
    print(f"Original Text : {user_text}")
    print(f"Desired Style : {target_style}")

    # Step 5: Run the RAG chain
    print("\nStep 5: Running the RAG chain...")

    # Hint: Pass `inputs` through the RAG chain to generate styled output.
    styled_result = rag_chain(inputs)  # Replace with your implementation

    print("\n--- Styled Output ---")
    print(styled_result)

Step 1: Scraping content and splitting into documents...
Scraping content from: https://en.wikipedia.org/wiki/Artificial_intelligence
Scraping content from: https://en.wikipedia.org/wiki/Machine_learning
Total number of documents: 820
Step 2: Building Chroma vector store and BM25 retriever...
Step 3: Building RAG chain...

Step 4: Neural Style Transfer Demo...

        Neural Style Transfer Demo          
Original Text : Explain machine learning.
Desired Style : as if it were a recipe for cooking

Step 5: Running the RAG chain...

--- Styled Output ---
 usage of machine learning in data compression, in particular adaptive Lempel-Ziv-Welch coding, has been demonstrated to greatly improve upon the performance of simple algorithms like the Huffman code.[35]

Preparing the Ingredients for Machine Learning:

Gather your data and separate it into labeled and unlabeled sets. For supervised learning, ensure the labeled set is abundant and accurately annotated.

Ingredients:

* Datasets (labele

***What You Should Remember:***

1. RAG (Retrieval-Augmented Generation) combines the power of information
retrieval and language models to generate accurate and context-aware responses.

2. Chroma Vector Store is used for semantic retrieval by embedding documents into high-dimensional vectors and finding semantically similar documents for a given query.

3. BM25 Retriever uses lexical matching to rank documents based on the occurrence of query terms, ensuring precision in keyword-based searches.

4. Ensemble Retriever merges results from Chroma (semantic similarity) and BM25 (lexical matching) to provide a balance of relevance and diversity in retrieved documents.

5. Formatting Context ensures that retrieved documents are clean, readable, and useful for the LLM, improving the quality of generated outputs.

6. Prompt Templates guide the LLM by structuring inputs, specifying the task (e.g., style transfer), and ensuring clarity and relevance.

7. Neural Style Transfer enables the LLM to rewrite text in a specified style (e.g., formal, poetic, conversational) using both the original input and retrieved context.

8. Building a RAG pipeline requires:

    **Data preparation:** Scraping and splitting raw text into smaller, retrievable chunks.

    **Retriever setup:** Combining Chroma and BM25 to maximize retrieval quality.

    **Chain integration:** Connecting the retrievers, context formatting, and LLM to form a cohesive workflow.

Congratulations! You've come to the end of this assignment.