
<div align="center">
  <h1></h1>
  <h1>Stylized Retrieval-Augmented Generation</h1>
</div>

### Table of Contents
- [1. Access to Hugging Face](#1-access-to-hugging-face)
- [2. Packages](#2-packages)
- [3. Problem Statement](#3-problem-statement)
- [4. Fetch and Parse](#4-fetch-and-parse)
- [5. Calculate Word Stats](#5-calculate-word-stats)
- [6. Set Up LLM](#6-set-up-llm)
- [7. BM25 Retriever](#7-bm25-retriever)
- [8. Build Chroma](#8-build-chroma)
- [9. Ensemble Retriever](#9-ensemble-retriever)
- [10. Format Documents](#10-format-documents)
- [11. RAG Chain](#11-rag-chain)
- [12. Final Response](#12-final-response)


# 1. Access to Hugging face

In [1]:
import getpass
import os

# Prompt user for Hugging Face API token if not already set
if "HUGGINGFACEHUB_API_TOKEN" not in os.environ:
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your Huggingfacehub API token: ")

Enter your Huggingfacehub API token: ··········


# 2. Packages

In [2]:
!pip install -q langchain
!pip install -q langchain-community
!pip install -q langchain-chroma
!pip install -q langchain-huggingface
!pip install -q bs4
!pip install -q rank_bm25
!pip install -q huggingface_hub
!pip install -q requests

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m628.3/628.3 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m12.4 MB/s[0m eta [36m0:0

# 3. Problem Statement
I have implemented **Text Style Transfer**, a technique that modifies text style while preserving its content. They will build an **ensemble retriever** combining **BM25** for keyword-based retrieval and **Chroma** for semantic search to retrieve relevant documents, which will be used as input for the style transfer process. This project integrates classical retrieval methods with modern neural embeddings for practical NLP applications.

**what is text style transfer?**

**Text Style Transfer** is a natural language processing (NLP) technique that modifies the style of a given text while preserving its original content. It allows for the transformation of linguistic expressions to convey different tones, emotions, or writing styles without altering the underlying meaning. For example, it can rephrase formal text into a casual tone, adapt neutral statements into an emotional tone, or convert modern language into a Shakespearean style. This technique has applications in personalized communication, creative writing, sentiment adjustment, and even domain adaptation, making it a powerful tool for generating diverse textual outputs tailored to specific needs.

### Example of Text Style Transfer:

#### **Input (Neutral Tone):**
"I am excited about the opportunity to work on this project."

#### **Output (Formal Tone):**
"I am genuinely enthusiastic about the prospect of contributing to this project."

#### **Output (Casual Tone):**
"I'm super pumped to get started on this project!"

#### **Output (Shakespearean Style):**
"Verily, I am thrilled by the chance to partake in this noble endeavor."


# 4. Fetch and Parse
In this part of the project:

*    Fetching and parsing web content: Write a function that fetches the HTML content of a webpage and processes it to extract clean, readable text.
*    Splitting text into smaller chunks: Implement a function to split the text into overlapping chunks, ensuring that each chunk is manageable for downstream tasks.

In [None]:
from re import U
import os
import requests
import numpy as np
from bs4 import BeautifulSoup
from langchain.schema import Document


def fetch_and_parse(url: str) -> str:
    """
    Fetch the webpage content at `url` and return a cleaned string of text.

    Parameters:
    - url (str): The URL of the webpage to fetch.

    Returns:
    - str: Cleaned text content extracted from the webpage.
    """

    page = requests.get(url)
    if page.status_code != 200:
        print("Error: Unable to fetch the webpage.")
        return None

    soup = BeautifulSoup(page.content, "html.parser")
    text = soup.get_text(separator=" ", strip=True)
    # print(text)
    return text


def split_text_into_documents(text: str, chunk_size: int = 1000, overlap: int = 100):
    """
    Split a long text into overlapping chunks and return them as a list of Documents.

    Parameters:
    - text (str): The long text to split.
    - chunk_size (int): The size of each chunk (default is 1000 characters).
    - overlap (int): The number of overlapping characters between consecutive chunks (default is 100).

    Returns:
    - list: A list of Documents, each containing a chunk of text.
    """

    # Initialize an empty list to store the chunks.
    docs = []

    for start in range(0, len(text), chunk_size - overlap):
      chunk = text[start:start + chunk_size]
      docs.append(Document(page_content=chunk))

    return docs

# 5. Calculate Word Stats

In this section, I have implemented a function to calculate basic word and character statistics for a list of documents. Each document is represented as a Document object with a page_content attribute that contains its text.

In [None]:
def calculate_word_stats(texts):
    """
    Calculate and display average word and character statistics for a list of documents.

    Parameters:
    - texts (list): A list of Document objects, where each Document contains a `page_content` attribute.

    Returns:
    - None: Prints the average word and character counts per document.
    """

    # Step 1: Initialize variables to keep track of total words and total characters.
    total_words, total_characters = 0, 0

    # Step 2: Iterate through each document in the `texts` list.
    for doc in texts:
      content = doc.page_content
      word_count = len(content.split())  
      char_count = len(content)    
      total_words += word_count
      total_characters += char_count

    # Step 3: Calculate the average words and characters per document.
    # - Avoid division by zero by checking if the `texts` list is not empty.
    docs_num = len(texts)
    avg_words = total_words / docs_num if docs_num > 0 else 0 
    avg_characters = total_characters / docs_num if docs_num > 0 else 0 

    # Step 4: Print the calculated averages in a readable format.
    print(f"Average words per document: {avg_words}")
    print(f"Average characters per document: {avg_characters}")


In [None]:
sample_docs = [
    Document(page_content="This is the first test document."),
    Document(page_content="Here is another example document for testing."),
    Document(page_content="Short text."),
    Document(page_content="This document has more content. It's longer and has more words in it for testing purposes."),
]
calculate_word_stats(sample_docs)


Average words per document: 7.75
Average characters per document: 44.5


# 6. Set Up LLM

In this part of the project, I have implemented a function to set up a Large Language Model (LLM) using the Hugging Face Endpoint API. This function will:

1. Initialize and connect to a pre-trained model available on Hugging Face.
2. Allow customization of parameters like the model repository ID and generation temperature.
3. Return the configured LLM object, which will be used later for text generation tasks in the RAG pipeline.

In [None]:
from langchain_huggingface import HuggingFaceEndpoint

def setup_llm(repo_id="mistralai/Mistral-7B-Instruct-v0.3",temperature=1.0):
    """
    Set up and return a Hugging Face LLM using the specified model repository ID and generation parameters.

    Parameters:
    - repo_id (str): The repository ID of the Hugging Face model to use (default: "mistralai/Mistral-7B-Instruct-v0.3").
    - temperature (float): The generation temperature to control creativity in outputs (default: 1.0).

    Returns:
    - HuggingFaceEndpoint: A configured LLM object ready for text generation.
    """

    # Step 1: Import the HuggingFaceEndpoint class.

    # Step 2: Configure the LLM connection.

    # Step 3: Return the configured LLM object.

    llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    temperature=temperature,
    )

    return llm


# 7. BM25 Retriever

In this section, I implemented a BM25 Retriever, a critical component of the RAG pipeline.

1. Initialize the BM25 retriever with a set of documents.
2. Implement a method to retrieve the top k most relevant documents for a given query.
3. Use efficient tokenization and scoring to ensure accurate and fast results.
This component will enable the pipeline to fetch relevant information from a corpus, which is then passed to the LLM for further processing.

In [None]:
from rank_bm25 import BM25Okapi
from langchain_core.runnables import RunnablePassthrough
import string

class BM25Retriever:
    """
    A class to implement BM25-based document retrieval.

    Attributes:
    - documents (list): A list of Document objects.
    - corpus (list): A list of strings representing the document contents.
    - tokenized_corpus (list): A list of tokenized documents (lists of words).
    - bm25 (BM25Okapi): The BM25 retriever initialized with the tokenized corpus.
    """

    def __init__(self, documents):
        """
        Initialize the BM25 retriever with the given documents.

        Parameters:
        - documents (list): A list of Document objects.
        """
        # Step 1: Store the input documents.

        self.documents = documents
        corpus=[]
        for doc in documents:
          corpus.append(doc.page_content.upper())

        self.corpus = corpus

        # Step 2: Tokenize the corpus.
    
        tokenized_corpus=[]
        for doc in corpus:
          tokenized_corpus.append(doc.split())

        self.tokenized_corpus = tokenized_corpus

        # Step 3: Initialize the BM25 retriever with the tokenized corpus.
        self.bm25 = BM25Okapi(self.tokenized_corpus)

    def retrieve(self, query, k=5):
        """
        Retrieve the top `k` most relevant documents for a given query.

        Parameters:
        - query (str): The input query as a string.
        - k (int): The number of top documents to return (default is 5).

        Returns:
        - list: A list of the top `k` relevant documents as strings.
        """
        # Step 1: Tokenize the input query.
        query_tokens = query.upper().split()

        scores = self.bm25.get_scores(query_tokens)
        print("BM25 Scores for Query:")
        for idx, score in enumerate(scores):
            print(f"Document {idx + 1}: {score:.4f}")

        # Step 2: Use the BM25 retriever to score and rank documents.
        top_docs = self.bm25.get_top_n(query_tokens, self.corpus, n=k)

        top_docs_sentence_case = [doc.capitalize() for doc in top_docs]
        # Step 3: Return the top `k` relevant documentss

        return top_docs_sentence_case

In [38]:
from langchain.schema import Document

# Create sample Document objects.
sample_docs = [
    Document(page_content="Machine learning is a method of data analysis that automates analytical model building."),
    Document(page_content="Deep learning is a subset of machine learning that uses neural networks with three or more layers."),
    Document(page_content="Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning."),
    Document(page_content="Natural language processing is a field of AI focused on the interaction between computers and human language."),
]

# Initialize the retriever with the sample documents.
retriever = BM25Retriever(sample_docs)

# Test the retriever with a query.
query = "What is machine learning?"
top_docs = retriever.retrieve(query, k=2)

# Print the results.
print("Top Relevant Documents:")
for idx, doc in enumerate(top_docs, 1):
    print(f"{idx}. {doc}")


BM25 Scores for Query:
Document 1: 0.2906
Document 2: 0.2579
Document 3: 0.1408
Document 4: 0.1290
Top Relevant Documents:
1. Machine learning is a method of data analysis that automates analytical model building.
2. Deep learning is a subset of machine learning that uses neural networks with three or more layers.


# 8. Build Chroma
In this section, i implemented a function to build a Chroma vector store, a key component of the RAG pipeline. The Chroma vector store enables efficient semantic search by embedding documents into a high-dimensional vector space. Using these embeddings, the retriever can find documents that are semantically similar to a given query.

The section involves:

1. Initializing a vector store (Chroma) with Hugging Face embeddings.
2. Adding a list of documents to the vector store.
3. Returning the vector store for later use in the retrieval and generation pipeline.

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

def build_chroma(documents: list[Document]) -> Chroma:
    """
    Build a Chroma vector store using Hugging Face embeddings
    and add the documents to it.

    Parameters:
    - documents (list[Document]): A list of Document objects to add to the vector store.

    Returns:
    - Chroma: The Chroma vector store containing the embedded documents.
    """

    # Step 1: Initialize Hugging Face embeddings.
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

    # Step 2: Initialize the Chroma vector store.
    vector_store = Chroma(
        collection_name="EngGenAI",
        embedding_function=embeddings,
        )

    # Step 3: Add the input documents to the Chroma vector store.
    vector_store.add_documents(documents)

    # Step 4: Return the Chroma vector store for later use.
    return vector_store


In [10]:
from langchain.schema import Document

# Create sample Document objects.
sample_docs = [
    Document(page_content="Machine learning is a method of data analysis that automates analytical model building."),
    Document(page_content="Deep learning is a subset of machine learning that uses neural networks with three or more layers."),
    Document(page_content="Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning."),
    Document(page_content="Natural language processing is a field of AI focused on the interaction between computers and human language."),
]

# Call the function to build the Chroma vector store.
vector_store = build_chroma(sample_docs)

# Test retrieval (optional, if supported).
print("Vector store built successfully!")
print(vector_store)  # Print the vector store object to verify.


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  vector_store = Chroma(


Vector store built successfully!
<langchain_community.vectorstores.chroma.Chroma object at 0x7fe0b276d120>


# 9. Ensemble Retriever

I have implemented an Ensemble Retriever that combines the strengths of Chroma (semantic similarity) and BM25 (keyword-based retrieval) to create a hybrid retriever. This ensemble approach ensures more robust and comprehensive retrieval results by leveraging both semantic and lexical search techniques.

This function plays a vital role in the RAG pipeline by ensuring that the retrieved documents are relevant and diverse, combining semantic understanding with precise keyword matching.

In [None]:
from langchain.schema import Document

class EnsembleRetriever:
    """
    Merges results from Chroma similarity search and BM25 lexical search.
    """

    def __init__(self, chroma_store, bm25_retriever):
        """
        Initialize the EnsembleRetriever with Chroma and BM25 retrievers.

        Parameters:
        - chroma_store: The Chroma vector store for semantic retrieval.
        - bm25_retriever: The BM25 retriever for lexical retrieval.
        """
        # Step 1: Store the Chroma vector store and BM25 retriever.
        self.chroma_store = chroma_store 
        self.bm25_retriever = bm25_retriever

    def get_relevant_documents(self, query: str, k: int = 5):
        """
        Retrieve relevant documents by combining results from Chroma and BM25.

        Parameters:
        - query (str): The input search query.
        - k (int): The number of top unique documents to return (default: 5).

        Returns:
        - list[Document]: A list of unique relevant documents.
        """

        # Step 1: Retrieve top-k documents from Chroma (semantic similarity).
        chroma_docs = self.chroma_store.similarity_search(query, k=k) 

        # Step 2: Retrieve top-k documents from BM25 (lexical matching).
        bm25_docs = self.bm25_retriever.retrieve(query,k=k)

        # Step 3: Combine results from both retrievers into a single list.
        combined = chroma_docs + bm25_docs

        # Step 4: Deduplicate the combined results.
        seen = set()
        unique_docs = []
        for doc in combined:
            # Retrieve content for deduplication (check if `page_content` exists).
            content = doc.page_content if isinstance(doc, Document) else doc

            # Use the first 60 characters of the document text as a key for deduplication.
            key = content[:60]

            if key not in seen:
                # Convert plain strings to Document objects if necessary.
                if isinstance(doc, str):
                    doc = Document(page_content=doc)
                unique_docs.append(doc)
                seen.add(key)

        # Step 5: Return the top-k unique documents.
        return unique_docs[:k]


In [12]:
from langchain.schema import Document

# Sample documents
sample_docs = [
    Document(page_content="Machine learning automates model building using data."),
    Document(page_content="Deep learning is a type of machine learning using neural networks."),
    Document(page_content="AI includes technologies like machine learning and deep learning."),
    Document(page_content="Natural language processing focuses on human-computer language interaction."),
]

# Sample Chroma and BM25 retrievers (mock behavior)
class MockChroma:
    def similarity_search(self, query, k):
        return [Document(page_content="Machine learning automates model building using data.")]

class MockBM25:
    def retrieve(self, query, k):
        return ["Deep learning is a type of machine learning using neural networks."]

# Initialize mock retrievers
chroma = MockChroma()
bm25 = MockBM25()

# Initialize EnsembleRetriever
ensemble_retriever = EnsembleRetriever(chroma, bm25)

# Test the retriever with a query
query = "What is machine learning?"
results = ensemble_retriever.get_relevant_documents(query, k=3)

# Print the results
print("Ensemble Retrieval Results:")
for idx, doc in enumerate(results, 1):
    print(f"{idx}. {doc.page_content}")


Ensemble Retrieval Results:
1. Machine learning automates model building using data.
2. Deep learning is a type of machine learning using neural networks.


In [13]:
from langchain_core.output_parsers import BaseOutputParser

class StrOutputParser(BaseOutputParser):
    def parse(self, text: str):
        return text

# 10. Format Documents

This has two key components to enhance the formatting and styling of documents in the RAG pipeline:

format_docs(docs):

This function takes a list of documents (docs) and formats them into a readable, numbered list. If no documents are provided, it returns a default message indicating the absence of context.

style_prompt:

This is a prompt template that prepares the input for a neural style transfer task. It asks an AI model to rewrite a given text (original_text) in a specified style, optionally using a contextual snippet (context) from the retrieved documents.

In [None]:
from langchain.prompts import PromptTemplate

def format_docs(docs):
    """
    Format a list of documents into a numbered, readable string.

    Parameters:
    - docs (list[Document]): A list of Document objects to format.

    Returns:
    - str: A string containing the formatted documents or a default message if no documents are provided.
    """

    # Step 1: Check if the list of documents is empty.
    if not docs:
        return "No relevant context found."

    # Step 2: Initialize an empty list to store formatted snippets.
    snippet_list = []

    # Step 3: Iterate over the documents and format each one
    for i, doc in enumerate(docs):
      cleaned_content = doc.page_content.replace("\n", " ").strip()
      snippet_list.append(f"{i+1}. {cleaned_content}") 

    # Step 4: Join the snippets with newline characters and return the result.
    final_output = "\n".join(snippet_list)
    return final_output 


# Define the style transfer prompt template
style_prompt = PromptTemplate(
    input_variables=["style", "context", "original_text"],
    template=(
        "Rewrite the given text in this {style} style."
        "Use the context coming from \n{context}\n"
        "This is the original text to be referred : \n{original_text}\n"
    )
)


In [None]:
from langchain.schema import Document
from langchain_huggingface import HuggingFaceEndpoint

def setup_llm():
    return HuggingFaceEndpoint(
        repo_id="mistralai/Mistral-7B-Instruct-v0.3", 
        temperature=0.7
    )

# Sample documents
sample_docs = [
    Document(page_content="Machine learning automates data analysis."),
    Document(page_content="Deep learning uses neural networks to learn patterns."),
    Document(page_content="Artificial intelligence includes various technologies."),
]

# Test the format_docs function
formatted_docs = format_docs(sample_docs)
print("Formatted Documents:\n")
print(formatted_docs)

# Test the style_prompt with sample inputs
style = "poetic"
context = formatted_docs
original_text = "Artificial intelligence is transforming the world."

styled_prompt = style_prompt.format(
    style=style,
    context=context,
    original_text=original_text,
)

print("\nGenerated Prompt for Style Transfer:\n")
print(styled_prompt)

# Pass the prompt to the LLM
llm = setup_llm()  # Initialize the LLM
styled_output = llm(styled_prompt)  # Generate the styled text

print("\n--- Rewritten (Styled) Text ---")
print(styled_output)


Formatted Documents:

1. Machine learning automates data analysis.
2. Deep learning uses neural networks to learn patterns.
3. Artificial intelligence includes various technologies.

Generated Prompt for Style Transfer:

Rewrite the given text in this poetic style.Use the context coming from 
1. Machine learning automates data analysis.
2. Deep learning uses neural networks to learn patterns.
3. Artificial intelligence includes various technologies.
This is the original text to be referred : 
Artificial intelligence is transforming the world.


--- Rewritten (Styled) Text ---
Deep learning, a subset of AI, is revolutionizing data analysis through the use of neural networks to learn patterns.
Machine learning, another subset of AI, automates the process of data analysis.

In the realm of the future,
Artificial Intelligence, a celestial light,
Shines upon the world, transforming its very essence.

A branch, the brilliant Deep Learning,
Dwells within the neural networks,
Learning patterns

# 11. RAG chain

Implemented a RAG chain that integrates an ensemble retriever (Chroma and BM25), formats retrieved context, applies a prompt template, and generates styled output using a Language Model (LLM).

The goal is to:

Use the EnsembleRetriever to retrieve relevant documents from Chroma and BM25.
Format the retrieved documents into a readable context.
Generate a prompt for neural style transfer using the retrieved context and the input query.
Pass the prompt to the LLM and parse the model's response to return the final styled output.

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import PromptTemplate

def build_rag_chain(llm, chroma_store, bm25_retriever):
    """
    Build a RAG chain using an ensemble retriever with Chroma and BM25,
    followed by formatting the context, applying the prompt, and parsing the output.

    Parameters:
    - llm: The language model for generating styled text.
    - chroma_store: Chroma vector store for semantic retrieval.
    - bm25_retriever: BM25 retriever for lexical retrieval.

    Returns:
    - rag_chain: A function that processes inputs through the RAG pipeline.
    """

    # Step 1: Define the Ensemble Retriever
    ensemble_retriever = EnsembleRetriever(chroma_store, bm25_retriever)

    # Step 2: Define a function to retrieve and format context
    def retrieve_and_format_context(query, k=5):
        """
        Retrieve relevant documents and format them into a readable context.

        Parameters:
        - query (str): The input query.
        - k (int): The number of documents to retrieve (default: 5).

        Returns:
        - str: The formatted context string.
        """
        # Step 2.1: Retrieve relevant documents using the ensemble retriever.
        context_docs = ensemble_retriever.get_relevant_documents(query,k=k) 

        # Step 2.2: Format the retrieved documents.
        context = format_docs(context_docs)

        return context

    # Step 3: Define the RAG chain
    def rag_chain(inputs):
        """
        Process inputs through the RAG pipeline to generate styled output.

        Parameters:
        - inputs (dict): A dictionary containing:
            - "question" (str): The query for retrieving context.
            - "style" (str): The desired writing style.
            - "original_text" (str): The text to be rewritten.

        Returns:
        - str: The final styled output.
        """

        # Step 3.1: Retrieve and format the context using the helper function.
        query = inputs["question"]
        context = retrieve_and_format_context(query)

        # Step 3.2: Generate the prompt using the `style_prompt`.
        prompt =style_prompt.format(
            style=inputs["style"],
            context=context,
            original_text=inputs["original_text"],
            ) 

        # Step 3.3: Pass the prompt through the LLM to generate the output.
        llm = setup_llm()
        llm_output = llm(prompt)

        # Step 3.4: Parse the LLM's output to extract the final styled text.
        parser = StrOutputParser() 
        result = parser.parse(llm_output) 

        return result

    return rag_chain


# 12. Final response

Implementation of the main script that integrates all components of the RAG pipeline into a complete application. The script will:

1. Scrape content from specified URLs, process the raw text, and split it into smaller, retrievable chunks.
2. Build the retrievers: Create a Chroma vector store and a BM25 retriever using the processed documents.
3. Build the RAG chain: Set up a pipeline that integrates the retrievers, context formatting, and an LLM to perform neural style transfer.
4. Run the application: Accept a user query and a target style, then process the input through the RAG chain to produce styled output.

In [None]:
if __name__ == "__main__":
    """
    Main script for scraping, building retrievers, setting up the RAG chain,
    and running a neural style transfer demo.
    """

    # Step 1: Scrape content and split into documents
    print("Step 1: Scraping content and splitting into documents...")
    example_urls = [
        "https://en.wikipedia.org/wiki/Artificial_intelligence",
        "https://en.wikipedia.org/wiki/Machine_learning"
    ]

    # Step 1A: Initialize an empty list to store all documents
    all_docs = []

    # Step 1B: Iterate through the URLs to fetch and process content
    for url in example_urls:
        print(f"Scraping content from: {url}")

        # Step 1B.1: Fetch and parse the raw text from the URL
        raw_text = fetch_and_parse(url) 

        # Step 1B.2: Split the raw text into chunks (documents)
        splits = split_text_into_documents(raw_text) 

        # Step 1B.3: Add the chunks to the list of documents
        all_docs.extend(splits)

    print(f"Total number of documents: {len(all_docs)}")

    # Step 2: Build Chroma and BM25 retrievers
    print("Step 2: Building Chroma vector store and BM25 retriever...")

    # Step 2A: Build the Chroma vector store
    chroma_store = build_chroma(all_docs) 

    # Step 2B: Build the BM25 retriever
    bm25_retriever = BM25Retriever(all_docs)

    # Step 3: Build the RAG chain
    print("Step 3: Building RAG chain...")

    # Step 3A: Set up the LLM
    llm = setup_llm() 

    # Step 3B: Build the RAG chain
    rag_chain = build_rag_chain(llm,chroma_store,bm25_retriever) 

    # Step 4: Neural Style Transfer Demo
    print("\nStep 4: Neural Style Transfer Demo...")

    # Step 4A: Define the user query and target style
    user_text = "Explain machine learning."
    target_style = "as if it were a recipe for cooking"
    inputs = {"question": user_text, "style": target_style, "original_text": user_text}

    print("\n============================================")
    print("        Neural Style Transfer Demo          ")
    print("============================================")
    print(f"Original Text : {user_text}")
    print(f"Desired Style : {target_style}")

    # Step 5: Run the RAG chain
    print("\nStep 5: Running the RAG chain...")

    styled_result = rag_chain(inputs) 

    print("\n--- Styled Output ---")
    print(styled_result)


Step 1: Scraping content and splitting into documents...
Scraping content from: https://en.wikipedia.org/wiki/Artificial_intelligence
Scraping content from: https://en.wikipedia.org/wiki/Machine_learning
Total number of documents: 369
Step 2: Building Chroma vector store and BM25 retriever...
Step 3: Building RAG chain...

Step 4: Neural Style Transfer Demo...

        Neural Style Transfer Demo          
Original Text : Explain machine learning.
Desired Style : as if it were a recipe for cooking

Step 5: Running the RAG chain...

--- Styled Output ---
Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence concerned with the creation of algorithms that can learn from and make decisions based on data. The process of automating the application of machine learning is called machine learning engineering. Big data, which refers to extremely large or complex datasets, is often used in machine learning. Deep learning, 