
<div align="center">
  <h1></h1>
  <h1>Stylized Retrieval-Augmented Generation</h1>
  <h4 align="center">Assignmnet II</h4>
</div>

In this notebook, we build and implement a Retrieval-Augmented Generation (RAG) pipeline tailored for a text style transfer application. Text style transfer is an NLP technique that modifies the style of the text while preserving the semantic meaning and content of it.

**Goals:**

*   Build a RAG pipeline to enhance text generation with external knowledge.
*   Retrieve relevant information from a dataset or knowledge base to support text generation.
*   Build an ensemble retriever to combine the benefits of sparse and dense search.
*   Implement a neural style transfer model to transform text into a desired writing style.


**Tools Used:**
*   **BM25** for keyword-based retrieval
*   **Chroma** for semantic search
*   **HuggingFace** to load pre-trained model




**1. Access to Hugging face**


In [1]:
import getpass
import os

# Prompt user for Hugging Face API token if not already set
if "HUGGINGFACEHUB_API_TOKEN" not in os.environ:
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your Huggingfacehub API token: ")


Enter your Huggingfacehub API token: ··········


**2. Installing relevant packages**


In [2]:
!pip install -q langchain
!pip install -q langchain-community
!pip install -q langchain-chroma
!pip install -q langchain-huggingface
!pip install -q bs4
!pip install -q rank_bm25
!pip install -q huggingface_hub
!pip install -q requests

**3. Fetch, parse and chunk web content**

In [3]:
import os
import requests
import numpy as np
from bs4 import BeautifulSoup
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Fetching and parsing web content to extract clean, readable text
def fetch_and_parse(url: str) -> str:
    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')

    for element in soup(['script', 'style', 'noscript']):
      element.decompose()

    text = soup.get_text(separator=' ')
    clean_text = ' '.join(text.split())

    return clean_text

# Chunking text into overlapping chunks
def split_text_into_documents(text: str, chunk_size: int = 1000, overlap: int = 100):
    docs = []

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )

    docs = text_splitter.create_documents([text])
    print("Eg document: ", docs[0])
    print("\n")
    return docs


**4. Calculate Word Stats**


In [4]:
# Calculate basic word and character statistics to understand the data

def calculate_word_stats(texts):
    total_words, total_characters = 0, 0

    for doc in texts:
      #`doc.page_content` contains the text of the document.
      content = doc.page_content
      word_count = len(content.split())  # Count words
      char_count = len(content)         # Count characters
      total_words += word_count
      total_characters += char_count

    avg_words = 0
    avg_characters = 0
    num_docs = len(texts)
    if len(texts) > 0:
      avg_words = total_words / num_docs
      avg_characters = total_characters / num_docs

    print(f"Average words per document: {avg_words}")
    print(f"Average characters per document: {avg_characters}")


In [5]:
# Creating sample Document objects for testing
sample_docs = [
    Document(page_content="This is the first test document."),
    Document(page_content="Here is another example document for testing."),
    Document(page_content="Short text."),
    Document(page_content="This document has more content. It's longer and has more words in it for testing purposes."),
]

calculate_word_stats(sample_docs)


Average words per document: 7.75
Average characters per document: 44.5


**5. Set Up LLM**



In [6]:
# Set up pre-trained LLM with Hugging Face.

from langchain_huggingface import HuggingFaceEndpoint

def setup_llm(repo_id="mistralai/Mistral-7B-Instruct-v0.3"):
    llm = HuggingFaceEndpoint(
        repo_id=repo_id,
        task="text-generation",
        temperature=1.0,
    )
    return llm


**6. BM25 Retriever**


In [7]:
# Class to set up BM25 retriever.
# init method initializes and tokenizes the document
# retrieve method returns the top k most relevant documents

from rank_bm25 import BM25Okapi
from langchain_core.runnables import RunnablePassthrough

class BM25Retriever:
    def __init__(self, documents):

        self.documents = documents
        self.corpus = [doc.page_content for doc in self.documents]
        print(self.corpus)
        print("\n")

        #Tokenize the corpus.
        self.tokenized_corpus = [doc.split(" ") for doc in self.corpus]
        print(self.tokenized_corpus)
        print("\n")

        self.bm25 = BM25Okapi(self.tokenized_corpus)

    def retrieve(self, query, k=5):
        # Tokenize the input query.
        tokenized_query = query.split(" ")

        k_docs = self.bm25.get_top_n(tokenized_query, self.corpus, n=k)
        print(self.bm25.get_scores(tokenized_query))

        return k_docs


In [8]:
# Testing BM25

from langchain.schema import Document

sample_docs = [
    Document(page_content="Machine learning is a method of data analysis that automates analytical model building."),
    Document(page_content="Deep learning is a subset of machine learning that uses neural networks with three or more layers."),
    Document(page_content="Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning."),
    Document(page_content="Natural language processing is a field of AI focused on the interaction between computers and human language."),
]

retriever = BM25Retriever(sample_docs)

query = "What is machine learning?"
top_docs = retriever.retrieve(query, k=2)

print("Top Relevant Documents:")
for idx, doc in enumerate(top_docs, 1):
    print(f"{idx}. {doc}")


['Machine learning is a method of data analysis that automates analytical model building.', 'Deep learning is a subset of machine learning that uses neural networks with three or more layers.', 'Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning.', 'Natural language processing is a field of AI focused on the interaction between computers and human language.']


[['Machine', 'learning', 'is', 'a', 'method', 'of', 'data', 'analysis', 'that', 'automates', 'analytical', 'model', 'building.'], ['Deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning', 'that', 'uses', 'neural', 'networks', 'with', 'three', 'or', 'more', 'layers.'], ['Artificial', 'intelligence', 'encompasses', 'a', 'wide', 'range', 'of', 'technologies,', 'including', 'machine', 'learning', 'and', 'deep', 'learning.'], ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'AI', 'focused', 'on', 'the', 'interaction', 'between', 'computers', 'and'

**7. Build Chroma**


In [9]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

def build_chroma(documents: list[Document]) -> Chroma:

    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

    vector_store = Chroma(
        collection_name="EngGenAI",
        embedding_function=embeddings,
    )
    vector_store.add_documents(documents)

    return vector_store


In [10]:
# Testing Chroma

from langchain.schema import Document

sample_docs = [
    Document(page_content="Machine learning is a method of data analysis that automates analytical model building."),
    Document(page_content="Deep learning is a subset of machine learning that uses neural networks with three or more layers."),
    Document(page_content="Artificial intelligence encompasses a wide range of technologies, including machine learning and deep learning."),
    Document(page_content="Natural language processing is a field of AI focused on the interaction between computers and human language."),
]

vector_store = build_chroma(sample_docs)

print("Vector store built successfully!")
print(vector_store)


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  vector_store = Chroma(


Vector store built successfully!
<langchain_community.vectorstores.chroma.Chroma object at 0x7d889a56bf10>


**8. Ensemble Retriever**

In [11]:
# Combine Chroma for semantic retrieval and BM25 for keyword retrival

from langchain.schema import Document

class EnsembleRetriever:

    def __init__(self, chroma_store, bm25_retriever):
        self.chroma_store = chroma_store
        self.bm25_retriever = bm25_retriever

    def get_relevant_documents(self, query: str, k: int = 5):
        chroma_docs = self.chroma_store.similarity_search(query, k)

        bm25_docs = self.bm25_retriever.retrieve(query, k)

        combined = chroma_docs + bm25_docs

        # Deduplicate the combined results.
        seen = set()
        unique_docs = []
        for doc in combined:
            content = doc.page_content if isinstance(doc, Document) else doc

            # Use the first 60 characters of the document text as a key for deduplication.
            key = content[:60]

            if key not in seen:
                if isinstance(doc, str):
                    doc = Document(page_content=doc)
                unique_docs.append(doc)
                seen.add(key)

        return unique_docs[:k]


In [12]:
# Testing Ensemble retriever
from langchain.schema import Document

sample_docs = [
    Document(page_content="Machine learning automates model building using data."),
    Document(page_content="Deep learning is a type of machine learning using neural networks."),
    Document(page_content="AI includes technologies like machine learning and deep learning."),
    Document(page_content="Natural language processing focuses on human-computer language interaction."),
]

# mock behaviour
class MockChroma:
    def similarity_search(self, query, k):
        return [Document(page_content="Machine learning automates model building using data.")]

class MockBM25:
    def retrieve(self, query, k):
        return ["Deep learning is a type of machine learning using neural networks."]

chroma = MockChroma()
bm25 = MockBM25()

ensemble_retriever = EnsembleRetriever(chroma, bm25)

query = "What is machine learning?"
results = ensemble_retriever.get_relevant_documents(query, k=3)

print("Ensemble Retrieval Results:")
for idx, doc in enumerate(results, 1):
    print(f"{idx}. {doc.page_content}")


Ensemble Retrieval Results:
1. Machine learning automates model building using data.
2. Deep learning is a type of machine learning using neural networks.


In [13]:
from langchain_core.output_parsers import BaseOutputParser

class StrOutputParser(BaseOutputParser):
    def parse(self, text: str):
        return text

**9. Formatting Documents and Prompt Styling**

In [14]:
from langchain.prompts import PromptTemplate

def format_docs(docs):
    if not docs:
        return "No relevant context found."

    snippet_list = []

    for i, doc in enumerate(docs):
        cleaned_content = doc.page_content.replace("\n", " ").strip()
        snippet_list.append(f"{i+1}. {cleaned_content}")

    return "\n".join(snippet_list)


# Define the style transfer prompt template
style_prompt = PromptTemplate(
    input_variables=["style", "context", "original_text"],
    template=(
        "Rewrite the following text in a {style} style:\n"
          "{original_text}\n\n"
          "Use the context below to guide the rewrite if needed:\n"
          "{context}"
    )
)


In [15]:
# Test above methods

from langchain.schema import Document
from langchain_huggingface import HuggingFaceEndpoint

def setup_llm():
    return HuggingFaceEndpoint(
        repo_id="mistralai/Mistral-7B-Instruct-v0.3",
        task="text-generation",
        temperature=0.7
    )

sample_docs = [
    Document(page_content="Machine learning automates data analysis."),
    Document(page_content="Deep learning uses neural networks to learn patterns."),
    Document(page_content="Artificial intelligence includes various technologies."),
]

formatted_docs = format_docs(sample_docs)
print("Formatted Documents:\n")
print(formatted_docs)

style = "poetic"
context = formatted_docs
original_text = "Artificial intelligence is transforming the world."

styled_prompt = style_prompt.format(
    style=style,
    context=context,
    original_text=original_text,
)

print("\nGenerated Prompt for Style Transfer:\n")
print(styled_prompt)

llm = setup_llm()
styled_output = llm(styled_prompt)

print("\n--- Rewritten (Styled) Text ---")
print(styled_output)


Formatted Documents:

1. Machine learning automates data analysis.
2. Deep learning uses neural networks to learn patterns.
3. Artificial intelligence includes various technologies.

Generated Prompt for Style Transfer:

Rewrite the following text in a poetic style:
Artificial intelligence is transforming the world.

Use the context below to guide the rewrite if needed:
1. Machine learning automates data analysis.
2. Deep learning uses neural networks to learn patterns.
3. Artificial intelligence includes various technologies.


  styled_output = llm(styled_prompt)



--- Rewritten (Styled) Text ---


In the realm where logic and reason intertwine,
A force of change, the world redefines,
Machine learning, a data analyst's friend,
Automating the tasks, the human mind can't bend.

Deep learning, a neural network's grace,
Learning patterns, in a timeless race,
A network of nodes, a web of thought,
In the vast sea of data, it's the silent boat.

Artificial intelligence, a multifaceted crown,
Encompassing technologies, from the ground up thrown,
A beacon of progress, a sign of the times,
Transforming the world, as it aligns.


**10. RAG chain**

The goal is to:

*   Use the EnsembleRetriever to retrieve relevant documents from Chroma and BM25.
*   Format the retrieved documents into a readable context.
*   Generate a prompt for neural style transfer using the retrieved context and the input query.
*   Pass the prompt to the LLM and parse the model's response to return the final styled output.

In [16]:
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import PromptTemplate

def build_rag_chain(llm, chroma_store, bm25_retriever):
    ensemble_retriever = EnsembleRetriever(chroma_store, bm25_retriever)

    def retrieve_and_format_context(query, k=5):

        context_docs = ensemble_retriever.get_relevant_documents(query, k)

        context = format_docs(context_docs)

        return context

    def rag_chain(inputs):

        query = inputs["question"]
        context = retrieve_and_format_context(query)

        prompt = style_prompt.format(
            style=inputs["style"],
            context=context,
            original_text=inputs["original_text"],
        )

        llm_output = llm(prompt)

        parser = StrOutputParser()
        result = parser.parse(llm_output)
        return result

    return rag_chain


**11. Final execution to create stylized resoinse**



In [17]:
if __name__ == "__main__":
    """
    Main script for scraping, building retrievers, setting up the RAG chain,
    and running a neural style transfer demo.
    """
    print("Step 1: Scraping content and splitting into documents...")
    example_urls = [
        "https://en.wikipedia.org/wiki/Artificial_intelligence",
        "https://en.wikipedia.org/wiki/Machine_learning"
    ]

    all_docs = []

    for url in example_urls:
        print(f"Scraping content from: {url}")

        raw_text = fetch_and_parse(url)
        print("Raw Text: ", raw_text)
        print("\n")

        splits = split_text_into_documents(raw_text)
        print("Document Splits: ", splits)
        print("\n")

        all_docs.extend(splits)

    print(f"Total number of documents: {len(all_docs)}")

    print("Step 2: Building Chroma vector store and BM25 retriever...")

    chroma_store = build_chroma(all_docs)

    bm25_retriever = BM25Retriever(all_docs)

    print("Step 3: Building RAG chain...")

    llm = setup_llm()

    rag_chain = build_rag_chain(llm, chroma_store, bm25_retriever)

    print("\nStep 4: Neural Style Transfer Demo...")

    user_text = "Explain machine learning."
    target_style = "as if it were a recipe for cooking"
    inputs = {"question": user_text, "style": target_style, "original_text": user_text}

    print("\n============================================")
    print("        Neural Style Transfer Demo          ")
    print("============================================")
    print(f"Original Text : {user_text}")
    print(f"Desired Style : {target_style}")

    print("\nStep 5: Running the RAG chain...")

    styled_result = rag_chain(inputs)

    print("\n--- Styled Output ---")
    print(styled_result)


Step 1: Scraping content and splitting into documents...
Scraping content from: https://en.wikipedia.org/wiki/Artificial_intelligence


Eg document:  page_content='Artificial intelligence - Wikipedia Jump to content Main menu Main menu move to sidebar hide Navigation Main page Contents Current events Random article About Wikipedia Contact us Contribute Help Learn to edit Community portal Recent changes Upload file Special pages Search Search Appearance Donate Create account Log in Personal tools Donate Create account Log in Pages for logged out editors learn more Contributions Talk Contents move to sidebar hide (Top) 1 Goals Toggle Goals subsection 1.1 Reasoning and problem-solving 1.2 Knowledge representation 1.3 Planning and decision-making 1.4 Learning 1.5 Natural language processing 1.6 Perception 1.7 Social intelligence 1.8 General intelligence 2 Techniques Toggle Techniques subsection 2.1 Search and optimization 2.1.1 State space search 2.1.2 Local search 2.2 Logic 2.3 Probabilis

