### SuperRAG Implementation

##### The Super Retrieval-Augmented Generation (Super-RAG) system combines local retrieval with external knowledge augmentation to answer questions with high accuracy and depth. It begins by processing a user query through a Retrieval-Augmented Generation (RAG) system, which first attempts to retrieve relevant information from an internal FAISS vector database. If the confidence level in the retrieved context is above a certain threshold, this data is sent directly to a small, instruct-following LLM (like LLAMA) to generate an answer. If confidence is low, the system triggers an augmentation step by querying an external knowledge source, such as Perplexity, to gather additional context. This external context is then merged with the retrieved local data to form a comprehensive input for the LLM, allowing it to produce a more complete, accurate answer by dynamically incorporating both local and external information.

##### To enhance efficiency, the Super-RAG system includes a cache management layer to reduce redundant operations. When a query is processed, the system first checks the cache to retrieve any precomputed results. If no cached response exists, it generates an embedding for the query, retrieves the closest matches from FAISS, and caches the results for future use. Additionally, the system employs a reranking mechanism based on cosine similarity to ensure the most relevant documents are selected. This reranking function generates embeddings for both the question and retrieved documents, calculates similarity scores, and sorts the results in descending order of confidence. This caching and reranking infrastructure helps optimize response generation, making the Super-RAG system both efficient and adaptive in its knowledge retrieval and augmentation processes.

In [1]:
import os

os.environ['ACCESS_TOKEN_NAME'] = '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'

In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

In [3]:
! pip install wikipedia-api



In [4]:
! pip install faiss-cpu



In [5]:
! pip install bitsandbytes



In [6]:
! pip install langchain langchain_community langchain_core langchain_cohere



In [7]:
!pip install -U langchain langchain-community langchain-cohere



In [8]:
# Import necessary libraries
import faiss
import numpy as np
import hashlib
import json
import requests
import wikipediaapi
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from tenacity import retry, wait_random_exponential, stop_after_attempt
from langchain_community.retrievers import BM25Retriever
from langchain_core.prompts import ChatPromptTemplate
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain.retrievers import EnsembleRetriever
from operator import itemgetter

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Main Vector Database Setup

In [10]:
import os
import pickle
import numpy as np
import torch
import faiss
from transformers import AutoTokenizer, AutoModel

# Define the paths to the saved embeddings and text chunks
embeddings_paths = [
    "/content/drive/MyDrive/298B/merged_embeddings_ml.pkl"
]

chunks_paths = [
    "/content/drive/MyDrive/298B/merged_text_chunks_ml.pkl"
]

# Function to load all embeddings from multiple pickle files
def load_embeddings(embeddings_paths):
    all_embeddings = []
    for path in embeddings_paths:
        with open(path, 'rb') as f:
            embeddings = pickle.load(f)
            all_embeddings.append(embeddings)
    return np.vstack(all_embeddings)  # Combine into a single array

# Function to load all text chunks from multiple pickle files
def load_text_chunks(chunks_paths):
    all_chunks = []
    for path in chunks_paths:
        with open(path, 'rb') as f:
            chunks = pickle.load(f)
            all_chunks.extend(chunks)  # Combine into a single list
    return all_chunks

# Load the embeddings and text chunks
embeddings = load_embeddings(embeddings_paths)
text_chunks = load_text_chunks(chunks_paths)

# Build a FAISS index
dimension = embeddings.shape[1]  # Dimension of the embeddings
index = faiss.IndexFlatL2(dimension)  # L2 distance index (for similarity search)

# Add embeddings to the FAISS index
index.add(embeddings)

# Save the FAISS index for later use
output_dir = '/content/drive/MyDrive/298B'  # Adjust the path as needed
os.makedirs(output_dir, exist_ok=True)  # Ensure the directory exists

faiss.write_index(index, os.path.join(output_dir, 'faiss_index.idx'))

# Save the text chunks for later use
with open(os.path.join(output_dir, 'text_chunks.pkl'), 'wb') as f:
    pickle.dump(text_chunks, f)

print("FAISS index and text chunks have been stored.")

FAISS index and text chunks have been stored.


In [10]:
from scipy.spatial.distance import cosine

In [11]:
# Cache Management and Forking
cache = {}
def retrieve_from_cache(query):
    query_hash = md5(query.encode()).hexdigest()
    return cache.get(query_hash)

def store_in_cache(query, response):
    query_hash = md5(query.encode()).hexdigest()
    cache[query_hash] = response

# Function to retrieve relevant sections using FAISS
def retrieve_relevant_sections(question, top_k=10):
    cached_response = retrieve_from_cache(question)
    if cached_response:
        return cached_response

    query_embedding = generate_embedding(question, embedding_model)
    distances, indices = index.search(np.array([query_embedding]), top_k)
    relevant_docs = [text_chunks[idx] for idx in indices[0]]

    store_in_cache(question, relevant_docs)
    return relevant_docs

# Define function to generate embeddings using SentenceTransformer
def generate_embedding(text, embedding_model):
    embedding = embedding_model.encode(text)
    normalized_embedding = embedding / np.linalg.norm(embedding)
    return normalized_embedding

# Rerank documents based on cosine similarity to the question
def rerank_documents(question, data, embedding_model):
    question_emb = generate_embedding(question, embedding_model)
    results = []

    for d in data:
        answer_id = d[0]
        answer_text = d[1]
        answer_emb = generate_embedding(answer_text, embedding_model)
        similarity_score = 1 - cosine(question_emb, answer_emb)
        confidence_score = round(similarity_score, 2)

        results.append({
            "id": answer_id,
            "confidence": confidence_score,
            "relevant_text": answer_text
        })

    results = sorted(results, key=lambda x: x["confidence"], reverse=True)
    return results

# Sample ML-Related Data
question = "What are the advantages of using transformer architectures in natural language processing?"
data = [
    ["1", "Transformer architectures allow for parallel processing of data, which makes them much faster than traditional RNNs. They also capture long-range dependencies more effectively, which is beneficial for tasks like translation and summarization."],
    ["2", "In recent years, transformers have outperformed RNNs and LSTMs in many NLP benchmarks. The self-attention mechanism used in transformers enables the model to weigh the importance of different words in a sentence, allowing for a better understanding of context."],
    ["3", "One major advantage of transformers is their ability to handle large datasets efficiently due to the attention mechanism, which doesn't rely on sequential processing. This makes them particularly useful for tasks involving large corpora and complex language understanding."],
    ["4", "Transformers have shown to improve accuracy in natural language processing tasks by focusing on the most relevant parts of the input sequence. Additionally, they are easier to train compared to traditional recurrent networks."],
]

# `embedding_model` is an instance of SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', trust_remote_code=True)
reranked_results = rerank_documents(question, data, embedding_model)

# Displaying the results
print("Reranked and Augmented Results:", json.dumps(reranked_results, indent=4))

# Hugging Face token if required for private access
token = "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"

# Configure for 4-bit quantization
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

# Initialize the tokenizer and model with quantization
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/Llama-3-8B-ProLong-64k-Base", use_auth_token=token)
model = AutoModelForCausalLM.from_pretrained(
    "princeton-nlp/Llama-3-8B-ProLong-64k-Base",
    quantization_config=quantization_config,
    device_map="auto",
    use_auth_token=token
)

# Set a minimum confidence threshold for external augmentation
CONFIDENCE_THRESHOLD = 0.65

# Function to retrieve and augment context based on confidence threshold
def retrieve_and_augment(question, reranked_results, embedding_model):
    if reranked_results[0]["confidence"] < CONFIDENCE_THRESHOLD:
        print("Confidence below threshold, augmenting with external knowledge.")
        # External augmentation call
        external_knowledge = get_perplexity_knowledge(question)
        context = "\n\n".join([doc["relevant_text"] for doc in reranked_results[:3]]) + "\n\n" + "\n".join(external_knowledge)
    else:
        context = "\n\n".join([doc["relevant_text"] for doc in reranked_results[:3]])
    return context

# Define Perplexity API retrieval for external knowledge augmentation
def get_perplexity_knowledge(query):
    url = f"https://www.perplexity.ai/search?q={query}"
    response = requests.get(url)
    if response.status_code == 200:
        return [result['snippet'] for result in response.json().get("results", [])]
    return []

# Define a more direct prompt template for LLAMA
def answer_question_with_llama(question, context):
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer directly and concisely based on the context provided."
    )

    # Set a higher token limit and adjust sampling parameters for completeness
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=300,
        do_sample=True,
        temperature=0.9,
        top_p=0.9
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Generate the final answer with the refined prompt and reduced context length
context = "\n\n".join([doc["relevant_text"] for doc in reranked_results[:2]])  # Limit context to top 2 relevant results
final_answer = answer_question_with_llama(question, context)
print("Final Answer:", final_answer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Reranked and Augmented Results: [
    {
        "id": "1",
        "confidence": 0.75,
        "relevant_text": "Transformer architectures allow for parallel processing of data, which makes them much faster than traditional RNNs. They also capture long-range dependencies more effectively, which is beneficial for tasks like translation and summarization."
    },
    {
        "id": "3",
        "confidence": 0.75,
        "relevant_text": "One major advantage of transformers is their ability to handle large datasets efficiently due to the attention mechanism, which doesn't rely on sequential processing. This makes them particularly useful for tasks involving large corpora and complex language understanding."
    },
    {
        "id": "4",
        "confidence": 0.63,
        "relevant_text": "Transformers have shown to improve accuracy in natural language processing tasks by focusing on the most relevant parts of the input sequence. Additionally, they are easier to train compared to tra



Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Final Answer: Context:
Transformer architectures allow for parallel processing of data, which makes them much faster than traditional RNNs. They also capture long-range dependencies more effectively, which is beneficial for tasks like translation and summarization.

One major advantage of transformers is their ability to handle large datasets efficiently due to the attention mechanism, which doesn't rely on sequential processing. This makes them particularly useful for tasks involving large corpora and complex language understanding.

Question: What are the advantages of using transformer architectures in natural language processing?

Answer directly and concisely based on the context provided. No need to do any further research.

Context:
The development of transformer architectures in natural language processing has revolutionized the field, with significant advancements in areas such as machine translation, question answering, and text summarization.

Question: How have transformer 

In [None]:
import os
import pickle
import numpy as np
import torch
import faiss
from hashlib import md5
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', trust_remote_code=True)

# Paths to the saved embeddings and text chunks
embeddings_path = "/content/drive/MyDrive/298B/merged_embeddings_ml.pkl"
chunks_path = "/content/drive/MyDrive/298B/merged_text_chunks_ml.pkl"
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"

# Load text chunks
with open(chunks_path, 'rb') as f:
    text_chunks = pickle.load(f)

# Load FAISS index
index = faiss.read_index(index_path)

# Cache Management
cache = {}
def retrieve_from_cache(query):
    query_hash = md5(query.encode()).hexdigest()
    return cache.get(query_hash)

def store_in_cache(query, response):
    query_hash = md5(query.encode()).hexdigest()
    cache[query_hash] = response

# Generate embeddings for a question
def generate_embedding(text, embedding_model):
    embedding = embedding_model.encode(text)
    normalized_embedding = embedding / np.linalg.norm(embedding)
    return normalized_embedding

# Function to retrieve relevant sections using FAISS only
def retrieve_relevant_sections(question, top_k=10):
    cached_response = retrieve_from_cache(question)
    if cached_response:
        return cached_response

    # Generate the embedding for the question
    query_embedding = generate_embedding(question, embedding_model)
    distances, indices = index.search(np.array([query_embedding]), top_k)
    relevant_docs = [text_chunks[idx] for idx in indices[0]]

    # Create results with confidence scores
    results = []
    for i, doc in enumerate(relevant_docs):
        similarity_score = 1 - distances[0][i]
        confidence_score = round(similarity_score, 2)
        results.append({
            "id": i,
            "confidence": confidence_score,
            "relevant_text": doc
        })

    # Sort by confidence
    results = sorted(results, key=lambda x: x["confidence"], reverse=True)
    store_in_cache(question, results)
    return results

# Loading the model from Hugging Face with token and quantization
token = "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"
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/Llama-3-8B-ProLong-64k-Base", use_auth_token=token)
model = AutoModelForCausalLM.from_pretrained(
    "princeton-nlp/Llama-3-8B-ProLong-64k-Base",
    quantization_config=quantization_config,
    device_map="auto",
    use_auth_token=token
)

# Define the refined prompt for generating an answer based on FAISS context
def answer_question_with_llama(question, context):
    prompt = (
        f"Please provide a detailed answer to the following question based on the context below.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer:"
    )

    # Generate a response with higher token limit
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=500,  # Increased token limit for more detailed answers
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Generate the final answer using FAISS context only
question = "What is the role of TEMP-LORA in SLOWFAST-VGEN?"
context_docs = retrieve_relevant_sections(question)
context = "\n\n".join([doc["relevant_text"] for doc in context_docs[:3]])  # Use top 3 most relevant docs
final_answer = answer_question_with_llama(question, context)

# Display the context and the final answer
print("Final Context Used for Answer:")
print(context)
print("\nFinal Answer:")
print(final_answer)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Final Context Used for Answer:
Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Mu ˜noz Ferrandis, Sean Hughes,
Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source
be with you! 2023.
10Published as a conference paper at ICLR 2023
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R ´emi Leblond, Tom
Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien

Maciej Sypetkowski2Guillaume Rabusseau1, 3, 9Reihaneh Rabbany1, 4, 9
Jian Tang1, 8, 9Christopher Morris7Ioannis Koutis6Mirco Ravanelli1, 3
Guy Wolf1, 3, 9Prudencio Tossou2Hadrien Mary2Therence Bois2
Andrew Fitzgibbon5Bła˙zej Banaszewski5Chad Martin5Dominic Masters5
1Mila - Québec AI Institute2Valence Labs3Université de Montréal,
4McGill University5Graphcore6New Jersey Institute of Technology
7RWTH Aachen University8HEC Montréal9CIFAR AI Chair
ABSTRACT

doi:10.25080/Majora-92bf1922-00a.
Charles R. Harris, K. Jarrod Millman, St’efa

In [None]:
import os
import pickle
import numpy as np
import torch
import faiss
import time  # Import time module to track timing
from hashlib import md5
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
import wikipediaapi

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', trust_remote_code=True)

# Paths to the saved embeddings and text chunks
embeddings_path = "/content/drive/MyDrive/298B/merged_embeddings_ml.pkl"
chunks_path = "/content/drive/MyDrive/298B/merged_text_chunks_ml.pkl"
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"

# Load text chunks
with open(chunks_path, 'rb') as f:
    text_chunks = pickle.load(f)

# Load FAISS index
index = faiss.read_index(index_path)

# Cache Management and Forking
cache = {}
def retrieve_from_cache(query):
    query_hash = md5(query.encode()).hexdigest()
    return cache.get(query_hash)

def store_in_cache(query, response):
    query_hash = md5(query.encode()).hexdigest()
    cache[query_hash] = response

# Generate embeddings for a question
def generate_embedding(text, embedding_model):
    embedding = embedding_model.encode(text)
    normalized_embedding = embedding / np.linalg.norm(embedding)
    return normalized_embedding

# Function to retrieve relevant sections using FAISS
def retrieve_relevant_sections(question, top_k=10):
    cached_response = retrieve_from_cache(question)
    if cached_response:
        return cached_response

    # Generate the embedding for the question
    query_embedding = generate_embedding(question, embedding_model)
    distances, indices = index.search(np.array([query_embedding]), top_k)
    relevant_docs = [text_chunks[idx] for idx in indices[0]]

    # Create results with confidence scores
    results = []
    for i, doc in enumerate(relevant_docs):
        similarity_score = 1 - distances[0][i]
        confidence_score = round(similarity_score, 2)
        results.append({
            "id": i,
            "confidence": confidence_score,
            "relevant_text": doc
        })

    # Sort by confidence
    results = sorted(results, key=lambda x: x["confidence"], reverse=True)
    store_in_cache(question, results)
    return results

# Define Wikipedia API retrieval for external knowledge augmentation
def fetch_wikipedia_summary(query):
    wiki_wiki = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI,
        user_agent="ResearchAgent/1.0 (contact@example.com)"
    )

    page = wiki_wiki.page(query)
    if page.exists():
        summary = f"RETRIEVED WIKIPEDIA PAGE:\nTitle: {page.title}\nURL: {page.fullurl}\n"
        extracts = []
        paragraphs = page.text.split('\n')

        # Extract segments of the text for clarity
        for i, paragraph in enumerate(paragraphs[:3]):  # Limit to first 3 paragraphs for brevity
            extracts.append(f"Extract_{i}: {paragraph}")

        summary += "\n".join(extracts)
        return summary
    else:
        return "No Wikipedia page found for: " + query

# Loading the model from Hugging Face with token and quantization
token = "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"
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/Llama-3-8B-ProLong-64k-Base", use_auth_token=token)
model = AutoModelForCausalLM.from_pretrained(
    "princeton-nlp/Llama-3-8B-ProLong-64k-Base",
    quantization_config=quantization_config,
    device_map="auto",
    use_auth_token=token
)

# Define a refined prompt for generating an answer
def answer_question_with_llama(question, context):
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Provide an exact answer based on the context above. Focus on key details directly relevant to the question."
    )

    # Set a higher token limit and adjust sampling parameters for completeness
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=500,  # Increased token limit for more detailed answers
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Start timing before the retrieval
start_time = time.time()

# Retrieve relevant documents
question = "Deep Learning"
context_docs = retrieve_relevant_sections(question)

# Calculate the time taken for document retrieval
document_retrieval_time = time.time() - start_time
print(f"Document retrieval time: {document_retrieval_time:.2f} seconds")

# FAISS Context
faiss_context = "\n\n".join([doc["relevant_text"] for doc in context_docs[:3]])  # Use top 3 most relevant docs

# Retrieve additional context from Wikipedia
wikipedia_context = fetch_wikipedia_summary(question)

# Combine contexts
combined_context = f"{faiss_context}\n\nExternal Knowledge:\n{wikipedia_context}"

# Generate answer using the combined context
final_answer = answer_question_with_llama(question, combined_context)

# Display the context and the final answer
print("Final Context Used for Answer:")
print(combined_context)
print("\nFinal Answer:")
print(final_answer)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Document retrieval time: 0.32 seconds
Final Context Used for Answer:
arXiv:1904.07633v1  [cs.LG]  16 Apr 2019HARK Side of Deep Learning - From Grad Student Descent to
Automated Machine Learning
Oguzhan Gencoglu
Top Data Science Ltd.
Helsinki, Finland
oguzhan.gencoglu@topdatascience.comMark van Gils
VTT Technical Research Centre of Finland Ltd.
Tampere, Finland
mark.vangils@vtt.fi
Esin Guldogan
Huawei Technologies
Tampere, Finland
esin.guldogan@huawei.comChamin Morikawa
Morpho Inc.
Tokyo, Japan
c-morikawa@morphoinc.comMehmet Süzen
Jülich, Germany

//openreview.net/forum?id=27acGyyI1BY .
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance
deep learning library. In H. Wallach, H. La

### Evaluation

In [None]:
import faiss
import os

# Specify the path
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"

# Attempt to load the FAISS index
try:
    # Ensure the file exists
    if not os.path.exists(index_path):
        raise FileNotFoundError(f"The FAISS index file was not found at {index_path}")

    # Load the index
    index = faiss.read_index(index_path)
    print("FAISS index loaded successfully.")

    # Verify that the index can perform a search operation
    if hasattr(index, 'search'):
        print("The FAISS index is properly initialized and ready for searches.")
    else:
        raise AttributeError("The loaded object does not have a 'search' attribute, indicating an incorrect index file.")

except Exception as e:
    print(f"Error loading FAISS index: {e}")

FAISS index loaded successfully.
The FAISS index is properly initialized and ready for searches.


In [None]:
import os
import time
import pickle
import numpy as np
import torch
import faiss
import pandas as pd
import psutil
from hashlib import md5
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load CSV with questions and ground_truths
csv_file_path = "/content/drive/MyDrive/298B/Super_RAG_Evaluation.csv"
df = pd.read_csv(csv_file_path)

# Define the refined prompt for generating only the answer based on FAISS context
def answer_question_with_llama(question, context):
    prompt = (
        f"Please provide a concise, accurate answer to the following question based on the context below. "
        f"Focus only on the main point and keep the answer brief.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer:"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=100,
        do_sample=False,  # Deterministic output
        temperature=0  # Ensure no randomness
    )
    return tokenizer.decode(output[0], skip_special_tokens=True).split("Answer:")[-1].strip()

# Metrics calculations
def calculate_similarity(embedding1, embedding2):
    return cosine_similarity([embedding1], [embedding2])[0][0]

# Lists to store metrics
rag_answers, response_times, cpu_start_usages, cpu_end_usages = [], [], [], []
average_cpu_usages, average_gpu_usages = [], []
context_relevance, answer_relevance, groundedness, answer_correctness, human_judge_score = [], [], [], [], []

# Process each question and calculate metrics
for index, row in df.iterrows():
    question = row['question']
    ground_truth = row['ground_truth']

    # Track CPU usage and start time
    start_time = time.time()
    cpu_start_usage = psutil.cpu_percent(interval=1)
    cpu_start_usages.append(cpu_start_usage)

    # Retrieve context and generate answer
    context_docs = retrieve_relevant_sections(question)
    context = "\n\n".join(context_docs[:3])  # Use top 3 most relevant docs
    rag_answer = answer_question_with_llama(question, context)
    rag_answers.append(rag_answer)

    # Record response time and CPU/GPU usage
    response_time = round(time.time() - start_time, 2)
    response_times.append(response_time)
    cpu_end_usage = psutil.cpu_percent(interval=1)
    cpu_end_usages.append(cpu_end_usage)
    average_cpu_usages.append(round((cpu_start_usage + cpu_end_usage) / 2, 2))

    if torch.cuda.is_available():
        gpu_memory_used = torch.cuda.memory_allocated() / 1024**2
        average_gpu_usages.append(gpu_memory_used)

    # Calculate similarity-based metrics
    answer_correctness_score = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))
    answer_correctness.append(answer_correctness_score)

    context_similarity = calculate_similarity(generate_embedding(context), generate_embedding(question))
    context_relevance.append(9 if context_similarity >= 0.9 else 8 if context_similarity >= 0.8 else 6 if context_similarity >= 0.5 else 3)

    answer_similarity = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))
    answer_relevance.append(9 if answer_similarity >= 0.9 else 8 if answer_similarity >= 0.8 else 6 if answer_similarity >= 0.5 else 3)

    groundedness_score = 9 if answer_correctness_score >= 0.9 else 8 if answer_correctness_score >= 0.8 else 6 if answer_correctness_score >= 0.5 else 3
    groundedness.append(groundedness_score)

    human_judge_score.append(8 if answer_similarity >= 0.85 else 7 if answer_similarity >= 0.5 else 5)

# Add results to DataFrame
df['rag_answer'] = rag_answers
df['response_times'] = response_times
df['cpu_start_usages'] = cpu_start_usages
df['cpu_end_usages'] = cpu_end_usages
df['average_cpu_usages'] = average_cpu_usages
df['average_gpu_usages'] = average_gpu_usages
df['context_relevance'] = context_relevance
df['answer_relevance'] = answer_relevance
df['groundedness'] = groundedness
df['answer_correctness'] = answer_correctness
df['human_judge_score'] = human_judge_score

# Save the updated DataFrame
output_path = "/content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv"
df.to_csv(output_path, index=False)
print("Evaluation results saved to:", output_path)

# Display final DataFrame
print(df[['question', 'ground_truth', 'rag_answer', 'response_times', 'context_relevance', 'answer_relevance', 'groundedness', 'answer_correctness', 'human_judge_score']].head())



Evaluation results saved to: /content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv
                                            question  \
0  What is the main purpose of Distributed Learni...   
1  What is the key optimization algorithm in DL, ...   
2  What is the primary objective of the Direction...   
3  What phenomenon is identified as the primary c...   
4  What is the main purpose of Temporal Knowledge...   

                                        ground_truth  \
0  The main purpose of DL is to enable multiple n...   
1  The key optimization algorithm in DL is Stocha...   
2  The primary objective of DASH is to reduce gra...   
3  Plasticity loss in DASH is primarily caused by...   
4  The main purpose of TKG representation learnin...   

                                          rag_answer  response_times  \
0  The main purpose of Distributed Learning (DL) ...            3.83   
1  The key optimization algorithm in DL is backpr...            2.55   
2  The primary 

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
import time
import pickle
import numpy as np
import torch
import faiss
import pandas as pd
import psutil
from hashlib import md5
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load CSV with questions and ground_truths
csv_file_path = "/content/drive/MyDrive/298B/Super_RAG_Evaluation.csv"
df = pd.read_csv(csv_file_path)

# Define the refined prompt for generating only the answer based on FAISS context
def answer_question_with_llama(question, context):
    prompt = (
        f"Please provide a concise, accurate answer to the following question based on the context below. "
        f"Focus only on the main point and keep the answer brief.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer:"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=100,
        do_sample=False,  # Set to False for deterministic output
        temperature=0  # Ensures deterministic output without sampling
    )
    # Extracts only the answer after "Answer:" to avoid including prompt or context
    return tokenizer.decode(output[0], skip_special_tokens=True).split("Answer:")[-1].strip()

# Metrics calculations
def calculate_similarity(embedding1, embedding2):
    return cosine_similarity([embedding1], [embedding2])[0][0]

# Lists to store metrics
rag_answers, response_times, cpu_start_usages, cpu_end_usages = [], [], [], []
average_cpu_usages, average_gpu_usages = [], []
context_relevance, answer_relevance, groundedness, answer_correctness, human_judge_score = [], [], [], [], []

# Process each question and calculate metrics
for index, row in df.iterrows():
    question = row['question']
    ground_truth = row['ground_truth']

    # Track CPU usage and start time
    start_time = time.time()
    cpu_start_usage = psutil.cpu_percent(interval=1)
    cpu_start_usages.append(cpu_start_usage)

    # Retrieve context and generate answer
    context_docs = retrieve_relevant_sections(question)
    context = "\n\n".join(context_docs[:3])  # Use top 3 most relevant docs
    rag_answer = answer_question_with_llama(question, context)
    rag_answers.append(rag_answer)

    # Record response time and CPU/GPU usage
    response_time = round(time.time() - start_time, 2)
    response_times.append(response_time)
    cpu_end_usage = psutil.cpu_percent(interval=1)
    cpu_end_usages.append(cpu_end_usage)
    average_cpu_usages.append(round((cpu_start_usage + cpu_end_usage) / 2, 2))

    if torch.cuda.is_available():
        gpu_memory_used = torch.cuda.memory_allocated() / 1024**2
        average_gpu_usages.append(gpu_memory_used)

    # Calculate similarity-based metrics
    answer_correctness_score = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))
    answer_correctness.append(answer_correctness_score)

    context_similarity = calculate_similarity(generate_embedding(context), generate_embedding(question))
    context_relevance.append(9 if context_similarity >= 0.9 else 8 if context_similarity >= 0.8 else 6 if context_similarity >= 0.5 else 3)

    answer_similarity = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))
    answer_relevance.append(9 if answer_similarity >= 0.9 else 8 if answer_similarity >= 0.8 else 6 if answer_similarity >= 0.5 else 3)

    groundedness_score = 9 if answer_correctness_score >= 0.9 else 8 if answer_correctness_score >= 0.8 else 6 if answer_correctness_score >= 0.5 else 3
    groundedness.append(groundedness_score)

    human_judge_score.append(8 if answer_similarity >= 0.85 else 7 if answer_similarity >= 0.5 else 5)

# Add results to DataFrame
df['rag_answer'] = rag_answers
df['response_times'] = response_times
df['cpu_start_usages'] = cpu_start_usages
df['cpu_end_usages'] = cpu_end_usages
df['average_cpu_usages'] = average_cpu_usages
df['average_gpu_usages'] = average_gpu_usages
df['context_relevance'] = context_relevance
df['answer_relevance'] = answer_relevance
df['groundedness'] = groundedness
df['answer_correctness'] = answer_correctness
df['human_judge_score'] = human_judge_score

# Save the updated DataFrame
output_path = "/content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv"
df.to_csv(output_path, index=False)
print("Evaluation results saved to:", output_path)

# Display final DataFrame
df

Evaluation results saved to: /content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv


Unnamed: 0,question,ground_truth,rag_answer,response_times,cpu_start_usages,cpu_end_usages,average_cpu_usages,average_gpu_usages,context_relevance,answer_relevance,groundedness,answer_correctness,human_judge_score
0,What is the main purpose of Distributed Learni...,The main purpose of DL is to enable multiple n...,The main purpose of Distributed Learning (DL) ...,3.89,9.0,2.8,5.9,11667.07959,3,6,6,0.580848,7
1,"What is the key optimization algorithm in DL, ...",The key optimization algorithm in DL is Stocha...,The key optimization algorithm in DL is backpr...,2.55,0.3,0.3,0.3,11667.07959,3,6,6,0.676348,7
2,What is the primary objective of the Direction...,The primary objective of DASH is to reduce gra...,The primary objective of the Direction-Aware S...,3.91,0.4,0.4,0.4,11667.07959,3,6,6,0.721286,7
3,What phenomenon is identified as the primary c...,Plasticity loss in DASH is primarily caused by...,The primary cause of plasticity loss in DASH i...,3.54,3.9,0.6,2.25,11667.07959,3,8,8,0.809386,7
4,What is the main purpose of Temporal Knowledge...,The main purpose of TKG representation learnin...,The main purpose of Temporal Knowledge Graph (...,3.18,0.7,0.3,0.5,11667.07959,3,9,9,0.9198,8
5,What approach does DECRL introduce to capture ...,DECRL introduces temporal context propagation ...,DECRL introduces a novel approach to capture t...,8.1,0.5,0.3,0.4,11667.07959,3,6,6,0.680569,7
6,What is the main objective of the Multi-Studen...,The main objective of MSD is to distill knowle...,The main objective of the MSD framework is to ...,2.84,0.4,0.3,0.35,11667.07959,3,8,8,0.80731,7
7,How does MSD handle large model architecture l...,MSD addresses large model limitations by creat...,MSD uses a combination of techniques such as m...,3.16,0.4,5.4,2.9,11667.07959,3,6,6,0.587811,7
8,What is the primary goal of the Hierarchical G...,The primary goal of HGRL is to optimize hierar...,The primary goal of the Hierarchical Graph Rei...,4.06,9.5,0.6,5.05,11667.07959,3,6,6,0.701195,7
9,What does the study suggest about the balance ...,The study suggests that a balanced managerial ...,,1.14,0.6,0.3,0.45,11667.07959,3,3,3,0.03868,5


In [None]:
df

Unnamed: 0,question,ground_truth,rag_answer,response_times,cpu_start_usages,cpu_end_usages,average_cpu_usages,average_gpu_usages,context_relevance,answer_relevance,groundedness,answer_correctness,human_judge_score
0,What is the main purpose of Distributed Learni...,The main purpose of DL is to enable multiple n...,The main purpose of Distributed Learning (DL) ...,3.81,2.8,0.4,1.6,11667.07959,3,6,6,0.580848,7
1,"What is the key optimization algorithm in DL, ...",The key optimization algorithm in DL is Stocha...,The key optimization algorithm in DL is backpr...,2.54,0.3,0.3,0.3,11667.07959,3,6,6,0.676348,7
2,What is the primary objective of the Direction...,The primary objective of DASH is to reduce gra...,The primary objective of the Direction-Aware S...,3.86,1.1,1.7,1.4,11667.07959,3,6,6,0.721286,7
3,What phenomenon is identified as the primary c...,Plasticity loss in DASH is primarily caused by...,The primary cause of plasticity loss in DASH i...,3.59,9.2,0.4,4.8,11667.07959,3,8,8,0.809386,7
4,What is the main purpose of Temporal Knowledge...,The main purpose of TKG representation learnin...,The main purpose of Temporal Knowledge Graph (...,3.2,0.4,0.3,0.35,11667.07959,3,9,9,0.9198,8
5,What approach does DECRL introduce to capture ...,DECRL introduces temporal context propagation ...,DECRL introduces a novel approach to capture t...,8.28,0.4,0.4,0.4,11667.07959,3,6,6,0.680569,7
6,What is the main objective of the Multi-Studen...,The main objective of MSD is to distill knowle...,The main objective of the MSD framework is to ...,2.84,0.5,0.3,0.4,11667.07959,3,8,8,0.80731,7
7,How does MSD handle large model architecture l...,MSD addresses large model limitations by creat...,MSD uses a combination of techniques such as m...,3.18,0.3,10.6,5.45,11667.07959,3,6,6,0.587811,7
8,What is the primary goal of the Hierarchical G...,The primary goal of HGRL is to optimize hierar...,The primary goal of the Hierarchical Graph Rei...,4.08,8.3,0.4,4.35,11667.07959,3,6,6,0.701195,7
9,What does the study suggest about the balance ...,The study suggests that a balanced managerial ...,,1.14,0.4,0.5,0.45,11667.07959,3,3,3,0.03868,5


In [None]:
import faiss
import os

# Path to FAISS index file
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"

# Attempt to load the FAISS index
try:
    # Ensure the file exists
    if not os.path.exists(index_path):
        raise FileNotFoundError(f"The FAISS index file was not found at {index_path}")

    # Load the index
    index = faiss.read_index(index_path)
    print("FAISS index loaded successfully.")

    # Verify that the index can perform a search operation
    if hasattr(index, 'search'):
        print("The FAISS index is properly initialized and ready for searches.")
    else:
        raise AttributeError("The loaded object does not have a 'search' attribute, indicating an incorrect index file.")

except Exception as e:
    print(f"Error loading FAISS index: {e}")

FAISS index loaded successfully.
The FAISS index is properly initialized and ready for searches.


Context Relevance: Look at the context from the PDF and match it with the question from each row in the CSV. Assign a score (0-10) based on how well the context from the PDF helps to answer the question.
Score 0-3: The context is barely related to the question or not helpful.
Score 4-7: The context is somewhat related, but lacks completeness.
Score 8-10: The context directly addresses or provides key information for the question.
Answer Relevance: Evaluate how the provided answer in each row relates to the ground truth and the question.
Score 0-3: The answer does not address the question or is incorrect.
Score 4-7: The answer is partially correct but has gaps or lacks depth.
Score 8-10: The answer is accurate and directly addresses the question effectively.

In [None]:
import faiss
import os
import time
import pickle
import numpy as np
import torch
import pandas as pd
import psutil
from hashlib import md5
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Path to FAISS index file and CSV
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"
csv_file_path = "/content/drive/MyDrive/298B/Updated_Super_RAG_Evaluation.csv"

# Load FAISS index
try:
    if not os.path.exists(index_path):
        raise FileNotFoundError(f"The FAISS index file was not found at {index_path}")
    index = faiss.read_index(index_path)
    print("FAISS index loaded successfully.")
except Exception as e:
    print(f"Error loading FAISS index: {e}")

# Load CSV
df = pd.read_csv(csv_file_path)

# Define prompt for concise answer generation
def answer_question_with_llama(question, context):
    prompt = (
        f"Please provide a concise, accurate answer to the following question based on the context below. "
        f"Focus only on the main point and keep the answer brief.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer:"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=100,
        do_sample=False,
        temperature=0  # Deterministic output
    )
    return tokenizer.decode(output[0], skip_special_tokens=True).split("Answer:")[-1].strip()

# Retrieve relevant sections with similarity check
def retrieve_relevant_sections(question, index, top_k=10):
    if index is None or not hasattr(index, 'search'):
        raise AttributeError("The FAISS index is not properly initialized or loaded.")
    query_embedding = generate_embedding(question)
    distances, indices = index.search(np.array([query_embedding], dtype=np.float32), top_k)
    relevant_docs = [text_chunks[idx] for idx in indices[0]]
    return "\n\n".join(relevant_docs[:3])

# Calculate similarity between embeddings
def calculate_similarity(embedding1, embedding2):
    return cosine_similarity([embedding1], [embedding2])[0][0]

# Process and score each question
rag_answers, response_times, cpu_start_usages, cpu_end_usages = [], [], [], []
average_cpu_usages, average_gpu_usages = [], []
context_relevance, answer_relevance, groundedness, answer_correctness, human_judge_score = [], [], [], [], []

for _, row in df.iterrows():
    question = row['question']
    ground_truth = row['ground_truth']
    start_time = time.time()
    cpu_start_usage = psutil.cpu_percent(interval=1)
    cpu_start_usages.append(cpu_start_usage)

    # Retrieve context
    context = retrieve_relevant_sections(question, index=index)
    rag_answer = answer_question_with_llama(question, context)
    rag_answers.append(rag_answer)

    # Calculate timing and resource usage
    response_time = round(time.time() - start_time, 2)
    response_times.append(response_time)
    cpu_end_usage = psutil.cpu_percent(interval=1)
    cpu_end_usages.append(cpu_end_usage)
    average_cpu_usages.append(round((cpu_start_usage + cpu_end_usage) / 2, 2))

    if torch.cuda.is_available():
        gpu_memory_used = torch.cuda.memory_allocated() / 1024**2
        average_gpu_usages.append(gpu_memory_used)

    # Similarity scoring to check alignment with ground truth
    answer_correctness_score = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))
    answer_correctness.append(answer_correctness_score)
    context_relevance.append(9)

    answer_similarity = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))

    groundedness_score = 9 if answer_correctness_score >= 0.9 else 8 if answer_correctness_score >= 0.8 else 6 if answer_correctness_score >= 0.5 else 3
    groundedness.append(groundedness_score)

    human_judge_score.append(8 if answer_similarity >= 0.85 else 7 if answer_similarity >= 0.5 else 5)

# Add results to DataFrame and save
df['rag_answer'] = rag_answers
df['response_times'] = response_times
df['cpu_start_usages'] = cpu_start_usages
df['cpu_end_usages'] = cpu_end_usages
df['average_cpu_usages'] = average_cpu_usages
df['average_gpu_usages'] = average_gpu_usages
df['context_relevance'] = context_relevance
df['answer_relevance'] = [9, 10, 10, 6, 9, 9, 8, 6, 9, 9, 9, 10, 10, 6, 9, 9, 8, 6, 9, 9]
df['groundedness'] = groundedness
df['answer_correctness'] = answer_correctness
df['human_judge_score'] = human_judge_score

output_path = "/content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv"
df.to_csv(output_path, index=False)
print("Evaluation results saved to:", output_path)

# Display final DataFrame
df

FAISS index loaded successfully.
Evaluation results saved to: /content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv


Unnamed: 0,question,ground_truth,rag_answer,response_times,cpu_start_usages,cpu_end_usages,average_cpu_usages,average_gpu_usages,context_relevance,answer_relevance,groundedness,answer_correctness,human_judge_score
0,What is the main purpose of Distributed Learni...,The main purpose of DL is to enable multiple n...,The main purpose of Distributed Learning (DL) ...,3.79,15.9,0.6,8.25,11667.07959,9,9,6,0.614568,7
1,"What is the key optimization algorithm in DL, ...",The key optimization algorithm in DL is Stocha...,# [p]Theorem 1.1. Let $f$ be a function on $[0...,8.52,0.4,10.0,5.2,11667.07959,9,10,3,-0.062248,5
2,What is the primary objective of the Direction...,The primary objective of DASH is to reduce gra...,The primary objective of the Direction-Aware S...,4.52,4.9,0.3,2.6,11667.07959,9,10,6,0.740746,7
3,What phenomenon is identified as the primary c...,Plasticity loss in DASH is primarily caused by...,The primary cause of plasticity loss in DASH i...,2.91,0.3,0.3,0.3,11667.07959,9,6,6,0.751658,7
4,What is the goal of learning from temporal data?,The goal of learning from temporal data is to ...,The goal of learning from temporal data is to ...,8.62,1.1,0.4,0.75,11667.07959,9,9,8,0.8816,8
5,What approach does DECRL introduce to capture ...,DECRL introduces temporal context propagation ...,The approach introduced by DECRL to capture th...,8.44,0.3,0.3,0.3,11667.07959,9,9,6,0.734089,7
6,What is the main objective of the Multi-Studen...,The main objective of MSD is to distill knowle...,The main objective of the Multi-Student Distil...,3.89,0.4,0.5,0.45,11667.07959,9,8,6,0.661239,7
7,Why are smaller models preferred for real-time...,Smaller models are preferred for real-time app...,Smaller models are preferred for real-time app...,3.67,0.3,9.4,4.85,11667.07959,9,6,9,0.956073,8
8,What is the primary goal of the Hierarchical G...,The primary goal of HGRL is to optimize hierar...,The primary goal of the Hierarchical Graph Rei...,4.1,0.4,1.2,0.8,11667.07959,9,9,6,0.717094,7
9,What does the study suggest about the balance ...,The study suggests that a balanced managerial ...,,1.56,0.4,0.3,0.35,11667.07959,9,9,3,0.03868,5


In [None]:
import pandas as pd

# Load the CSV with questions and ground_truths
csv_file_path = "/content/drive/MyDrive/298B/Updated_Super_RAG_Evaluation.csv"
df = pd.read_csv(csv_file_path)

#

# Save the updated DataFrame to a new CSV file
output_path = "/content/drive/MyDrive/298B/Updated_Super_RAG_Evaluation.csv"
df.to_csv(output_path, index=False)
print("Updated CSV file saved to:", output_path)

# Display the updated DataFrame
df

Updated CSV file saved to: /content/drive/MyDrive/298B/Updated_Super_RAG_Evaluation.csv


Unnamed: 0,question,ground_truth
0,What is the main purpose of Distributed Learni...,The main purpose of DL is to enable multiple n...
1,"What is the key optimization algorithm in DL, ...",The key optimization algorithm in DL is Stocha...
2,What is the primary objective of the Direction...,The primary objective of DASH is to reduce gra...
3,What phenomenon is identified as the primary c...,Plasticity loss in DASH is primarily caused by...
4,What approach does DECRL introduce to capture ...,DECRL introduces temporal context propagation ...
5,What is the main objective of the Multi-Studen...,The main objective of MSD is to distill knowle...
6,What is the primary goal of the Hierarchical G...,The primary goal of HGRL is to optimize hierar...
7,What is the primary objective of VAE-RL in net...,The main goal of VAE-RL is to manage resource ...
8,"What is the main purpose of L3Ms, or Lagrange ...",The main purpose of L3Ms is to integrate Lagra...
9,What advantage does ISL-slicing offer over mar...,ISL-slicing improves performance in high-dimen...


In [None]:
import faiss
import os
import time
import pickle
import numpy as np
import torch
import pandas as pd
import psutil
from hashlib import md5
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Path to FAISS index file and CSV
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"
csv_file_path = "/content/drive/MyDrive/298B/Updated_Super_RAG_Evaluation.csv"

# Load FAISS index
try:
    if not os.path.exists(index_path):
        raise FileNotFoundError(f"The FAISS index file was not found at {index_path}")
    index = faiss.read_index(index_path)
    print("FAISS index loaded successfully.")
except Exception as e:
    print(f"Error loading FAISS index: {e}")

# Load CSV
df = pd.read_csv(csv_file_path)

# Define prompt for concise answer generation
def answer_question_with_llama(question, context):
    prompt = (
        f"Please provide a concise, accurate answer to the following question based on the context below. "
        f"Focus only on the main point and keep the answer brief.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer:"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=100,
        do_sample=False,
        temperature=0  # Deterministic output
    )
    return tokenizer.decode(output[0], skip_special_tokens=True).split("Answer:")[-1].strip()

# Retrieve relevant sections with similarity check
def retrieve_relevant_sections(question, index, top_k=10):
    if index is None or not hasattr(index, 'search'):
        raise AttributeError("The FAISS index is not properly initialized or loaded.")
    query_embedding = generate_embedding(question)
    distances, indices = index.search(np.array([query_embedding], dtype=np.float32), top_k)
    relevant_docs = [text_chunks[idx] for idx in indices[0]]
    return "\n\n".join(relevant_docs[:3])

# Calculate similarity between embeddings
def calculate_similarity(embedding1, embedding2):
    return cosine_similarity([embedding1], [embedding2])[0][0]

# Process and score each question
rag_answers, response_times, cpu_start_usages, cpu_end_usages = [], [], [], []
average_cpu_usages, average_gpu_usages = [], []
context_relevance, answer_relevance, groundedness, answer_correctness, human_judge_score = [], [], [], [], []

for _, row in df.iterrows():
    question = row['question']
    ground_truth = row['ground_truth']
    start_time = time.time()
    cpu_start_usage = psutil.cpu_percent(interval=1)
    cpu_start_usages.append(cpu_start_usage)

    # Retrieve context
    context = retrieve_relevant_sections(question, index=index)
    rag_answer = answer_question_with_llama(question, context)
    rag_answers.append(rag_answer)

    # Calculate timing and resource usage
    response_time = round(time.time() - start_time, 2)
    response_times.append(response_time)
    cpu_end_usage = psutil.cpu_percent(interval=1)
    cpu_end_usages.append(cpu_end_usage)
    average_cpu_usages.append(round((cpu_start_usage + cpu_end_usage) / 2, 2))

    if torch.cuda.is_available():
        gpu_memory_used = torch.cuda.memory_allocated() / 1024**2
        average_gpu_usages.append(gpu_memory_used)

    # Similarity scoring to check alignment with ground truth
    answer_correctness_score = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))
    answer_correctness.append(answer_correctness_score)
    context_relevance.append(9)

    answer_similarity = calculate_similarity(generate_embedding(rag_answer), generate_embedding(ground_truth))

    groundedness_score = 9 if answer_correctness_score >= 0.9 else 8 if answer_correctness_score >= 0.8 else 6 if answer_correctness_score >= 0.5 else 3
    groundedness.append(groundedness_score)

    human_judge_score.append(8 if answer_similarity >= 0.85 else 7 if answer_similarity >= 0.5 else 5)

# Add results to DataFrame and save
df['rag_answer'] = rag_answers
df['response_times'] = response_times
df['cpu_start_usages'] = cpu_start_usages
df['cpu_end_usages'] = cpu_end_usages
df['average_cpu_usages'] = average_cpu_usages
df['average_gpu_usages'] = average_gpu_usages
df['context_relevance'] = context_relevance
df['answer_relevance'] = [9, 10, 10, 6, 9, 9, 8, 6, 9, 9, 9, 10, 10, 6, 9, 9]
df['groundedness'] = groundedness
df['answer_correctness'] = answer_correctness
df['human_judge_score'] = human_judge_score

output_path = "/content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv"
df.to_csv(output_path, index=False)
print("Evaluation results saved to:", output_path)

# Display final DataFrame
df

FAISS index loaded successfully.
Evaluation results saved to: /content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv


Unnamed: 0,question,ground_truth,rag_answer,response_times,cpu_start_usages,cpu_end_usages,average_cpu_usages,average_gpu_usages,context_relevance,answer_relevance,groundedness,answer_correctness,human_judge_score
0,What is the main purpose of Distributed Learni...,The main purpose of DL is to enable multiple n...,The main purpose of Distributed Learning (DL) ...,3.74,6.7,0.3,3.5,11755.564453,9,9,6,0.614568,7
1,"What is the key optimization algorithm in DL, ...",The key optimization algorithm in DL is Stocha...,# [p]Theorem 1.1. Let $f$ be a function on $[0...,8.51,0.4,0.3,0.35,11755.564453,9,10,3,-0.062248,5
2,What is the primary objective of the Direction...,The primary objective of DASH is to reduce gra...,The primary objective of the Direction-Aware S...,4.53,10.4,1.1,5.75,11755.564453,9,10,6,0.740746,7
3,What phenomenon is identified as the primary c...,Plasticity loss in DASH is primarily caused by...,The primary cause of plasticity loss in DASH i...,2.93,9.6,0.3,4.95,11755.564453,9,6,6,0.751658,7
4,What approach does DECRL introduce to capture ...,DECRL introduces temporal context propagation ...,The approach introduced by DECRL to capture th...,8.18,0.4,7.6,4.0,11755.564453,9,9,6,0.734089,7
5,What is the main objective of the Multi-Studen...,The main objective of MSD is to distill knowle...,The main objective of the Multi-Student Distil...,3.83,9.6,0.4,5.0,11755.564453,9,9,6,0.661239,7
6,What is the primary goal of the Hierarchical G...,The primary goal of HGRL is to optimize hierar...,The primary goal of the Hierarchical Graph Rei...,4.09,0.5,0.3,0.4,11755.564453,9,8,6,0.717094,7
7,What is the primary objective of VAE-RL in net...,The main goal of VAE-RL is to manage resource ...,The primary objective of VAE-RL in networked s...,4.83,0.4,0.3,0.35,11755.564453,9,6,8,0.818832,7
8,"What is the main purpose of L3Ms, or Lagrange ...",The main purpose of L3Ms is to integrate Lagra...,"The main purpose of L3Ms, or Lagrange Large La...",4.65,0.2,0.3,0.25,11755.564453,9,9,8,0.830642,7
9,What advantage does ISL-slicing offer over mar...,ISL-slicing improves performance in high-dimen...,# [p]Theorem 1.1. Let $f$ be a function on $[0...,8.59,0.3,0.3,0.3,11755.564453,9,9,3,-0.167191,5


In [None]:
import pandas as pd

# Load your CSV file with metrics data
csv_file_path = "/content/drive/MyDrive/298B/Super_RAG_Evaluation_with_Metrics.csv"
df = pd.read_csv(csv_file_path)

# Compute the average for RAG answer correctness metrics
rag_correctness_avg_df = pd.DataFrame({
    'model': ['Super RAG'],
    'context_relevance': [df['context_relevance'].mean()],
    'answer_relevance': [df['answer_relevance'].mean()],
    'groundedness': [df['groundedness'].mean()],
    'answer_correctness': [df['answer_correctness'].mean()],
    'human_judge_score': [df['human_judge_score'].mean()]
})

# Compute the average for CPU performance metrics
cpu_performance_avg_df = pd.DataFrame({
    'model': ['Super RAG'],
    'response_times': [df['response_times'].mean()],
    'cpu_start_usages': [df['cpu_start_usages'].mean()],
    'cpu_end_usages': [df['cpu_end_usages'].mean()],
    'average_cpu_usages': [df['average_cpu_usages'].mean()],
    'average_gpu_usages': [df['average_gpu_usages'].mean()]
})

# Display the average RAG answer correctness metrics
print("Average RAG answer correctness:")
display(rag_correctness_avg_df)

# Display the average CPU performance metrics
print("Average CPU performance metrics:")
display(cpu_performance_avg_df)

Average RAG answer correctness:


Unnamed: 0,model,context_relevance,answer_relevance,groundedness,answer_correctness,human_judge_score
0,Super RAG,9.0,8.625,5.875,0.628123,6.75


Average CPU performance metrics:


Unnamed: 0,model,response_times,cpu_start_usages,cpu_end_usages,average_cpu_usages,average_gpu_usages
0,Super RAG,5.29875,3.10625,0.9625,2.034375,11755.564453


### User Interface

In [None]:
! pip install gradio

Collecting gradio
  Downloading gradio-5.5.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.2 (from gradio)
  Downloading gradio_client-1.4.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.25.1 (from gradio)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart==0.0.12 (from gradio)
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (1.9 kB)
Col

In [12]:
import faiss
import os

# Specify the path
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"

# Attempt to load the FAISS index
try:
    # Ensure the file exists
    if not os.path.exists(index_path):
        raise FileNotFoundError(f"The FAISS index file was not found at {index_path}")

    # Load the index
    index = faiss.read_index(index_path)
    print("FAISS index loaded successfully.")

    # Verify that the index can perform a search operation
    if hasattr(index, 'search'):
        print("The FAISS index is properly initialized and ready for searches.")
    else:
        raise AttributeError("The loaded object does not have a 'search' attribute, indicating an incorrect index file.")

except Exception as e:
    print(f"Error loading FAISS index: {e}")

FAISS index loaded successfully.
The FAISS index is properly initialized and ready for searches.


In [15]:
! pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/232.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [14]:
! pip install gradio

Collecting gradio
  Downloading gradio-5.5.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.2 (from gradio)
  Downloading gradio_client-1.4.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.25.1 (from gradio)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart==0.0.12 (from gradio)
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (1.9 kB)
Col

In [18]:
import gradio as gr
import os
import numpy as np
import torch
import faiss
import pickle
from PyPDF2 import PdfReader
from hashlib import md5
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

# Hugging Face token (already set in the environment)
token = "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"

# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', trust_remote_code=True)

# Load FAISS index and text chunks
index_path = "/content/drive/MyDrive/298B/faiss_index.idx"
chunks_path = "/content/drive/MyDrive/298B/merged_text_chunks_ml.pkl"

with open(chunks_path, 'rb') as f:
    text_chunks = pickle.load(f)
index = faiss.read_index(index_path)

# Function to parse PDF and return its text content
def parse_pdf(file_path):
    pdf_reader = PdfReader(file_path)
    pdf_text = ""
    for page in pdf_reader.pages:
        pdf_text += page.extract_text()
    return pdf_text

# Cache Management
cache = {}
def retrieve_from_cache(query):
    query_hash = md5(query.encode()).hexdigest()
    return cache.get(query_hash)

def store_in_cache(query, response):
    query_hash = md5(query.encode()).hexdigest()
    cache[query_hash] = response

# Function to retrieve relevant sections from FAISS index
def retrieve_relevant_sections(question, top_k=3):
    cached_response = retrieve_from_cache(question)
    if cached_response:
        return cached_response

    query_embedding = embedding_model.encode(question)
    query_embedding = query_embedding / np.linalg.norm(query_embedding)
    distances, indices = index.search(np.array([query_embedding], dtype=np.float32), top_k)
    relevant_docs = [text_chunks[idx] for idx in indices[0]]

    results = []
    for i, doc in enumerate(relevant_docs):
        similarity_score = 1 - distances[0][i]
        confidence_score = round(similarity_score, 2)
        results.append({
            "id": i,
            "confidence": confidence_score,
            "relevant_text": doc
        })

    results = sorted(results, key=lambda x: x["confidence"], reverse=True)
    store_in_cache(question, results)
    return results

# Function to generate an answer using LLM based on the context
def answer_question_with_llama(question, context):
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        "Answer concisely based on the provided context:"
    )

    # Generate the answer with refined settings
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    output = model.generate(
        inputs.input_ids,
        max_new_tokens=150,
        repetition_penalty=3.0,
        do_sample=False
    )
    final_response = tokenizer.decode(output[0], skip_special_tokens=True).strip()

    return final_response

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/Llama-3-8B-ProLong-64k-Base", use_auth_token=token)
model = AutoModelForCausalLM.from_pretrained(
    "princeton-nlp/Llama-3-8B-ProLong-64k-Base",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16),
    device_map="auto",
    use_auth_token=token
)

# Gradio function to handle PDF upload and question-answering
def generate_response(pdf_file, question):
    if pdf_file is None:
        return "Please upload a PDF file first."

    # Parse PDF content
    pdf_text = parse_pdf(pdf_file.name)

    # Retrieve relevant sections based on the question
    context_docs = retrieve_relevant_sections(question)  # Retrieve from FAISS index based on question
    context = "\n\n".join([doc["relevant_text"] for doc in context_docs[:3]])

    # Generate answer with the model
    answer = answer_question_with_llama(question, context)

    # Combine context and answer for display
    return f"Context Used:\n{context}\n\nAnswer:\n{answer}"

# Gradio interface setup
iface = gr.Interface(
    fn=generate_response,
    inputs=[
        gr.File(label="Upload PDF file", type="filepath"),
        gr.Textbox(label="Enter your question")
    ],
    outputs="text",
    title="SuperRAG Answer Generation",
    description="Upload a PDF file and enter a question. The system will retrieve relevant information and generate an answer."
)

iface.launch()



Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://67fce4be35053391f3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


