<a href="https://colab.research.google.com/github/hirdeshkumar2407/NLP_Group_Assigment/blob/main/Training%20models/2_RAG_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports and loading the dataset:

In [1]:
import pandas as pd
import numpy as np
import os
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch
import hnswlib
from transformers import AutoModel

if os.path.isfile("rag_instruct.json"): 
    df = pd.read_json("rag_instruct.json")
else:
    df = pd.read_json("hf://datasets/FreedomIntelligence/RAG-Instruct/rag_instruct.json")

documents = df['documents']

## Our models for calculating the embeddings and using the CrossEncoder

In [None]:
semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
semb_model.to('cuda')

## Calculating the embeddings for the corpus:

In [3]:
corpus_embeddings = semb_model.encode(documents, convert_to_tensor=True, show_progress_bar=True)


Batches:   0%|          | 0/1267 [00:00<?, ?it/s]

## Indexing for faster access:

In [4]:
index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

In [5]:
# Define hnswlib index path
index_path = "./hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    index.save_index(index_path)

Loading index...


In [6]:
# function to get the related docs
def get_related_docs(query, k=3):
    query_embedding = semb_model.encode(query, convert_to_tensor=True)
    corpus_ids, _ = index.knn_query(query_embedding.cpu(), k=k)

    model_inputs = [(query, str(documents[idx])) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)
    send_to_LLM = ""
    positive_docs = [documents[corpus_ids[0][idx]] for idx in np.argsort(-cross_scores) if cross_scores[idx] > 0]

    if len(positive_docs) > 1:
        for i, doc in enumerate(positive_docs):
            send_to_LLM += f"Document {i+1}:\n\n"
            # Convert the list 'doc' to a string before concatenating
            send_to_LLM += str(doc) + "\n"
    elif len(positive_docs) == 1:
        # Convert the list to a string if there's only one document
        send_to_LLM = str(positive_docs[0])

    else:
        # If no positive scores, take the top 2 negative scores
        negative_docs = []
        for idx in np.argsort(-cross_scores)[:2]: # Take the top 2 indices based on sorted scores
            negative_docs.append(documents[corpus_ids[0][idx]])

        if len(negative_docs) > 1:
            for i, doc in enumerate(negative_docs):
                send_to_LLM += f"Document {i+1}:\n"
                send_to_LLM += str(doc) + "\n\n"
        elif len(negative_docs) == 1:
            send_to_LLM = str(negative_docs[0])

    return send_to_LLM



In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # or bfloat16 if supported
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    "AITeamVN/Vi-Qwen2-3B-RAG",
    quantization_config=quant_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("AITeamVN/Vi-Qwen2-3B-RAG")

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
query = "Do all plants do photosynthesis?"

context_docs = get_related_docs(query)

prompt = f"Given this context: \n{context_docs} \n\nPlease answer the question: {query}.\n\nAnswer:\n"

inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"].to(model.device),
        attention_mask=inputs["attention_mask"].to(model.device),
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print result
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n=== Generated Answer ===\n")
print(answer.split("Answer:")[-1].strip())  # Optional: strip prompt parts


=== Generated Answer ===

To answer the question "Do all plants do photosynthesis?" based on the provided context, we can analyze the information as follows:

1. **Definition of Photosynthesis**: Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that can later be released to fuel the organisms' activities. This process is crucial for the survival and growth of plants.

2. **Function of Chloroplasts**: Chloroplasts are organelles that conduct photosynthesis, where the photosynthetic pigment chlorophyll captures the energy from sunlight, converts it, and stores it in the energy-storage molecules ATP and NADPH. Chloroplasts also use the ATP and NADPH to make organic molecules from carbon dioxide in a process known as the Calvin cycle.

3. **Examples of Photosynthesis in Plants**: The text provides examples of how photosynthesis occurs in various types of plants, including green plants, algae, and some bacteria. It also mentions th