<a href="https://colab.research.google.com/github/hirdeshkumar2407/NLP_Group_Assigment/blob/main/Training%20models/2_RAG_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports and loading the dataset:

In [1]:
import pandas as pd
import numpy as np
import os
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch
import hnswlib
from transformers import AutoModel

if os.path.isfile("rag_instruct.json"): 
    df = pd.read_json("rag_instruct.json")
else:
    df = pd.read_json("hf://datasets/FreedomIntelligence/RAG-Instruct/rag_instruct.json")

documents = df['documents']

In [2]:
print(documents[:3])

0    [decided to make the story more straightforwar...
1    [the world with 68.5% of Taiwanese high school...
2    [Sparrho Sparrho combines human and artificial...
Name: documents, dtype: object


## Our models for calculating the embeddings and using the CrossEncoder

In [2]:
semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
#semb_model.to('cuda')

## Calculating the embeddings for the corpus:

In [4]:
corpus_embeddings = semb_model.encode(documents, convert_to_tensor=True, show_progress_bar=True)


Batches:   0%|          | 0/1267 [00:00<?, ?it/s]

## Indexing for faster access:

In [5]:
index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

In [6]:
# Define hnswlib index path
index_path = "./hnswlib.index"

# Load index if available
if os.path.exists(index_path):
    print("Loading index...")
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print("Start creating HNSWLIB index")
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print("Saving index to:", index_path)
    index.save_index(index_path)

Loading index...


## Testing Cosine Similarity Search
This part is commented since it's not needed for the final run. It's just for testing. We can see that we get kinda descent resutls.

In [None]:
# query = "How has the female literacy rate in tribal areas changed from 1995 to the present according to the latest statistics?"
# query_embedding = semb_model.encode(query, convert_to_tensor=True)

In [None]:
# corpus_ids, distances = index.knn_query(query_embedding.cpu(), k=3)
# scores = 1 - distances

# print("Cosine similarity model search results")
# print(f"Query: \"{query}\"")
# print("---------------------------------------")
# for idx, score in zip(corpus_ids[0], scores[0]):
#     print(f"Score: {score:.4f}\nDocument: \"{documents[idx]}\"\n\n")

## Re-ranking the results:
Taking the top 128 results and re-ranking them with a cross encoder.

In [None]:
query = "How has the female literacy rate in tribal areas changed from 1995 to the present according to the latest statistics?"
query_embedding = semb_model.encode(query, convert_to_tensor=True)
corpus_ids, _ = index.knn_query(query_embedding.cpu(), k=128)

model_inputs = [(query, str(documents[idx])) for idx in corpus_ids[0]]
cross_scores = xenc_model.predict(model_inputs)

print("Cross-encoder model re-ranking results")
print(f"Query: \"{query}\"")
print("---------------------------------------")
for idx in np.argsort(-cross_scores)[:3]:
    print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{documents[corpus_ids[0][idx]]}\"\n\n")

Cross-encoder model re-ranking results
Query: "How has the female literacy rate in tribal areas changed from 1995 to the present according to the latest statistics?"
---------------------------------------
Score: 0.9397
Document: "['the National Sample Survey Data of 1997, only the states of Kerala and Mizoram have approached universal female literacy. According to scholars, the major factor behind improvements in the social and economic status of women in Kerala is literacy. Under the Non-Formal Education programme (NFE), about 40% of the NFE centres in states and 10% of the centres in UTs are exclusively reserved for women. As of 2000, about 300,000 NFE centres were catering to about 7.42 million children. About 120,000 NFE centres were exclusively for girls. According to a 1998 report by the U.S. Department of Commerce, the chief', 'school is 25% (CBS, 2008). This is reflected in the disparity in literacy rates, between women in rural areas, 36.5%, and those in urban areas, 61.5%. L

Another test with a different query:

In [None]:
query = "How does TAFE NSW support Indigenous students in accessing educational opportunities?"
query_embedding = semb_model.encode(query, convert_to_tensor=True)
corpus_ids, _ = index.knn_query(query_embedding.cpu(), k=128)

model_inputs = [(query, str(documents[idx])) for idx in corpus_ids[0]]
cross_scores = xenc_model.predict(model_inputs)

print("Cross-encoder model re-ranking results")
print(f"Query: \"{query}\"")
print("---------------------------------------")
for idx in np.argsort(-cross_scores)[:3]:
    print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{documents[corpus_ids[0][idx]]}\"\n\n")

Cross-encoder model re-ranking results
Query: "How does TAFE NSW support Indigenous students in accessing educational opportunities?"
---------------------------------------
Score: 7.2661
Document: "['Program aims to: TAFE Outreach TAFE NSW Outreach is an initiative in the Australian tertiary education sector to offer educational opportunities to people who would not otherwise gain access to appropriate courses. Outreach negotiates courses with potential students (hours, attendance, subjects and content, etc.). All Outreach courses are free, as they target disadvantaged groups in the Australian community. They can be held at colleges or off campus in community locations to cater for isolated communities, childcare needs, lack of transport, and other barriers. The TAFE NSW Outreach Program is designed to provide an access point by which adults can re-enter', 'students with strong links to the community through organisations such as Mission Australia and ICAN. The Wirreanda Adaptive Voca

Printing results with positive scores, can be commented as it's not needed for final run.

In [None]:
print("Cross-encoder model re-ranking results")
print(f"Query: \"{query}\"")
print("---------------------------------------")
for idx in np.argsort(-cross_scores)[:3]:
    if cross_scores[idx] > 0:
        print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{documents[corpus_ids[0][idx]]}\n\n")


Cross-encoder model re-ranking results
Query: "How does TAFE NSW support Indigenous students in accessing educational opportunities?"
---------------------------------------
Score: 7.2661
Document: "['Program aims to: TAFE Outreach TAFE NSW Outreach is an initiative in the Australian tertiary education sector to offer educational opportunities to people who would not otherwise gain access to appropriate courses. Outreach negotiates courses with potential students (hours, attendance, subjects and content, etc.). All Outreach courses are free, as they target disadvantaged groups in the Australian community. They can be held at colleges or off campus in community locations to cater for isolated communities, childcare needs, lack of transport, and other barriers. The TAFE NSW Outreach Program is designed to provide an access point by which adults can re-enter', 'students with strong links to the community through organisations such as Mission Australia and ICAN. The Wirreanda Adaptive Voca

## Final output:
If there are results with positive scores, they get added to the send_to_LLM vairable.
If there are no results with positve scores, the top 2 results with negative scores will be added to the send_to_LLM variable in the hopes of sending something usefull to the LLM.

In [None]:
send_to_LLM = ""
positive_docs = []
for idx in np.argsort(-cross_scores):
    if cross_scores[idx] > 0:
        positive_docs.append(documents[corpus_ids[0][idx]])

if len(positive_docs) > 1:
  for i, doc in enumerate(positive_docs):
    send_to_LLM += f"Document {i+1}:\n"
    # Convert the list 'doc' to a string before concatenating
    send_to_LLM += str(doc) + "\n\n"
elif len(positive_docs) == 1:
    # Convert the list to a string if there's only one document
    send_to_LLM = str(positive_docs[0])

else:
    # If no positive scores, take the top 2 negative scores
    negative_docs = []
    for idx in np.argsort(-cross_scores)[:2]: # Take the top 2 indices based on sorted scores
        negative_docs.append(documents[corpus_ids[0][idx]])

    if len(negative_docs) > 1:
        for i, doc in enumerate(negative_docs):
            send_to_LLM += f"Document {i+1}:\n"
            send_to_LLM += str(doc) + "\n\n"
    elif len(negative_docs) == 1:
        send_to_LLM = str(negative_docs[0])
# End of added code


print (send_to_LLM)

Document 1:
['Program aims to: TAFE Outreach TAFE NSW Outreach is an initiative in the Australian tertiary education sector to offer educational opportunities to people who would not otherwise gain access to appropriate courses. Outreach negotiates courses with potential students (hours, attendance, subjects and content, etc.). All Outreach courses are free, as they target disadvantaged groups in the Australian community. They can be held at colleges or off campus in community locations to cater for isolated communities, childcare needs, lack of transport, and other barriers. The TAFE NSW Outreach Program is designed to provide an access point by which adults can re-enter', 'students with strong links to the community through organisations such as Mission Australia and ICAN. The Wirreanda Adaptive Vocational Education (WAVE) program assists students that wish to transition from school to employment. Wirreanda has developed a "Middle School" approach to students in Years 8 and 9. Studen

In [7]:
# function to get the related docs
def get_related_docs(query, k=3):
    query_embedding = semb_model.encode(query, convert_to_tensor=True)
    corpus_ids, _ = index.knn_query(query_embedding.cpu(), k=k)

    model_inputs = [(query, str(documents[idx])) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)
    send_to_LLM = ""
    positive_docs = [documents[corpus_ids[0][idx]] for idx in np.argsort(-cross_scores) if cross_scores[idx] > 0]

    if len(positive_docs) > 1:
        for i, doc in enumerate(positive_docs):
            send_to_LLM += f"Document {i+1}:\n\n"
            # Convert the list 'doc' to a string before concatenating
            send_to_LLM += str(doc) + "\n"
    elif len(positive_docs) == 1:
        # Convert the list to a string if there's only one document
        send_to_LLM = str(positive_docs[0])

    else:
        # If no positive scores, take the top 2 negative scores
        negative_docs = []
        for idx in np.argsort(-cross_scores)[:2]: # Take the top 2 indices based on sorted scores
            negative_docs.append(documents[corpus_ids[0][idx]])

        if len(negative_docs) > 1:
            for i, doc in enumerate(negative_docs):
                send_to_LLM += f"Document {i+1}:\n"
                send_to_LLM += str(doc) + "\n\n"
        elif len(negative_docs) == 1:
            send_to_LLM = str(negative_docs[0])

    return send_to_LLM



In [8]:
# using pre-trained and fine-tuned model
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AITeamVN/Vi-Qwen2-3B-RAG")
model = AutoModelForCausalLM.from_pretrained("AITeamVN/Vi-Qwen2-3B-RAG")



Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
# model.to('cuda')
# tokenizer.to('cuda')

query = "Do all plants do photosynthesis?"

context_docs = get_related_docs(query)

prompt = f"Given this context: \n{context_docs} \n\nPlease answer the question: {query}.\n\nAnswer:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print result
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n=== Generated Answer ===\n")
print(answer.split("Answer:")[-1].strip())  # Optional: strip prompt parts


=== Generated Answer ===

All plants do photosynthesis. This is based on the information provided in Document 1, which states that "Green plants obtain most of their energy from sunlight via photosynthesis by primary chloroplasts that are derived from endosymbiosis with cyanobacteria. Their chloroplasts contain chlorophylls a and b, which gives them their green color." Additionally, Document 2 further reinforces this point by noting that "Chloroplasts and cyanobacteria contain the blue-green pigment chlorophyll 'a'." These statements clearly indicate that photosynthesis is a fundamental process for plants, as it is necessary for their energy acquisition and survival. 

However, it is worth noting that some plants may have variations or adaptations in their photosynthetic processes, such as the CAM (Crassulacean acid metabolism) and C4 pathways, which help them optimize their photosynthesis under specific environmental conditions. But fundamentally, all plants engage in the process of 