Task 1: Generating Embeddings for Documents: Generation of embeddings for a set of documents using a pre-trained language model such as or Flan-T5.

In [21]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
from transformers import AutoTokenizer, pipeline
from transformers import T5Tokenizer, T5ForConditionalGeneration
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
import torch

In [22]:
# Specify the dataset name and the column containing the content
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"  # or any other column you're interested in

# Create a loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

# Load the data
data = loader.load()



In [23]:
# Filter data which page content is empty
# Assuming data is a list of entries with 'page_content' attribute
filtered_data = [entry for entry in data if entry.page_content != '""']
# Further filter to keep only entries with 'category' equal to 'closed_qa'
filtered_data = [entry for entry in filtered_data if entry.metadata['category'] == 'closed_qa']

data = filtered_data[:100]

#data = data[:100]
for entry in data:
    print(entry)

page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."' metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}
page_content='"Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he debuted as a midfielder in 2001, he did not play much and the club was relegate

In [24]:
#Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
#It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=400)

#'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)
docs[0]
print(len(docs))
for entry in docs:
    print(entry)


100
page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."' metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}
page_content='"Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he debuted as a midfielder in 2001, he did not play much and the club was rele

In [25]:

#Define the path to the pre-trained model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

#Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

#Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

#Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

In [26]:
# Test embeddings 
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:3]

[-0.03833850845694542, 0.1234646588563919, -0.02864295430481434]

Task 2: Storing Embeddings in a Database (30 points): In this component, you will design and implement a vector database to store the generated embeddings efficiently and their corresponding texts respectively. Follow these steps: 

In [27]:
# Vector database
db = FAISS.from_documents(docs, embeddings)
retriever = db.as_retriever()

In [28]:
#Test retriever from vector database given a query
question = "When did Virgin Australia start operating?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)
searchDocsRelevant = retriever.get_relevant_documents("When did Virgin Australia start operating?")
print(searchDocsRelevant[0].page_content)

"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."
"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisban

In [29]:
#grab the ground truth and query for later evalualtion. 
instructions = []
contexts = []
responses_gt=[]

for page in docs: 
    instructions.append(page.metadata["instruction"])
    # Decode Unicode escape sequences and convert to Unicode string
    #decoded_text = bytes(page.page_content, "utf-8").decode("unicode_escape").encode("utf-8").decode("utf-8")
    contexts.append(page.page_content)
    responses_gt.append(page.metadata["response"])
    
print(instructions[0])
print(contexts[0])
print(responses_gt[0])

When did Virgin Australia start operating?
"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.


Task 3: Semantic Search with RAG Model. In this component, you will leverage the stored embeddings and implement a semantic search system using the RAG model. 

In [30]:
def generate_response(model, tokenizer, query, context=None):
    inputs = tokenizer(query, context, return_tensors="pt", padding=True)
    input_ids = inputs["input_ids"]
    output = model.generate(input_ids=input_ids,max_length=512)
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

In [31]:

model_name= "google-t5/t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 4.99MB/s]
model.safetensors: 100%|██████████| 892M/892M [01:22<00:00, 10.8MB/s] 
generation_config.json: 100%|██████████| 147/147 [00:00<00:00, 1.13MB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 5.98MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 6.09MB/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [32]:
def test_train():
    max_source_length = 512
    max_target_length = 256

    question = "What did Albert Einstein contribute to physics in the early 20th century?"
    context = "In the early 20th century, Albert Einstein revolutionized physics with his theory of relativity. Born in Germany in 1879, Einstein's groundbreaking work included the famous equation E=mc^2. His theories fundamentally changed our understanding of space, time, and gravity. Einstein later fled to the United States to escape the rise of the Nazis and continued his scientific contributions at Princeton University."
    input_sequence_1 = f"question: {question} context: {context}"
    output_sequence_1 = "His theory of relativity, including the famous equation E=mc^2."

    question2 = "Name two of Jane Austen's famous works."
    context2 = "Jane Austen, an English novelist, lived in the late 18th and early 19th centuries. Her works, including 'Pride and Prejudice' and 'Sense and Sensibility,' are celebrated for their keen observations of social manners and relationships. Austen's novels continue to be widely read and adapted into various forms of media, maintaining their relevance in literature and popular culture."
    input_sequence_2 = f"question: {question2} context: {context2}"
    output_sequence_2 = "Pride and Prejudice and Sense and Sensibility."

    input_sequences = [input_sequence_1, input_sequence_2]

    encoding = tokenizer(
        [sequence for sequence in input_sequences],
        padding="longest",
        max_length=max_source_length,
        truncation=True,
        return_tensors="pt",
    )

    input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

    target_encoding = tokenizer(
        [output_sequence_1, output_sequence_2],
        padding="longest",
        max_length=max_target_length,
        truncation=True,
        return_tensors="pt",
    )

    labels = target_encoding.input_ids

    # replace padding token id's of the labels by -100 so it's ignored by the loss
    labels[labels == tokenizer.pad_token_id] = -100

    # forward pass
    loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
    loss.item()

In [33]:
#Test no RAG
query= "When did Virgin Australia start operating?"
input_prompt = f"answer: {query}"
response_without_rag = generate_response(model, tokenizer, input_prompt)
print("Response without RAG: " + response_without_rag)

Response without RAG: False


In [34]:
#Test RAG
question = "When did Virgin Australia start operating?"
#Find context in RAG
searchDocs = db.similarity_search(question)
context = searchDocs[0].page_content
input_text = f"question: {question} context: {context}"
response_with_rag = generate_response(model, tokenizer, input_text)
print("Response with RAG: " + response_with_rag)


Response with RAG: 31 August 2000


In [35]:
from bert_score import score as bert_score
import evaluate
    
def evaluate_metrics(predictions, references,model):
    bleu = evaluate.load("bleu")
    results = bleu.compute(predictions=predictions, references= references,smooth=True)
    
    rouge = evaluate.load('rouge')
    rouge = rouge.compute(predictions=predictions,references=references,rouge_types=['rougeL'])
    
    bertscore = evaluate.load("bertscore")
    bert = bertscore.compute(predictions=predictions, references=references, lang="en")
    bert_score_precision, bert_score_recall, bert_score_f1 = bert_score(predictions, references, lang="en", verbose=False)
    return results,rouge,bert,bert_score_precision, bert_score_recall, bert_score_f1

In [36]:

test_train()
# Initialize lists to store results
responses_without_rag = []
responses_with_rag = []
iter = 0
# Iterate through the queries
for query in instructions:  
    print("QUERY: " + query)
    print("Ground Truth: " + responses_gt[iter])
    print("Context: " + contexts[iter])
    
    # Generate response without RAG
    input_prompt = f"question: {query}"
    response_without_rag = generate_response(model, tokenizer, input_prompt)
    responses_without_rag.append(response_without_rag)
    print("NO RAG: " + response_without_rag)
   
   
    # Generate response with RAG
    searchDocs = db.similarity_search(query)
    context = searchDocs[0].page_content

    input_text = f"question: {query} context: {context}"
 
    response_with_rag = generate_response(model, tokenizer, input_text)
    responses_with_rag.append(response_with_rag)
    print("RAG: " + response_with_rag)
    iter+=1
   
print(len(responses_without_rag))
print(len(responses_with_rag))

QUERY: When did Virgin Australia start operating?
Ground Truth: Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.
Context: "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."
NO RAG: not_entailment
RAG: 31 August 2000
QUERY: When was Tomoaki Komorida born?
Ground Truth: Tomoaki Komorida was born on July 10,1981.
Context: "Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he

Token indices sequence length is longer than the specified maximum sequence length for this model (828 > 512). Running this sequence through the model will result in indexing errors


NO RAG: who
RAG: Poul-Henning Kamp
QUERY: Given a reference text about Run Towards the Danger, tell me how many essays are part of the collection.
Ground Truth: Six essays are part of the Run Towards the Danger essay collection.
Context: "Run Towards the Danger is a 2022 Canadian essay collection by Sarah Polley, a former child star, director, and screenwriter.\n\nThe six essays in the collection examine aspects of Polley's career on stage, screen, and on film detailing her roles in a Stratford Festival production of Alice Through the Looking Glass, as well as her breakout roles in The Adventures of Baron Munchausen and the TV series Road to Avonlea. The book also revealed for the first time that Polley had been a victim of Jian Ghomeshi who sexually and physically assaulted her when she was 16 and he was 28."
NO RAG: how many essays are part of the collection
RAG: six
QUERY: Who tends to participates in hackathons?
Ground Truth: Computer programmers and others involved in software dev

In [37]:
# Evaluate metrics
bleu_without_rag,rouge_norag,bert_norag,bert_score_precision, bert_score_recall, bert_score_f1 = evaluate_metrics(responses_without_rag, responses_gt,model)
bleu_with_rag,rouge_rag,bert_rag,bert_score_precision_2, bert_score_recall_2, bert_score_f1_2 = evaluate_metrics(responses_with_rag, responses_gt,model)
print(str(bleu_without_rag))
print(str(bleu_with_rag))
print(str(rouge_norag))
print(str(rouge_rag))
print(str(bert_norag))
print(str(bert_rag))

mean_precision = sum(bert_norag['precision']) / len(bert_norag['precision'])
mean_recall = sum(bert_norag['recall']) / len(bert_norag['recall'])
mean_f1= sum(bert_norag['f1']) / len(bert_norag['f1'])
print(mean_precision)
print(mean_recall)
print(mean_f1)

mean_precision = sum(bert_rag['precision']) / len(bert_rag['precision'])
mean_recall = sum(bert_rag['recall']) / len(bert_rag['recall'])
mean_f1= sum(bert_rag['f1']) / len(bert_rag['f1'])

print(mean_precision)
print(mean_recall)
print(mean_f1)

print(bert_score_precision.mean().item(), bert_score_recall.mean().item(), bert_score_f1.mean().item())
print(bert_score_precision_2.mean().item(), bert_score_recall_2.mean().item(), bert_score_f1_2.mean().item())


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['ro

{'bleu': 1.0639951911476182e-05, 'precisions': [0.347953216374269, 0.2163265306122449, 0.17777777777777778, 0.1721311475409836], 'brevity_penalty': 4.8567726100955855e-05, 'length_ratio': 0.09146995708154507, 'translation_length': 341, 'reference_length': 3728}
{'bleu': 0.0011783880618119163, 'precisions': [0.7032755298651252, 0.5775656324582339, 0.5348837209302325, 0.5167785234899329], 'brevity_penalty': 0.0020357088389880316, 'length_ratio': 0.13894849785407726, 'translation_length': 518, 'reference_length': 3728}
{'rougeL': 0.06551637471868516}
{'rougeL': 0.24263000355246}
{'precision': [0.8016784191131592, 0.8050881624221802, 0.8348761200904846, 0.0, 0.8394904732704163, 0.8603605031967163, 0.8954699635505676, 0.8811906576156616, 0.8256478309631348, 0.7827385663986206, 0.7985420823097229, 0.8397805690765381, 0.886365532875061, 0.7984955906867981, 0.8383229374885559, 0.8201546669006348, 0.918373167514801, 0.7975795865058899, 0.8795962333679199, 0.8370873928070068, 0.8003584146499634,

In [38]:
model_name_flan = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_flan)
tokenizer = AutoTokenizer.from_pretrained(model_name_flan)

In [39]:

test_train()
# Initialize lists to store results
responses_without_rag = []
responses_with_rag = []
iter = 0
# Iterate through the queries
for query in instructions:  
    print("QUERY: " + query)
    print("Ground Truth: " + responses_gt[iter])
    print("Context " + contexts[iter])
    # Generate response without RAG
    input = "Answer this question: " + query
    response_without_rag = generate_response(model, tokenizer, input)
    responses_without_rag.append(response_without_rag)
    print("NO RAG: " + response_without_rag)

    # Generate response with RAG using the retrieved context
    searchDocs = db.similarity_search(query)
    context = searchDocs[0].page_content
    
    input_text = f"Question: {query} Context: {context}"
    response_with_rag = generate_response(model, tokenizer, input_text)
    responses_with_rag.append(response_with_rag)
    print("RAG: " + response_with_rag)
    iter+=1
   
print(len(responses_without_rag))
print(len(responses_with_rag))

QUERY: When did Virgin Australia start operating?
Ground Truth: Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.
Context "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."
NO RAG: 18 November 2007
RAG: 31 August 2000
QUERY: When was Tomoaki Komorida born?
Ground Truth: Tomoaki Komorida was born on July 10,1981.
Context "Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he

Token indices sequence length is longer than the specified maximum sequence length for this model (829 > 512). Running this sequence through the model will result in indexing errors


NO RAG: john d. scott
RAG: Parkinson's
QUERY: Given a reference text about Run Towards the Danger, tell me how many essays are part of the collection.
Ground Truth: Six essays are part of the Run Towards the Danger essay collection.
Context "Run Towards the Danger is a 2022 Canadian essay collection by Sarah Polley, a former child star, director, and screenwriter.\n\nThe six essays in the collection examine aspects of Polley's career on stage, screen, and on film detailing her roles in a Stratford Festival production of Alice Through the Looking Glass, as well as her breakout roles in The Adventures of Baron Munchausen and the TV series Road to Avonlea. The book also revealed for the first time that Polley had been a victim of Jian Ghomeshi who sexually and physically assaulted her when she was 16 and he was 28."
NO RAG: 3
RAG: six
QUERY: Who tends to participates in hackathons?
Ground Truth: Computer programmers and others involved in software development, including graphic designers,

In [40]:
# Evaluate metrics
bleu_without_rag,rouge_norag,bert_norag,bert_score_precision, bert_score_recall, bert_score_f1 = evaluate_metrics(responses_without_rag, responses_gt,model)
bleu_with_rag,rouge_rag,bert_rag,bert_score_precision_2, bert_score_recall_2, bert_score_f1_2 = evaluate_metrics(responses_with_rag, responses_gt,model)
print(str(bleu_without_rag))
print(str(bleu_with_rag))
print(str(rouge_norag))
print(str(rouge_rag))
print(str(bert_norag))
print(str(bert_rag))

mean_precision = sum(bert_norag['precision']) / len(bert_norag['precision'])
mean_recall = sum(bert_norag['recall']) / len(bert_norag['recall'])
mean_f1= sum(bert_norag['f1']) / len(bert_norag['f1'])
print(mean_precision)
print(mean_recall)
print(mean_f1)

mean_precision = sum(bert_rag['precision']) / len(bert_rag['precision'])
mean_recall = sum(bert_rag['recall']) / len(bert_rag['recall'])
mean_f1= sum(bert_rag['f1']) / len(bert_rag['f1'])

print(mean_precision)
print(mean_recall)
print(mean_f1)

print(bert_score_precision.mean().item(), bert_score_recall.mean().item(), bert_score_f1.mean().item())
print(bert_score_precision_2.mean().item(), bert_score_recall_2.mean().item(), bert_score_f1_2.mean().item())


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['ro

{'bleu': 0.0018745985047354046, 'precisions': [0.14044350580781415, 0.04604486422668241, 0.022871664548919948, 0.010723860589812333], 'brevity_penalty': 0.052823274384199766, 'length_ratio': 0.25375536480686695, 'translation_length': 946, 'reference_length': 3728}
{'bleu': 0.0044826023400422335, 'precisions': [0.5669172932330827, 0.46017699115044247, 0.4146341463414634, 0.38738738738738737], 'brevity_penalty': 0.009907553516242729, 'length_ratio': 0.1781115879828326, 'translation_length': 664, 'reference_length': 3728}
{'rougeL': 0.05090768121490386}
{'rougeL': 0.2658182746440736}
{'precision': [0.8574485778808594, 0.8461105823516846, 0.7879549264907837, 0.8185395002365112, 0.826104998588562, 0.8522376418113708, 0.8204290866851807, 0.8811906576156616, 0.82569819688797, 0.7765249609947205, 0.8615630865097046, 0.8182584643363953, 0.8520282506942749, 0.9237302541732788, 0.8361082077026367, 0.8352677822113037, 0.8266233801841736, 0.844606339931488, 0.8869531154632568, 0.7584036588668823, 0

1. The difference in outcome between the two Models (Flan-T5 and T5) 

I experienced better performance with Flan-T5 than with T5. Flan-T5 base with context gave better answers than T5-large.

2. Why do the scores differ while using RAG vs. without using RAG?  

RAG improves LLM performance and the factuality of its responses. Also, as a note, I got bleu scores very low because I noticed the answers given by the LLM even with context as even if the answer is correct, sometimes it is very short compared to the Ground Truth provided by the dataset as seen in the example below. So the bleu score calculation has a "brevity_penalty" which penalizes the generated text that is too short compared to the closest reference length. 

My Rouge L score was able to exemplify better the strong similarity between the candidate and the reference with the use of RAG. 

Example short answers: 
A. QUERY: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?
Ground Truth: Lollapalooza is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.
NO RAG: the United States
RAG: Grant Park in Chicago


B. QUERY: When was Tomoaki Komorida born?
Ground Truth: Tomoaki Komorida was born on July 10,1981.
NO RAG: November 11, 1939
RAG: July 10, 1981 


| T5-base         | BLEU   | Rouge-L | BERTScore |
|------------|--------|---------|-----------|
| Without RAG| 1.0639951911476182e-05,| 0.06551637471868516  | 0.8089906573295593 0.7739961743354797 0.7905730605125427  |
| With RAG   | 0.0011783880618119163 | 0.24263000355246 | 0.8742617964744568 0.8256046772003174 0.8486824035644531 |


|Flan T5     | BLEU   | Rouge-L | BERTScore |
|------------|--------|---------|-----------|
| Without RAG| 0.0018745985047354046 | 0.05090768121490386  | 0.8282802104949951 0.8048997521400452 0.8157396912574768 |
| With RAG   | 0.0044826023400422335 | 0.2658182746440736 | 0.8782511353492737 0.8272031545639038 0.851351797580719  |

