# LLM and RAG testing with Langchain

In [10]:
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAI

# client = OpenAI()
client = ChatOpenAI(model="gpt-3.5-turbo")

message = "Who granted the team access for the photoshoot at Blenheim Palace?"

prompt = PromptTemplate(
    input_variables=["message"],
    template="""You are a helpful assistant.
    
    Question: {message}
    
    Answer:""",
)

chain = prompt | client
response = chain.invoke({"message": message, "toto": 1})
print(response.content)


The team was granted access for the photoshoot at Blenheim Palace by the event coordinator.


In [2]:
from datasets import load_dataset

corpus = load_dataset("MarkrAI/AutoRAG-evaluation-2024-LLM-paper-v1", "corpus")
qa = load_dataset("MarkrAI/AutoRAG-evaluation-2024-LLM-paper-v1", "qa")

print(len(corpus["train"]), len(qa["train"]))
print(corpus["train"][0])
print(corpus["train"][0]["contents"])

8576 520
{'doc_id': '6f86094c-47fe-43de-a77a-e8c34c69c997', 'contents': "# Rag-Driver: Generalisable Driving Explanations With Retrieval-Augmented In-Context Learning In Multi-Modal Large Language Model\n\nJianhao Yuan1, Shuyang Sun1, Daniel Omeiza1, Bo Zhao2, Paul Newman1, Lars Kunze1, Matthew Gadd1\n1 University of Oxford 2 Beijing Academy of Artificial Intelligence\n{jianhaoyuan,kevinsun,daniel,pnewman,lars,mattgadd}@robots.ox.ac.uk  \nAbstract—Robots powered by 'blackbox' models need to provide\nhuman-understandable explanations which we can trust. Hence,\nexplainability plays a critical role in trustworthy autonomous\ndecision-making to foster transparency and acceptance among\nend users, especially in complex autonomous driving. Recent\nadvancements in Multi-Modal Large Language models (MLLMs)\nhave shown promising potential in enhancing the explainability\nas a driving agent by producing control predictions along with\nnatural language explanations. However, severe data scarcity

In [3]:
from langchain_openai import OpenAIEmbeddings

embedding_fn = OpenAIEmbeddings(model="text-embedding-3-small")
query = "Hello, how are you today?"

embedded_query = embedding_fn.embed_query(query)


In [4]:
from langchain_chroma import Chroma


docs = corpus["train"]["contents"]
ids = corpus["train"]["doc_id"]

chroma_client = Chroma(embedding_function=embedding_fn)
for i in range(0, len(docs), 1000):
    chroma_client.add_texts(texts=docs[i : i + 1000], ids=ids[i : i + 1000])

In [5]:
retriever = chroma_client.as_retriever(search_kwargs={"k": 5})
retriever.invoke(
    "What is the performance percentage increase observed in the navigational prompt suffix attack scenario when using VELMA-FT?"
)

[Document(id='f06e8b7c-92db-4a33-a5f2-99d440aa0760', metadata={}, page_content="# How Secure Are Large Language Models (Llms) For Navigation In Urban Environments?\n## B. Navigational Prompt Suffix (Nps) Attack\n\naMa and VELMA-LLaMa2, specifically examining their performance before and after the attack. On the Touchdown dataset, VELMA-LLaMa and VELMA-LLaMa2 both experienced an increase in their SPD metrics, by 9.09% and 4.65% over their baseline values, respectively. At the same time, both models demonstrated marked declines in KPA from their baseline results, with VELMA-LLaMa experiencing a decrease of 69.13% and VELMA-LLaMa2 a decrease of 57.72%. On the Map2seq dataset, both VELMA-LLaMa and VELMA- LLaMa2 exhibited significant increases in SPD, with VELMA- LLaMa showing a 13.35% increase and VELMA-LLaMa2 experiencing a 14.44% increase. Furthermore, the KPA of VELMA-LLaMa decreased by 43.67% from its baseline, while VELMA-LLaMa2 showed a reduction of 64.35%. These outcomes underscore 

In [6]:
for i, q in enumerate(qa["train"][:100]["query"]):
    print(i, q)

0 Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?
1 Is the oa_temp or the zone_occ the most impactful feature according to the Shapley values?
2 What is the performance percentage increase observed in the navigational prompt suffix attack scenario when using VELMA-FT?
3 What are essential components of evaluating large language models (LLMs)?
4 How many errors were there in the Inference phase as per the document?
5 What is the meaning of PLP-former in the context of this text?
6 How does the Qmsum value vary between the first and the third entries?
7 What is the purpose of using the torch.einsum function in the provided text?
8 What is the result of applying the PAL attack against GPT-3.5-Turbo?
9 What is the process to deceive the test function as described?
10 What leads to a greater decrease in performance in deploying LLMs/VLMs in robotics according to the experiment?
11 Why has the accuracy of the models not reached 1

In [None]:
question = qa["train"][88]["query"]
answer = qa["train"][88]["generation_gt"][0]

# Use the existing retriever instead of papers_collection
docs = retriever.invoke(question)
context = "\n\n".join([doc.page_content for doc in docs])

print(f"Question: {question}")
print(f"Dataset answer: {answer}")

client = OpenAI()

prompt = PromptTemplate(
    input_variables=["question", "context"],
    template="""You are an helpful AI assistant, answer in one sentence.
    Context: {context}
    Question: {question}
    """,
)

response_rag = (prompt | client).invoke({"question": question, "context": context})

print(f"Model answer with context: {response_rag}")

promt = PromptTemplate(
    input_variables=["question"],
    template="""You are an helpful AI assistant, answer in one sentence.
    Question: {question}
    """,
)

response_norag = (promt | client).invoke({"question": question})
print(f"Model answer without context: {response_norag}")


Question: What makes the GRACE framework able to perform generative cross-modal retrieval?
Dataset answer: GRACE assigns unique identifiers to images and comprises two training steps: learning to memorize the associations between visual content and their identifiers, and learning to retrieve the identifier of a relevant image given a textual query.
Model answer with context: 
Answer: The use of unique identifiers for images and a training scheme focused on "learning to memorize" and "learning to retrieve" enable the GRACE framework to effectively perform generative cross-modal retrieval.
Model answer without context: 
The GRACE framework uses a unified graph-based representation and joint cross-modal learning to effectively perform generative cross-modal retrieval.


In [8]:
print(context)

# Generative Cross-Modal Retrieval: Memorizing Images In Multimodal Language Models For Retrieval And Beyond
## 4 Experiments 4.1 Datasets And Baselines

We evaluated our proposed generative cross-modal retrieval framework, GRACE, on two commonlyused datasets: Flickr30K (Young et al., 2014) and MS-COCO (Lin et al., 2014). Flickr30K contains 31,783 images sourced from Flickr. Each image is associated with five human-annotated sentences.  
We adopted the data split used by Li et al., comprising 29,783 images for training, 1,000 for validation, and 1,000 for testing. MS-COCO comprises
123,287 images, and each MS-COCO image comes with five sentences of annotations. We followed the dataset split proposed in (Lee et al., 2018), utilizing 113,287 images for training, 5,000 for validation, and 5,000 for testing. Consistent with prior studies (Young et al., 2014; Chen et al., 2021),  
| Paradigm              | Methods        |
|-----------------------|----------------|
| Flickr30K             |

In [9]:
from ragas import SingleTurnSample
from ragas.metrics import BleuScore, RougeScore

bleu = BleuScore()
rouge = RougeScore()
print("With RAG:")
test_data = SingleTurnSample(
    **{
        "user_input": question,
        "response": response_rag,
        "reference": answer,
    }
)
print("Bleu:", bleu.single_turn_score(test_data))
print("Rouge:", rouge.single_turn_score(test_data))

print("Without RAG:")
test_data = SingleTurnSample(
    **{
        "user_input": question,
        "response": response_norag,
        "reference": answer,
    }
)
print("Bleu:", bleu.single_turn_score(test_data))
print("Rouge:", rouge.single_turn_score(test_data))

With RAG:
Bleu: 0.08724449615067745
Rouge: 0.38235294117647056
Without RAG:
Bleu: 0.018597924362574465
Rouge: 0.17543859649122806
