# Lesson-5: Evaluating the LLM
Evaluating the responses using a LLM itself, which is called LLM-assisted evaluation

### 1. Create the QnA chain

In [2]:
#Loads the Langsmith API key from the .env file, used for accessing the hub
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_ollama import ChatOllama, OllamaLLM, OllamaEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain import hub

In [3]:
loader = PyPDFLoader(
    file_path="SuFIA.pdf",
    extract_images=True,
)
pages = loader.load()

embeddings = OllamaEmbeddings(model="llama3.2")

db = DocArrayInMemorySearch.from_documents(pages, embeddings)



In [10]:
llm = ChatOllama(model="llama3.2")

# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_chain = (
    {
        "context": db.as_retriever() | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)




### 2. LLM Assisted Evalution 

Generate examples using the LLM

In [33]:
from langchain.evaluation.qa import QAGenerateChain

example_generator_chain = QAGenerateChain.from_llm(ChatOllama(model="llama3.2"))

qa_pairs = example_generator_chain.apply_and_parse(
    input_list=[{"doc": doc.page_content} for doc in pages[0:2]] #test for first two pages
)



In [34]:
print(qa_pairs)

[{'qa_pairs': {'query': 'What is the primary limitation of learning-based approaches to robotic surgical assistants, as mentioned in the introduction?', 'answer': 'The primary limitation of learning-based approaches, such as reinforcement and imitation learning, is that complex, long-horizon surgical sub-tasks are often computationally expensive, require extensive domain knowledge and reward engineering, and involve time-consuming dataset curation, which limits their generalizability in safety-critical applications.'}}, {'qa_pairs': {'query': 'What is the primary contribution of SUFIA, a framework for natural interaction between a human surgeon and a surgical robot?', 'answer': 'The primary contributions of SUFIA are (1) a general formulation for natural language interaction between a surgeon and a robot, (2) a language-based control approach to facilitate surgical sub-task implementations, and (3) a systematic evaluation of the generalization of its approach to various surgical sub-ta

In [37]:
qa_examples = [pair["qa_pairs"] for pair in qa_pairs]


Now, we will make predictions of these queries by invoking the `qa_chain`, then use the LLM to evaluate those predictions using the generated answers in the `qa_pairs`

In [22]:
from langchain.evaluation.qa import QAEvalChain

In [None]:
input_queries = [qa_pair["query"] for qa_pair in qa_examples]
predictions = qa_chain.batch(
    inputs=input_queries
    )

In [31]:
print(predictions)

['The text describes a framework called SUFIA (Surgeon-in-the-Loop Framework for Augmented Dexterity in Robotics) that enables natural language-guided augmented dexterity for robotic surgical assistants. The framework combines the strengths of large language models (LLMs) with perception modules to implement high-level planning and low-level control of a robot for surgical sub-task execution.\n\nSUFIA receives commands from a surgeon in natural language, converts them into high-level planning and low-level control code, and queries a perception module for object state information when necessary. The framework can assist a surgeon with open-ended tasks, such as moving the robot in a desired motion to help complete a surgical task.\n\nIn times of insufficient information, SUFIA delegates full control back to the surgeon. This approach enables natural human-robot coordination and has the potential to develop general-purpose models for autonomous surgery beyond the capability of current ta

In [38]:
qa_predict=[{"query": query, "answer": answer} for query, answer in zip(input_queries, predictions)]

In [40]:
eval_chain = QAEvalChain.from_llm(ChatOllama(model="llama3.2"))
graded_outputs = eval_chain.evaluate(
    examples=qa_examples,
    predictions=qa_predict,
    prediction_key="answer",
)

In [42]:
print(graded_outputs)

[{'results': "INCORRECT\n\nThe student's answer provides a detailed description of the SUFIA framework, its capabilities, and its evaluation, but it does not address the primary limitation of learning-based approaches as mentioned in the introduction. The true answer specifically highlights that complex surgical tasks are computationally expensive, require domain knowledge and reward engineering, and involve dataset curation issues, which limits their generalizability."}, {'results': 'INCORRECT'}]


In [50]:
for i, eg in enumerate(qa_examples):
    print(f"Example {i}:")
    print("Question: " + qa_predict[i]['query'])
    print("Real Answer: \n" + qa_examples[i]['answer'])
    print("Predicted Answer:\n" + qa_predict[i]['answer'])
    print("\nPredicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: What is the primary limitation of learning-based approaches to surgical robotic platforms, according to the document?
Real Answer: 
The primary limitation of learning-based approaches, such as reinforcement and imitation learning, is that complex, long-horizon surgical sub-tasks are often computationally expensive, require extensive domain knowledge and reward engineering, and involve time-consuming dataset curation, which limits their generalizability in safety-critical applications.
Predicted Answer:
The text describes a framework called SUFIA (Surgeon-in-the-Loop Framework for Augmented Dexterity in Robotics) that enables natural language-guided augmented dexterity for robotic surgical assistants. The framework combines the strengths of large language models (LLMs) with perception modules to implement high-level planning and low-level control of a robot for surgical sub-task execution.

SUFIA receives commands from a surgeon in natural language, converts them in