# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 10: Evaluate LLMs with (other) LLMs</font>

# <font color="#003660">Evaluate using (other) LLMs?</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>

<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will know how to generate test data using Argilla and HuggingFace.
        ... will know how to evaluate LLMs using <b>Larger LLMs.</b> <br>
        ... will know how to apply this using LangChain-Ollama.
    </font>
</div>
</p>

The following content is heavily inspired by the following excellent sources:

* [HuggingFace (2024): NLP Course](https://huggingface.co/learn/nlp-course/)
* [Huggingface (2024): Open-Source AI Cookbook](https://huggingface.co/learn/cookbook/index)
* [LangChain API Reference (2024)](https://python.langchain.com/api_reference/reference.html)
* [LangChain Docs (2024)](https://python.langchain.com/docs/introduction/)
* [LangChain AI (2024) Cookbook](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb?ref=blog.langchain.dev)
* [Don't Touch the Power Line](https://github.com/skaltenp/dont_touch_the_powerline)
* [Replacing Judges with Juries](https://arxiv.org/pdf/2404.18796)
* [Argilla Dataset Generator](https://huggingface.co/spaces/argilla/synthetic-data-generator)

# RAG Extensions

![](imgs/Pipeline.png)

In [None]:
!pip install -U pymupdf4llm datasets transformers faiss-cpu sentence-transformers accelerate langchain langchain-community langchain-huggingface langchain-ollama

Collecting langchain-ollama
  Using cached langchain_ollama-0.2.2-py3-none-any.whl.metadata (1.9 kB)
Collecting ollama<1,>=0.4.4 (from langchain-ollama)
  Using cached ollama-0.4.6-py3-none-any.whl.metadata (4.7 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.4,>=0.1.17->langchain)
  Using cached httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Using cached langchain_ollama-0.2.2-py3-none-any.whl (18 kB)
Using cached ollama-0.4.6-py3-none-any.whl (13 kB)
Using cached httpx-0.27.2-py3-none-any.whl (76 kB)
Installing collected packages: httpx, ollama, langchain-ollama
  Attempting uninstall: httpx
    Found existing installation: httpx 0.28.1
    Uninstalling httpx-0.28.1:
      Successfully uninstalled httpx-0.28.1
Successfully installed httpx-0.27.2 langchain-ollama-0.2.2 ollama-0.4.6


In [37]:
import os
import re
from tqdm.notebook import tqdm
import pymupdf4llm
import urllib

from IPython.display import display, Markdown

from transformers import AutoTokenizer
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain import hub

from langchain_ollama import ChatOllama
from langchain_ollama import OllamaEmbeddings

In [38]:
RETRIEVER_NAME = "jina/jina-embeddings-v2-base-de"
GENERATOR_NAME = "qwen2.5:7b"

# Loading Documents

In [39]:
markdown_documents_path = "markdown_documents"

In [40]:
def remove_markdown_links(text):
    """
    Removes Markdown links from the given text while keeping the link text.

    Args:
        text (str): The input Markdown text.

    Returns:
        str: The text with Markdown links removed.

    Yeah this was ChatGPT ;)
    """
    # Regex to match Markdown links [text](link)
    pattern = r'\[([^\]]+)\]\([^\)]+\)'
    # Replace the matched pattern with just the text inside the brackets
    cleaned_text = re.sub(pattern, r'\1', text)
    return cleaned_text

In [41]:
markdown_file_path = "documents/Game_of_Thrones.md"

with open(markdown_file_path) as file:
    md_files = [["Game_of_Thrones.md", remove_markdown_links(file.read())]]
md_files

[['Game_of_Thrones.md',
  '# Game of Thrones\n\n**_Game of Thrones is an American fantasy drama_**\ntelevision series created by David Benioff and\nD. B. Weiss for HBO. It is an adaptation of A Song of\n_Ice and Fire, a series of fantasy novels by_\nGeorge R. R. Martin, the first of which is _A Game of_\n_Thrones. The show premiered on HBO in the United_\nStates on April 17, 2011, and concluded on May 19,\n2019, with 73 episodes broadcast over eight seasons.\n\nSet on the fictional continents of Westeros and Essos,\n_Game of Thrones has a large ensemble cast and follows_\nseveral story arcs throughout the course of the show.\nThe first major arc concerns the Iron Throne of the)\nSeven Kingdoms of Westeros through a web of\npolitical conflicts among the noble families either\nvying to claim the throne or fighting for independence\nfrom whoever sits on it. The second major arc focuses\non the last descendant of the realm\'s deposed ruling\ndynasty, who has been exiled to Essos and is plo

In [46]:
embedding_tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en", use_fast=False)
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(embedding_tokenizer, chunk_size=512, chunk_overlap=32)
all_splits = text_splitter.create_documents(texts=[x[1] for x in md_files], metadatas=[{"source": x[0]} for x in md_files])

In [47]:
llm = ChatOllama(
    model = "llama3.3:70b",
    base_url = "131.234.154.21",
    seed=1,
)

In [48]:
SYSTEM_PROMPT = """
You are a question and reference answer generator that generates questions based on a given document.

### Instructions:
- You will be given a document to read.
- You will then be asked to generate questions based on the document.
- You will also be asked to generate reference answers for the questions.



### Format
Document:
<here goes the document>

Generate your questions in the format:

QUESTION: <your question here>
REFERENCE ANSWER: <your reference answer here>
"""

PROMPT_TEMPLATE = """
Document:
{document}

Generate your questions in the format:

QUESTION: <your question here>
REFERENCE ANSWER: <your reference answer here>"""

In [None]:
l = []
for split in tqdm(all_splits[:10]):
    messages = [
        ("system", SYSTEM_PROMPT),
        ("human", PROMPT_TEMPLATE.format(document=split.page_content)),
    ]
    qa = llm.invoke(messages).content.split("QUESTION:")[-1]
    question = qa.split("REFERENCE ANSWER:")[0].strip()
    reference_answer = qa.split("REFERENCE ANSWER:")[-1].strip()
    l.append((question, reference_answer))

  0%|          | 0/10 [00:00<?, ?it/s]

In [36]:
test_set = []
for i in range(len(all_splits[:10])):
    test_set.append(
        {
            "document": all_splits[i].page_content,
            "question": l[i][0],
            "reference_answer": l[i][1],
        }
    )
    print(f"Question: {l[i][0]}")
    print(f"Reference Answer: {l[i][1]}")

Question: In what year did the show conclude?
Reference Answer: 2019, with the final episode airing on May 19.
Question: What is the title of the opening theme song?
Reference Answer: "Main Title".
Question: In what year were several actors' contracts renegotiated to include a seventh-season option?
Reference Answer: 2014.
Question: What is the name of the healer who helps Robb Stark?
Reference Answer: Talisa Maegyr, played by Oona Chaplin.
Question: Who seeks vengeance against the Lannisters in Dorne?
Reference Answer: Warrior Ellaria Sand.
Question: What genre does George R. R. Martin aim to make the story feel like?
Reference Answer: Historical fiction, rather than contemporary fantasy.
Question: How does the series rank in terms of deaths per episode compared to other television drama shows?
Reference Answer: Game of Thrones ranked second in deaths per episode, averaging 14, out of 40 recent television drama shows.
Question: What historical novel series was a main inspiration for M

In [None]:
SYSTEM_PROMPT = """
You are an LLM that answers questions based on documents.
Answer only based on the document but not on your own knowledge.
"""

PROMPT_TEMPLATE = """
Answer the following question based on the document:
{document}

Here is the question:
{question}

Give you answer in the fomat:

ANSWER: <your answer>"""

small_llm = ChatOllama(
    model = "qwen2.5:7b",
    base_url = "131.234.154.21",
)

for i in tqdm(range(len(test_set))):
    messages = [
        ("system", SYSTEM_PROMPT),
        ("human", PROMPT_TEMPLATE.format(document=test_set[i]["document"], question=test_set[i]["question"])),
    ]
    response = small_llm.invoke(messages).content
    test_set[i]["answer"] = response

  0%|          | 0/10 [00:00<?, ?it/s]

In [44]:
for i in range(len(all_splits[:10])):
    print(f"Question: {test_set[i]['question']}")
    print(f"Reference Answer: {test_set[i]['reference_answer']}")
    print(f"Answer: {test_set[i]['answer']}")

Question: In what year did the show conclude?
Reference Answer: 2019, with the final episode airing on May 19.
Answer:  ANSWER: 2019
Question: What is the title of the opening theme song?
Reference Answer: "Main Title".
Answer: ANSWER: Main Title
Question: In what year were several actors' contracts renegotiated to include a seventh-season option?
Reference Answer: 2014.
Answer: ANSWER: 2014
Question: What is the name of the healer who helps Robb Stark?
Reference Answer: Talisa Maegyr, played by Oona Chaplin.
Answer: ANSWER: Talisa Maegyr
Question: Who seeks vengeance against the Lannisters in Dorne?
Reference Answer: Warrior Ellaria Sand.
Answer: ANSWER: Ellaria Sand
Question: What genre does George R. R. Martin aim to make the story feel like?
Reference Answer: Historical fiction, rather than contemporary fantasy.
Answer: ANSWER: historical fiction
Question: How does the series rank in terms of deaths per episode compared to other television drama shows?
Reference Answer: Game of Thr

In [45]:
SYSTEM_PROMPT = """
You are an assistant that validates the answers given by another LLM.

### Instructions:
- You will be given a question and an answer.
- Additionally you will be given the source document and the reference answer.
- You have to validate if the answer is correct based on the document.
- If the answer is correct, reply with "CORRECT".
- If the answer is incorrect, reply with "INCORRECT".
"""

PROMPT_TEMPLATE = """
Here is the document:
{document}

Here is the question:
{question}

Here is the reference answer:
{reference_answer}

Here is the answer to validate:
{answer}

Now give only information if the answer is correct or incorrect.

Your answer:"""

for i in tqdm(range(len(test_set))):
    messages = [
        ("system", SYSTEM_PROMPT),
        ("human", PROMPT_TEMPLATE.format(document=test_set[i]["document"], question=test_set[i]["question"], reference_answer=test_set[i]["reference_answer"], answer=test_set[i]["answer"])),
    ]
    response = llm.invoke(messages).content
    print(test_set[i]["document"])
    print("-" * 25)
    print(test_set[i]["question"])
    print("-" * 25)
    print(test_set[i]["reference_answer"])
    print("-" * 25)
    print(response)
    print()
    print()
    print()
    print()

  0%|          | 0/10 [00:00<?, ?it/s]

# Game of Thrones

**_Game of Thrones is an American fantasy drama_**
television series created by David Benioff and
D. B. Weiss for HBO. It is an adaptation of A Song of
_Ice and Fire, a series of fantasy novels by_
George R. R. Martin, the first of which is _A Game of_
_Thrones. The show premiered on HBO in the United_
States on April 17, 2011, and concluded on May 19,
2019, with 73 episodes broadcast over eight seasons.

Set on the fictional continents of Westeros and Essos,
_Game of Thrones has a large ensemble cast and follows_
several story arcs throughout the course of the show.
The first major arc concerns the Iron Throne of the)
Seven Kingdoms of Westeros through a web of
political conflicts among the noble families either
vying to claim the throne or fighting for independence
from whoever sits on it. The second major arc focuses
on the last descendant of the realm's deposed ruling
dynasty, who has been exiled to Essos and is plotting
to return and reclaim the throne. The thir

Is that really super good?
No! This only gives a hint how good your model is.

Combine multiple models ([Verga et al., 2024](https://arxiv.org/pdf/2404.18796)) or models and humans.

![](imgs/judges.png)

([Verga et al., 2024](https://arxiv.org/pdf/2404.18796))