In [1]:
import requests
from bs4 import BeautifulSoup
import re

In [2]:
uri = "https://arxiv.org/html/2306.05685v4"
paper = requests.get(uri)

In [3]:
print(paper.content)
print(paper.status_code) # make sure you get status code of 200 to proceed

b'<!DOCTYPE html>\n\n<html lang="en">\n<head>\n<meta content="text/html; charset=utf-8" http-equiv="content-type"/>\n<title>Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena</title>\n<!--Generated on Sun Dec 24 02:00:19 2023 by LaTeXML (version 0.8.7) http://dlmf.nist.gov/LaTeXML/.-->\n<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>\n<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>\n<link href="/static/browse/0.3.4/css/ar5iv_0.7.4.min.css" rel="stylesheet" type="text/css"/>\n<link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/>\n<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>\n<script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script>\n<script src="/static/browse/0.3.4/js/addons.js"></script>\n<script src="/static/browse/0.3.4/js/feedbackOver

In [4]:
soup = BeautifulSoup(paper.content, "html.parser")

In [5]:
page_content = soup.select_one(".ltx_page_content")

In [6]:
title = page_content.select_one("h1.ltx_title").text.replace('\n', '')
title = "Title: "+title
print(title)

Title: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena


In [7]:
authors = page_content.select_one(".ltx_personname").text.replace("start_FLOATSUPERSCRIPT", "").replace("end_FLOATSUPERSCRIPT","")
authors = re.sub(r"\d", "", authors).replace("*", '').replace('{}','').replace('^',"").replace("\\And","").replace("\n","")
authors = "Authors: " + authors
print(authors)

Authors: Lianmin Zheng       Wei-Lin Chiang⁣        Ying Sheng⁣        Siyuan Zhuang  Zhanghao Wu       Yonghao Zhuang       Zi Lin       Zhuohan Li       Dacheng Li  Eric P. Xing       Hao Zhang    Joseph E. Gonzalez       Ion Stoica     UC Berkeley     UC San Diego     Carnegie Mellon University     Stanford     MBZUAI


In [8]:
abstract = page_content.select_one(".ltx_abstract").text
print(abstract)


Abstract
Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences.
To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions.
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform.
Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.
Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise ver

In [9]:
# All p tags
paragraphs = [i.text for i in page_content.select(".ltx_section .ltx_para p")]
print(paragraphs)

['There has been a proliferation of LLM-based chat assistants (chatbots) that leverage supervised instruction fine-tuning and reinforcement learning with human feedback (RLHF) to unlock new instruction following and conversational abilities\xa0[31, 2, 30, 8, 52, 48, 14].\nOnce aligned with humans, these chat models are strongly preferred by human users over the original, unaligned models on which they are built.\nHowever, the heightened user preference does not always correspond to improved scores on traditional LLM benchmarks – benchmarks like MMLU\xa0[19] and HELM\xa0[24] cannot effectively tell the difference between these aligned models and the base models.\nThis phenomenon suggests that there is a fundamental discrepancy between user perceptions of the usefulness of chatbots and the criteria adopted by conventional benchmarks.', 'We argue that this discrepancy primarily arises due to\nexisting evaluation that only measures LLMs’ core capability on a confined set of tasks (e.g., mu

In [10]:
import sys
sys.path.insert(0, "C:\\Users\\jaide\\Documents\\CodingProjects\\NeuroSymbolicAI\\rag-pipeline")
from rag_pipeline import RAG


  from .autonotebook import tqdm as notebook_tqdm


In [11]:
rag = RAG()
rag.add_documents("collection", [title] + [authors] + [abstract] + paragraphs)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\jaide\.cache\huggingface\hub\models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF\snapshots\3a6fbf4a41a1d52e415a4958cde6856d34b2db93\mistral-7b-instruct-v0.2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_mo

In [13]:
output = rag.rag_query("collection", "Can you tell me what chatbot arena is?", 5)


llama_print_timings:        load time =   52474.67 ms
llama_print_timings:      sample time =      25.38 ms /    98 runs   (    0.26 ms per token,  3861.92 tokens per second)
llama_print_timings: prompt eval time =   52473.53 ms /   506 tokens (  103.70 ms per token,     9.64 tokens per second)
llama_print_timings:        eval time =   12307.63 ms /    97 runs   (  126.88 ms per token,     7.88 tokens per second)
llama_print_timings:       total time =   65121.75 ms /   603 tokens


In [14]:
print(output)

        Chatbot Arena is a platform where users engage in conversations with two chatbots at the same time and rate their responses based on personal preferences. It is a crowdsourced evaluation environment used to study and compare different chatbot models, such as GPT-4, GPT-3.5, Claude, Vicuna, Koala, Alpaca, LLaMA, and Dolly, by having human judges evaluate their conversational abilities in real-world scenarios.


In [13]:
output = rag.rag_query("collection", "What is the title of the paper?", 5)


llama_print_timings:        load time =   50091.68 ms
llama_print_timings:      sample time =       7.59 ms /    30 runs   (    0.25 ms per token,  3954.13 tokens per second)
llama_print_timings: prompt eval time =   50091.00 ms /   452 tokens (  110.82 ms per token,     9.02 tokens per second)
llama_print_timings:        eval time =    3931.05 ms /    29 runs   (  135.55 ms per token,     7.38 tokens per second)
llama_print_timings:       total time =   54124.93 ms /   481 tokens


In [14]:
print(output)
# Retrieval system doesn't seem to be the best, like if there is only 1 word "title", it probably won't view it as similar

The title of the paper is "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena".


In [12]:
print(rag.retrieve("collection", "What is the title"))

['Title: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena', 'We argue that this discrepancy primarily arises due to\nexisting evaluation that only measures LLMs’ core capability on a confined set of tasks (e.g., multi-choice knowledge or retrieval questions), without adequately assessing its alignment with human preference in open-ended tasks, such as the ability to accurately adhere to instructions in multi-turn dialogues.\nAs a demonstration, we show conversation histories with two models on an MMLU question in Figure\xa01.\nThe two models are LLaMA-13B\xa0[39], a pre-trained base model without fine-tuning, and Vicuna-13B, our fine-tuned model from LLaMA-13B on high-quality conversations (the training details are in Appendix\xa0E).\nDespite the base LLaMA models showing competitive performance on conventional benchmarks (Table\xa09), its answers to open-ended questions are often not preferred by humans.\nThis misalignment of conventional benchmarks underscores the core problem 