# WaterLLMarks Project using Langchain's API

The goal of the study is to review the use of text watermarks to track users throughout a LLM SaaS platform. The study implements the following pipelines:
- Question Answering without RAG.
- Question Answering with RAG (implemented as a naive RAG using Chroma).
- QA with RAG and a token-based watermark (`waterllmarks.watermarks.TokenWatermark`).
- QA with RAG and a watermark using character-embedding (`waterllmarks.watermarks.Rizzo2016`).
- With and without augmentation -> ideally see the impact of the augemntation on the llm response and/or the retrieved documents

The goal is to assess the response quality of the different pipelines using known metrics:
- BLEU, ROUGE, METEOR
- String similarity (Levenshtein distance)
- Context precision and recall (for RAG)
- Retrieved documents compared to reference.


## Setup dataset and corpus

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
def format_prompt(prompt: str) -> str:
    return "\n".join([l.lstrip() for l in prompt.split("\n")]).rstrip("\n")


In [3]:
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.globals import set_llm_cache
from langchain_community.cache import SQLiteCache
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.runnables import ConfigurableField
from langchain_core.rate_limiters import InMemoryRateLimiter


# set_llm_cache(SQLiteCache(".cache/llm_cache.db"))

SEED = 1977

# llm_client = ChatOpenAI(
#     model="gpt-4o-mini",
#     seed=SEED,
# )


# llm_client = ChatOllama(
#     model="mistral:7b",
#     base_url="http://snt-precision-7920-rack-lama:11434",
#     seed=SEED,
# )

llm_client = ChatOpenAI(
    api_key="EMPTY",
    base_url="http://snt-precision-7920-rack-lama:8001/v1",
    model="mistralai/Mistral-7B-Instruct-v0.3",
    seed=SEED,
)


# embedding_fn = OpenAIEmbeddings(model="text-embedding-3-small")

embedding_fn = OpenAIEmbeddings(
    api_key="EMPTY",
    base_url="http://snt-precision-7920-rack-lama:8002/v1",
    model="intfloat/e5-mistral-7b-instruct",
    tiktoken_enabled=False,
    check_embedding_ctx_length=False,
)

# embedding_fn = OllamaEmbeddings(
#     base_url="http://snt-precision-7920-rack-lama:2",
#     model="intfloat/e5-mistral-7b-instruct",
# )
vector_store = Chroma(
    embedding_function=embedding_fn,
    persist_directory=".cache/chroma_mistral_db",
)

In [4]:
from waterllmarks.datasets import LLMPaperDataset, RAGBenchDataset

ds = LLMPaperDataset()
corpus = ds.corpus
print(len(corpus))


8576


In [5]:
ds.corpus[1].page_content


"## I. Introduction\n\nDriven by the emerging development of deep learning, autonomous driving has observed a paradigm shift from rulesbased decision systems [66, 21] to data-driven learning-based approaches [28, 6, 36]. However, this comes at the cost of transparency in decision-making, especially for end-to-end autonomous driving systems which are considered black-box in nature [13]. Thus, in addition to precision in action control, explanation provision is key in ensuring trustworthy decisionmaking to reconcile the system's decisions with end-user expectations to foster confidence and acceptance [79, 8, 57] in dynamic driving environments.  \nTraditional approaches have mainly relied on attention visualisation [5, 7, 55] as a proxy to rationalise the decisions of the black-box systems or auxiliary intermediate tasks such as semantic segmentation [25, 32], object detection [16, 31], and affordance prediction [68, 45] provide meaningful intermediate representation for decision-making.

In [6]:
len(embedding_fn.embed_documents([corpus[1].page_content]))

1

In [7]:
import time


existing_ids = vector_store.get(include=[])["ids"]
doc_to_add = [doc for doc in ds.corpus if doc.id not in existing_ids]
while doc_to_add:
    n = min(100, len(doc_to_add))
    try:
        vector_store.add_documents(doc_to_add[:n])
        doc_to_add = doc_to_add[n:]
    except Exception as e:
        print("Got exception", e)
        print("Sleeping for 3 seconds")
        time.sleep(3)

In [8]:
from waterllmarks.pipeline import DictParser
from langchain_core.runnables import ConfigurableField, RunnableLambda


retriever = vector_store.as_retriever(
    search_type="mmr", search_kwargs={"search_k": 20}
).configurable_alternatives(
    ConfigurableField(id="retriever"),
    default_key="chroma",
    empty=RunnableLambda(lambda _: []),
)

print(ds.qas[0]["reference_context_ids"])
(DictParser("user_input") | retriever).invoke(ds.qas[0])


['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b']


[Document(id='a09c2e9a-dd24-4e36-9b2e-7d4e15e58639', metadata={'creation_datetime': '2024-03-04', 'file_name': '2402.09193v2.md', 'file_path': 'paper_data/2402.09193v2.md', 'file_size': 69224, 'file_type': '', 'last_accessed_datetime': '2024-03-04', 'last_modified_datetime': '2024-02-22', 'title': '(Ir)Rationality And Cognitive Biases In Large Language Models'}, page_content='## 1 Introduction\n\n can be used to test the rationality of a model; this is a complex problem which requires a consensus what is deemed rational and irrational.  \nMirco Musolesi University College London University of Bologna m.musolesi@ucl.ac.uk In using methods designed to evaluate human reasoning, it is important to acknowledge the performance vs. competence debate [14]. This line of argument encourages *species-fair* comparisons between humans and machines, meaning that we should design tests specific to either humans or machines, as otherwise apparent failures may not reflect underlying capabilities but on

## Baseline setting

### QA without RAG

In [9]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import ConfigurableField

from waterllmarks.pipeline import DictParser

prompt = PromptTemplate(
    input_variables=["pipeline_input"],
    template=format_prompt("""
        Answer the question. Keep the answer short and concise.

        Question: {pipeline_input}
        
        Answer: 
        """),
)

set_input = RunnablePassthrough.assign(pipeline_input=DictParser("user_input"))
output_parser = StrOutputParser(name="content") | (lambda x: x.strip())

llm = RunnablePassthrough.assign(response=prompt | llm_client | output_parser)

norag_chain = set_input | llm.with_config(llm="ollama")

norag_chain.invoke(ds.qas[0])

{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': 'Yes',
 'reference_context_ids': ['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b'],
 'reference_contexts': ["Language represents a rigorously structured communicative system characterized by its grammar and vocabulary. It serves as the principal medium through which humans articulate and convey meaning. This conception of language as a structured communicative tool is pivotal in the realm of computational linguistics, particularly in the development and evaluation of natural language processing (NLP) algorithms. A seminal aspect in this field is the Turing Test, proposed by Alan Turing in 1950 [1], which evaluates a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. In this context, the Turing Test primarily assesses the machine's capability to perfor

### QA with RAG

In [10]:
from textwrap import dedent
from langchain_core.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.runnables import RunnablePassthrough, chain, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

from waterllmarks.pipeline import DictParser, DictWrapper

rag_prompt = PromptTemplate(
    input_variables=["pipeline_input", "context"],
    template=format_prompt("""
        Answer the question using the provided context. Keep the answer short and concise.
        
        Context:
        {context}
        
        Question: {pipeline_input}

        Answer: 
        """),
)

context_formatter = RunnableLambda(
    lambda docs: "\n\n".join([doc.page_content for doc in docs]),
)  # list[Document] -> str

ragllm = (
    RunnablePassthrough.assign(
        retrieved_contexts=DictParser("pipeline_input") | retriever,
    )  # {"pipeline_input": str, "retrieved_contexts": list[Document]}
    | RunnablePassthrough.assign(
        context=DictParser("retrieved_contexts") | context_formatter,
    )  # {"pipeline_input": str, "retrieved_contexts": list[Document], "context": str}
    | RunnablePassthrough.assign(
        response=rag_prompt | llm_client | output_parser,
    )  # {"pipeline_input": str, "retrieved_contexts": list[Document], "context": str, "response": str}
).with_config(llm="ollama")

rag_chain = set_input | ragllm

rag_chain.invoke(ds.qas[0])

{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': 'Yes',
 'reference_context_ids': ['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b'],
 'reference_contexts': ["Language represents a rigorously structured communicative system characterized by its grammar and vocabulary. It serves as the principal medium through which humans articulate and convey meaning. This conception of language as a structured communicative tool is pivotal in the realm of computational linguistics, particularly in the development and evaluation of natural language processing (NLP) algorithms. A seminal aspect in this field is the Turing Test, proposed by Alan Turing in 1950 [1], which evaluates a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. In this context, the Turing Test primarily assesses the machine's capability to perfor

### QA with RAG but no context

In [11]:
empty_ragllm = ragllm.with_config(retriever="empty")
emptyrag_chain = set_input | empty_ragllm

# rag_results = rag_chain.batch(ds.qas)
# rag_results[0]

emptyrag_chain.invoke(ds.qas[0])


{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': 'Yes',
 'reference_context_ids': ['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b'],
 'reference_contexts': ["Language represents a rigorously structured communicative system characterized by its grammar and vocabulary. It serves as the principal medium through which humans articulate and convey meaning. This conception of language as a structured communicative tool is pivotal in the realm of computational linguistics, particularly in the development and evaluation of natural language processing (NLP) algorithms. A seminal aspect in this field is the Turing Test, proposed by Alan Turing in 1950 [1], which evaluates a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. In this context, the Turing Test primarily assesses the machine's capability to perfor

### Evaluation

In [12]:
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import pickle
from functools import partial
from ragas.embeddings import LangchainEmbeddingsWrapper
from waterllmarks.evaluation import DEFAULT_ALL_METRICS, WLLMKResult
from ragas.llms import LangchainLLMWrapper
from waterllmarks.evaluation import evaluate
from ragas import RunConfig
from langchain_core.runnables import RunnableConfig, RunnableSequence
from waterllmarks.pipeline import ProgressBarCallback, ThreadedSequence
from collections import Counter

llm_wrapper = LangchainLLMWrapper(llm_client)
embedding_wrapper = LangchainEmbeddingsWrapper(embedding_fn)

eval_fn = partial(
    evaluate,
    metrics=DEFAULT_ALL_METRICS,
    llm=llm_wrapper,
    embeddings=embedding_wrapper,
    runconfig=RunConfig(seed=SEED, max_workers=2),
)

inputs = ds.qas[:]

path = f"results/{SEED}_baseline_results_noempty.pkl"
if Path(path).exists():
    baseline_evals = WLLMKResult.load(path)
    print("Loaded existing results")
else:
    rag_results = ThreadedSequence(rag_chain).batch(inputs)
    # empty_rag_results = ThreadedSequence(emptyrag_chain).batch(inputs)

    baseline_evals = WLLMKResult(
        rag=eval_fn(rag_results),
        #    empty_rag=eval_fn(empty_rag_results),
    )

    baseline_evals.save(path)

baseline_evals.synthesis


Loaded existing results


[nltk_data] Downloading package punkt_tab to .cache/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to .cache/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


{'bleu_score': 0.11060374876651043,
 'rouge_score': 0.23664912136291053,
 'meta_meteor_score': 0.31791078889900865,
 'non_llm_string_similarity': 0.2509601153278232,
 'semantic_similarity': 0.7608850498987082,
 'factual_correctness': 0.5308299595141701,
 'llm_context_precision_without_reference': 0.88214829307892,
 'context_recall': 0.6780805194805195,
 'faithfulness': 0.7854154653607934,
 'context_overlap': 0.06682692307692308,
 'retrieved_context_similarity': 0.7406435398799064}

### Seting up the new baseline

In [13]:
baseline_results = ThreadedSequence(rag_chain).batch(inputs)
baseline_qas = [
    {
        "id": res["id"],
        "user_input": res["user_input"],
        "reference": res["response"],
        "reference_contexts": [doc.page_content for doc in res["retrieved_contexts"]],
        "reference_context_ids": [doc.id for doc in res["retrieved_contexts"]],
    }
    for res in baseline_results
    if res["id"] != "3ff85d76-1baa-4a8b-9347-521a51aa22cd"
]

  0%|          | 0/520 [00:00<?, ?it/s]

In [None]:
# This question has been removed from the dataset
print(ds.qas["3ff85d76-1baa-4a8b-9347-521a51aa22cd"])


## Watermarks

### TokenWatermark

In [14]:
from langchain_core.runnables import RunnableParallel

from waterllmarks.watermarks import TokenWatermark

wtmk = TokenWatermark(key=b"0123456789ABCDEF")
apply_watermark = RunnablePassthrough.assign(
    pipeline_input=DictParser("pipeline_input") | wtmk.apply_as_runnable()
)
token_rag_chain = set_input | apply_watermark | ragllm


# token_rag_results = token_rag_chain.invoke(baseline_qas)
# token_rag_results[0]

token_rag_chain.invoke(baseline_qas[0])


{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': "No, the Turing Test does not directly assess a machine's ability to exhibit intelligent behavior equivalent to that of a human. Instead, it evaluates a machine's ability to mimic human conversation, without necessarily implying the machine possesses genuine understanding or rationality. The text mentions the embodied Turing test (Ortiz, 2016) and nonverbal Turing test (Pfeiffer et al., 2011) as examples, which focus on differentiating mind from machine in non-verbal interactions. However, these tests also do not directly measure a machine's capacity for intelligent behavior equivalent to that of a human.",
 'reference_contexts': ['## 1 Introduction\n\n can be used to test the rationality of a model; this is a complex problem which requires a consensus what is deemed rational and irrational.  \nMirc

### Rizzo2016 (Character Embedding)

In [15]:
from textwrap import dedent

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from waterllmarks.watermarks import Rizzo2016
from waterllmarks.pipeline import RunnableTryFix


wtmk = Rizzo2016(key=b"0123456789ABCDEF")
apply_watermark = RunnablePassthrough.assign(
    pipeline_input=DictParser("pipeline_input") | wtmk.apply_as_runnable()
)

augment_prompt = PromptTemplate(
    input_variables=["pipeline_input"],
    template=dedent("""Increase the query size to at least 105 characters.

    Query: {pipeline_input}

    Augmented Query: 
    """),
)

augmenter = RunnablePassthrough.assign(
    pipeline_input=DictParser("pipeline_input")
    | augment_prompt
    | llm_client
    | output_parser
)

apply_or_augment = RunnableTryFix(
    primary_step=apply_watermark, fix_step=augmenter, log_failures=True
)

embed_rag_chain = set_input | apply_or_augment | ragllm

# embed_rag_results = embed_rag_chain.batch(baseline_qas)
# embed_rag_results[0]


embed_rag_chain.invoke(baseline_qas[0])


{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': "No, the Turing Test does not directly assess a machine's ability to exhibit intelligent behavior equivalent to that of a human. Instead, it evaluates a machine's ability to mimic human conversation, without necessarily implying the machine possesses genuine understanding or rationality. The text mentions the embodied Turing test (Ortiz, 2016) and nonverbal Turing test (Pfeiffer et al., 2011) as examples, which focus on differentiating mind from machine in non-verbal interactions. However, these tests also do not directly measure a machine's capacity for intelligent behavior equivalent to that of a human.",
 'reference_contexts': ['## 1 Introduction\n\n can be used to test the rationality of a model; this is a complex problem which requires a consensus what is deemed rational and irrational.  \nMirc

## Evaluation

### Watermarks vs. baseline

In [16]:
from functools import partial

eval_fn = partial(
    evaluate,
    metrics=DEFAULT_ALL_METRICS,
    llm=llm_wrapper,
    embeddings=embedding_wrapper,
)

token_rag_results = ThreadedSequence(token_rag_chain).batch(baseline_qas)
embed_rag_results = ThreadedSequence(embed_rag_chain).batch(baseline_qas)

path = Path(f".cache/{SEED}_wllmk_intermediate/")
if not path.exists():
    path.mkdir()

if not (path / "token_res.pkl").exists():
    token_res = eval_fn(pipeline_results=token_rag_results)
    pickle.dump(token_res, open(path / "token_res.pkl", "wb"))
else:
    token_res = pickle.load(open(path / "token_res.pkl", "rb"))

if not (path / "embed_res.pkl").exists():
    embed_res = eval_fn(pipeline_results=embed_rag_results)
    pickle.dump(embed_res, open(path / "embed_res.pkl", "wb"))
else:
    embed_res = pickle.load(open(path / "embed_res.pkl", "rb"))

res = WLLMKResult(token=token_res, embed=embed_res)
res.save(f"results/{SEED}_wllmk_results.pkl")
res.synthesis

  0%|          | 0/519 [00:00<?, ?it/s]

  0%|          | 0/519 [00:00<?, ?it/s]

Unnamed: 0,bleu_score,rouge_score,meta_meteor_score,non_llm_string_similarity,semantic_similarity,factual_correctness,llm_context_precision_without_reference,context_recall,faithfulness,context_overlap,retrieved_context_similarity
token,0.453847,0.485661,0.50045,0.474112,0.884936,0.590248,0.870065,0.815762,0.788592,0.367593,0.739676
embed,0.209238,0.306499,0.348447,0.320105,0.818746,0.474313,0.850586,0.73575,0.694046,0.150087,0.718541


### Data Augmentation

#### Number of retries

In [17]:
n_retried_pipelines = 0
mean_retry = 0.0
for r in embed_rag_results:
    if "failures" in r:
        if r["failures"] > 0:
            n_retried_pipelines += 1
        mean_retry += r["failures"]

mean_retry /= len(embed_rag_results)

print(f"Number of retried pipelines: {n_retried_pipelines}")
print(f"Mean number of retries: {mean_retry}")

failures = Counter([r["failures"] for r in embed_rag_results if "failures" in r])

print(f"Failures: {failures}")


Number of retried pipelines: 357
Mean number of retries: 0.6994219653179191
Failures: Counter({1: 351, 0: 162, 2: 6})


#### Augmented prompts vs. original

In [18]:
augmented_qas = [
    res | {"user_input": res["last_input"]["pipeline_input"]}
    for res in embed_rag_results
    if res["failures"] > 0
]
augmented_qas = sorted(augmented_qas, key=lambda x: x["id"])
reference_qas = sorted(ds.qas, key=lambda x: x["id"])
for a, b in zip(augmented_qas, reference_qas):
    a["reference"] = b["reference"]
    a["reference_contexts"] = b["reference_contexts"]
    a["reference_context_ids"] = b["reference_context_ids"]


In [19]:
augmented_results = ThreadedSequence(rag_chain).batch(augmented_qas)

  0%|          | 0/357 [00:00<?, ?it/s]

In [20]:
if not (path / "augmented_res.pkl").exists():
    augmented_res = eval_fn(pipeline_results=augmented_results)
    pickle.dump(augmented_res, open(path / "augmented_res.pkl", "wb"))
else:
    augmented_res = pickle.load(open(path / "augmented_res.pkl", "rb"))

augmented_evals = WLLMKResult(augmented=augmented_res)


Evaluating:   0%|          | 0/3927 [00:00<?, ?it/s]

ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[41]: RagasOutputParserException(The output parser failed to parse the output including retries.)
ERROR:ragas.executor:Exception raised in Job[104]: TimeoutError()
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_o

In [28]:
import pandas as pd

b_evals = baseline_evals.details[
    baseline_evals.details["reference"].isin(augmented_evals.details["reference"])
    & baseline_evals.details["reference_contexts"].isin(
        augmented_evals.details["reference_contexts"]
    )
]

b_synthesis = {k: b_evals[k].mean() for k in baseline_evals.synthesis.keys()}

augment_eval_res = WLLMKResult()
augment_eval_res.synthesis = pd.DataFrame(
    [
        augmented_evals.synthesis,
        b_synthesis,
    ],
    index=["baseline", "augmented"],
)
augment_eval_res.details = {
    "baseline": b_evals,
    "augmented": augmented_evals.details,
}
augment_eval_res.save(f"results/{SEED}_augmented_results.pkl")

In [29]:
augment_eval_res.synthesis


Unnamed: 0,bleu_score,rouge_score,meta_meteor_score,non_llm_string_similarity,semantic_similarity,factual_correctness,llm_context_precision_without_reference,context_recall,faithfulness,context_overlap,retrieved_context_similarity
baseline,0.015783,0.074993,0.106471,0.166407,0.561388,0.278787,0.868779,0.145836,0.728232,0.0007,0.731044
augmented,0.119863,0.248121,0.334363,0.257913,0.767473,0.539403,0.888889,0.696968,0.791642,0.072829,0.739267


In [30]:
augment_eval_res.details["augmented"]


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,bleu_score,rouge_score,meta_meteor_score,non_llm_string_similarity,semantic_similarity,factual_correctness,llm_context_precision_without_reference,context_recall,faithfulness,context_overlap,retrieved_context_similarity
0,What does the acronym PSI stand for in the con...,[## 2.2 Fine-Tuning For Downstream Tasks\n\nAf...,[## Appendix\n\n | ...,"In the provided context, there is no mention o...",Product Substitute Identification,0.000000,0.000000,0.000000,0.084084,0.477198,0.75,1.000000,0.000000,1.000000,0.0,0.726919
1,"What does the term ""recall"" specifically denot...","[Recall, also known as Sensitivity or True Pos...",[## 1 Introduction\n\n Llama 2 [TMS+23] and Fa...,"In anomaly detection, recall, also known as Tr...",The method demonstrated to be highly performat...,0.033008,0.078431,0.127953,0.167614,0.622888,0.00,1.000000,0.666667,1.000000,0.0,0.691295
2,Was the conference on knowledge discovery and ...,[## C.2 Precision And Recall\n\niled: the most...,[## V. Limitations And Future Work\n\n ...,"No, the lines were spoken by different people.",I believe it holds significant promise for enh...,0.011150,0.060606,0.061475,0.170455,0.346355,0.00,0.000000,0.000000,0.000000,0.0,0.698202
3,What are your thoughts on the effectiveness of...,[## 1 Introduction\n\n for response generation...,[## E. Analysis On Base Models\n\nWe compare t...,The RS-DPO method is effective in aligning lar...,I believe instruction tuning is immensely cruc...,0.023587,0.111111,0.193322,0.263889,0.635889,0.14,1.000000,0.000000,,0.0,0.749425
4,What allows for the highly efficient generativ...,[Deploying large language models (LLMs) is alm...,"[Recall, also known as Sensitivity or True Pos...",The study suggests that the highly efficient g...,Recall measures the proportion of actual posit...,0.012091,0.085714,0.094787,0.189266,0.550496,0.00,1.000000,0.000000,0.833333,0.0,0.763315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352,What specifically influenced the specification...,[Reinforcement Learning from Human Feedback (R...,[## C.5 Kendall'S Tau Correlation Coefficient ...,The provided context does not explicitly menti...,I believe that Kendall's Tau correlation coeff...,0.032342,0.102190,0.134875,0.247649,0.593588,0.40,1.000000,0.000000,0.428571,0.0,0.747547
353,Is the recently implemented update significant...,[| | ...,[## 3.2 Experimental Setup\n\nWe start our exp...,"Yes, the update significantly enhances graphic...",The specific reasons for choosing the Open Ass...,0.002636,0.056338,0.042735,0.196581,0.553615,0.00,0.750000,0.000000,0.000000,0.0,0.706815
354,What is the detailed process to devise a strat...,[## System Answer The Following Questions As B...,[115\n150\n184\nDataset Size (Billion Tokens)\...,To devise a strategy that may potentially lead...,I don't know. Because the text does not explai...,0.000000,0.042553,0.060606,0.232198,0.562109,0.00,0.916667,0.000000,0.000000,0.0,0.728312
355,What specifically inspired the in-depth resear...,[## Abstract\n\nThis paper considers the chall...,[## C.5 Kendall'S Tau Correlation Coefficient ...,The in-depth research on probabilistic reasoni...,I don't know. Because the text only provides s...,0.015394,0.142857,0.179104,0.205556,0.579088,0.86,0.805556,0.000000,1.000000,0.0,0.718475
