# WaterLLMarks Project using Langchain's API

The goal of the study is to review the use of text watermarks to track users throughout a LLM SaaS platform. The study implements the following pipelines:
- Question Answering without RAG.
- Question Answering with RAG (implemented as a naive RAG using Chroma).
- QA with RAG and a token-based watermark (`waterllmarks.watermarks.TokenWatermark`).
- QA with RAG and a watermark using character-embedding (`waterllmarks.watermarks.Rizzo2016`).
- With and without augmentation -> ideally see the impact of the augemntation on the llm response and/or the retrieved documents

The goal is to assess the response quality of the different pipelines using known metrics:
- BLEU, ROUGE, METEOR
- String similarity (Levenshtein distance)
- Context precision and recall (for RAG)
- Retrieved documents compared to reference.


## Setup dataset and corpus

In [1]:
%load_ext autoreload
%autoreload 2

In [26]:
def format_prompt(prompt: str) -> str:
    return "\n".join([l.lstrip() for l in prompt.split("\n")]).rstrip("\n")


In [3]:
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.globals import set_llm_cache
from langchain_community.cache import SQLiteCache
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.runnables import ConfigurableField
from langchain_core.rate_limiters import InMemoryRateLimiter


# set_llm_cache(SQLiteCache("llm_cache.db"))

SEED = 1138

# llm_client = ChatOpenAI(
#     model="gpt-4o-mini",
#     seed=SEED,
# )


# llm_client = ChatOllama(
#     model="mistral:7b",
#     base_url="http://snt-precision-7920-rack-lama:11434",
#     seed=SEED,
# )

llm_client = ChatOpenAI(
    api_key="EMPTY",
    base_url="http://snt-precision-7920-rack-lama:8001/v1",
    model="mistralai/Mistral-7B-Instruct-v0.3",
)


# embedding_fn = OpenAIEmbeddings(model="text-embedding-3-small")

embedding_fn = OpenAIEmbeddings(
    api_key="EMPTY",
    base_url="http://snt-precision-7920-rack-lama:8002/v1",
    model="intfloat/e5-mistral-7b-instruct",
    tiktoken_enabled=False,
    check_embedding_ctx_length=False,
)

# embedding_fn = OllamaEmbeddings(
#     base_url="http://snt-precision-7920-rack-lama:2",
#     model="intfloat/e5-mistral-7b-instruct",
# )
vector_store = Chroma(
    embedding_function=embedding_fn,
    persist_directory="./chroma_mistral_db",
)

In [4]:
from waterllmarks.datasets import LLMPaperDataset, RAGBenchDataset

ds = LLMPaperDataset()
corpus = ds.corpus
print(len(corpus))


8576


In [5]:
ds.corpus[1].page_content


"## I. Introduction\n\nDriven by the emerging development of deep learning, autonomous driving has observed a paradigm shift from rulesbased decision systems [66, 21] to data-driven learning-based approaches [28, 6, 36]. However, this comes at the cost of transparency in decision-making, especially for end-to-end autonomous driving systems which are considered black-box in nature [13]. Thus, in addition to precision in action control, explanation provision is key in ensuring trustworthy decisionmaking to reconcile the system's decisions with end-user expectations to foster confidence and acceptance [79, 8, 57] in dynamic driving environments.  \nTraditional approaches have mainly relied on attention visualisation [5, 7, 55] as a proxy to rationalise the decisions of the black-box systems or auxiliary intermediate tasks such as semantic segmentation [25, 32], object detection [16, 31], and affordance prediction [68, 45] provide meaningful intermediate representation for decision-making.

In [6]:
len(embedding_fn.embed_documents([corpus[1].page_content]))

1

In [7]:
import time


existing_ids = vector_store.get(include=[])["ids"]
doc_to_add = [doc for doc in ds.corpus if doc.id not in existing_ids]
while doc_to_add:
    n = min(100, len(doc_to_add))
    try:
        vector_store.add_documents(doc_to_add[:n])
        doc_to_add = doc_to_add[n:]
    except Exception as e:
        print("Got exception", e)
        print("Sleeping for 3 seconds")
        time.sleep(3)

In [8]:
from waterllmarks.pipeline import DictParser
from langchain_core.runnables import ConfigurableField, RunnableLambda


retriever = vector_store.as_retriever(
    search_type="mmr", search_kwargs={"search_k": 20}
).configurable_alternatives(
    ConfigurableField(id="retriever"),
    default_key="chroma",
    empty=RunnableLambda(lambda _: []),
)

print(ds.qas[0]["reference_context_ids"])
(DictParser("user_input") | retriever).invoke(ds.qas[0])


['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b']


[Document(id='a09c2e9a-dd24-4e36-9b2e-7d4e15e58639', metadata={'creation_datetime': '2024-03-04', 'file_name': '2402.09193v2.md', 'file_path': 'paper_data/2402.09193v2.md', 'file_size': 69224, 'file_type': '', 'last_accessed_datetime': '2024-03-04', 'last_modified_datetime': '2024-02-22', 'title': '(Ir)Rationality And Cognitive Biases In Large Language Models'}, page_content='## 1 Introduction\n\n can be used to test the rationality of a model; this is a complex problem which requires a consensus what is deemed rational and irrational.  \nMirco Musolesi University College London University of Bologna m.musolesi@ucl.ac.uk In using methods designed to evaluate human reasoning, it is important to acknowledge the performance vs. competence debate [14]. This line of argument encourages *species-fair* comparisons between humans and machines, meaning that we should design tests specific to either humans or machines, as otherwise apparent failures may not reflect underlying capabilities but on

## Baseline setting

### QA without RAG

In [36]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import ConfigurableField

from waterllmarks.pipeline import DictParser

prompt = PromptTemplate(
    input_variables=["pipeline_input"],
    template=format_prompt("""
        Answer the question. Keep the answer short and concise.

        Question: {pipeline_input}
        
        Answer: 
        """),
)

set_input = RunnablePassthrough.assign(pipeline_input=DictParser("user_input"))
output_parser = StrOutputParser(name="content") | (lambda x: x.strip())

llm = RunnablePassthrough.assign(response=prompt | llm_client | output_parser)

norag_chain = set_input | llm.with_config(llm="ollama")

norag_chain.invoke(ds.qas[0])

{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': 'Yes',
 'reference_context_ids': ['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b'],
 'reference_contexts': ["Language represents a rigorously structured communicative system characterized by its grammar and vocabulary. It serves as the principal medium through which humans articulate and convey meaning. This conception of language as a structured communicative tool is pivotal in the realm of computational linguistics, particularly in the development and evaluation of natural language processing (NLP) algorithms. A seminal aspect in this field is the Turing Test, proposed by Alan Turing in 1950 [1], which evaluates a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. In this context, the Turing Test primarily assesses the machine's capability to perfor

### QA with RAG

In [29]:
from textwrap import dedent
from langchain_core.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.runnables import RunnablePassthrough, chain, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

from waterllmarks.pipeline import DictParser, DictWrapper

rag_prompt = PromptTemplate(
    input_variables=["pipeline_input", "context"],
    template=format_prompt("""
        Answer the question using the provided context. Keep the answer short and concise.
        
        Context:
        {context}
        
        Question: {pipeline_input}

        Answer: 
        """),
)

context_formatter = RunnableLambda(
    lambda docs: "\n\n".join([doc.page_content for doc in docs]),
)  # list[Document] -> str

ragllm = (
    RunnablePassthrough.assign(
        retrieved_contexts=DictParser("pipeline_input") | retriever,
    )  # {"pipeline_input": str, "retrieved_contexts": list[Document]}
    | RunnablePassthrough.assign(
        context=DictParser("retrieved_contexts") | context_formatter,
    )  # {"pipeline_input": str, "retrieved_contexts": list[Document], "context": str}
    | RunnablePassthrough.assign(
        response=rag_prompt | llm_client | output_parser,
    )  # {"pipeline_input": str, "retrieved_contexts": list[Document], "context": str, "response": str}
).with_config(llm="ollama")

rag_chain = set_input | ragllm

rag_chain.invoke(ds.qas[0])

{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': 'Yes',
 'reference_context_ids': ['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b'],
 'reference_contexts': ["Language represents a rigorously structured communicative system characterized by its grammar and vocabulary. It serves as the principal medium through which humans articulate and convey meaning. This conception of language as a structured communicative tool is pivotal in the realm of computational linguistics, particularly in the development and evaluation of natural language processing (NLP) algorithms. A seminal aspect in this field is the Turing Test, proposed by Alan Turing in 1950 [1], which evaluates a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. In this context, the Turing Test primarily assesses the machine's capability to perfor

### QA with RAG but no context

In [11]:
empty_ragllm = ragllm.with_config(retriever="empty")
emptyrag_chain = set_input | empty_ragllm

# rag_results = rag_chain.batch(ds.qas)
# rag_results[0]

emptyrag_chain.invoke(ds.qas[0])


{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': 'Yes',
 'reference_context_ids': ['1626f35b-7ffc-4736-bd6e-2a2f20e1aa8b'],
 'reference_contexts': ["Language represents a rigorously structured communicative system characterized by its grammar and vocabulary. It serves as the principal medium through which humans articulate and convey meaning. This conception of language as a structured communicative tool is pivotal in the realm of computational linguistics, particularly in the development and evaluation of natural language processing (NLP) algorithms. A seminal aspect in this field is the Turing Test, proposed by Alan Turing in 1950 [1], which evaluates a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. In this context, the Turing Test primarily assesses the machine's capability to perfor

### Evaluation

In [None]:
from concurrent.futures import ThreadPoolExecutor
import pickle
from functools import partial
from ragas.embeddings import LangchainEmbeddingsWrapper
from waterllmarks.evaluation import DEFAULT_ALL_METRICS, WLLMKResult
from ragas.llms import LangchainLLMWrapper
from waterllmarks.evaluation import evaluate
from ragas import RunConfig
from langchain_core.runnables import RunnableConfig, RunnableSequence
from waterllmarks.pipeline import ProgressBarCallback, ThreadedSequence
from collections import Counter

llm_wrapper = LangchainLLMWrapper(llm_client)
embedding_wrapper = LangchainEmbeddingsWrapper(embedding_fn)


eval_fn = partial(
    evaluate,
    metrics=DEFAULT_ALL_METRICS,
    llm=llm_wrapper,
    embeddings=embedding_wrapper,
    runconfig=RunConfig(seed=SEED, max_workers=2),
)

inputs = ds.qas[:100]

rag_results = ThreadedSequence(rag_chain).batch(inputs)
# empty_rag_results = ThreadedSequence(emptyrag_chain).batch(inputs)

baseline_evals = WLLMKResult(
    rag=eval_fn(rag_results),
    #    empty_rag=eval_fn(empty_rag_results),
)

baseline_evals.save(f"results/{SEED}_baseline_results.synthesis")
baseline_evals.save(f"results/{SEED}_baseline_results_noempty.synthesis")

baseline_evals.synthesis


[nltk_data] Downloading package punkt_tab to ./nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to ./nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/22 [00:00<?, ?it/s]

{'bleu_score': 0.031939173411512777,
 'rouge_score': 0.11000000000000001,
 'meta_meteor_score': 0.14127721080986377,
 'non_llm_string_similarity': 0.13839285714285715,
 'semantic_similarity': 0.6719361752292586,
 'factual_correctness': 0.335,
 'llm_context_precision_without_reference': 0.49999999998333333,
 'context_recall': 0.7,
 'faithfulness': 0.7333333333333334,
 'context_overlap': 0.0,
 'retrieved_context_similarity': 0.6967715587971113}

### Seting up the new baseline

In [13]:
baseline_results = rag_chain.batch(inputs)
baseline_qas = [
    {
        "id": res["id"],
        "user_input": res["user_input"],
        "reference": res["response"],
        "reference_contexts": [doc.page_content for doc in res["retrieved_contexts"]],
        "reference_context_ids": [doc.id for doc in res["retrieved_contexts"]],
    }
    for res in baseline_results
]

## Watermarks

### TokenWatermark

In [14]:
from langchain_core.runnables import RunnableParallel

from waterllmarks.watermarks import TokenWatermark

wtmk = TokenWatermark(key=b"0123456789ABCDEF")
apply_watermark = RunnablePassthrough.assign(
    pipeline_input=DictParser("pipeline_input") | wtmk.apply_as_runnable()
)
token_rag_chain = set_input | apply_watermark | ragllm


# token_rag_results = token_rag_chain.invoke(baseline_qas)
# token_rag_results[0]

token_rag_chain.invoke(baseline_qas[0])


{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': " No, the Turing Test does not directly assess a machine's ability to exhibit intelligent behavior equivalent to that of a human. Instead, it evaluates a machine's ability to mimic human-like conversation.",
 'reference_contexts': ['## 1 Introduction\n\n can be used to test the rationality of a model; this is a complex problem which requires a consensus what is deemed rational and irrational.  \nMirco Musolesi University College London University of Bologna m.musolesi@ucl.ac.uk In using methods designed to evaluate human reasoning, it is important to acknowledge the performance vs. competence debate [14]. This line of argument encourages *species-fair* comparisons between humans and machines, meaning that we should design tests specific to either humans or machines, as otherwise apparent failures ma

### Rizzo2016 (Character Embedding)

In [39]:
from textwrap import dedent

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from waterllmarks.watermarks import Rizzo2016
from waterllmarks.pipeline import RunnableTryFix


wtmk = Rizzo2016(key=b"0123456789ABCDEF")
apply_watermark = RunnablePassthrough.assign(
    pipeline_input=DictParser("pipeline_input") | wtmk.apply_as_runnable()
)

augment_prompt = PromptTemplate(
    input_variables=["pipeline_input"],
    template=dedent("""Increase the query size to at least 105 characters.

    Query: {pipeline_input}

    Augmented Query: 
    """),
)

augmenter = RunnablePassthrough.assign(
    pipeline_input=DictParser("pipeline_input")
    | augment_prompt
    | llm_client
    | output_parser
)

apply_or_augment = RunnableTryFix(
    primary_step=apply_watermark, fix_step=augmenter, log_failures=True
)

embed_rag_chain = set_input | apply_or_augment | ragllm

# embed_rag_results = embed_rag_chain.batch(baseline_qas)
# embed_rag_results[0]


embed_rag_chain.invoke(baseline_qas[0])


{'id': '1c40f174-bda1-4c48-88fb-b7449e999067',
 'user_input': "Does the Turing Test assess a machine's ability to exhibit intelligent behavior equivalent to that of a human?",
 'reference': " No, the Turing Test does not directly assess a machine's ability to exhibit intelligent behavior equivalent to that of a human. Instead, it evaluates a machine's ability to mimic human-like conversation.",
 'reference_contexts': ['## 1 Introduction\n\n can be used to test the rationality of a model; this is a complex problem which requires a consensus what is deemed rational and irrational.  \nMirco Musolesi University College London University of Bologna m.musolesi@ucl.ac.uk In using methods designed to evaluate human reasoning, it is important to acknowledge the performance vs. competence debate [14]. This line of argument encourages *species-fair* comparisons between humans and machines, meaning that we should design tests specific to either humans or machines, as otherwise apparent failures ma

## Evaluation

### Watermarks vs. baseline

In [16]:
from functools import partial

eval_fn = partial(
    evaluate,
    metrics=DEFAULT_ALL_METRICS,
    llm=llm_wrapper,
    embeddings=embedding_wrapper,
)

token_rag_results = ThreadedSequence(token_rag_chain).batch(baseline_qas)
embed_rag_results = ThreadedSequence(embed_rag_chain).batch(baseline_qas)

token_res = eval_fn(pipeline_results=token_rag_results)
embed_res = eval_fn(pipeline_results=embed_rag_results)

res = WLLMKResult(token=token_res, embed=embed_res)
res.save(f"results/{SEED}_wllmk_results.pkl")
res.synthesis

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/22 [00:00<?, ?it/s]

ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[7]: RagasOutputParserException(The output parser failed to parse the output including retries.)


Evaluating:   0%|          | 0/22 [00:00<?, ?it/s]

Unnamed: 0,bleu_score,rouge_score,meta_meteor_score,non_llm_string_similarity,semantic_similarity,factual_correctness,llm_context_precision_without_reference,context_recall,faithfulness,context_overlap,retrieved_context_similarity
token,0.102812,0.124725,0.136375,0.117364,0.628211,0.2,0.0,1.0,0.0,0.142857,0.686901
embed,0.100084,0.148571,0.113743,0.114324,0.677962,0.0,0.666667,0.75,0.5,0.166667,0.680833


### Number of retries

In [20]:
n_retried_pipelines = 0
mean_retry = 0.0
for r in embed_rag_results:
    if "failures" in r:
        n_retried_pipelines += 1
        mean_retry += r["failures"]

mean_retry /= len(embed_rag_results)

n_retried_pipelines, mean_retry

KeyboardInterrupt: 

### Result analysis

In [18]:
from pathlib import Path
import re

seeds = []
for f in Path("./results").iterdir():
    seeds.append([])

In [19]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))

df_full = pd.concat(
    [v.to_pandas().assign(experiment=k) for k, v in eval_results.items()], axis=0
)

# df_concat = pd.DataFrame(
#     columns=eval_results["no_wtmk_norag"].to_pandas().columns + ["experiment"]
# )
# for k, v in eval_results.items():
#     df = v.to_pandas().assign(experiment=k)
#     df_concat = pd.concat([df_concat, df])

# df

# # sns.boxplot(data=df_concat, hue="experiment")

df_full = df_full.drop(
    columns=["user_input", "response", "reference", "retrieved_contexts"]
)
df_full = df_full.melt(id_vars=["experiment"], var_name="metric", value_name="value")

plt.xticks(rotation=45, ha="right")
sns.boxplot(data=df_full, x="metric", y="value", hue="experiment")

NameError: name 'pd' is not defined

<Figure size 1200x600 with 0 Axes>