<a href="https://colab.research.google.com/github/its-emile/watermarks/blob/main/Logit_Watermark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
In this notebook, I demonstrate how LLM outputs can use a deterministic pattern to choose the likeliest tokens to output next, rather than randomness or the most likely one.

Note: GPT2 is obviously awful, with very poorly aligned or incoherent outputs. Any 7B or 13B model would present more coherence. GPT2 is just used here to demonstrate.

In [6]:
import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel, AutoModelForCausalLM, AutoTokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2LMHeadModel.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


# Generating with deterministic logit ranks:
Here, I use a deterministic pattern to determine the rank of the next token to output, among the likeliest potential tokens.

In [7]:
watermark = [0,1,1,0,2] # use a unique signature. the model will use this pattern repeatedly as a "bias" across its top possible output tokens. You should use low ranks, the program assumes they are no greater than the sequence length.
text = "When you walk your dog, stay safe by" # Currently based on sentence completion, not an instruct model format (todo)
while True:
  encoded_input = tokenizer(text, return_tensors='tf')
  output = model(encoded_input)
  logits = output.logits[0, -1, :]
  softmax = tf.math.softmax(logits, axis=-1)
  result = tf.math.top_k(softmax, len(watermark))
  candidates = ""
  for r in result:
    candidates += ", "+tokenizer.decode(r)
  print("candidates:"+candidates)


  # custom rank for the token (NOTE: we could be more greedy to end now if any candidate is a full stop, this would be a TODO option)
  rank = watermark[len(encoded_input.input_ids[0]) % len(watermark)]
  print("selected rank",rank)

  select = result.indices[rank]
  token = tokenizer.decode(select)
  print(text, "[", token, "]")
  text += token

  # stop generating when appropriate for demo.
  if ((len(encoded_input.input_ids[0])>50) | (token.find(".")>=0)):
      break

candidates:, !!!!!,  your keeping the using following
selected rank 2
When you walk your dog, stay safe by [  the ]
candidates:, !!!!!,  side door leash front dog
selected rank 0
When you walk your dog, stay safe by the [  side ]
candidates:, !!!!!,  of or., and
selected rank 1
When you walk your dog, stay safe by the side [  or ]
candidates:, !!!!!,  the behind in side on
selected rank 1
When you walk your dog, stay safe by the side or [  behind ]
candidates:, !!!!!,  the you a your him
selected rank 0
When you walk your dog, stay safe by the side or behind [  the ]
candidates:, !!!!!,  fence dog door wheel car
selected rank 2
When you walk your dog, stay safe by the side or behind the [  door ]
candidates:, !!!!!, ., and or when
selected rank 0
When you walk your dog, stay safe by the side or behind the door [ . ]


# Catching on
Now, we analyze the "supplied" text and show that by recalculating possible tokens one at a time, we can derive the "rank pattern" that the message carries.

In [8]:
encoded_input = tokenizer(text, return_tensors='tf')
ranks=[]
for i in range(encoded_input.input_ids[0].shape[0]-1):
  output = model({'input_ids':encoded_input.input_ids[:,0:i+1],'attention_mask':encoded_input.attention_mask[:,0:i+1]})
  logits = output.logits[0, -1, :]
  softmax = tf.math.softmax(logits, axis=-1)
  result = tf.math.top_k(softmax, len(watermark)).indices.numpy() # we could increase the list here, but we assume the watermark length is the best range to consider.
  # what does the model propose as the likeliest next tokens?
  candidates = []
  for r in result:
    candidates.append(r)

  # which token is actually next?
  test = encoded_input.input_ids[0][i+1].numpy()

  # detect rank of the next token, -1 if outside top predictions:
  try:
    rank = candidates.index(test)
  except:
    rank = -1
  ranks.append(rank)

# check on sample tokens' ranks among predictions. This is the "hidden pattern" in the text, if any!
print(ranks)


[3, -1, -1, 0, 0, -1, -1, -1, 2, 0, 1, 1, 0, 2, 0]


# Find the longest run matching the watermark pattern
This is still a fairly basic demo, but that length would be indicative of a low-likelihood particular sample unless it was generated and watermarked. Only short runs (or medium with very low ranks) carry some chance of being human.

Improvements here will include incorporating levenstein distance to account for any point replacements that might break up very long sequences, but such robustness should present the user with a remark about human generated edits (which may be a *desirable* behavior).

In [16]:

maxrun=0
for k in range(len(ranks)):
  run = 0 # start comparing from ranks[k]
  # indices of watermark matches
  try:
    start = watermark.index(ranks[k])
    while ranks[k+run] == watermark[start+run % len(watermark)]: # count as long as the pattern matches the watermark
      run += 1
  except:
    if(run>maxrun):
      maxrun = run
    continue

est_likelihood = 1 - 2.72**(-maxrun/2)

print("detected a sequence as long as",maxrun,"tokens matching the watermark pattern (watermark confidence (not original text): ",f"{int(est_likelihood*100)}/100)")

detected a sequence as long as 6 tokens matching the watermark pattern (watermark confidence (not original text):  95/100)


Since short runs or very low ranks carry some chance of being human, a confidence evaluation would require that expand the loop above to extract the softmax probabilities of these particular tokens. That said, the matched length is a rapid indicator of outputs likely to have been generated with this watermark algorithm.