# Inference with CLAVE

<a target="_blank" href="https://colab.research.google.com/github/davidaf3/CLAVE/blob/master/src/run_clave.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook shows how you can run inference on CLAVE and creates a Gradio UI that lets you experiment with the model.

## Setup

Install the necessary dependencies. This only install the packages that are not available in Colab. If you are not using Colab, you might need to install `torch`, `requests`, and `tqdm`.

In [1]:
%pip install rarfile gradio

Collecting rarfile
  Downloading rarfile-4.2-py3-none-any.whl.metadata (4.4 kB)
Collecting gradio
  Downloading gradio-5.25.2-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.6-py3-none-manylinux_2_17_x86_6

In [9]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Clone CLAVE's repo and move into it. If you are running this notebook locally and have already clone the repo, this step is not necessary.

In [2]:
!git clone https://github.com/davidaf3/CLAVE.git
%cd CLAVE/src

Cloning into 'CLAVE'...
remote: Enumerating objects: 101, done.[K
remote: Counting objects: 100% (101/101), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 101 (delta 47), reused 88 (delta 34), pack-reused 0 (from 0)[K
Receiving objects: 100% (101/101), 183.62 KiB | 3.91 MiB/s, done.
Resolving deltas: 100% (47/47), done.
/content/CLAVE/src


## Download the model weights
First, download the model weights and SentencePiece parameter from the provided URLs:

In [3]:
from tqdm import tqdm
import requests


res = requests.get(
    "https://www.reflection.uniovi.es/bigcode/download/2024/CLAVE/model.rar",
    stream=True,
)

with tqdm(
    total=int(res.headers.get("content-length", 0)), unit="B", unit_scale=True
) as progress_bar:
    with open("model.rar", "wb") as f:
        for data in res.iter_content(1024):
            progress_bar.update(len(data))
            f.write(data)

res = requests.get(
    "https://www.reflection.uniovi.es/bigcode/download/2024/CLAVE/tokenizer_data.zip",
    stream=True,
)

with tqdm(
    total=int(res.headers.get("content-length", 0)), unit="B", unit_scale=True
) as progress_bar:
    with open("tokenizer_data.zip", "wb") as f:
        for data in res.iter_content(1024):
            progress_bar.update(len(data))
            f.write(data)

100%|██████████| 277M/277M [00:23<00:00, 11.6MB/s]
100%|██████████| 1.03M/1.03M [00:00<00:00, 1.73MB/s]


Extract the downloaded `model.rar` and `tokenizer_data.zip` files:

In [4]:
import rarfile
import zipfile


with rarfile.RarFile("model.rar") as f:
    f.extractall(path=".")

with zipfile.ZipFile("tokenizer_data.zip") as f:
    f.extractall(path=".")

## Load the weights
Create a new model (`FineTunedModel` class) and load the weights from the extracted file (`CLAVE.pt`):

In [5]:
import torch
from model import FineTunedModel
from tokenizer import SpTokenizer


device = "cuda" if torch.cuda.is_available() else "cpu"

model = FineTunedModel(
    SpTokenizer.get_vocab_size(), 512, 512, 8, 2048, 6, use_layer_norm=True
).to(device)
model_checkpoint = torch.load("CLAVE.pt", map_location=device)
weights = {
    k[10:] if k.startswith("_orig_mod") else k: v
    for k, v in model_checkpoint["model_state_dict"].items()
}
model.load_state_dict(weights)
model.eval()

FineTunedModel(
  (encoder): Encoder(
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (linear1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=512, bias=True)
          (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (embedding): Embedding(16000, 512)
    (pos_embedding): Embedding(2048, 512)
    (embedding_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (embedding_dropout): Dropout(p=0.1, inplace=

## Start the UI
Start the Gradio UI configured to run the `verify_authorship` function. This function tokenizes the inputs, processes the tokens with CLAVE to obtain an embedding for each input, and computes the distance between the embeddings.

In [6]:
import gradio as gr
import torch.nn.functional as F
from utils import pad_and_split_tokens


tokenizer = SpTokenizer()
threshold = 0.1050


def verify_authorship(source_code_1, source_code_2):
    with torch.inference_mode():
        tokens_1 = pad_and_split_tokens(tokenizer.tokenizes(source_code_1))[0]
        tokens_2 = pad_and_split_tokens(tokenizer.tokenizes(source_code_2))[0]
        embedding_1 = model(torch.tensor([tokens_1], device=device))
        embedding_2 = model(torch.tensor([tokens_2], device=device))
        distance = (1 - F.cosine_similarity(embedding_1, embedding_2)).item()
        return [
            distance,
            "Yes" if distance <= threshold else "No",
        ]


ui = gr.Interface(
    fn=verify_authorship,
    inputs=[
        gr.Code(language="python", label="Source code 1"),
        gr.Code(language="python", label="Source code 2"),
    ],
    outputs=[gr.Number(label="Distance"), gr.Text(label="Same author?")],
    allow_flagging="never",
)
ui.launch()



It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://c474ce4118723edd2a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Novelty 1: Image as Input for Authorship Verification

In [7]:
#input in image format
!sudo apt update
!sudo apt install tesseract-ocr -y
!pip install pytesseract

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
[33m0% [Waiting for headers] [Waiting for headers] [1 InRelease 0 B/3,632 B 0%] [Wa[0m[33m0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connected[0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,604 kB]
Hit:

In [8]:
import gradio as gr
import torch
import torch.nn.functional as F
import cv2

from PIL import Image
import numpy as np
from utils import pad_and_split_tokens
from tokenizer import SpTokenizer
import pytesseract
pytesseract.pytesseract.tesseract_cmd = "/usr/bin/tesseract"

# --- Load CLAVE tokenizer and model ---
tokenizer = SpTokenizer()
threshold = 0.1050
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def process_image_and_verify(image, source_code_2):
    try:
        img = np.array(image)
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        extracted_code = pytesseract.image_to_string(gray)

        tokens_1 = pad_and_split_tokens(tokenizer.tokenizes(extracted_code))[0]
        tokens_2 = pad_and_split_tokens(tokenizer.tokenizes(source_code_2))[0]

        with torch.inference_mode():  # ← fix is here
            embedding_1 = model(torch.tensor([tokens_1], device=device))
            embedding_2 = model(torch.tensor([tokens_2], device=device))
            distance = (1 - F.cosine_similarity(embedding_1, embedding_2)).item()

        result = "Yes" if distance <= threshold else "No"
        return extracted_code, distance, result

    except Exception as e:
        return f"[ERROR] {str(e)}", None, "Error"

# Gradio UI
ui = gr.Interface(
    fn=process_image_and_verify,
    inputs=[
        gr.Image(type="pil", label="Upload Code Image"),
        gr.Code(language="python", label="Enter Code (Text)")
    ],
    outputs=[
        gr.Textbox(label="Extracted Code from Image"),
        gr.Number(label="Embedding Distance"),
        gr.Text(label="Same Author?")
    ],
    title="CLAVE Authorship Verification from Image + Code",
    allow_flagging="never"
)

ui.launch()



It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://58996c4e12ed7ad00d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Novelty 2: Authorship Attribution

In [10]:
#predict its author using CodeBERT + Logistic Regression.

import gradio as gr
import torch
import joblib
import pandas as pd
from transformers import AutoTokenizer, AutoModel

# Load device and models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load CodeBERT
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base").to(device)
model.eval()

# Load classifier
clf = joblib.load("/content/drive/MyDrive/MAIN_PROJECT/author_classifier.joblib")

# Load label → author name mapping
mapping_df = pd.read_csv("/content/drive/MyDrive/MAIN_PROJECT/author_label_mapping.csv")  # Must contain 'label' and 'Author Name'
label_to_author = dict(zip(mapping_df["label"], mapping_df["Author Name"]))

# Encoder class
class CodeBERTCodeEncoder:
    def encode(self, code: str):
        with torch.no_grad():
            inputs = tokenizer(code, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
            inputs = {key: val.to(device) for key, val in inputs.items()}
            outputs = model(**inputs)
            embedding = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
            return embedding

encoder = CodeBERTCodeEncoder()

# Prediction function
def predict_author(code_snippet):
    embedding = encoder.encode(code_snippet)
    pred_label = clf.predict([embedding])[0]
    author_name = label_to_author.get(pred_label, "Unknown Author")
    return f"Predicted Author: {author_name}"

# Gradio interface
iface = gr.Interface(
    fn=predict_author,
    inputs=gr.Textbox(lines=10, label="Enter Code"),
    outputs=gr.Text(label="Predicted Author"),
    title="Code Authorship Classifier",
    description="Paste a code snippet to predict its author using CodeBERT + Logistic Regression."
)

iface.launch()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://dc0e8e41477d4ccee5.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


