# Inference with CLAVE

<a target="_blank" href="https://colab.research.google.com/github/davidaf3/CLAVE/blob/master/src/run_clave.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook shows how you can run inference on CLAVE and creates a Gradio UI that lets you experiment with the model.

## Setup

Install the necessary dependencies. This only install the packages that are not available in Colab. If you are not using Colab, you might need to install `torch`, `requests`, and `tqdm`.

In [1]:
%pip install rarfile gradio

Collecting rarfile
  Downloading rarfile-4.2-py3-none-any.whl.metadata (4.4 kB)
Collecting gradio
  Downloading gradio-5.24.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.4-py3-none-manylinux_2_17_x86_6

Clone CLAVE's repo and move into it. If you are running this notebook locally and have already clone the repo, this step is not necessary.

In [2]:
!git clone https://github.com/davidaf3/CLAVE.git
%cd CLAVE/src

Cloning into 'CLAVE'...
remote: Enumerating objects: 101, done.[K
remote: Counting objects: 100% (101/101), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 101 (delta 47), reused 88 (delta 34), pack-reused 0 (from 0)[K
Receiving objects: 100% (101/101), 183.62 KiB | 944.00 KiB/s, done.
Resolving deltas: 100% (47/47), done.
/content/CLAVE/src


## Download the model weights
First, download the model weights and SentencePiece parameter from the provided URLs:

In [3]:
from tqdm import tqdm
import requests


res = requests.get(
    "https://www.reflection.uniovi.es/bigcode/download/2024/CLAVE/model.rar",
    stream=True,
)

with tqdm(
    total=int(res.headers.get("content-length", 0)), unit="B", unit_scale=True
) as progress_bar:
    with open("model.rar", "wb") as f:
        for data in res.iter_content(1024):
            progress_bar.update(len(data))
            f.write(data)

res = requests.get(
    "https://www.reflection.uniovi.es/bigcode/download/2024/CLAVE/tokenizer_data.zip",
    stream=True,
)

with tqdm(
    total=int(res.headers.get("content-length", 0)), unit="B", unit_scale=True
) as progress_bar:
    with open("tokenizer_data.zip", "wb") as f:
        for data in res.iter_content(1024):
            progress_bar.update(len(data))
            f.write(data)

100%|██████████| 277M/277M [00:24<00:00, 11.4MB/s]
100%|██████████| 1.03M/1.03M [00:00<00:00, 1.18MB/s]


Extract the downloaded `model.rar` and `tokenizer_data.zip` files:

In [4]:
import rarfile
import zipfile


with rarfile.RarFile("model.rar") as f:
    f.extractall(path=".")

with zipfile.ZipFile("tokenizer_data.zip") as f:
    f.extractall(path=".")

## Load the weights
Create a new model (`FineTunedModel` class) and load the weights from the extracted file (`CLAVE.pt`):

In [5]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import torch
from model import FineTunedModel
from tokenizer import SpTokenizer


device = "cuda" if torch.cuda.is_available() else "cpu"

model = FineTunedModel(
    SpTokenizer.get_vocab_size(), 512, 512, 8, 2048, 6, use_layer_norm=True
).to(device)
model_checkpoint = torch.load("/content/CLAVE/src/CLAVE.pt", map_location=device)
weights = {
    k[10:] if k.startswith("_orig_mod") else k: v
    for k, v in model_checkpoint["model_state_dict"].items()
}
model.load_state_dict(weights)
model.eval()

FineTunedModel(
  (encoder): Encoder(
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (linear1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=512, bias=True)
          (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (embedding): Embedding(16000, 512)
    (pos_embedding): Embedding(2048, 512)
    (embedding_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (embedding_dropout): Dropout(p=0.1, inplace=

In [8]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

from tokenizer import SpTokenizer

# 1. Device setup
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device:", device)

# 2. Load your tokenizer and model
tokenizer = SpTokenizer()
PADDING_TOK = 0
# 2. Create a wrapper to encode code
class CustomCodeEncoder:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def encode(self, code: str):
        with torch.no_grad():
            # Tokenize input code
            token_ids = self.tokenizer.tokenizes(code)

            # Pad/truncate to length 512
            max_len = 512
            if len(token_ids) > max_len:
                token_ids = token_ids[:max_len]
            else:
                token_ids += [PADDING_TOK] * (max_len - len(token_ids))

            # Convert to tensor
            input_tensor = torch.tensor([token_ids]).to(device)

            # Forward pass through model
            output = self.model(input_tensor)

            # Average pooling over sequence dimension
            embedding = output.mean(dim=1).squeeze().cpu().numpy()
            return embedding

print("Started encoding using CodeBERT")
simclr_model = CustomCodeEncoder(model, tokenizer)

print("csv file loaded")
# 4. Load labeled code dataset
df = pd.read_csv("/content/drive/MyDrive/MAIN_PROJECT/labeled_code_data.csv")  # Make sure this file contains 'Code Cont' and 'label' columns

# 5. Generate embeddings
print("Encoding code snippets...")
codes = df["Code Content"].tolist()
labels = df["label"].tolist()

print("going to get embeddings:")
#embeddings = [simclr_model.encode(code) for code in codes]
import numpy as np

# Collect embeddings properly
embeddings = [simclr_model.encode(code) for code in codes]
embeddings_array = np.vstack(embeddings)  # Shape: (num_samples, embedding_dim)
labels_array = np.array(labels)

# Save to disk
np.save("/content/drive/MyDrive/MAIN_PROJECT/embeddings.npy", embeddings_array)
np.save("/content/drive/MyDrive/MAIN_PROJECT/labels.npy", labels_array)

print("Embeddings and labels saved!")
"""
# Load from disk
embeddings = np.load("embeddings.npy")
labels = np.load("labels.npy")

print("Loaded embeddings and labels!")
"""
# 6. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    embeddings_array, labels_array, test_size=0.2, random_state=42
)

print("LR classifier:")
# 7. Train classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)


print("Prediction:")
# 8. Evaluate
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Test Accuracy:", acc)

# 9. Save classifier
joblib.dump(clf, "/content/drive/MyDrive/MAIN_PROJECT/author_classifier.joblib")

# 10. Load label-to-author mapping
mapping_df = pd.read_csv("/content/drive/MyDrive/MAIN_PROJECT/author_label_mapping.csv")  # Must contain 'label' and 'author_name'
label_to_author = dict(zip(mapping_df["label"], mapping_df["Author Name"]))

# 11. Predict a new code snippet
new_code = input("Enter the code:")
new_embedding = simclr_model.encode(new_code)
predicted_label = clf.predict([new_embedding])[0]
predicted_author = label_to_author[predicted_label]

print("Predicted Author:", predicted_author)


device: cuda
Started encoding using CodeBERT
csv file loaded
Encoding code snippets...
going to get embeddings:
Embeddings and labels saved!
LR classifier:
Prediction:
Test Accuracy: 0.004160166406656267
Enter the code:t = int(input())  for a in range(t): 	x = input() 	x = x.split(" ") 	d = int(x[0]) 	p = x[1]  	#ctotal = p.count("c") 	c = p.split("c") 	if len(c) == 1: 		dtotal = len(c[0]) 		if dtotal > d: print("case #{}: IMPOSSIBLE".format(a+1)) 		else: print("case #{}: 0".format(a+1)) 	else: 		wap = 0 		if(c[0] == ''): 			dder = len(c[1])*2 			dizq = 0 		else: 			dizq = len(c[0]) 			dder = len(c[1])*2  		dtotal = dizq+dder  		if dtotal <= d: print("case #{}: 0".format(a+1)) 		else: 			if len(c[0])+len(c[1]) > d: print("case #{}: IMPOSSIBLE".format(a+1)) 			else: 				while(dtotal > d): 					dizq += 1 					dder -= 2 					dtotal = dizq+dder 					wap+=1 				print("case #{}: {}".format(a+1,wap))


ValueError: Expected 2D array, got 1D array instead:
array=[1.0189917].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

## Start the UI
Start the Gradio UI configured to run the `verify_authorship` function. This function tokenizes the inputs, processes the tokens with CLAVE to obtain an embedding for each input, and computes the distance between the embeddings.

In [None]:
import gradio as gr
import torch.nn.functional as F
from utils import pad_and_split_tokens


tokenizer = SpTokenizer()
threshold = 0.1050


def verify_authorship(source_code_1, source_code_2):
    with torch.inference_mode():
        tokens_1 = pad_and_split_tokens(tokenizer.tokenizes(source_code_1))[0]
        tokens_2 = pad_and_split_tokens(tokenizer.tokenizes(source_code_2))[0]
        embedding_1 = model(torch.tensor([tokens_1], device=device))
        embedding_2 = model(torch.tensor([tokens_2], device=device))
        distance = (1 - F.cosine_similarity(embedding_1, embedding_2)).item()
        return [
            distance,
            "Yes" if distance <= threshold else "No",
        ]


ui = gr.Interface(
    fn=verify_authorship,
    inputs=[
        gr.Code(language="python", label="Source code 1"),
        gr.Code(language="python", label="Source code 2"),
    ],
    outputs=[gr.Number(label="Distance"), gr.Text(label="Same author?")],
    allow_flagging="never",
)
ui.launch()