This notebook builds a triplet-loss fine-tuned SentenceTransformer to better encode our data embeddings so they move toward “violation” or “non-violation” vector meanings. We do this using a minimal two-sentence label description, a triplet training loop, and a triplet Euclidean loss function to shape the embedding space.

# Imports

### Load libraries

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import VotingClassifier
from tqdm import tqdm
import re
from sklearn.preprocessing import LabelEncoder
import pickle
import os
import torch
import platform
from sentence_transformers import SentenceTransformer
from sentence_transformers import InputExample
from sentence_transformers import losses
from torch.utils.data import DataLoader

### Load the data

In [2]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")  
solution_df = pd.read_csv("data/solution.csv")

### Device / encoder setup

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

Using device: cuda


In [4]:
model = SentenceTransformer("intfloat/e5-large-v2", device=device)

train_texts = train_df["body"].tolist()
train_labels = train_df["rule_violation"].astype(int).tolist()

pos_descriptions = [
    "This text violates the rule.",
    "This text does not violate the rule."
]

### Triplet Generation

In [7]:
triplets = []

for text, label in zip(train_texts, train_labels):
    anchor = text
    positive = pos_descriptions[label]
    negative = pos_descriptions[1 - label]

    triplets.append(
        InputExample(
            texts=[anchor, positive, negative]
        )
    )

print(f"Built {len(triplets)} triplets.")


Built 2029 triplets.


### Main Training Loop

In [12]:
train_loss = losses.TripletLoss(model, distance_metric=losses.TripletDistanceMetric.EUCLIDEAN, triplet_margin=1.0)
train_loader = DataLoader(triplets, shuffle=True, batch_size=16)

model.fit(
    train_objectives=[(train_loader, train_loss)],
    epochs=3,
    warmup_steps=100,
    show_progress_bar=True,
    output_path="./e5-large-v2-triplet"
)
print("Triplet fine-tuning complete.")
print("Model saved to ./e5-large-v2-triplet")


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss


Triplet fine-tuning complete.
Model saved to ./e5-large-v2-triplet
