Below I have created a template to encode sentences/chunks with SBERT:

NOTE: Please remember that we are using the SecureBERT model from this huggingface link: https://huggingface.co/ehsanaghaei/SecureBERT. This model embeds in a 768 dimensional space.

**Also** note that I DO NOT normalize the embeddings since we already have code in Anya's clustering to normalize embeddings. I think it is best that we only normalize once and use the same normalization to ensure consistency and reproducability.

In [None]:
## You should only need to download these three packages!
import torch
import transformers
from transformers import RobertaTokenizerFast, RobertaModel
#The tokenizerfast just works quicker. According to the documentation, it should be equivalent to the RobertaTokenizer

Note: This code DOES NOT mask or pad. I also did not add a pooling layer initally because it is defined later on. The pooling that this model usually does is just average all encoded tokens (with context/spatial awareness) and output one tensor for the embedding. I just did this directly so that it's clear.

One other important note: I did not use padding, so if you want to use this function, the batch size is 1. That means you **HAVE TO** use a loop or equivalent to encode each chunk seperately. This function does not accept multiple different sentences seperated in something like an array. Again, **ONLY** pass one chunk at a time through the function.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
dfExcel = pd.read_excel("full_clean.xlsx")
dfExcel["date"] = pd.to_datetime(dfExcel["date"], errors="coerce", utc=True)
dfExcel.head()

Unnamed: 0,date,title,source,article
0,2024-05-01 12:03:00+00:00,Everyone's an Expert: How to Empower Your Empl...,thehackernews,\n\n\n\nThere's a natural human desire to avoi...
1,2025-08-21 13:53:34+00:00,Why Certified VMware Pros Are Driving the Futu...,bleepingcomputer,"\n\nBy Brenda Emerson, VMUG President\n\nIT is..."
2,2024-10-03 12:59:00+00:00,Defunct' DOJ ransomware task force raises ques...,techtarget,\n\nListen to this article. This audio was gen...
3,2025-03-26 11:25:00+00:00,Sparring in the Cyber Ring: Using Automated Pe...,thehackernews,"""A boxer derives the greatest advantage from h..."
4,2025-05-14 15:42:00+00:00,Open source project curl is sick of users subm...,arstechnica,"""A threshold has been reached. We are effectiv..."


# Data Preprocessing

In [None]:
import hashlib

#creating a function that formats our data in a suitable way to be inputted into functions below
#converting each row in the original dataframe to a dictionary with 5 keys: article id, text, source, date, and title
def df_to_articles(df, include_title=True):
    records = []
    for _, row in df.iterrows():
        text = f"{row['title']}\n\n{row['article']}" if include_title else row['article']
        a_id = hashlib.md5(text.encode("utf-8", "ignore")).hexdigest()

        records.append({
            "article_id": a_id,
            "text": text,
            "source": row.get("source", ""),
            "date": row["date"],                    # now uniform datetime
            "title": row.get("title", ""),
        })
    return records


**Articles is now filled with dictionaries, where each dictionary (with keys specified above) is each article and its corresponding metadata**



# Chunking Function

In [None]:
#chunk func that is meant to be implemented per article
def chunk_article_by_tokens(text: str, tokenizer, max_len: int = 512, stride: int = 96):
    """
    Return a list of dictionaries with tokenized windows for a SINGLE article.
    Overlap applies within the article; reset between articles to avoid cross-article contamination.
    """
    if not text:
        return []
    enc = tokenizer(
        text,
        return_tensors=None, #returns regular lists as opposed to tensors or smth else
        truncation=False,      # we'll window manually
        padding=False,
        add_special_tokens=False # we'll add special tokens later maybe
    )
    input_ids = enc["input_ids"]
    attn = [1] * len(input_ids) #create attention mask and keep track of real (1) vs padded tokens (0); doesn't cost anything rn, could come in use later
    # padded tokens are necessary for if we embed in batches

    windows = [] #initialize a list
    # this will eventually be a list of dictionaries, each dictionary will represent a chunk
    start = 0
    chunk_id = 0
    while start < len(input_ids):
        #looping while the starting token id is less than the total length of the article
        #this is done so that we don't have chunks overlap between articles
        end = min(start + max_len, len(input_ids))
        #end is set to be the end of the current chunk, or the length of the article if we exceed article length
        win_ids = input_ids[start:end]
        win_attn = attn[start:end] #not rly necessary but storing the corresponding attention mask for the chunk
        windows.append({
            "chunk_id": chunk_id,
            "input_ids": win_ids,
            "attention_mask": win_attn,
            "start_token": start,
            "end_token": end
        }) #appending chunk #1 along with some relevant metadata (just to check our bases)
        if end == len(input_ids): #exit condition
            break
        start = end - stride  # move up our start pointer just enough to ensure overlap
        chunk_id += 1
    return windows

# Looping Chunk Func over all articles ()

In [None]:
def chunk_corpus(articles, tokenizer, max_len=512, stride=96):
    all_chunks = []
    for art in articles:
        #get article_id and text
        a_id = art["article_id"]
        text = art["text"]

        #chunk this single article
        windows = chunk_article_by_tokens(text, tokenizer, max_len, stride)

        #adding metadata to each chunk
        for w in windows:
            w["article_id"] = a_id
            w["title"] = art.get("title", "")
            w["source"] = art.get("source", "")
            w["date"] = art.get("date", "")
            all_chunks.append(w)
    return all_chunks


In [None]:
from transformers import AutoTokenizer, AutoModel

model_name = "ehsanaghaei/SecureBERT"
#Tokenizer
tok = AutoTokenizer.from_pretrained(model_name)

all_chunks = chunk_corpus(articles, tokenizer=tok, max_len=512, stride=96)
print(len(all_chunks))
print(all_chunks[0].keys())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2186 > 512). Running this sequence through the model will result in indexing errors


29298
dict_keys(['chunk_id', 'input_ids', 'attention_mask', 'start_token', 'end_token', 'article_id', 'title', 'source', 'date'])


**all_chunks is now a dictionary with chunk_id, input_ids: which is the tokenized numerical representation of the data, along with some other metadata**


In [None]:
all_chunks[0]

{'chunk_id': 0,
 'input_ids': [7682,
  1264,
  18,
  41,
  26721,
  35,
  1336,
  7,
  3676,
  11017,
  2486,
  13479,
  13,
  12098,
  14361,
  50118,
  50118,
  50118,
  50118,
  50118,
  50118,
  970,
  18,
  10,
  1632,
  1050,
  4724,
  7,
  1877,
  5608,
  12593,
  4,
  20,
  21490,
  6,
  9,
  768,
  6,
  16,
  114,
  47,
  1034,
  7,
  23229,
  143,
  9031,
  3662,
  2389,
  9,
  573,
  6,
  47,
  348,
  300,
  7,
  1091,
  2460,
  7,
  10749,
  167,
  182,
  276,
  3455,
  4,
  50118,
  50118,
  1620,
  10,
  568,
  12,
  5406,
  13,
  110,
  1651,
  6,
  47,
  216,
  42,
  157,
  4,
  125,
  117,
  948,
  141,
  171,
  2320,
  50,
  10128,
  13468,
  3270,
  110,
  1651,
  34,
  10,
  2934,
  2510,
  6,
  47,
  214,
  129,
  25,
  2823,
  25,
  110,
  19261,
  3104,
  4,
  345,
  18,
  202,
  65,
  333,
  14,
  64,
  25074,
  490,
  5,
  14213,
  7,
  15067,
  1856,
  5552,
  116,
  16625,
  308,
  82,
  4,
  50118,
  50118,
  36090,
  531,
  28,
  200,
  2574,
  13,
  110,
 

# Embedding chunked dictionary:


In [None]:
import torch
import torch.nn.functional as F
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModel

model_name = "ehsanaghaei/SecureBERT"
tok = AutoTokenizer.from_pretrained(model_name)
enc_model = AutoModel.from_pretrained(model_name, add_pooling_layer=False)
enc_model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
enc_model.to(device)

def collate_windows_for_batch(tokenizer, windows, pad_to_multiple_of=8):
    ids = [torch.tensor(w["input_ids"], dtype=torch.long) for w in windows]
    # For each pre-window we (rightly) used all-ones attention; pad() will extend mask with zeros
    att = [torch.ones(len(w["input_ids"]), dtype=torch.long) for w in windows]
    batch = tokenizer.pad(
        {"input_ids": ids, "attention_mask": att},
        padding=True,
        return_tensors="pt",
        pad_to_multiple_of=pad_to_multiple_of
    )
    return batch



config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

# Mean Strategies

In [None]:
@torch.no_grad()
def masked_mean_pool(last_hidden_state, attention_mask):
    # last_hidden_state: [B, T, H], attention_mask: [B, T]
    mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)     # [B, T, 1]
    summed = (last_hidden_state * mask).sum(dim=1)                     # [B, H]
    counts = mask.sum(dim=1).clamp(min=1e-9)                           # [B, 1]
    return summed / counts                                             # [B, H]

@torch.no_grad()
def last4_layers_masked_mean(out_hidden_states, attention_mask):
    """
    out_hidden_states: tuple(len=L+1) of [B, T, H] (includes embeddings at index 0)
    We take the top 4 layers, average them, then masked-mean over tokens.
    """
    # Take last 4 layers: [-4], [-3], [-2], [-1]
    layers = out_hidden_states[-4:]
    # Stack to [4, B, T, H] and average across layer dim -> [B, T, H]
    stacked = torch.stack(layers, dim=0).mean(dim=0)
    return masked_mean_pool(stacked, attention_mask)


In [None]:
@torch.no_grad()
def embed_chunks_to_parquet(
    all_chunks,
    tokenizer,
    model,
    batch_size=64,
    pooling="last4_masked_mean",   # options: "last4_masked_mean" | "masked_mean"
    normalize=False,
    parquet_path="securebert_chunks.parquet"
):
    rows = []
    vecs = []


    use_last4 = (pooling == "last4_masked_mean")

    if use_last4 and not getattr(model.config, "output_hidden_states", False):
        model.config.output_hidden_states = True

    for i in range(0, len(all_chunks), batch_size):
        batch_windows = all_chunks[i:i+batch_size]
        batch = collate_windows_for_batch(tokenizer, batch_windows)
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        out = model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=use_last4)

        if use_last4:
            pooled = last4_layers_masked_mean(out.hidden_states, attention_mask)
        else:
            pooled = masked_mean_pool(out.last_hidden_state, attention_mask)

        if normalize:
            pooled = F.normalize(pooled, p=2, dim=1)

        pooled = pooled.cpu().to(torch.float32).numpy()  # [B, H]

        # collect metadata rows and decoded text
        for j, w in enumerate(batch_windows):
            rows.append({
                "article_id": w["article_id"],
                "chunk_id":   w["chunk_id"],
                "start_token": w["start_token"],
                "end_token":   w["end_token"],
                "title":      w.get("title", ""),
                "source":     w.get("source", ""),
                "date":       w.get("date", ""),
                # Decode a human-readable text preview for the chunk
                "chunk_text": tokenizer.decode(w["input_ids"], skip_special_tokens=True)
            })
        vecs.append(pooled)

    # Stack all embeddings
    X = np.vstack(vecs).astype("float32")  # [N, H]

    # Build DF. Store embedding as list<float32> (Parquet handles this as a list column).
    df = pd.DataFrame(rows)
    df["embedding"] = [x.tolist() for x in X]

    # Write Parquet (pyarrow backend handles list columns nicely)
    df.to_parquet(parquet_path, index=False)
    return df


# Write to Parquet


In [None]:
df_parquet = embed_chunks_to_parquet(
    all_chunks,
    tokenizer=tok,
    model=enc_model,
    batch_size=64,
    pooling="last4_masked_mean",
    normalize=False,
    parquet_path="securebert_chunks.parquet"
)

print("Wrote Parquet with", len(df_parquet), "rows.")
print(df_parquet.head(2))


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Wrote Parquet with 29298 rows.
                         article_id  chunk_id  start_token  end_token  \
0  a46f04c55a3a2c0bae3a7103fe82a6f4         0            0        512   
1  a46f04c55a3a2c0bae3a7103fe82a6f4         1          416        928   

                                               title         source  \
0  Everyone's an Expert: How to Empower Your Empl...  thehackernews   
1  Everyone's an Expert: How to Empower Your Empl...  thehackernews   

                       date  \
0 2024-05-01 12:03:00+00:00   
1 2024-05-01 12:03:00+00:00   

                                          chunk_text  \
0  Everyone's an Expert: How to Empower Your Empl...   
1  : they don't work.\n\nAge-Old Challenges of Ol...   

                                           embedding  
0  [0.193899005651474, 0.06388619542121887, 0.061...  
1  [0.1835463047027588, 0.08975280076265335, 0.05...  


In [None]:
from google.colab import files
files.download("/content/securebert_chunks.parquet")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd
import numpy as np

# Load chunk-level data
df = pd.read_parquet("/content/securebert_chunks_with_cluster_id.parquet")

def weighted_embed(group):
    vecs = np.stack(group["embedding"].to_numpy())            # [n_chunks, dim]
    w = (group["end_token"] - group["start_token"]).to_numpy(dtype=float)
    # guard against zeros / negatives
    w = np.clip(w, 1.0, None)
    w = w / w.sum()
    return (vecs * w[:, None]).sum(axis=0)                    # [dim]

# Collapse: one embedding per article (weighted)
article_df = (
    df.groupby("article_id")
      .apply(weighted_embed)
      .reset_index(name="embedding")
)

# Bring back representative metadata (pick what you prefer)
meta = (df.sort_values("date")  # earliest date/title/source seen
          .groupby("article_id")
          .agg(title=("title","first"),
               source=("source","first"),
               date=("date","first"))
          .reset_index())

article_df = article_df.merge(meta, on="article_id", how="left")
article_df.head()

  .apply(weighted_embed)


Unnamed: 0,article_id,embedding,title,source,date
0,0000c0e764e81d384669caad4dacdd53,"[0.1624595820903778, 0.08895283937454224, -0.0...",Verizon Subsidiary Settles With FCC for $16M O...,securityweek,2024-07-24 13:10:12+00:00
1,0003ff0e883e9a34ffc48c99808200e4,"[0.22192189428541392, -0.01754539046022627, -0...",AI-as-a-Service Providers Vulnerable to PrivEs...,thehackernews,2024-04-05 15:08:00+00:00
2,000c43a5d1d395bafd83092595528b7e,"[0.14464730024337769, -0.007398000452667475, -...",Russian pro basketball player arrested for all...,bleepingcomputer,2025-07-10 16:26:35+00:00
3,00188ea00d5ddf2288e25c3269536f20,"[0.16467630765579305, -0.04355095658152776, 0....",NATO Draws a Cyber Red Line in Tensions With R...,securityweek,2024-05-13 15:12:34+00:00
4,0020a6c8470c6cb9116c49ce3c21923a,"[0.22905868742201063, 0.07023979283869267, 0.0...","Apple pulls Advanced Data Protection in UK, sp...",techtarget,2025-02-24 14:54:00+00:00
