# **SET-UP AND DEPENDENCIES**

In [None]:
lda_2k_bert["bert_embedding"].iloc[0]

'[0.08504268527030945, -0.08991982787847519, -0.3310895264148712, 0.17475174367427826, 0.17374038696289062, 0.010502063669264317, 0.09113357216119766, -0.009091060608625412, 0.1093989685177803, 0.24470943212509155, 0.1621207296848297, 0.29537060856819153, 0.15000250935554504, 0.31608647108078003, 0.07465407997369766, -0.1422363966703415, 0.017795894294977188, -0.01801280304789543, -0.2586177885532379, -0.23759788274765015, 0.029648559167981148, -0.18469396233558655, -0.3168199062347412, 0.027158569544553757, -0.3880928158760071, -0.11783148348331451, 0.26131683588027954, 0.9825981855392456, 0.11006905138492584, 0.3417786955833435, -0.1494264155626297, 0.23733049631118774, -0.29542532563209534, 0.16874052584171295, -0.0782364159822464, 0.1246759444475174, 0.008889998309314251, 0.19892458617687225, 0.09358085691928864, 0.31177976727485657, 0.06423398852348328, -0.07339493185281754, 0.1532508283853531, 0.4257239103317261, 0.2785787582397461, 0.03671146184206009, 0.053239498287439346, -0.3

Getting contextual embeddings: Call the tokenizer into full mode to:


*   Convert text to input IDs
*   Generate attention masks
*   Add special tokens ([CLS], [SEP])
*   Pad and truncate
*   Return a format usable by the model
*   This is required for running the text through the Bio_ClinicalBERT model







In models like BERT, which are based on transformers, positional embeddings help the model understand the order of words in a sentence. Transformers don't process text like RNNs or LSTMs (which naturally move left to right). Instead, transformers process the entire sentence all at once, without any built-in sense of sequence. That’s what positional embeddings do: they add information about word order to each token’s representation. Each token in a sentence gets two types of embeddings:
Token embedding – the meaning of the word itself (e.g., “dog”, “cat”)
Positional embedding – the position of that word in the sentence (e.g., 1st, 2nd, 3rd…). These are added together to form the final input vector for each word. So BERT doesn’t just know “cat” is in the sentence — it knows where it is.



In [None]:
batch_size = 100 # BERT models are heavy (took over an hour!). Instead of processing all rows at once (which would crash Colab), we work in chunks of 100.
all_embeddings = [] # list to store embeddings
texts = lda_2k["exp_text"].tolist()

for i in range(0, len(texts), batch_size):
    batch_texts = texts[i:i + batch_size]
    tokenized = tokenizer(
        batch_texts,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    with torch.no_grad(): # Disables gradient tracking with torch.no_grad() to save on computing resources
        outputs = model(**tokenized) # Runs the tokenized batch through the BERT model but in a very efficient way, specifically for inference, not training.

    # Mean pooling
    attention_mask = tokenized["attention_mask"].unsqueeze(-1)
    masked = outputs.last_hidden_state * attention_mask
    embeddings = masked.sum(dim=1) / attention_mask.sum(dim=1)

    all_embeddings.extend(embeddings.cpu().numpy().tolist())

lda_2k["bert_embedding"] = all_embeddings
# This column will contain:
# One 768-dimensional vector per clinical note
# Stored as a list of floats.
