# MMContextEncoder — quick‑start & usage tour

This notebook uses the **`OmicsCaptionSimulator`** to generate toy data and walks through three ways of running the `MMContextEncoder` inside the Sentence‑Transformers framework:

1. **Text‑only** (no numeric data)
2. **Pre‑computed numeric embeddings**  
   2 a. feature‑level tokens  2 b. sample‑level tokens
3. **Random‑initialised numeric embeddings** (baseline)

> *Training* will be covered in a follow‑up notebook. Here we focus on end‑to‑end **`encode`** calls and what comes out.

---

## 0  Setup

In [1]:
%load_ext autoreload
%autoreload 2
from mmcontext.utils import setup_logging

setup_logging()

  from .autonotebook import tqdm as notebook_tqdm
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  import pkg_resources
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
  warn(
  STOPWORDS = set(map(str.strip, open(os.path.join(FILE, 'stopwords')).readlines()))


<RootLogger root (INFO)>

In [3]:
import numpy as np
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer

from mmcontext.models.mmcontextencoder import MMContextEncoder
from mmcontext.simulator import OmicsCaptionSimulator

sim = OmicsCaptionSimulator(n_samples=100, n_genes=10).simulate()
token_df = sim.get_dataframe()
raw_ds = sim.get_hf_dataset()["train"]
raw_ds

Filter: 100%|██████████| 200/200 [00:00<00:00, 167504.15 examples/s]
Filter: 100%|██████████| 200/200 [00:00<00:00, 153975.92 examples/s]


Dataset({
    features: ['sentence1', 'sentence2', 'label', 'sample_idx'],
    num_rows: 160
})

In [8]:
# The token dataframe has entries of the following dimensions:
print(f"Sample embeddings shape: {token_df['embedding'].shape}")

Sample embeddings shape: (100,)


The HuggingFace dataset has the columns
`sample_idx, 'sentence1', 'sentence2', label`.

## 1  MMContextEncoder as a **pure text** model

In [10]:
text_enc = MMContextEncoder(text_encoder_name="prajjwal1/bert-tiny")  # any HF model works
st_text = SentenceTransformer(modules=[text_enc])

example = [raw_ds["sentence1"][0], raw_ds["sentence2"][0]]
print("input →", example)
print("embedding →", st_text.encode(example)[:5], "…")

2025-05-20 10:23:09,714 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps


input → ['sample_idx:S1', 'Neuron']


Batches: 100%|██████████| 1/1 [00:00<00:00,  4.01it/s]

embedding → [[-9.99999166e-01  9.19028595e-02 -9.96874273e-01 -6.12561703e-01
  -9.63874042e-01  4.67113942e-01 -8.47504914e-01 -9.84450459e-01
   1.09639160e-01 -1.06203169e-01 -7.82011151e-01 -1.68492451e-01
   5.84925874e-05  9.99998629e-01  1.00110903e-01 -9.75917816e-01
   7.74460971e-01  1.05135463e-01 -8.92424941e-01  4.47784990e-01
   8.49705040e-01 -1.46661596e-02  6.78773463e-01  1.48798982e-02
  -9.99661326e-01 -2.09409930e-02 -9.99751985e-01  2.98347682e-01
   9.98913705e-01 -4.92799431e-02 -1.47239253e-01 -9.91691053e-02
  -9.98491526e-01 -7.11548686e-01  5.00017703e-01  9.99980211e-01
  -9.95522738e-01  2.97337230e-02  9.65869248e-01 -9.97810125e-01
   9.97708559e-01  9.47759271e-01 -9.99164104e-01  8.97138715e-01
  -9.99734938e-01 -9.18340236e-02 -9.85926330e-01  9.99412119e-01
   9.35510337e-01  9.99569476e-01  3.51746738e-01 -8.14474225e-01
  -1.71417326e-01  5.83500862e-01  9.58849430e-01  9.96164680e-01
  -9.34911549e-01 -8.49859059e-01  9.76384103e-01 -1.83745846e-0




`sentence1` is **treated like ordinary words**, because we never registered numeric embeddings.

If you initialise with `output_token_embeddings=True` you can retrieve the per‑token vectors:

In [11]:
text_enc_tokens = MMContextEncoder("prajjwal1/bert-tiny", output_token_embeddings=True)
st_tokens = SentenceTransformer(modules=[text_enc_tokens])

res = st_tokens.encode(example, output_value="token_embeddings")
print(len(res))  # a list with length of batch size (2)
res[0].shape  # the first element is a tensor of shape (n_tokens, n_features)

2025-05-20 10:23:29,511 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps
Batches: 100%|██████████| 1/1 [00:00<00:00, 50.33it/s]

2





torch.Size([9, 128])

## 2  Using **pre‑computed** numeric embeddings
### 2 a  Feature‑level (gene) tokens

In [None]:
sim = OmicsCaptionSimulator(n_samples=2000, n_genes=20, use_gene_level=True).simulate()
token_df = sim.get_dataframe()
raw_ds = sim.get_hf_dataset()["train"]

In [18]:
enc_feat = MMContextEncoder(
    "prajjwal1/bert-tiny", adapter_hidden_dim=32, adapter_output_dim=64, output_token_embeddings=True
)
enc_feat.register_initial_embeddings(token_df, data_origin="geneformer")

# prefix the dataset so the processor knows which column is omics
pref_ds = enc_feat.prepare_ds(raw_ds, cell_sentences_cols="sentence1", caption_col="sentence2")

st_feat = SentenceTransformer(modules=[enc_feat])
row = pref_ds[0]
print("input →", row["sentence_1"])
encoding = st_feat.encode(row["sentence_1"], output_value="sentence_embedding")
print("Pooled Embedding shape:", encoding.shape)
token_encoding = st_feat.encode(row["sentence_1"], output_value="token_embeddings")
print("Token Embedding shape:", token_encoding.shape)

Filter: 100%|██████████| 4000/4000 [00:00<00:00, 628854.75 examples/s]
Filter: 100%|██████████| 4000/4000 [00:00<00:00, 650910.42 examples/s]
2025-05-20 10:26:26,965 - mmcontext.models.omicsencoder - INFO - Loaded embedding matrix with shape (21, 16)
2025-05-20 10:26:26,966 - mmcontext.models.mmcontextencoder - INFO - Registered 21 new numeric samples (total 21). ≈0.000 GiB added. (Assuming float32 precision.)
Prefixing sentence1: 100%|██████████| 3200/3200 [00:00<00:00, 169943.18 examples/s]
2025-05-20 10:26:27,038 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps


input → sample_idx:g10 g2 g11 g20 g15 g8 g19 g4 g5 g12


Batches: 100%|██████████| 1/1 [00:00<00:00, 18.36it/s]


Pooled Embedding shape: (64,)


Batches: 100%|██████████| 1/1 [00:00<00:00, 155.52it/s]

Token Embedding shape: torch.Size([10, 64])





### 2 b  Sample‑level tokens

In [19]:
sim = OmicsCaptionSimulator(n_samples=2000, n_genes=20).simulate()
token_df = sim.get_dataframe()
raw_ds = sim.get_hf_dataset()["train"]

Filter: 100%|██████████| 4000/4000 [00:00<00:00, 623734.70 examples/s]
Filter: 100%|██████████| 4000/4000 [00:00<00:00, 620344.46 examples/s]


In [20]:
enc_samp = MMContextEncoder(
    "prajjwal1/bert-tiny", adapter_hidden_dim=32, adapter_output_dim=64, output_token_embeddings=True
)
enc_samp.register_initial_embeddings(token_df, data_origin="pca")

pref_ds2 = enc_samp.prepare_ds(raw_ds, cell_sentences_cols="sentence1", caption_col="sentence2")
st_samp = SentenceTransformer(modules=[enc_samp])
print("input →", pref_ds2[0]["sentence_1"])
encoding = st_samp.encode(pref_ds2[0]["sentence_1"])
print("Pooled Embedding shape:", encoding.shape)
token_encoding = st_samp.encode(pref_ds2[0]["sentence_1"], output_value="token_embeddings")
print("Token Embedding shape:", token_encoding.shape)

2025-05-20 10:27:30,090 - mmcontext.models.omicsencoder - INFO - Loaded embedding matrix with shape (2001, 32)
2025-05-20 10:27:30,091 - mmcontext.models.mmcontextencoder - INFO - Registered 2001 new numeric samples (total 2001). ≈0.000 GiB added. (Assuming float32 precision.)
Prefixing sentence1: 100%|██████████| 3200/3200 [00:00<00:00, 164734.86 examples/s]
2025-05-20 10:27:30,169 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps


input → sample_idx:S1


Batches: 100%|██████████| 1/1 [00:00<00:00, 36.97it/s]


Pooled Embedding shape: (64,)


Batches: 100%|██████████| 1/1 [00:00<00:00, 190.11it/s]

Token Embedding shape: torch.Size([1, 64])





The numeric vectors from `sample_df` are returned **unmodified** by the omics branch and then projected by the adapter.

> **Note**  Embedding weights are *not* saved with the model; only the adapter weights are. When you reload the model you must call `register_initial_embeddings` again with a compatible matrix.

## 3  Random‑initialised embeddings (baseline)

In [21]:
sim = OmicsCaptionSimulator(n_samples=2000, n_genes=20).simulate()
token_df = sim.get_dataframe()
raw_ds = sim.get_hf_dataset()["train"]

Filter: 100%|██████████| 4000/4000 [00:00<00:00, 622369.55 examples/s]
Filter: 100%|██████████| 4000/4000 [00:00<00:00, 662058.17 examples/s]


In [23]:
enc_rand = MMContextEncoder("prajjwal1/bert-tiny", adapter_hidden_dim=32)
enc_rand.random_initial_embeddings(list(token_df["token"]))
pref_ds3 = enc_rand.prepare_ds(raw_ds, cell_sentences_cols="sentence1", caption_col="sentence2")

st_rand = SentenceTransformer(modules=[enc_rand])
print(st_rand.encode(pref_ds3[0]["sentence_1"]))

2025-05-20 10:28:25,078 - mmcontext.models.omicsencoder - INFO - Loaded embedding matrix with shape (2001, 64)
2025-05-20 10:28:25,079 - mmcontext.models.mmcontextencoder - INFO - Registered 2001 new numeric samples (total 2001). ≈0.000 GiB added. (Assuming float32 precision.)
Prefixing sentence1: 100%|██████████| 3200/3200 [00:00<00:00, 162676.32 examples/s]
2025-05-20 10:28:25,159 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps
Batches: 100%|██████████| 1/1 [00:00<00:00, 58.20it/s]

[ 0.10948668  0.1453564   0.18222867  0.01778192  0.16418293  0.18799107
  0.23551662  0.02295085  0.14415075  0.51319635 -0.21753936 -0.08540704
 -0.22772713  0.33696035  0.17125972 -0.06174558  0.05924015  0.06253229
 -0.255522    0.19456369 -0.30784404  0.12620766 -0.15672968 -0.182354
 -0.12227616 -0.615613    0.24411213 -0.30043757  0.12713297 -0.29219785
 -0.49842006 -0.00670506  0.1654552  -0.01335027  0.19999892 -0.02134908
  0.08071972 -0.05063622 -0.576203   -0.07939567  0.27898422  0.22024229
 -0.01880814 -0.19075714  0.30060425  0.31629696 -0.08036962  0.25729987
 -0.32603797 -0.01327852 -0.14412963  0.11517966  0.0690413   0.03479644
  0.18441519  0.03432512 -0.19326882 -0.07836887  0.11872623 -0.31066564
  0.23714355 -0.1661951   0.10479903  0.06035277 -0.04181308  0.03456398
  0.48263597  0.01516395  0.21020971  0.4676326  -0.29020986 -0.0757796
  0.04970912 -0.18281977  0.23352966  0.01416691 -0.14536797 -0.37176612
  0.10221446 -0.0665024   0.34153995  0.05214313  0.24




Random vectors let you benchmark how much pre‑computed representations help compared with an uninformed baseline (same dimension, same adapters).

## 4  What’s next?
* **Training** → use `SentenceTransformerTrainer` with `pref_ds`. Give the model a pair dataset (`label` = 1/0) and a suitable loss, e.g. `CosineSimilarityLoss`.
* **Saving / loading** → `st_rand.save(path)`   then   `SentenceTransformer(path)`. Numeric lookup tables are *not* stored—re‑register before inference.
* **Hub upload** → after training, `.push_to_hub()` works like for every Sentence‑Transformers model.

A dedicated training notebook will cover these steps in detail.