This example demonstrates how to use the `VectorizedMultinomialBinaryOnlineMongo` model.

It illustrates how to connect the model to a MongoDB backend, configure the connection parameters, and perform online updates on sparse data.
The example is intended to help users understand the workflow of the OnlineMongo models, including data representation, initialization, and incremental voters' reliability updates.


`VectorizedMultinomialBinaryOnlineMongo` is an online Expectation-Maximization (EM)–based aggregation algorithm for crowdsourced labeling tasks.
It is designed to  handle sparse data stored in a MongoDB backend, enabling incremental updates.

Requirements

A running MongoDB instance is required to use any OnlineMongo-based model.
You can specify a custom MongoDB URI when initializing the model:
```python
from pymongo import MongoClient
from peerannot.models.aggregation.multinomial_binary_online import (
    VectorizedMultinomialBinaryOnlineMongo,
)

model = VectorizedMultinomialBinaryOnlineMongo(
    mongo_client=MongoClient("mongodb://mongo_instance:27017/")
)
```

If no client is provided, the model defaults to:
```python
mongo_client = MongoClient("mongodb://localhost:27017/")
```

Running MongoDB Locally

To run MongoDB locally using Docker:
```bash
docker run --name mongodb -p 27017:27017 -d mongodb/mongodb-community-server:latest
```

Data Representation

The model operates on sparse task x worker x class matrices represented as `sparse.COO` tensors.
This format ensures that only non-zero label assignments are stored, minimizing memory usage and improving computational efficiency.

Online Updates

`VectorizedMultinomialBinaryOnlineMongo` supports online learning, allowing it to incrementally update model parameters - particularly worker reliability estimates (pi) — without reprocessing the entire dataset.

Each worker is represented by a single scalar value corresponding to the current estimate of their reliability.

In [1]:

import numpy as np

from peerannot.models.aggregation.multinomial_binary_online import (
    VectorizedMultinomialBinaryOnlineMongo,
)

# Batch 1: 4 tasks
batch1 = {
    "task_A": {
        "user_001": "Quercus robur",
        "user_002": "Betula pendula",
    },
    "task_B": {
        "user_003": "Pinus sylvestris",
    },
    "task_C": {
        "user_001": "Fagus sylvatica",
        "user_004": "Quercus robur",
    },
    "task_D": {
        "user_002": "Betula pendula",
        "user_005": "Acer platanoides",
    },
}



# Batch 2: 5 tasks
batch2 = {
    "task_A": {
        "user_003": "Pinus sylvestris",
        "user_004": "Quercus robur",
        "user_005": "Quercus robur",
    },
    "task_B": {
        "user_002": "Pinus sylvestris",
        "user_005": "Pinus sylvestris",
    },
    "task_E": {
        "user_001": "Fagus sylvatica",
        "user_002": "Fagus sylvatica",
    },
    "task_F": {
        "user_004": "Tilia cordata",
    },
    "task_G": {
        "user_003": "Pinus sylvestris",
        "user_001": "Acer platanoides",
        "user_005": "Fagus sylvatica",
    },
}





  return torch._C._cuda_getDeviceCount() > 0


In [2]:

model = VectorizedMultinomialBinaryOnlineMongo()
model.drop() # cleans whole storage for the specified model

model.process_batch(batch1)
model.get_answers()




[32m2025-11-05 12:15:17.025[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 000 | L=-1.034318 | delta=inf | 10.276s[0m
[32m2025-11-05 12:15:17.034[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 001 | L=-0.786676 | delta=2.39e-01 | 0.009s[0m
[32m2025-11-05 12:15:17.043[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 002 | L=-0.613312 | delta=2.20e-01 | 0.008s[0m
[32m2025-11-05 12:15:17.052[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 003 | L=-0.517546 | delta=1.56e-01 | 0.009s[0m
[32m2025-11-05 12:15:17.061[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 004 | L=-0.487210 | delta=5.86e-02 | 0.009s[0m
[32m2025-11-05 12:15:17

array(['Betula pendula', 'Pinus sylvestris', 'Betula pendula',
       'Betula pendula'], dtype='<U16')

1. Batch preparation:
`process_batch` calls methods  to prepare `batch_matrix`, a 3 dimensional **one-hot encoded** array of shape n_tasks x n_workers x n_classes and batch appropriate mappings, i.e.  mappings that keep track of position of tasks, workers and classes in the array.
In the above example the 3 dimensional array will be of shape 4 x 5 x 5 with following mapping:

```python
task_mapping={'task_A': 0, 'task_B': 1, 'task_C': 2, 'task_D': 3}
worker_mapping={'user_001': 0, 'user_002': 1, 'user_003': 2, 'user_004': 3, 'user_005': 4}
class_mapping={'Quercus robur': 0, 'Betula pendula': 1, 'Pinus sylvestris': 2, 'Fagus sylvatica': 3, 'Acer platanoides': 4}
```


2. Soft vote calculation:
The `batch_matrix` input array (n_tasks x n_workers x n_classes) is collapsed across workers to compute `batch_T`, a 2D array (n_tasks x n_classes) containing per-task class probability distributions.

Each row in batch_T is the normalized frequency of worker votes for that task, so it reflects soft voting (label proportions).


3. Iterative EM updates:
The prepared `batch_matrix` and `batch_T` are used in the iterative EM loop, that runs until convergence or maxiter iterations reached.
Inside each iteration:
   1. M-step: Giben current posteriors (`batch_T`), estimate batch_rho (class prior distribution) and batch_pi (worker reliability)
   2. E-step: Given new parameters (batch_pi, batch_rho) recomputer posteriors (`batch_T`)
   3. log-likelihood calculation: log of total probability of observed labels under current parameters
   4. convergence check


4. After the loop online updates are performed. Updated probabilities (`batch_T`, `batch_rho` and `batch_pi`) are pushed into the store.

The model is ready to process another batch:


In [3]:

model.process_batch(batch2)
model.get_answers()



[32m2025-11-05 12:15:17.246[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 000 | L=-0.644996 | delta=inf | 0.009s[0m
[32m2025-11-05 12:15:17.256[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 001 | L=-0.470418 | delta=2.71e-01 | 0.009s[0m
[32m2025-11-05 12:15:17.266[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 002 | L=-0.458180 | delta=2.60e-02 | 0.009s[0m
[32m2025-11-05 12:15:17.276[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 003 | L=-0.452681 | delta=1.20e-02 | 0.009s[0m
[32m2025-11-05 12:15:17.285[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mlog_em_iter[0m:[36m26[0m - [34m[1m[EM] Iter 004 | L=-0.445831 | delta=1.51e-02 | 0.009s[0m
[32m2025-11-05 12:15:17.

array(['Betula pendula', 'Pinus sylvestris', 'Betula pendula',
       'Betula pendula', 'Fagus sylvatica', 'Tilia cordata',
       'Fagus sylvatica'], dtype='<U16')

When we call above, the following happens:

1. Batch preparation
Just like with `batch1`, the new batch (`batch2`) is transformed into a 3D one-hot encoded array `batch_matrix` (n_tasks x n_workers x n_classes). Mappings for tasks, workers, and classes are created or updated to reflect the current batch.

2. Soft vote calculation:
`batch_matrix` is collapsed along the worker dimension to compute `batch_T`, a 2D array (n_tasks x n_classes). This gives the soft label distribution for the new tasks.

2a. Blending with previous estimates:
If previous task-class probabilities exist (from `batch1`), the new estimates are blended with the stored ones using a convex combination with weight `gamma`. This ensures continuity and prevents 
overwriting prior knowledge, allowing the model to update incrementally.


3. Iterative EM updates:
As previously, the EM loop runs on the new batch:

   1. M-step: Given current posteriors (`batch_T`), estimate batch-specific parameters `batch_rho` (class priors) and `batch_pi` (worker reliabilities).
   2. E-step: Recompute task posteriors (`batch_T`) using updated parameters.
   3. Log-likelihood: Calculate the probability of observed labels under current parameters to monitor convergence.
   4. Convergence check: Stop if the maximum number of iterations is reached or the change in log-likelihood is below a threshold.

4. Online update:
Updated probabilities (`batch_T`, `batch_rho`, `batch_pi`) are merged with the stored model state in the database using a convex combination with weight `gamma`.

Answer retrieval:
`model.get_answers()` returns the current task-class predictions, reflecting both the new batch (`batch2`) and prior knowledge from `batch1`.

Following code cells demonstrate how the model works from the inside out.

We start by initializing the model and preparing mappings for tasks, workers, and classes. The batch is then transformed into a one-hot encoded matrix, which serves as input to the EM algorithm.

A custom EM loop (em_trace) is implemented to iteratively estimate:
- batch_rho: class priors
- batch_pi: worker reliabilities
- batch_T: posterior probabilities for each task

The loop logs intermediate states, including the log-likelihood, at each iteration. After convergence, an **online update** merges the batch estimates into the model.

In [4]:



def em_trace(model, batch_matrix, task_mapping, worker_mapping, class_mapping,
             maxiter=50, epsilon=1e-6, prev_globals=None):
    """Runs EM for one batch and traces evolution."""

    batch_T = model._init_T(batch_matrix, task_mapping, class_mapping)
    
    i, eps, ll = 0, np.inf, []

    while i < maxiter and eps > epsilon:
        batch_rho, batch_pi = model._m_step(batch_matrix, batch_T)
        batch_T, batch_denom = model._e_step(batch_matrix, batch_pi, batch_rho)
        likeli = np.log(np.sum(batch_denom))
        ll.append(likeli)

        if i > 0:
            eps = np.abs((ll[-1] - ll[-2]) / (np.abs(ll[-2]) + 1e-12))
        i += 1
    

    # Online update after convergence
    model._online_update(task_mapping, worker_mapping, class_mapping, batch_T, batch_rho, batch_pi)



def prepare_batch(model, batch): #-> Tuple[Any, Dict[str, Any]]:
    """Prepares a single batch and returns (matrix, mappings)."""
    task_mapping, worker_mapping, class_mapping = {}, {}, {}
    model._prepare_mapping(batch, task_mapping, worker_mapping, class_mapping)
    
    # ensure indices exist in global mappings
    model.get_or_create_indices(model.task_mapping, list(task_mapping))
    model.get_or_create_indices(model.worker_mapping, list(worker_mapping))
    model.get_or_create_indices(model.class_mapping, list(class_mapping))

    batch_matrix = model._process_batch_to_matrix(batch, task_mapping, worker_mapping, class_mapping)
    mappings = {
        "task_mapping": task_mapping.copy(),
        "worker_mapping": worker_mapping.copy(),
        "class_mapping": class_mapping.copy(),
    }
    return batch_matrix, mappings


def run_em_for_batches(model, batches, maxiter=50):
    """Runs EM sequentially for multiple batches, carrying over global parameters."""
    model.drop()
    model.t = 1
    prev_globals = {}

    for idx, batch in enumerate(batches, 1):
        batch_matrix, mappings = prepare_batch(model, batch)
        em_trace(
            model,
            batch_matrix,
            mappings["task_mapping"],
            mappings["worker_mapping"],
            mappings["class_mapping"],
            maxiter=maxiter,
            prev_globals=prev_globals,
        )
        
        print(f"Finished batch {idx}")

model = VectorizedMultinomialBinaryOnlineMongo()
run_em_for_batches(model, [batch1, batch2], maxiter=50)

model.get_answers()

[32m2025-11-05 12:15:17.707[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mmongo_timer[0m:[36m16[0m - [34m[1m[Mongo] online update class probs took 0.009s[0m
[32m2025-11-05 12:15:17.929[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mmongo_timer[0m:[36m16[0m - [34m[1m[Mongo] online update class probs took 0.002s[0m


Finished batch 1
Finished batch 2


array(['Betula pendula', 'Pinus sylvestris', 'Betula pendula',
       'Betula pendula', 'Fagus sylvatica', 'Pinus sylvestris',
       'Pinus sylvestris'], dtype='<U16')

The following cell visualizes the behavior of the `VectorizedMultinomialBinaryOnlineMongo` model during online updates.  
It shows how the estimated voters’ reliability evolves over time as new labeling data is incorporated.


In [5]:
from peerannot.helpers.visualization import visualize_model


visualize_model(model=VectorizedMultinomialBinaryOnlineMongo, maxiter=50, batches=[batch1, batch2])

[32m2025-11-05 12:15:18.484[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mmongo_timer[0m:[36m16[0m - [34m[1m[Mongo] online update class probs took 0.017s[0m


Finished batch 1 (iterations: 16)


[32m2025-11-05 12:15:19.020[0m | [34m[1mDEBUG   [0m | [36mpeerannot.helpers.logging[0m:[36mmongo_timer[0m:[36m16[0m - [34m[1m[Mongo] online update class probs took 0.001s[0m


Finished batch 2 (iterations: 20)
