DIGER: Differentiable Semantic ID for Generative Recommendation

This work has been accepted as a full paper at SIGIR 2026 (ACM International Conference on Information Retrieval).

Overview

This repository contains the core implementation of our recommendation model with two training strategies:

Frequency-based Uncertainty Decay: Dynamically switches between Gumbel sampling and deterministic indexing based on code usage frequency
Standard Deviation Uncertainty Decay: Uses learnable uncertainty to balance task loss

Roadmap (TODO)

Current status: Due to time constraints, what you see here is an illustrative / reference implementation—it sketches the model and training flow but is not yet the full end-to-end release we intend to ship.

Target: Before SIGIR 2026 (conference begins July 20, 2026), we plan to publish:

Runnable code — including configs and pre-trained checkpoints needed to reproduce the paper
Dataset — processed data and embeddings, with setup instructions in the README

Repository Structure

DIGER/
├── main.py                              # Main training entry point
├── vq.py                                # Vector Quantization (RQ-VAE) implementation
├── trainer.py                           # Training loop and loss computation
├── model.py                             # Recommender model architecture
├── data.py                              # Data loading utilities
├── utils.py                             # Helper functions
├── metrics.py                           # Evaluation metrics
├── layers.py                            # Neural network layers
├── config/
│   └── beauty_jo.yaml                  # Configuration file for Beauty dataset
├── accelerate_config.yaml              # Accelerate configuration
├── run_FrqUD.sh                        # Training script 1
└── run_SDUD.sh                         # Training script 2

Requirements

Dependencies

pip install torch transformers accelerate pyyaml numpy faiss-cpu scikit-learn colorama tqdm

Python Version

Python 3.12.11
PyTorch 2.5.1

Data Preparation

1. Dataset Structure

Organize your dataset in the following structure:

dataset/
└── beauty/
    ├── beauty.train.inter
    ├── beauty.valid.inter
    ├── beauty.test.inter
    └── Beauty.emb-llama.npy    # Semantic embeddings

2. Data Format

Interaction files (.inter): Tab-separated values with columns user_id:token, item_id:token, timestamp:float
Semantic embeddings (.npy): NumPy array of shape [num_items, embedding_dim]

3. Pre-trained RQ-VAE Checkpoint

You need a pre-trained RQ-VAE checkpoint. The checkpoint should contain:

Encoder weights
Residual Quantization (RQ) codebooks
Decoder weights (optional, can be frozen)

Configuration

Update Paths

Before running, update the placeholder paths in the following files:

Shell scripts (run_FrqUD.sh, run_SDUD.sh):

RQVAE_INIT="<PATH_TO_RQVAE_CHECKPOINT>"  # Update this

Config file (config/beauty_jo.yaml):

semantic_emb_path: <PATH_TO_DATASET>/beauty/Beauty.emb-llama.npy  # Update this
rqvae_path: <PATH_TO_RQVAE_CHECKPOINT>  # Update this
data_path: ./dataset  # Update if needed

Usage

Training Script 1: Frequency-based Uncertainty Decay

This script uses adaptive selection to dynamically choose between Gumbel sampling (for popular codes) and deterministic indexing (for rare codes).

bash run_FrqUD.sh

Training Script 2: Standard Deviation Uncertainty Decay

This script uses a learnable uncertainty parameter to automatically balance task loss.

bash run_SDUD.sh

Loss formula:

L = L_task / (2*(σ+λ)²) + log(σ+λ)

At equilibrium: σ = sqrt(L_task) - λ

Output

Logs

Training logs are saved to ./logs/<dataset>/ with timestamps.

Checkpoints

Model checkpoints are saved to ./myckpt/<dataset>/ including:

best_model.pth: Best model based on validation metric
Training statistics and metrics

Metrics

The model is evaluated on:

Recall@5, Recall@10
NDCG@5, NDCG@10

Validation metric: NDCG@10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIGER: Differentiable Semantic ID for Generative Recommendation

Overview

Roadmap (TODO)

Repository Structure

Requirements

Dependencies

Python Version

Data Preparation

1. Dataset Structure

2. Data Format

3. Pre-trained RQ-VAE Checkpoint

Configuration

Update Paths

Usage

Training Script 1: Frequency-based Uncertainty Decay

Training Script 2: Standard Deviation Uncertainty Decay

Output

Logs

Checkpoints

Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
.gitignore		.gitignore
README.md		README.md
accelerate_config.yaml		accelerate_config.yaml
data.py		data.py
layers.py		layers.py
main.py		main.py
metrics.py		metrics.py
model.py		model.py
run_FrqUD.sh		run_FrqUD.sh
run_SDUD.sh		run_SDUD.sh
trainer.py		trainer.py
utils.py		utils.py
vq.py		vq.py

Folders and files

Latest commit

History

Repository files navigation

DIGER: Differentiable Semantic ID for Generative Recommendation

Overview

Roadmap (TODO)

Repository Structure

Requirements

Dependencies

Python Version

Data Preparation

1. Dataset Structure

2. Data Format

3. Pre-trained RQ-VAE Checkpoint

Configuration

Update Paths

Usage

Training Script 1: Frequency-based Uncertainty Decay

Training Script 2: Standard Deviation Uncertainty Decay

Output

Logs

Checkpoints

Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages