Skip to content

matstech/dmf-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DMF Benchmarks

This repository contains the benchmark evaluation suite for the dmf, a specialized memory system for conversational AI, comparing its performance and resource footprint against Mem0.

It implements isolated end-to-end evaluation pipelines for two standard long-term memory benchmarks: LoCoMo (Long-Context Memory Benchmark) and LongMemEval.

Native Approach

The pipelines in this repository default to a native approach. Unlike strict benchmark setups that reconstruct rigid session contexts or historical dialogue transcripts, the native approach directly queries the memory framework's natural storage surface.

Specifically, during question answering, the framework-native context is retrieved and formatted into a minimal context injection prompt for the answerer model (fllowing the official guidelines and examples of each benchmark framework). This provides a realistic assessment of each framework's retrieval quality.

Dependencies

This benchmark requires the following key memory frameworks:

  • dmf: The benchmark targets the official version of the dmf-memory.
  • mem0: The benchmark utilizes a custom fork of Mem0. This fork is necessary to introduce internal telemetry instrumentation. The instrumentation enables the precise tracking of prompt and completion tokens, API call counts, and execution metrics from the underlying LLM and embedding operations within Mem0, which is required for rigorous resource-comparative reporting.

Environment

The evaluation pipelines rely on LLM providers for generating answers and executing LLM-as-a-judge scoring. By default, the benchmark is configured to use OpenAI (defaulting to gpt-4.1-mini for the answerer and gpt-5-mini for the judge).

However, you can alternatively use OpenRouter or a local Ollama instance by overriding the provider variables (see Configuration Variables).

To configure your environment, create a .env file in the root of the project (which will be loaded automatically) or export the variables in your shell:

1. OpenAI (Default)

To use the default OpenAI models:

OPENAI_API_KEY=your_openai_api_key_here
# Optional:
# OPENAI_BASE_URL=https://api.openai.com/v1

2. OpenRouter (Alternative)

To route calls through OpenRouter:

OPENROUTER_API_KEY=your_openrouter_api_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

3. Ollama (Alternative)

For local, offline open-source models:

OLLAMA_BASE_URL=http://localhost:11434

Implemented Pipeline

The evaluation pipeline runs through four major stages:

           +---------------------------------------------+
           |              Benchmark Dataset              |
           +----------------------+----------------------+
                                  |
                                  | (sequential turns)
                                  v
           +---------------------------------------------+
           |                 Ingestion                   |
           |  - DMF: TemporalMemory engine direct writes |
           |  - Mem0: sequential profile additions       |
           +----------------------+----------------------+
                                  |
                                  | (memory state)
                                  v
           +---------------------------------------------+
           |               Native Context                |
           |  - Extract native memory surface from store |
           |  - Build minimal context-injection prompt   |
           +----------------------+----------------------+
                                  |
                                  | (context & question)
                                  v
           +---------------------------------------------+
           |                Answer Phase                 |
           |  - Query LLM (gpt-4.1-mini) for predictions |
           |  - Tracks inference token usage & latency   |
           +----------------------+----------------------+
                                  |
                                  | (predictions)
                                  v
           +---------------------------------------------+
           |                 Evaluation                  |
           |  - LLM Judge (gpt-5-mini) grades predictions|
           |  - Rigorous resource & quality report logs  |
           +---------------------------------------------+
  1. Ingestion: Dialog turns or interaction logs are sequentially loaded into the memory backend. For the dmf, ingestion interacts directly with the TemporalMemory storage layer. For Mem0, turns are written sequentially using its standard user memory interface.
  2. Native Context Construction: The pipeline queries the memory backend to fetch the native memory context surface relevant to the current user state. This context is embedded directly in a minimal, natural QA prompt.
  3. Answer Generation: The prompt is sent to the answerer model (defaulting to gpt-4.1-mini) to produce candidate answers. Prompt tokens, and completion tokens are recorded.
  4. Evaluation: A stronger LLM judge (defaulting to gpt-5-mini) compares the generated answers against ground-truth solutions. In addition to accuracy, a rigorous evaluation module computes system metrics and generates structured quality and resource reports.

Makefile Commands

A Makefile is provided to run different stages of the benchmarks. The commands use Poetry for environment and dependency isolation.

Project Setup

  • make lock: Regenerate the poetry.lock file to update dependencies.
  • make install: Install the project and all framework dependencies in the local virtual environment.

Running LoCoMo

  • make locomo: Run the entire end-to-end native LoCoMo benchmark including ingestion, prediction, and judge evaluation.
  • make locomo-judge: Run only the LLM judge scoring on already saved predictions, bypassing ingestion and answer generation.
  • make locomo-rigorous: Run only the rigorous secondary metric reports on existing outputs to compile final comparison summaries.

Running LongMemEval

  • make longmemeval: Run the entire end-to-end native LongMemEval benchmark.
  • make longmemeval-judge: Run only the LLM judge evaluation on existing LongMemEval predictions.
  • make longmemeval-rigorous: Run only the rigorous evaluation and reporting scripts on existing LongMemEval outputs.

Configuration Variables

You can customize the execution by overriding variables on the command line when running any make target.

General Variables

  • POETRY: Command to invoke poetry. Defaults to poetry.
  • BENCHMARKS_DIR: Target benchmarks root path. Defaults to .
  • ANSWERER_PROVIDER: LLM API provider for generating candidate answers. Defaults to openai.
  • ANSWERER_MODEL: LLM model name for generating answers. Defaults to gpt-4.1-mini.
  • JUDGE_PROVIDER: LLM API provider for scoring. Defaults to openai.
  • JUDGE_MODEL: LLM model name for scoring candidate answers. Defaults to gpt-5-mini.

LoCoMo Variables

  • LOCOMO_PROJECT: Isolated project directory name for saved results. Defaults to locomo_dmf.
  • LOCOMO_FRAMEWORK: Memory framework under evaluation (dmf or mem0). Defaults to dmf.
  • LOCOMO_CONFIG: Settings configuration path. Defaults to config/locomo_dmf_settings.toml.
  • LOCOMO_CONVERSATION_IDS: Comma-separated list of conversation indices to process. Defaults to running all.
  • LOCOMO_CATEGORIES: Question categories to evaluate. Defaults to 1,2,3,4.

LongMemEval Variables

  • LONGMEMEVAL_PROJECT: Isolated project directory name for saved results. Defaults to longmemeval_dmf.
  • LONGMEMEVAL_FRAMEWORK: Memory framework under evaluation (dmf or mem0). Defaults to dmf.
  • LONGMEMEVAL_CONFIG: Settings configuration path. Defaults to config/longmemeval_dmf_settings.toml.
  • LONGMEMEVAL_PER_TYPE: Maximum questions to run per question type. Defaults to 10.

Execution Examples

Run LoCoMo with Mem0:

make locomo LOCOMO_FRAMEWORK=mem0 LOCOMO_CONFIG=config/locomo_mem0_settings.toml

Run LongMemEval with DMF using custom models:

make longmemeval LONGMEMEVAL_FRAMEWORK=dmf ANSWERER_MODEL=gpt-4o-mini JUDGE_MODEL=gpt-4o

Only rerun the judge on existing saved outputs:

make locomo-judge LOCOMO_PROJECT=locomo_dmf

About

Benchmarks for dmf-memory framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors