MSE-Check

A tool for evaluating policy server performance by computing Mean Squared Error (MSE) against ground-truth actions from Bridge V2 trajectories. Forked from zhouzypaul/mse-check.

Overview

This tool provides a framework for:

Evaluating policy server actions against ground truth trajectories
Computing performance metrics including MSE
Supporting both sequential and parallel action collection
Handling historical context with configurable history lengths and sampling strategies

Usage

First, ensure your policy server is running at your desired <server_ip>:<server_port> using the mallet toolkit.
Run the evaluation script:

python eval.py --ip <server_ip> --port <server_port>

Advanced Configuration

The tool supports several configuration options through the DeployConfig class:

config = DeployConfig(
    host="localhost",          # Policy server IP address
    port=8000,                # Policy server port
    subsample=10,             # Subsample rate for trajectory steps
    save_dir="results",       # Directory to save results
    analyze_only=False,       # Only analyze previously saved results
    sequential=False,         # Process observations sequentially vs in parallel
    model_name="",           # Name of the model being evaluated
    external_history_length=None,  # Number of historical steps to include
    external_history_choice=None   # How to sample historical steps
)

The keyword external is used to specify here that the evaluation code is constructing the history and submitting it to the server instead of the server providing the history. This allows for asynchronous evaluation instead of time-synchronous evaluation.

History Options

When using historical context, the following sampling strategies are available:

"all": Include all historical steps
"last": Include only the last (most recent) step
"first": Include only the first (oldest) step
"alternate": Include every other step
"third": Include every third step

Output

The tool generates two main output files in the specified save_dir:

actions_{timestamp}.pkl: Raw collected actions
metrics_{timestamp}.pkl: Computed performance metrics

Each file includes:

Trajectory index
Language instructions (if available)
Predicted actions
Ground truth actions
Number of actions
VLM responses
Timestamp
Model name

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
bridge_v2_10_trajs.npy		bridge_v2_10_trajs.npy
compare_models.ipynb		compare_models.ipynb
eval.py		eval.py
setup.cfg		setup.cfg
sweep.py		sweep.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MSE-Check

Overview

Usage

Advanced Configuration

History Options

Output

About

Uh oh!

Releases

Packages

Languages

jacobphillips99/mse-check

Folders and files

Latest commit

History

Repository files navigation

MSE-Check

Overview

Usage

Advanced Configuration

History Options

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages