This is a minor release focused on bug fixes and improvements related to memory management.
The following changes are included:
- updated OpenVINO Runtime to version 2026.2.1
- updated GPU driver in the docker image with ubuntu24 base image
- updated NPU driver in the docker image with ubuntu24 base image
- improvements in input validation for KServer and TFS API
- added new configuration parameters for LLM and VLM models
--cache_interval_multiplier
New parameter cache_interval_multiplier is relevant only for model models with linear attention like Qwen3.6-35B-A3B. It defines how prefix caching algorithm is managing KV cache allocations. The default value 8 is optimized for short context. While using long prompts like over 20k tokens, it is recommended to increase the value to reduce memory consumption. Here is and example for deploying Qwen3.6-35B-A3B model
ovms --model_repository_path c:\models --source_model OpenVINO/Qwen3.6-35B-A3B-int4-ov ^
--task text_generation --target_device GPU --tool_parser qwen3coder --reasoning_parser qwen3 ^
--rest_port 8000 --cache_interval_multiplier 64
You can use OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2026.2.1- CPU device support with image based on Ubuntu 24.04docker pull openvino/model_server:2026.2.1-gpu- GPU, NPU and CPU device support with image based on Ubuntu 24.04
or use provided binary packages. Only packages with suffix _python_on have support for python.
There is also additional distribution channel via https://storage.openvinotoolkit.org/repositories/openvino_model_server/packages/2026.2.1/
Check the instructions how to install the binary package. The prebuilt image is available also on RedHat Ecosystem Catalog