Skip to content

Bug Fix & monitoring / events + map optimization

Choose a tag to compare

@johnnyh1975 johnnyh1975 released this 24 Jun 20:26
ae37e49

Roomba+ v2.9.0 — Release Notes

Based on: v2.8.7 (published). Test status: 2750 passing, 0
failures, 49 test files (45 in v2.8.7 → 49).


Overview

This release closes out the v2.9.0 monitoring/events milestone with five
features plus an unrelated, more urgent fix for a real production
bug found in a field report from Thonno on v2.8.7.


Fixed: EVENT_ROOM_COMPLETED crashed the local MQTT thread

Thonno reported what looked like a regression in the v2.8.7 stuck-mission
fix: a multi-room mission (Bagno principale → Corridoio → Soggiorno,
Two Passes) completed correctly in the iRobot app, but Home Assistant
stayed stuck on Returning to base / docking – end of mission, with
elapsed_run_min still climbing 45+ minutes after the robot had docked.

It wasn't the stuck-mission fix. The v2.8.7 periodic 30-second
safety-net recheck was running exactly as designed — it just never had
anything new to look at. His log showed the real root cause: at the very
first room transition (Bagno principale → Corridoio), _on_mission_message
fired hass.bus.async_fire(EVENT_ROOM_COMPLETED, ...) directly from
roombapy's paho-mqtt background thread, not from the event loop. On his
HA core (a notably stricter/newer build than ours), Home Assistant's
thread-safety guard raised a hard RuntimeError for this — which
propagated all the way up through paho-mqtt's on_message callback and
killed the entire MQTT message-processing thread for the rest of the
mission. Every subsequent symptom — frozen state, the climbing timer, the
stuck-mission recheck endlessly re-evaluating the same cached message —
followed directly from no further MQTT traffic ever being processed
again, not from any logic bug in the end-of-mission classifier itself.

EVENT_ROOM_COMPLETED was introduced in v2.8.6's EVENT-BUS feature. A
sibling event added in the same release, EVENT_MAP_RETRAIN_STARTED/ COMPLETED, already bridges correctly via
asyncio.run_coroutine_threadsafe — this one event type didn't get the
same treatment when it was added.

Fix: hass.loop.call_soon_threadsafe(hass.bus.async_fire, ...)
instead of a direct call — the same bridging pattern already used
elsewhere in this file for async_check_map_retrain_workflow.

Audit: checked every other direct hass.bus.async_fire call site in
the integration. blocking_manager.py and presence_manager.py both
fire exclusively from service calls and state-change listeners — already
running on the event loop, no change needed.

Likely impact beyond this report: this affects any multi-room mission
on any HA core build strict enough to enforce (rather than just warn
about) cross-thread event-bus calls — not unique to Thonno's setup.


Fixed: mqtt_watchdog cloud-connectivity hint was hardcoded in German

Community member boutXIII pasted what looked like a confusing, broken
Repair Issue message — a French sentence with a German clause stitched
into the middle of it ("Connectivité cloud du robot : verbunden — spricht
für..."). Not a robot/network issue at all — a real localization bug.

Root cause: the cloud-connectivity hint inserted into the
{cloud_hint} placeholder was built as a hardcoded German string in
binary_sensor.py, regardless of the user's configured HA locale. The
surrounding sentence was correctly localized via the normal translation
system; the substituted value itself never went through it. Present
since the same v2.9.0 session that added the enriched mqtt_watchdog
message — the existing tests for this code only ever asserted on German
substrings, so the locale gap was invisible in CI from the start.

Fix: _async_watchdog_tick is a @callback (synchronous, can't await
a translation lookup), so server-side string substitution isn't a clean
fit here. Replaced the single mqtt_watchdog translation_key + hardcoded
hint with three fully-localized translation_keys —
mqtt_watchdog_cloud_connected, _disconnected, _unknown — one full,
natural sentence per language per status, resolved automatically by Home
Assistant's existing per-locale translation_key mechanism, the same way
{minutes}/{last_phase} already work. No more placeholder substitution
for this field at all.


Fixed: mqtt_watchdog false positive right after undocking

Both boutXIII and Jean-Christoph reported the watchdog firing within
minutes of starting a mission, on robots that were perfectly fine. Both
reports showed the issue firing with minutes≈5 — right at the existing
silence threshold, not far past it.

Root cause: RoombaMqttStale.is_on only checked phase=="run" and
5 minutes of silence — with no awareness of how long the mission had
actually been running. A genuine, benign Wi-Fi gap right after undocking
(reassociation while the robot physically moves away from the router;
motor startup interference) is common, especially on older robots like
the 980 OG with an aftermarket NiMH battery. The last message received
before the gap already showed phase=="run", so the watchdog fired the
instant the 5-minute threshold was crossed, regardless of how fresh the
mission was.

Fix: new MQTT_WATCHDOG_START_GRACE_SECONDS (420s/7min, chosen with
margin above the ~5min observed in both reports — exact gap duration
wasn't precisely measured in either case). The watchdog now suppresses
entirely for this long after mssnStrtTm, regardless of silence
duration. A genuine outage starting early in the mission is still caught
once both this grace window and the normal silence threshold have
elapsed — this only suppresses the very start of a mission, not any
arbitrary mid-mission silence.


MAP-FONT — embedded font for map rendering

map_renderer.py rendered all zone/wall/door/obstacle labels with PIL's
tiny, non-anti-aliased bitmap default font. Now uses a bundled DejaVu Sans
TTF (Bitstream Vera License — freely redistributable) at two preloaded
sizes. No new config option — there's no scenario where the old default
would be preferable.

ROOM-PALETTE — distinct fill colour per room

_render_rooms_png() filled every Smart Map room with the same uniform
colour. Now rotates through an 8-colour palette by region index, so
adjacent rooms are visually distinguishable even without the
xiaomi-vacuum-map-card's own label overlay. Doesn't touch the v2.7.3
decision to omit room-name labels from this PNG (colour ≠ text — no
duplicate-label risk reintroduced).

CLEAN-ROOM-PER-ROOM-PASSES — individual two-pass setting per room

clean_room gains an optional room_passes field — a list of
{name, two_pass} entries — for setting two-pass cleaning independently
per room within a single multi-room sequence (e.g. Kitchen twice,
Hallway once, same job). Backward compatible: the existing room_name +
global two_pass fields are unchanged for the simple case.

Bug found and fixed in the same area: two_pass was documented in
services.yaml and read by the handler, but missing entirely from the
registered voluptuous schema — any call going through real schema
validation (a YAML automation, or the Developer Tools UI) was rejected
with extra keys not allowed @ data['two_pass']. This had apparently
never worked over that path. Not caught earlier because the existing
test_clean_room_action.py only exercises a hand-copied reference
implementation of the room-resolution logic, never the real registered
service end-to-end.

REST980-MIGRATE — import room names from roomba_rest980

A new options-flow step, shown only when a Smart Map robot is configured
and an existing roomba_rest980 installation is detected
(hass.config_entries.async_entries("roomba_rest980")). Reads room names
straight from that integration's own select.* entities via the state
machine (their room_data attribute) and pre-fills smart_zone_labels
for anything not already named here — read-only, never writes to or
calls services on the foreign integration, and never overwrites a name
already assigned through our own naming workflow. Closing note in the
flow suggests the old roomba_rest980 setup (plus its external rest980
relay container) can be removed once done.

Side discovery: this is the first test to ever import config_flow.py
directly rather than against a hand-copied reference implementation. On
our Python 3.12 test environment (capped at HA core 2025.1.4 — newer
requires Python ≥3.13), helpers.service_info.dhcp/.zeroconf don't
exist yet; added as Shim 4 in conftest.py, aliasing to the equivalent
classes under homeassistant.components.dhcp/.zeroconf. Test
infrastructure only — no production code affected.

ZONE-LAYER-CACHE — cached room-polygon rendering

RoombaRoomsImage._render_rooms_png() re-rendered the full PIL room
layer on every async_image() call (every frontend poll/refresh) even
though room polygons only change on map retrain. Now cached, keyed by
(pmap_version_id, aligned), including the transform parameters the
cache entry was computed with (so calibration_points/_to_px_last
attribute consistency holds on a cache hit, not just the PNG itself).
Known, documented limitation: assumes the pose-space transform is stable
for a given pmap_version_id once aligned=True is reached — not
expected to drift in practice, but not structurally prevented either.


Tests

2750 passing, 49 test files (45 in v2.8.7 → 49: test_rest980_migrate.py
new). +38 net: MAP-FONT +3, ROOM-PALETTE +2, CLEAN-ROOM-PER-ROOM-PASSES +8,
REST980-MIGRATE +16, ZONE-LAYER-CACHE +5, THREAD-SAFETY-FIX +1,
MQTT-WATCHDOG-START-GRACE +3. Four pre-existing tests in
test_mission_timer_store.py needed hass.loop changed from None to
MagicMock() — they exercise the AUTO-ADVANCE-ROOM event-firing path for
the first time now that it actually touches hass.loop. Three existing
TestMqttWatchdogRepairIssue tests updated to assert on the selected
translation_key instead of hardcoded German substrings in
translation_placeholders["cloud_hint"] — itself a symptom of the
localization bug fixed above.

The new thread-safety regression test calls the mission callback from a
genuine second OS thread and asserts hass.bus.async_fire has not been
invoked immediately after that thread returns — only after the event loop
is subsequently drained. Verified to fail against the original direct-call
code before confirming it passes against the fix.