Skip to content

index: PostingList.removeDocId is O(n) linear scan on a sorted list — should be O(log n) #248

@justrach

Description

@justrach

Problem

PostingList maintains items sorted by doc_id. getByDocId and containsDocId already use std.sort.binarySearch for O(log n) lookup. But removeDocId uses an O(n) linear scan:

// src/index.zig — PostingList.removeDocId
pub fn removeDocId(self: *PostingList, doc_id: u32) void {
    var i: usize = 0;
    while (i < self.items.items.len) { // ← O(n), no early break
        if (self.items.items[i].doc_id == doc_id) {
            _ = self.items.orderedRemove(i);
        } else { i += 1; }
    }
}

For a PostingList with N entries, every file removal costs O(n) per trigram × number of trigrams per file ≈ O(n × trigrams). With 15k files at ~300 trigrams each, file removal is measurably slow.

Correctness Test (passes today, serves as regression guard)

test "issue-248: PostingList.removeDocId removes target and preserves sorted order" {
    const PostingList = @import("index.zig").PostingList;
    var list = PostingList{};
    defer list.items.deinit(testing.allocator);

    var id: u32 = 0;
    while (id < 100) : (id += 1) {
        const p = try list.getOrAddPosting(testing.allocator, id * 2);
        p.loc_mask = 0xFF;
    }
    list.removeDocId(50);
    try testing.expectEqual(@as(usize, 99), list.items.items.len);
    try testing.expect(list.getByDocId(50) == null);
    for (1..list.items.items.len) |k| {
        try testing.expect(list.items.items[k].doc_id > list.items.items[k - 1].doc_id);
    }
}

Fix

Replace the linear scan with std.sort.binarySearch to find the index, then a single orderedRemove. Since doc_ids are unique per list, there is at most one match.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions