Skip to content

Performance improvements to chess/polyglot.py (5 micro-optimisations) #1191

@shawnbale

Description

@shawnbale

All changes preserve the public API exactly. Happy to submit a PR if any or all of these look worthwhile — raising as an issue first since some involve trade-offs worth discussing.


1. POLYGLOT_RANDOM_ARRAY — list → tuple

The 781-element constant array is changed from a list literal to a tuple literal and annotated as Tuple[int, ...].

Why it matters:
CPython list indexing requires a bounds check and a pointer indirection through an overallocated backing array. Tuple indexing uses a fixed-size C array with no overallocation, so reads are slightly faster and the object is ~64 bytes smaller. hash_board() indexes into this array once per occupied square (up to 32 times per call), so the saving compounds across every position evaluation.

The ZobristHasher.__init__ signature is widened from List[int] to Union[List[int], Tuple[int, ...]] so existing callers passing a list are unaffected.


2. ZobristHasher.hash_board — typed bitboard iteration

Original approach:

for pivot, squares in enumerate(board.occupied_co):
    for square in chess.scan_reversed(squares):
        piece_index = (typing.cast(chess.PieceType, board.piece_type_at(square)) - 1) * 2 + pivot
        zobrist_hash ^= self.array[64 * piece_index + square]

Each square requires a BB_SQUARES[square] lookup plus up to 6 bitboard membership tests inside piece_type_at() (expected ~3.5 on an average board).

New approach:
Iterate the six typed piece bitboards (board.pawns, board.knights, etc.) directly, masking with occupied_co[WHITE] and occupied_co[BLACK]. Use the LSB trick (lsb = bb & -bb; sq = lsb.bit_length() - 1) instead of scan_reversed to avoid Python generator overhead. Pre-compute array base offsets once per piece type.

What is eliminated per occupied square:

  • One BB_SQUARES[square] lookup
  • One piece_type_at() call (up to 6 bitboard membership tests)
  • The color branch is hoisted outside the inner loop

A module-level _PIECE_BB_OFFSET tuple pre-computes the six (piece_type - 1) * 2 values to eliminate constant multiplication from the inner loop.


3. MemoryMappedReader.iter — struct.iter_unpack over memoryview

Original:

for i in range(len(self)):
    yield self[i]

Each self[i] call computes index * ENTRY_STRUCT.size, calls unpack_from(), and decodes move fields. Index arithmetic and __getitem__'s negative-index handling and IndexError guard are all unnecessary during full sequential iteration.

New:

mv = memoryview(self.mmap)
for key, raw_move, weight, learn in ENTRY_STRUCT.iter_unpack(mv):
    ...

struct.iter_unpack() advances through the buffer internally in C with no Python-level index arithmetic. The memoryview avoids copying bytes out of the mmap. Move decoding is inlined, eliminating the __getitem__ call frame per entry.

Note: this duplicates the move-decoding logic from __getitem__. A shared _decode_entry() helper could consolidate them at a small per-entry call cost if preferred.


4. MemoryMappedReader.choice — single collect + rng.choice

Original reservoir sampling:

chosen_entry = None
for i, entry in enumerate(self.find_all(...)):
    if chosen_entry is None or _randint(rng, 0, i) == i:
        chosen_entry = entry

_randint() is called once per entry after the first, each time allocating a Python frame. Uniform selection over N entries requires N−1 random draws.

New:

entries = list(self.find_all(...))
return rng.choice(entries)

random.Random.choice() is implemented in C and performs a single uniform draw. For the common case of a small number of book entries (typically 1–20) the list allocation is negligible and the code is simpler.

Trade-off: reservoir sampling preserves O(1) memory if find_all ever returns a very large number of entries. For opening books this is practically irrelevant, but worth noting.

The module-level _randint() helper is removed. A module-level _random_module instance (seeded from OS entropy) is added so callers passing random=None get a properly seeded RNG rather than relying on the implicit global state of the random module.


5. MemoryMappedReader.weighted_choice — single-pass collect

Original:

total_weights = sum(entry.weight for entry in self.find_all(board, ...))
...
for entry in self.find_all(board, ...):   # second full scan
    ...

find_all() scans the sorted key range twice — one bisect_key_left binary search and sequential read per scan.

New:

entries = list(self.find_all(board, ...))
total_weights = sum(entry.weight for entry in entries)
...
for entry in entries:
    ...

One scan, one bisect. Memory cost is the same small list of Entry namedtuples that was being iterated twice anyway.


A patch against current master is available if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions