game compression v3 #4129

niklasf · 2018-03-13T08:04:39Z

~~tune relative importance of criteria~~
better piece-square tables
arithmetic coding (https://www.reddit.com/r/chess/comments/83xmhl/lichess_developer_update_275_improved_game/dvlsijp/) or grouping symbols in the huffman code (https://lichess.org/forum/general-chess-discussion/blog-Wqa7GiAAAOIpBLoY?page=5#44)
use different huffman codes depending on the position (e.g. number of legal moves, or even game phase)
use padding instead of storing ply count

cyanfish · 2018-06-10T13:27:41Z

Faster alternative to arithmetic coding:
https://en.wikipedia.org/wiki/Asymmetric_numeral_systems

ErdoganSeref · 2023-02-25T20:27:15Z

I'd like to work on the issue but could you explain the approach for the arithmetic coding in detail? Is our alphabet the codes of the Huffman Tree?

niklasf · 2023-02-25T20:58:46Z

There's no predetermined approach. This issue is basically a research project, trying to improve upon the compression we have now, and everything above is just as collection of rough ideas.

There are theoretical limits on what can be achieved with better entropy coding alone (I believe the maximum is saving 0.5 more bits per position, about 10%). Major gains would have to come from better modelling. That is, to find a better way to assign probabilities to each move in a given position, minimizing cross entropy.

ErdoganSeref · 2023-02-26T19:04:04Z

Which script and data did you use to generate the Huffman codes?

Using different Huffman codes depending on the game phase would be easy to implement. Just pass the read and write methods of the Huffman class a move parameter. Then the correct tree according to the game phase is selected.

But we'd have to divide each game into opening, middlegame and endgame to generate them.

I suggest move 0 to 20 is the opening, move 20 to 40 is the middlegame and move 40 upwards is the endgame.

niklasf · 2023-02-26T20:22:03Z

Awesome, sounds good!

The data was a sample from https://database.lichess.org/. The tools and scripts are linked in the "Ordering moves" section of https://lichess.org/blog/Wqa7GiAAAOIpBLoY/developer-update-275-improved-game-compression.

ErdoganSeref · 2023-02-27T20:05:54Z

Based on the number of games I guess you used December 2017 as the sample.

I couldn't find a script which performs the exact task. I guess you build a private script based on rust-pgn-reader.

Still I don't understand how you translate the probability of thousands of legal moves from the pgn to 256 Huffman codes.

niklasf · 2023-03-06T20:58:09Z

The legal moves in any position are first sorted based on an ordering heuristic. Then the word to encode is the index in the list of ordered moves. If the move ordering heuristic is good, index 0 is most likely to have been played, followed by 1, etc. And no chess position has more than 256 moves. See https://lichess.org/blog/Wqa7GiAAAOIpBLoY/developer-update-275-improved-game-compression for a bit more details.

One weakness of this approach is that the probability distribution is fixed across all positions, without context. For example index 7 still has some implied probability, even if we know there are only 4 legal moves. And another bit of unused context is the game phase, that you've prepared a branch for.

The linked Rust and Python scripts are not the cleanest, but they should be complete.

niklasf · 2023-03-29T18:14:49Z

Nice work! And sorry, looks like a major part was missing from the Rust version after all.

What were the overall results in terms of average bytes per game or move?

ErdoganSeref · 2023-03-29T20:29:02Z

Opening: 21.518720819986676 bytes per game
Middlegame: 12.454941800918728 bytes per game
Endgame: 3.736622027554465 bytes per game

How much did the modification improve the algorithm?

niklasf · 2023-03-29T21:47:53Z

It's a slight regression, with the status quo being 37.03 bytes per game. There must be a subtle mistake somewhere. Will review more closely, soon.

ErdoganSeref · 2023-04-30T22:24:40Z

Faster alternative to arithmetic coding: https://en.wikipedia.org/wiki/Asymmetric_numeral_systems

Should we implement rANS or tANS? Which of the two variants is better for our use case?

niklasf · 2023-05-02T19:14:17Z

Or arithmetic coding. Not sure in advance which will perform best. I would expect any of these to gain about half a bit per move.

niklasf · 2023-06-03T14:35:44Z

We might need a separate one, then, since we're not touching clock compression here.

ErdoganSeref · 2023-07-09T17:06:18Z

lichess-org/compression#13

ErdoganSeref · 2023-09-20T17:40:20Z

* use padding instead of storing ply count

Could you elaborate that further?

niklasf · 2023-09-21T18:33:52Z

When the compressor writes bits to a byte array, the final byte may not be fully used, but the database can only store whole bytes. Without doing something about that, the decoder cannot tell which bits of a byte array actually belong to the game.

       Example bits: 00001111 01001111 000011
Currently stored as: 00001111 01001111 00001100 (n moves)

       Example bits: 00001111 0
Currently stored as: 00001111 00000000 (m moves)

       Example bits: 00001111 00000001
Currently stored as: 00001111 00000001 (k moves)

The current solution is to separately store the number of moves and pass them to https://github.com/lichess-org/compression/blob/e675db1936eaf6dc0c5f718366760f00a5b7b47d/src/main/java/game/Encoder.java#L117.

Bit padding would be more efficient. For example, one scheme would be to append the inverse of the final bit, and fill the remaining bits of the final byte with the same.

       Example bits: 00001111 01001111 000011
Currently stored as: 00001111 01001111 00001100 (n moves)
        Bit padding: 00001111 01001111 00001100

       Example bits: 00001111 0
Currently stored as: 00001111 00000000 (m moves)
        Bit padding: 00001111 01111111

       Example bits: 00001111 00000001
Currently stored as: 00001111 00000001 (k moves)
        Bit padding: 00001111 00000001 00000000

Most of the time this would would be free. Only in cases like the third example an additional byte is required.

ErdoganSeref · 2023-09-28T21:46:02Z

How does the decoder detect the end of the game then?

niklasf · 2023-10-08T14:47:10Z

For this particular scheme, the decoder would look at the last bit of the last byte. It would then pop bits off the end, until finding a different bit value.

ErdoganSeref · 2023-12-12T16:26:59Z

Or arithmetic coding. Not sure in advance which will perform best. I would expect any of these to gain about half a bit per move.

I don't understand how you came up with that estimate. So I calculated the entropy with the move index frequencies. The entropy is around 4.38 bits per move while the weighted path length is 4.40 bits per move for the Huffman code. So using a better entropy coder would only lead to marginal improvements.

Below you can see the script I used to calculate those numbers. You can check if they are correct.

import math

def entropy(probabilities):
    return -sum([probability * math.log2(probability) for probability in probabilities])

def weightedPathLength(weights, codewordLengths):
    return sum([weight * codeWordLength for weight, codeWordLength in zip(weights, codewordLengths)])
    
moveIndexFrequencies = [225883932, 134956126, 89041269, 69386238, 57040790, 44974559, 36547155, 31624920, 28432772, 26540493, 24484873, 23058034, 23535272, 20482457, 20450172, 18316057, 17214833, 16964761, 16530028, 15369510, 14178440, 14275714, 13353306, 12829602, 13102592, 11932647, 10608657, 10142459, 8294594, 7337490, 6337744, 5380717, 4560556, 3913313, 3038767, 2480514, 1951026, 1521451, 1183252, 938708, 673339, 513153, 377299, 276996, 199682, 144602, 103313, 73046, 52339, 36779, 26341, 18719, 13225, 9392, 6945, 4893, 3698, 2763, 2114, 1631, 1380, 1090, 887, 715, 590, 549, 477, 388, 351, 319, 262, 236, 200, 210, 153, 117, 121, 121, 115, 95, 75, 67, 55, 50, 55, 33, 33, 30, 32, 28, 29, 27, 21, 15, 9, 10, 12, 12, 8, 7, 2, 4, 5, 5, 1, 5, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
moveIndexFrequenciesTotal = sum(moveIndexFrequencies)
moveIndexProbabilities = [moveIndexFrequency / moveIndexFrequenciesTotal for moveIndexFrequency in moveIndexFrequencies]

codeWordLengthOccurrences = [1, 1, 3, 5, 12, 9, 3, 4, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2, 3, 5, 4, 5, 7, 6, 5, 6, 5, 5, 49, 98]
codeWordLengths = []
codeWordLength = 2
for codeWordLengthOccurrence in codeWordLengthOccurrences:
    for _ in range(codeWordLengthOccurrence):
        codeWordLengths.append(codeWordLength)
    codeWordLength += 1
    
bits_per_move_theoretical = entropy(moveIndexProbabilities)
bits_per_move_huffman = weightedPathLength(moveIndexProbabilities, codeWordLengths)

niklasf added the prospect For long-term consideration label Mar 13, 2018

niklasf added the foundational label Mar 9, 2020

ErdoganSeref mentioned this issue Mar 1, 2023

Use game phase specific Huffman codes lichess-org/compression#11

Closed

This was referenced Mar 28, 2023

Compression with game phase niklasf/rust-pgn-reader#33

Open

Use game phase specific Huffman codes lichess-org/compression#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

game compression v3 #4129

game compression v3 #4129

niklasf commented Mar 13, 2018 •

edited

cyanfish commented Jun 10, 2018

ErdoganSeref commented Feb 25, 2023

niklasf commented Feb 25, 2023

ErdoganSeref commented Feb 26, 2023

niklasf commented Feb 26, 2023

ErdoganSeref commented Feb 27, 2023

niklasf commented Mar 6, 2023

niklasf commented Mar 29, 2023

ErdoganSeref commented Mar 29, 2023

niklasf commented Mar 29, 2023 •

edited

ErdoganSeref commented Apr 30, 2023

niklasf commented May 2, 2023

niklasf commented Jun 3, 2023

ErdoganSeref commented Jul 9, 2023

ErdoganSeref commented Sep 20, 2023

niklasf commented Sep 21, 2023 •

edited

ErdoganSeref commented Sep 28, 2023

niklasf commented Oct 8, 2023 •

edited

ErdoganSeref commented Dec 12, 2023

game compression v3 #4129

game compression v3 #4129

Comments

niklasf commented Mar 13, 2018 • edited

cyanfish commented Jun 10, 2018

ErdoganSeref commented Feb 25, 2023

niklasf commented Feb 25, 2023

ErdoganSeref commented Feb 26, 2023

niklasf commented Feb 26, 2023

ErdoganSeref commented Feb 27, 2023

niklasf commented Mar 6, 2023

niklasf commented Mar 29, 2023

ErdoganSeref commented Mar 29, 2023

niklasf commented Mar 29, 2023 • edited

ErdoganSeref commented Apr 30, 2023

niklasf commented May 2, 2023

niklasf commented Jun 3, 2023

ErdoganSeref commented Jul 9, 2023

ErdoganSeref commented Sep 20, 2023

niklasf commented Sep 21, 2023 • edited

ErdoganSeref commented Sep 28, 2023

niklasf commented Oct 8, 2023 • edited

ErdoganSeref commented Dec 12, 2023

niklasf commented Mar 13, 2018 •

edited

niklasf commented Mar 29, 2023 •

edited

niklasf commented Sep 21, 2023 •

edited

niklasf commented Oct 8, 2023 •

edited