Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

game compression v3 #4129

Open
niklasf opened this issue Mar 13, 2018 · 19 comments
Open

game compression v3 #4129

niklasf opened this issue Mar 13, 2018 · 19 comments
Labels
foundational prospect For long-term consideration

Comments

@niklasf
Copy link
Member

niklasf commented Mar 13, 2018

@niklasf niklasf added the prospect For long-term consideration label Mar 13, 2018
@cyanfish
Copy link
Member

Faster alternative to arithmetic coding:
https://en.wikipedia.org/wiki/Asymmetric_numeral_systems

@ErdoganSeref
Copy link
Contributor

I'd like to work on the issue but could you explain the approach for the arithmetic coding in detail? Is our alphabet the codes of the Huffman Tree?

@niklasf
Copy link
Member Author

niklasf commented Feb 25, 2023

There's no predetermined approach. This issue is basically a research project, trying to improve upon the compression we have now, and everything above is just as collection of rough ideas.

There are theoretical limits on what can be achieved with better entropy coding alone (I believe the maximum is saving 0.5 more bits per position, about 10%). Major gains would have to come from better modelling. That is, to find a better way to assign probabilities to each move in a given position, minimizing cross entropy.

@ErdoganSeref
Copy link
Contributor

Which script and data did you use to generate the Huffman codes?

Using different Huffman codes depending on the game phase would be easy to implement. Just pass the read and write methods of the Huffman class a move parameter. Then the correct tree according to the game phase is selected.

But we'd have to divide each game into opening, middlegame and endgame to generate them.

I suggest move 0 to 20 is the opening, move 20 to 40 is the middlegame and move 40 upwards is the endgame.

@niklasf
Copy link
Member Author

niklasf commented Feb 26, 2023

Awesome, sounds good!

The data was a sample from https://database.lichess.org/. The tools and scripts are linked in the "Ordering moves" section of https://lichess.org/blog/Wqa7GiAAAOIpBLoY/developer-update-275-improved-game-compression.

@ErdoganSeref
Copy link
Contributor

Based on the number of games I guess you used December 2017 as the sample.

I couldn't find a script which performs the exact task. I guess you build a private script based on rust-pgn-reader.

Still I don't understand how you translate the probability of thousands of legal moves from the pgn to 256 Huffman codes.

@niklasf
Copy link
Member Author

niklasf commented Mar 6, 2023

The legal moves in any position are first sorted based on an ordering heuristic. Then the word to encode is the index in the list of ordered moves. If the move ordering heuristic is good, index 0 is most likely to have been played, followed by 1, etc. And no chess position has more than 256 moves. See https://lichess.org/blog/Wqa7GiAAAOIpBLoY/developer-update-275-improved-game-compression for a bit more details.

One weakness of this approach is that the probability distribution is fixed across all positions, without context. For example index 7 still has some implied probability, even if we know there are only 4 legal moves. And another bit of unused context is the game phase, that you've prepared a branch for.

The linked Rust and Python scripts are not the cleanest, but they should be complete.

@niklasf
Copy link
Member Author

niklasf commented Mar 29, 2023

Nice work! And sorry, looks like a major part was missing from the Rust version after all.

What were the overall results in terms of average bytes per game or move?

@ErdoganSeref
Copy link
Contributor

Opening: 21.518720819986676 bytes per game
Middlegame: 12.454941800918728 bytes per game
Endgame: 3.736622027554465 bytes per game

How much did the modification improve the algorithm?

@niklasf
Copy link
Member Author

niklasf commented Mar 29, 2023

It's a slight regression, with the status quo being 37.03 bytes per game. There must be a subtle mistake somewhere. Will review more closely, soon.

@ErdoganSeref
Copy link
Contributor

Faster alternative to arithmetic coding: https://en.wikipedia.org/wiki/Asymmetric_numeral_systems

Should we implement rANS or tANS? Which of the two variants is better for our use case?

@niklasf
Copy link
Member Author

niklasf commented May 2, 2023

Or arithmetic coding. Not sure in advance which will perform best. I would expect any of these to gain about half a bit per move.

@niklasf
Copy link
Member Author

niklasf commented Jun 3, 2023

We might need a separate one, then, since we're not touching clock compression here.

@ErdoganSeref
Copy link
Contributor

@ErdoganSeref
Copy link
Contributor

* use padding instead of storing ply count

Could you elaborate that further?

@niklasf
Copy link
Member Author

niklasf commented Sep 21, 2023

When the compressor writes bits to a byte array, the final byte may not be fully used, but the database can only store whole bytes. Without doing something about that, the decoder cannot tell which bits of a byte array actually belong to the game.

       Example bits: 00001111 01001111 000011
Currently stored as: 00001111 01001111 00001100 (n moves)

       Example bits: 00001111 0
Currently stored as: 00001111 00000000 (m moves)

       Example bits: 00001111 00000001
Currently stored as: 00001111 00000001 (k moves)

The current solution is to separately store the number of moves and pass them to https://github.com/lichess-org/compression/blob/e675db1936eaf6dc0c5f718366760f00a5b7b47d/src/main/java/game/Encoder.java#L117.

Bit padding would be more efficient. For example, one scheme would be to append the inverse of the final bit, and fill the remaining bits of the final byte with the same.

       Example bits: 00001111 01001111 000011
Currently stored as: 00001111 01001111 00001100 (n moves)
        Bit padding: 00001111 01001111 00001100

       Example bits: 00001111 0
Currently stored as: 00001111 00000000 (m moves)
        Bit padding: 00001111 01111111

       Example bits: 00001111 00000001
Currently stored as: 00001111 00000001 (k moves)
        Bit padding: 00001111 00000001 00000000

Most of the time this would would be free. Only in cases like the third example an additional byte is required.

@ErdoganSeref
Copy link
Contributor

How does the decoder detect the end of the game then?

@niklasf
Copy link
Member Author

niklasf commented Oct 8, 2023

For this particular scheme, the decoder would look at the last bit of the last byte. It would then pop bits off the end, until finding a different bit value.

@ErdoganSeref
Copy link
Contributor

Or arithmetic coding. Not sure in advance which will perform best. I would expect any of these to gain about half a bit per move.

I don't understand how you came up with that estimate. So I calculated the entropy with the move index frequencies. The entropy is around 4.38 bits per move while the weighted path length is 4.40 bits per move for the Huffman code. So using a better entropy coder would only lead to marginal improvements.

Below you can see the script I used to calculate those numbers. You can check if they are correct.

import math

def entropy(probabilities):
    return -sum([probability * math.log2(probability) for probability in probabilities])

def weightedPathLength(weights, codewordLengths):
    return sum([weight * codeWordLength for weight, codeWordLength in zip(weights, codewordLengths)])
    
moveIndexFrequencies = [225883932, 134956126, 89041269, 69386238, 57040790, 44974559, 36547155, 31624920, 28432772, 26540493, 24484873, 23058034, 23535272, 20482457, 20450172, 18316057, 17214833, 16964761, 16530028, 15369510, 14178440, 14275714, 13353306, 12829602, 13102592, 11932647, 10608657, 10142459, 8294594, 7337490, 6337744, 5380717, 4560556, 3913313, 3038767, 2480514, 1951026, 1521451, 1183252, 938708, 673339, 513153, 377299, 276996, 199682, 144602, 103313, 73046, 52339, 36779, 26341, 18719, 13225, 9392, 6945, 4893, 3698, 2763, 2114, 1631, 1380, 1090, 887, 715, 590, 549, 477, 388, 351, 319, 262, 236, 200, 210, 153, 117, 121, 121, 115, 95, 75, 67, 55, 50, 55, 33, 33, 30, 32, 28, 29, 27, 21, 15, 9, 10, 12, 12, 8, 7, 2, 4, 5, 5, 1, 5, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
moveIndexFrequenciesTotal = sum(moveIndexFrequencies)
moveIndexProbabilities = [moveIndexFrequency / moveIndexFrequenciesTotal for moveIndexFrequency in moveIndexFrequencies]

codeWordLengthOccurrences = [1, 1, 3, 5, 12, 9, 3, 4, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2, 3, 5, 4, 5, 7, 6, 5, 6, 5, 5, 49, 98]
codeWordLengths = []
codeWordLength = 2
for codeWordLengthOccurrence in codeWordLengthOccurrences:
    for _ in range(codeWordLengthOccurrence):
        codeWordLengths.append(codeWordLength)
    codeWordLength += 1
    
bits_per_move_theoretical = entropy(moveIndexProbabilities)
bits_per_move_huffman = weightedPathLength(moveIndexProbabilities, codeWordLengths)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
foundational prospect For long-term consideration
Projects
None yet
Development

No branches or pull requests

3 participants