New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
game compression v3 #4129
Comments
Faster alternative to arithmetic coding: |
I'd like to work on the issue but could you explain the approach for the arithmetic coding in detail? Is our alphabet the codes of the Huffman Tree? |
There's no predetermined approach. This issue is basically a research project, trying to improve upon the compression we have now, and everything above is just as collection of rough ideas. There are theoretical limits on what can be achieved with better entropy coding alone (I believe the maximum is saving 0.5 more bits per position, about 10%). Major gains would have to come from better modelling. That is, to find a better way to assign probabilities to each move in a given position, minimizing cross entropy. |
Which script and data did you use to generate the Huffman codes? Using different Huffman codes depending on the game phase would be easy to implement. Just pass the But we'd have to divide each game into opening, middlegame and endgame to generate them. I suggest move 0 to 20 is the opening, move 20 to 40 is the middlegame and move 40 upwards is the endgame. |
Awesome, sounds good! The data was a sample from https://database.lichess.org/. The tools and scripts are linked in the "Ordering moves" section of https://lichess.org/blog/Wqa7GiAAAOIpBLoY/developer-update-275-improved-game-compression. |
Based on the number of games I guess you used December 2017 as the sample. I couldn't find a script which performs the exact task. I guess you build a private script based on rust-pgn-reader. Still I don't understand how you translate the probability of thousands of legal moves from the pgn to 256 Huffman codes. |
The legal moves in any position are first sorted based on an ordering heuristic. Then the word to encode is the index in the list of ordered moves. If the move ordering heuristic is good, index 0 is most likely to have been played, followed by 1, etc. And no chess position has more than 256 moves. See https://lichess.org/blog/Wqa7GiAAAOIpBLoY/developer-update-275-improved-game-compression for a bit more details. One weakness of this approach is that the probability distribution is fixed across all positions, without context. For example index 7 still has some implied probability, even if we know there are only 4 legal moves. And another bit of unused context is the game phase, that you've prepared a branch for. The linked Rust and Python scripts are not the cleanest, but they should be complete. |
Nice work! And sorry, looks like a major part was missing from the Rust version after all. What were the overall results in terms of average bytes per game or move? |
Opening: 21.518720819986676 bytes per game How much did the modification improve the algorithm? |
It's a slight regression, with the status quo being 37.03 bytes per game. There must be a subtle mistake somewhere. Will review more closely, soon. |
Should we implement rANS or tANS? Which of the two variants is better for our use case? |
Or arithmetic coding. Not sure in advance which will perform best. I would expect any of these to gain about half a bit per move. |
We might need a separate one, then, since we're not touching clock compression here. |
Could you elaborate that further? |
When the compressor writes bits to a byte array, the final byte may not be fully used, but the database can only store whole bytes. Without doing something about that, the decoder cannot tell which bits of a byte array actually belong to the game.
The current solution is to separately store the number of moves and pass them to https://github.com/lichess-org/compression/blob/e675db1936eaf6dc0c5f718366760f00a5b7b47d/src/main/java/game/Encoder.java#L117. Bit padding would be more efficient. For example, one scheme would be to append the inverse of the final bit, and fill the remaining bits of the final byte with the same.
Most of the time this would would be free. Only in cases like the third example an additional byte is required. |
How does the decoder detect the end of the game then? |
For this particular scheme, the decoder would look at the last bit of the last byte. It would then pop bits off the end, until finding a different bit value. |
I don't understand how you came up with that estimate. So I calculated the entropy with the move index frequencies. The entropy is around 4.38 bits per move while the weighted path length is 4.40 bits per move for the Huffman code. So using a better entropy coder would only lead to marginal improvements. Below you can see the script I used to calculate those numbers. You can check if they are correct. import math
def entropy(probabilities):
return -sum([probability * math.log2(probability) for probability in probabilities])
def weightedPathLength(weights, codewordLengths):
return sum([weight * codeWordLength for weight, codeWordLength in zip(weights, codewordLengths)])
moveIndexFrequencies = [225883932, 134956126, 89041269, 69386238, 57040790, 44974559, 36547155, 31624920, 28432772, 26540493, 24484873, 23058034, 23535272, 20482457, 20450172, 18316057, 17214833, 16964761, 16530028, 15369510, 14178440, 14275714, 13353306, 12829602, 13102592, 11932647, 10608657, 10142459, 8294594, 7337490, 6337744, 5380717, 4560556, 3913313, 3038767, 2480514, 1951026, 1521451, 1183252, 938708, 673339, 513153, 377299, 276996, 199682, 144602, 103313, 73046, 52339, 36779, 26341, 18719, 13225, 9392, 6945, 4893, 3698, 2763, 2114, 1631, 1380, 1090, 887, 715, 590, 549, 477, 388, 351, 319, 262, 236, 200, 210, 153, 117, 121, 121, 115, 95, 75, 67, 55, 50, 55, 33, 33, 30, 32, 28, 29, 27, 21, 15, 9, 10, 12, 12, 8, 7, 2, 4, 5, 5, 1, 5, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
moveIndexFrequenciesTotal = sum(moveIndexFrequencies)
moveIndexProbabilities = [moveIndexFrequency / moveIndexFrequenciesTotal for moveIndexFrequency in moveIndexFrequencies]
codeWordLengthOccurrences = [1, 1, 3, 5, 12, 9, 3, 4, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2, 3, 5, 4, 5, 7, 6, 5, 6, 5, 5, 49, 98]
codeWordLengths = []
codeWordLength = 2
for codeWordLengthOccurrence in codeWordLengthOccurrences:
for _ in range(codeWordLengthOccurrence):
codeWordLengths.append(codeWordLength)
codeWordLength += 1
bits_per_move_theoretical = entropy(moveIndexProbabilities)
bits_per_move_huffman = weightedPathLength(moveIndexProbabilities, codeWordLengths) |
tune relative importance of criteriaThe text was updated successfully, but these errors were encountered: