Discussion needed for the next steps: User-solutions for approximate-position-searching? #12

nguyenpham · 2022-01-06T23:58:12Z

nguyenpham
Jan 6, 2022
Maintainer

In this post, we will make a brief about what we have/done and then what issues we are facing. We would like to hear your opinions, suggestions and get help from some experts.

A. Current status

1. Building databases
With the latest Attempt 9th, we store game contents into SQLite databases in a simple, straightforward way. Each game contains two important fields, one is a string of FEN (empty if the game started from the start position), the other is a string of all moves. The last string is not yet split nor parsed into individual moves.

This way makes the building process be very fast. A PGN file of 3.45 million games required under a minute to be converted into a SQL database stored on the hard disk (the program used only one thread, running on my old computer Quad-Core i7, 3.6 GHz, 16 GB RAM).

2. Extract moves/positions
Because we don’t store any information about moves nor their positions we much create all those information on the fly, whenever we need. Basically to do position matching or searching, we need to read all games from the database, parse the text of moves into moves, make all those moves in a chessboard to get all information about their positions.

In our tests, the time to read all games from the SQLite database is just similar to the time to read them from the PGN file. Total time to read and parse all games of a database 3.45 million games is under 4 minutes on my computer (IMHO, that is fast, somewhat is comparable to SCID - the fastest chess database program so far).

The code to read all games, get their pair strings of FEN-moves without splitting and parsing as the following:

     for (auto cnt = 0; statement.executeStep(); cnt++) {
            auto gameID = statement.getColumn("ID").getInt64(); assert(gameID > 0);
            auto fenText = statement.getColumn("FEN").getText();
            auto moveText = statement.getColumn("Moves").getText();
     }

Below is the code to parse the pair strings into a chessboard:

     for auto cnt = 0(; statement.executeStep(); cnt++) {
            auto gameID = statement.getColumn("ID").getInt64();
            auto fenText = statement.getColumn("FEN").getText();
            auto moveText = statement.getColumn("Moves").getText();
            threadParsePGNGame(gameID, fenText, moveText);
     }

3. Exact-position-matching and approximate-position-searching
Both matching and searching need the information of all positions which are created on the fly as mentioned in the above part. After parsing/making each move the program calls a callback function, provides it with redundant parameters, including a full set of bitboards, king squares, hash key of the position. We can add code for that callback function to do any task we need.

For example, to find all positions with 3-White-Queens we create the callback as the following. The code takes the bitboard of all Queens, do operator AND with the bitboard of side White and then count all bit 1, using the function popCount:

checkToStop = [=](int64_t gameId, const std::vector<uint64_t>& bitboardVec, const bslib::BoardCore* board) -> bool {
                    auto White = bitboardVec[static_cast<int>(bslib::BBIdx::white)];
                    auto Queens = bitboardVec[static_cast<int>(bslib::BBIdx::queens)];

                    if (popCount(White & Queens) == 3) {
                        succCount++;
                        std::cout << succCount << ". gameId: " << gameId << ", fen: " << board->getFen() << std::endl;

                       return true;
                    }

                    return false;
                };

All works well. The callback function runs very fast. It takes some extra time but just a little bit, thus the total time is quite close to the time needed to read and parse games (about 4 minutes).

B. The issue and considering solutions
Even the way to match/search as above part works, it requires hard-coding. We need to write some code, compile and run it. That is impossible for average users.

We need solutions for this issue. It should help users to query what they want without coding.

One of the simplest solutions is to use dialog boxes. The program creates some dialog boxes, users tick some boxes, select something from dropdowns. After all the program will run the callback function with parameters from that dialog box (using some If-Switch commands). However, using dialog boxes has a huge limit since they can’t cover all cases and any effort to expand covering may pay by heavy complexity. Dialog boxes are suitable for the programs with GUI but not for ones in the form of console/command lines.

We think solutions should relate to the text-only: users can query what they want just by entering some simple text without compiling and rerunning the program.

At the moment we have been considering two solutions as the below.

1. Using CQL (Chess Query Language)
Frankly speaking, even though I have known CQL for a while but I didn’t high evaluate it nor plan to use it since it looked so complicated for me. I didn’t learn to use it and was confused about its license. Clearly, I didn’t have enough information and avoided it as much as possible. However, some experts have given me some simple examples of usage, enough for me to understand how strong it is and how simple to use it is (at least there are some simple ways to use it).

If we could integrate CQL with SQL databases all will be done in a quick way. CQL has been developing for a long time by a good team thus it should be good for its tasks.

We have started finding more information, got the latest code as well as some confirmations from CQL authors (thanks a lot and high appreciation to them). The authors don’t have any objection about using/integrating their code and may support us too.

The good news is that the code written by those authors is an MIT license, compatible with the license of this project, and could be used in any other project.

However, they warned us that CQL is using some code from SCID with a GPL license which conflicts with our license. They can’t contact Shane Hudson (the first author of SCID) to change the license. That code is mostly for PGN parsing and may take a lot of effort to replace. Yes, we have code to parse PGN but looks like not an easy task (to replace).

2. Developing a simple expression parser

Instead of coding, users can enter an expression as a text. The program will compile, auto changes it to parameters, and run with the callback function. For example, users can enter such as below expression strings:

// find all positions having 3 White Queens
“popCount(White & Queens) == 3”

// Find all positions having two Black Rooks in the middle squares
“popCount(Black & Rooks & bb(e4, e5, f4, f5)) == 2”

// White Pawns in d4, e5, f4, g4, Black King in b7
“(White & Pawns & bb(d4, e5, f4, g4)) == bb(d4, e5, f4, g4) && blackKingSquare == b7

Since that is just simple expressions, short strings, focusing on working with bitboards, and some specific chess constants (such as d4), it should be much easier to implement than CQL. Of course, that is our code we won’t have any problem with license conflicts.

C. Discussion

We love to hear your opinions, especially someone who has experience working and/or implementing CQL. Any suggestion, help on implementation is highly appreciated.

haydoooke · 2022-01-07T03:14:01Z

haydoooke
Jan 7, 2022

However, they warned us that CQL is using some code from SCID with a GPL license which conflicts with our license. They can’t contact Shane Hudson (the first author of SCID) to change the license. That code is mostly for PGN parsing and may take a lot of effort to replace. Yes, we have code to parse PGN but looks like not an easy task (to replace).

The SCID code in the orig directory does a lot more than just parse. Much, much more.

Keep in mind what I mentioned in an earlier e-mail. There are at least a couple of fresh CQL implementations on the near horizon, and their licensing might be more amenable to your needs. At least one of those has been written from scratch (partly so that it can handle variants) and would not carry any Scid baggage with it. The formal announcement should be just around the corner.

Also, Gregor began implementing his own CQL for Scidb.

https://sourceforge.net/projects/scidb/

I'm not sure how far along he got but you might want to take a look at it. His language has a different syntax and might give you some ideas for whatever you decide on for your own expressive chess query language.

> // find all positions having 3 White Queens
> “popCount(White & Queens) == 3”
> 
> // Find all positions having two Black Rooks in the middle squares
> “popCount(Black & Rooks & bb(e4, e5, f4, f5)) == 2”
> 
> // White Pawns in d4, e5, f4, g4, Black King in b7
> “(White & Pawns & bb(d4, e5, f4, g4)) == bb(d4, e5, f4, g4) && blackKingSquare == b7

Just to let others compare the above with the CQL equivalent syntax:

'Q == 3'
'r[d4,d5,e4,e5] == 2'
'P[d4,e5,f4,g4] == 3 and kb7'

5 replies

nguyenpham Jan 7, 2022
Maintainer Author

The SCID code in the orig directory does a lot more than just parse. Much, much more.

Keep in mind what I mentioned in an earlier e-mail. There are at least a couple of fresh CQL implementations on the near horizon, and their licensing might be more amenable to your needs. At least one of those has been written from scratch (partly so that it can handle variants) and would not carry any Scid baggage with it. The formal announcement should be just around the corner.

That is my terrible headache since I can’t use SCID code (due to license conflicts). CQL has a huge source code thus it is surely not an easy task to understand and modify. That is why I have solution 2 as a backup or at least as a temporary solution.

// find all positions having 3 White Queens
“popCount(White & Queens) == 3”

// Find all positions having two Black Rooks in the middle squares
“popCount(Black & Rooks & bb(e4, e5, f4, f5)) == 2”

// White Pawns in d4, e5, f4, g4, Black King in b7
“(White & Pawns & bb(d4, e5, f4, g4)) == bb(d4, e5, f4, g4) && blackKingSquare == b7
Just to let others compare the above with the CQL equivalent syntax:
'Q == 3'
'r[d4,d5,e4,e5] == 2'
'P[d4,e5,f4,g4] == 3 and kb7'

That looks good! I may use short similar forms to that.

However, my expression is simpler, clearer, and more straightforward without any redefining meaning. For example, in my expression, Queens == 3 means a comparison between two integers 64 bit, one is a bitboard Queens, the other is a number 3 (it is actually a constant-bitboard) but not the number of bit 1. That is why I consider it as an expression, not a new language for querying.

haydoooke Jan 7, 2022

It's contextual semantics. The explicit syntax for acquiring the cardinality of a "set" is, e.g., #Q. In the context of the relational operator, the pound sign is not necessary. But, yes, that's a language thing... defining the semantics.

I should probably point out that Q == Q is a test for equivalence between two "sets", and that particular expression would always be true. Of course, that begs the question of its practical use.... one can assign a bitboard to a variable in CQL.

haydoooke Jan 7, 2022

I forgot to mention... if you choose to use a syntax similar to CQL, you might want to borrow parts of the CQL parser from the source.

nguyenpham Jan 7, 2022
Maintainer Author

Are there any full language syntax descriptions/definitions for CQL somewhere? I have found some descriptions but not full. For example, you explained the number 3 is a cardinal (in Q == 3) but I wonder if there is any other usage/meanings of numbers.

I am also a bit confused about the expression 'P[d4,e5,f4,g4] == 3 and kb7'. IMHO, it looks like a mix between syntaxes of C/C++ and other languages (the usage of == and AND).

I forgot to mention... if you choose to use a syntax similar to CQL, you might want to borrow parts of the CQL parser from the source.

Thanks for the suggestion. Sure I will take a look.

Yes, I have been considering implementing an expression parser that is compatible with CQL. I focus on expression only since that may be enough for approximate-position-searching. SQL is strong for querying all other things anyway. It is a backup plan if we would delay or can’t make CQL work with us.

BTW, implementing an expression parser is much easier than a full language parser anyway :)

haydoooke Jan 7, 2022

Are there any full language syntax descriptions/definitions for CQL somewhere?

I'm not aware of anything like that. I guess the ultimate authority on such things is the lexer/parser.

IMHO, it looks like a mix between syntaxes of C/C++ and other languages (the usage of == and AND).

You're free to change it. :)

nguyenpham · 2022-01-07T11:07:53Z

nguyenpham
Jan 7, 2022
Maintainer Author

I have started creating a BNF (Backus Naur Form) for the syntax of the expression/condition parser, used for position searching. I have tried to make it to be compatible as much as possible with CQL.

clause = condition { (”and” | ”or” | “&&” | “||”) condition }
condition = expression { ( “=“ | “<” | ”<=“| ” >” | ”>=“ | “!=” | “<>” ) expression }
expression = [ “+” | ”-“ ] term  {( “+” | ”-“ ) term }
term = factor {( “*” | ”/“ ]) factor} 

factor = number | piece | “(“ expression “)”
piece = piecename (<empty> | square | squareset)

piecename = “K” | “Q” | “R” | “B” | “N” | “P” | “k” | “q” | “r” | “b” | “n” | “p” | “white” | “black”

squareset = column | row | “[“ (square| squarerange, columnrange | rowrange) {“,” (square| squarerange, columnrange | rowrange) } “]”

squarerange = square “-“ square
columnrange = column “-“ column
rowrange = row “-“ row
square = column row 
column = “a” | “b” | “c” | “d” | “e” | “f” | “g” | “h”
row = “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8”

A condition/expression may have some chess piece types. For evaluating, they are cardinalities/total numbers of those chess pieces on the chessboard. For examples:

R                the total number of White Rooks
qb3              the total number of Black Queens on square b3
B3               the total number of White Bishops on row 3
n[b-e]           the total number of Black Knights from column b to e
P[a4, c5, d5]    the total number of White Pawns on squares a4, c5, and d5

The condition may be implicit or explicit:

R 		the implicit form of the comparison R != 0
R == 3		the total of White Rooks must be 3
q[5-7] >= 2      the total of Balck Queens from row 5 to row 7 must be equal or larger 2

Some other examples:

// find all positions having 3 White Queens
Q = 3

// Find all positions having two Black Rooks in the middle squares
r[e4, e5, f4, f5] = 2

// White Pawns in d4, e5, f4, g4, Black King in b7
P[d4, e5, f4, g4] = 4 and kb7

// Black Pawns in column c more than 1
pc > 1

// White Pawns in row 3 from 2
P3 >= 2

// Two Bishops in column c, d, e, f
B[c-f] + b[c-f] = 2

0 replies

haydoooke · 2022-01-07T14:48:32Z

haydoooke
Jan 7, 2022

On 1/7/22 04:08, nguyenpham wrote: I have started creating a BNF (Backus Naur Form) for the syntax of the expression/condition parser, used for position searching. I have tried to make it to be compatible as much as possible with CQL. |condition = expression | expression (“=“|“<”|”<=“|”>”|”>=“ | ”and” | ”or”) expression expression = [“+”|”-“] term {(“+”|”-“) term } term = factor {(“*”|”/“]) factor} factor = piececardinality | number | “(“ expression “)” piececardinality = piece { <empty> | square|squareset } piece = “K” | “Q” | “R” | “B” | “N” | “P” | “k” | “q” | “r” | “b” | “n” | “p” | “W” | “B”|

|Is this valid BNF? "B" for Bishop and for Black?|

…

|square = column row | column | row column = “a” | “b” | “c” | “d” | “e” | “f” | “g” | “h” row = “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” squareset = “[“ (square {“,” square} | square "-" square | column “-“ column | row “-“ row) “]” | Below are examples: |// find all positions having 3 White Queens Q = 3 // Find all positions having two Black Rooks in the middle squares r[e4, e5, f4, f5] = 2 // White Pawns in d4, e5, f4, g4, Black King in b7 P[d4, e5, f4, g4] = 4 and kb7 | — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALI6WKZC3UIZBCG6GHLIFH3UU3CRHANCNFSM5LNQPVIA>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

2 replies

haydoooke Jan 7, 2022

Re mangled BNF: I tried repying by e-mail. That didn't work out so well...

nguyenpham Jan 9, 2022
Maintainer Author

Thanks, fixed with some improvements.

asdfjkl · 2022-01-08T12:18:42Z

asdfjkl
Jan 8, 2022

A. Current status

1. Building databases With the latest Attempt 9th, we store game contents into SQLite databases in a simple, straightforward way. Each game contains two important fields, one is a string of FEN (empty if the game started from the start position), the other is a string of all moves. The last string is not yet split nor parsed into individual moves.

This way makes the building process be very fast. A PGN file of 3.45 million games required under a minute to be converted into a SQL database stored on the hard disk (the program used only one thread, running on my old computer Quad-Core i7, 3.6 GHz, 16 GB RAM).

If you go into that direction, it might be really worthwhile though to convert the PGN into a binary format on the fly. It should be not much slower during conversion; it certainly won't be slower when searching, and it saves a lot of space!

Just for the sake of the example I took a one game out of millionbase 2.22, which as 33 moves. No comments/variations, which is probably true for most games stored in large databases. The string of moves has 409 chars, i.e. 409 bytes.

The PGN is approx 1.4 GB

Using a two-byte (source destination) encoding of moves, we require 66 bytes for all 33 moves. That is a reduction down to 16 percent of the original size.
Using a one-byte encoding (even though this is somewhat nasty to implement/parse, especially w.r.t. interop), we are down to 33 percent, which is roughly 8 percent of the original size.

In other words instead of requiring 1.4 GB to store all games, a user would only require space roughly in the 150 to 250 MB range.

That is a huge(!) difference, especially if we consider that Mega has over 9 million games. It might be even possible to then further accelarate the search by using SQlite's memmaping ( https://www.sqlite.org/mmap.html ), as we can keep a database in memory for most current hardware systems.

It also should not be too difficult to design a (simple) binary format. Basically something like this:

one byte for source, one byte for destination
store promotions in the destination byte
slightly limit the range of the first byte in order to include a TLV structure

i.e. something like this:

B1 B2 B3
if B1 == 0x00: Start of comment, next two bytes (B2 B3) are length, then UTF8 comment string starts
if B1 == 0x01: move annotatation
...
if B1 > 0x10: it's a move, source square, B2 has destination square + promotion

(this should be further refined, but basically more or less directly correspond to PGN features to make conversion simple and without loosing information)

The great advantage of SQLite btw is that everything is in one file. A program author could create additional "indices" into byte structures / extra tables which are not required for interop, i.e. purely optional/proprietary, but accelarate certain search queries at the cost of space. I think Chessbase did something similar what they dubbed "search booster".

An alternative to SQL might be HDF5, but it certainly is not as widespread as SQlite.

PS: Have a look at your 'typical' club player. They will never be able to type in a CQL query. They will always need a GUI, no matter if the backend is CQL or something else.

3 replies

haydoooke Jan 8, 2022

They can understand SAN but not Q == 3? Really?

nguyenpham Jan 9, 2022
Maintainer Author

SQL is not good for binary data when the number (of records) and/or size (of each record) are huge. BTW, I will try to make that clear.

I have been creating the basement/library. Other chess GUIs/tools can build their own GUI based on that library, say, some dialog boxes for average users to search. But I guess advanced users will prefer to use a query language. SQL itself is a query language too and it is not easy for average users either.

asdfjkl Jan 15, 2022

Are you really sure about the binary data thing? Granted I do not have experience with SQL(SQLite), but think about the potential size reductions. Also note what that implies on the RAM usage if you want to search in-memory.

To make things more clear w.r.t. the binary encoding: Unless you just want to store moves, an encoding needs to be close to PGN, be simple to parse/export, and be able to handle variations and comments. Something like this might work:

Game = [ GameLength [FenMarker or FenMarker | FenLen | Fen] Moves ]

Moves = [   Move or 
            BeginOfVariation or 
            EndOfVariation or 
            [ StartofComment CommentLength Comment] or 
            [ AnnotationsFollow AnnotationLength Annotations ] or
            NullMove
        ]

FenMarker is just 0 if there is no fen, or 1 if a Fen follows.

Annotations would be just a sequence of bytes encoding NAGs.

Comment should be just the byte sequence of the UTF8-encoded string. Extra comments (like arrows defined with % and the like) should go into that string. It's not elegant, but very close to how PGN works.

A length value can be encoded as BER-TLV, defined in ISO 7816-4. Essentially this tries to use as few bytes as required:

if a length value is 0 up to 127, just encode it directly as one byte
if a length value is >= 128, set the first byte to 0x81, and the second byte to 0x00 to 0xFF. We can encode values 0-255 like this
if a length value is >= 256, set the first byte to 0x82, and use the second and third byte to encode the value. We encode values in the range 0 to 65535 like that
... and so on, to encode arbitrary length, even though four bytes (16777215) will be probably always enough.

As most length' value will likely be < 128 (i.e. annotations or comments), the length will be usually just one byte.

If we use two bytes to encode a move, you get some extra bits. Take the highest bit for example. Set it to 0 for moves, and 1 for tags. Then it is easy to distinguish moves from tags (BeginOfVariation, EndOfVariation,StartofComment,AnnotationsFollow,NullMove, potentially more tags).

It just remains to define byte values for the tags, and specify the move encoding.

This might not be the most efficient encoding, but is pretty straight forward to implement. The overhead for games with variations and comments should be pretty low. A game with no variations or comments is then just a sequence of two-byte moves, i.e. no overhead.
In any event, annotated games usually make up only a fraction of an overall database. For example Mega has 9.2 million games, but just about 100,000 games annotated.

It is possible to optimize this of course. For example one could use an (almost) one byte encoding for moves (nasty to implement, bad for interop). Then it gets tricky to store variations though... One could also put comments into a different table, and only encode references in the binary encoding, but I think that would not gain much when searching for specific comments....

As for CQL: I am not saying this is approach does not work or is a bad foundation. But please observe non-tech affine chess players, especially the older generation when using chess programs... As programmers, we tend to live in a tech-bubble.... Just for example one thing: Q==3 is simple of course, but even here: Non-English speakers will have a different symbol then Q. CQL then actually needs a translation! Or they need to learn English SAN notation....

haydoooke · 2022-01-08T19:44:07Z

haydoooke
Jan 8, 2022

From a post by Fulvio on TalkChess:

Maybe an SQL-like syntax:
COUNT(WQ%) = 3
COUNT(BRe4, BRe5, BRf4, BRf5) = 2
COUNT(WPd4, WPe5, WPf4, WPg4, BKb6) = 5
COUNT(_P_4) -> count pawns (both colors) on the 4th rank

Here's the CQL syntax for comparison:
Q == 3
r[e4,e5,f4,f5] == 2
P[d4,e5,f4,g4] == 4 and kb6
[Pp]a-h4 > 4 (more than four pawns of either color on 4th rank)

0 replies

nguyenpham · 2022-01-10T10:24:23Z

nguyenpham
Jan 10, 2022
Maintainer Author

I have implemented the parser for that language (Attempt 10th). All works so well!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion needed for the next steps: User-solutions for approximate-position-searching? #12

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Discussion needed for the next steps: User-solutions for approximate-position-searching? #12

nguyenpham Jan 6, 2022 Maintainer

Replies: 6 comments · 10 replies

nguyenpham Jan 7, 2022 Maintainer Author

nguyenpham Jan 7, 2022 Maintainer Author

nguyenpham Jan 7, 2022 Maintainer Author

nguyenpham Jan 9, 2022 Maintainer Author

nguyenpham Jan 9, 2022 Maintainer Author

nguyenpham Jan 10, 2022 Maintainer Author

nguyenpham
Jan 6, 2022
Maintainer

Replies: 6 comments 10 replies

nguyenpham Jan 7, 2022
Maintainer Author

nguyenpham Jan 7, 2022
Maintainer Author

nguyenpham
Jan 7, 2022
Maintainer Author

nguyenpham Jan 9, 2022
Maintainer Author

nguyenpham Jan 9, 2022
Maintainer Author

nguyenpham
Jan 10, 2022
Maintainer Author