-- encoding: utf-8; indent-tabs-mode: nil --
This project is an complete exploration of the game Black Box (with 4 atoms). This is not yet another implementation of the game. When I just want to play the game, I use Emacs or Simon Tatham's Puzzles Collection. This project feeds a database which can be used to produce statistics about Black Box.
Black Box is a puzzle game in which the hider player sets up 4 or 5 atoms in a black box and in which the seeker player shoots rays from the sides of the black box to locate the hidden atoms. As all deduction games, there is a luck factor because some answers from the hider player can be more interesting than others and thus lead to a shorter solution. Or the answers are more mundane and give fewer informations to the seeker player. Another luck factor is that there are some configurations where the seeker player cannot guess all the atom positions, even if he has shot all possible rays from the outside of the box. A well-known example is this 5-atom situation:
- - - - - - - -
- - O - - O - -
- - - ? ? - - -
- - - ? ? - - -
- - O - - O - -
- - - - - - - -
- - - - - - - -
- - - - - - - -
The O-atoms are easily located. But they form a "no-ray's land" represented by question marks. The 5th atom can be in any question mark position, where no ray can interact with it. The seeker has to choose one of these 4 positions, with a 1-in-4 chance of finding the good position and a 3-in-4 chance of getting a 5-point penalty.
There are other ambiguous 5-atom situations, a bit more convoluted, like this one from a forum discussion on Board Game Geek. The general opinion is that ambiguous situations exist in the 5-atom variant, but not in the 4-atom variant. This opinion is wrong. Ambiguous 4-atom positions exists, just read my answer to the forum discussion.
For the moment (before writing the programmes), in the 4-atom game, I had found only situations where the ambiguous situations are grouped by two. The purpose of this project is to determine whether situations exist in groups of 3 or 4. This project examines all the ways 4 atoms can be arrayed inside the box, computes the results for all possible 32 rays and, thus, determines which atom configurations leads to the same ray results.
Post-scriptum. While writing the exploration program and before running it on a configuration with 4 atoms on a 8×8 square, or even on any configuration with a square bigger than 4×4, I have found by pure thought this game:
6 7 5 H 4 H 3 H
1 - - - - - - - - 1
2 - - - - - - - - 2
3 - - - - - - - - R
H - - - - - - - - H
4 - - - - - - - - R
H - - - - - - - - H
5 - - - - - - - - R
H - - - - - - - - H
6 7 R H R H R H
There are 4 balls in the box
(H = absorbed, R = reflected) Try to decode it...
Here are 4 games I played within Emacs.
3
1 - - - - - - - - 1
R - - - - - - - -
H O - - O - - - -
- - - - - - - -
3 - - - - - - - -
- - - - - - O -
R - - - - - - - - 2
H O - - - - - - -
4 4 2
There are 4 balls in the box
6 2 1
H O - - - - - - -
- - - - - - - - 4
3 - - - - - - - - 3
- - - - - - - - 6
- - - O - - - -
4 - - - - - - - - 5
- - - - - - - - 5
H O - - O - - - - H
H H R 2 1
There are 4 balls in the box
4 3 2
H - - - - - - O -
5 - - - - - - - - 6
- - - - - - - - 4
O - - - - - - -
- - - - - - - - 5
- - - - - - - - 6
O - - - - - O -
R - - - - - - - - 1
3 2 1
There are 4 balls in the box
1 7 5 4 6 7
1 - - - - - - - -
- O - - - - - O
- - - - - - - -
H - O - - - - - -
3 - - - - - - - -
R - - - - - - - -
H - - - - - - - O
2 - - - - - - - -
3 5 4 6 2
There are 4 balls in the box
As you can see, in each game a ray has been deflected several times before exiting the board (in the first 3 games) or before being reflected (in the 4th game).
The secondary purpose is to gather statistics on the rays: path length, number of deflections, to know if other convoluted configurations exist.
We can easily find that a ray can be deflected up to 6 times and cross up to 23 squares, as in the following situation:
1
- + - - - - - O
- + + + + + + -
O - - - - - + -
- + + + + + + -
- + - - - - - O
- + + + + + + -
O - - - - - + -
- - - - - - + -
1
There are 4 balls in the box
Yet, the exploration programme will compute statistics for each situation, to confirm or to refine this statement.
Up to now, my mark-up language of choice was Perl's POD. I have decided to try another possibility. So this text is written in Markdown.
Supposing everything is installed (Raku, modules, database), here is how you explore all possible games with 4 atoms within a 8×8 box.
First, edit and update the content of the lib/db-conf-sql.rakumod
file.
This file holds the fully qualified pathname of the database in
my computer. The directory structure of your computer is most
certainly a bit different. So you have to replace the fully qualified
pathname of the database by another qualified pathname compatible
with your computer.
Then please run:
raku init-conf-sql.raku --config=A4_B8
raku explore-conf-sql.raku --config=A4_B8
For the last command, do not redirect the standard output. There will be about 100 lines which will keep you a bit entertained while waiting for the programme to finish.
Go on with
raku init-spectrum-sql.raku --config=A4_B8
raku upd-spectrum-sql.raku --config=A4_B8
Now the database is ready and you can query it with sqlite3
.
If you want to display the records in a user-friendly format,
run commands like these:
raku display-sql.raku --config=A4_B8 5 6 7 8 9 10
raku display-sql.raku --config=A4_B8 --from=5 --to=10
Please type
./add-index-mongo.sh
raku init-conf-mongo.raku --config=A4_B8
raku explore-conf-mongo.raku --config=A4_B8
For the last command, do not redirect the standard output. There will be about 100 lines which will keep you a bit entertained while waiting for the programme to finish.
Go on with
raku init-spectrum-mongo.raku --config=A4_B8
raku upd-spectrum-mongo.raku --config=A4_B8
Now the database is ready and you can query it with the mongo
shell.
If you want to display the records in a user-friendly format,
run commands like these:
raku display-mongo.raku --config=A4_B8 5 6 7 8 9 10
raku display-mongo.raku --config=A4_B8 --from=5 --to=10
The contents of Black Box consists of several atoms in a precise position. So we will call each configuration a molecule.
Likewise, all 32 rays are grouped together in what is called a spectrum.
The metaphor has its limits. In the real world, a spectrum is a bunch of rays over a wide range of wavelengths. Here, a spectrum is a bunch of rays over a wide range of geometric positions.
With these definitions, the main purpose of this project is to find different molecules with the same spectrum.
In our 3-D real world, for a given molecule, its enantiomer is its mirror image, provided the first molecule is asymmetrical. In the 2-D abstract Black Box world, I will use the word enantiomer to designate a molecule obtained from the first one by a rotation around the square's center or a symmetry with respect to one of the square's diagonals or medians.
For example, for the following molecule:
- O O - - - - -
- O - - - - - -
O - - - - - - -
- - - - - - - -
- - - - - - - -
- - - - - - - -
- - - - - - - -
- - - - - - - -
the 7 enantiomers are:
- - - - - O - - - - - - - - - - - - - - - - - -
- - - - - - O O - - - - - - - - - - - - - - - -
- - - - - - - O - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - O O - - - - - - -
- - - - - - - - - - - - - - O - O O - - - - - -
- - - - - - - - - - - - - O O - - - O - - - - -
- - O - - - - - - - - - - O O - - - - - - - - - - - - - - - - -
O O - - - - - - - - - - - - O - - - - - - - - - - - - - - - - -
O - - - - - - - - - - - - - - O - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - O O - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - O O - O - - - - - -
- - - - - - - - - - - - - - - - - - - - - O - - - O O - - - - -
On the other hand, the following molecules are not enantiomers:
- - - O O - - - - O O - - - - -
- - - O - - - - - - O - - - - -
- - O - - - - - - - - O - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
because we discard translations and we discard rotations and symmetries which do not apply to the 8×8 square.
In an enantiomer group, I define the "canonical molecule" as the molecule with the smallest number when stored in the database, that is, the first molecule of the group that is processed by the exploration programme. This phrase has no equivalent in 3-D chemicals, but I need to give a name to this concept. So it will be "canonical molecule".
The programmes will explore the Black Box games in a 8×8 square with 4 atoms. This gives 635376 molecules. Yet, the number of atoms and the size of the square are parameters, not hard-coded values. So the same programmes will be able to explore Black Box games with 2 atoms in a 4×4 square. There are only 120 molecules, so I will be able to thoroughly check the programmes and their results.
In addition, some interesting results can appear when exploring molecules with 4 atoms in a 6×6 square, or even in a 5×5 square.
A configuration will be identified by the An_Bp code, where n is the number of atoms and p is the size of the box. So the "normal" configuration will have the key "A4_B8".
The project includes several programmes.
-
The first programme is initialisation, which cleans up the contents of the database for a given configuration.
-
The exploration programme, which generates all possible molecules for a given configuration, computes the spectrum for each molecule and store them into the database.
-
A programme summarising the findings of the exploration programme, extracting the spectrums which appear in two molecules or more.
-
A programme updating the spectrums created by the previous programme, filling the attributes that could not be computed before.
-
A programme extracting some molecules from the database and displaying in a human-friendly fashion.
About the exploration programme: the programme will not process all 635376 molecules in one go. It will be able to stop at any moment and restart later from the point where it stopped.
The 8×8 square with the atoms is implemented as a 2-D array, with a line coordinate and a column coordinate. Lines and columns are numbered 1-to-8. Why not start at 0? Because line 0 and column 0 are dedicated to the ray entry points and exit points, as well as the additional line 9 and column 9. An atom is represented by a "heavy" character, such as "O", "X" or "*", while an empty box is represented by a "light" character, such as a space, a dot or a hyphen. Actually, I have used the Emacs notation, with a "O" and an hyphen.
For the peripheral area (lines and columns 0 and 9), I will not use Emacs' notation. Firstly, the rays which come out will be shown as lower-caps letters, not 1-to-9 digits. A competitive player seldom shoots all possible rays, so 1-to-9 is sufficient for nearly all competitive games. But here, the exploration is based on shooting all possible rays. The digits 1 to 9, or even 0 to 9, are not sufficient. Here is an example (without "0"):
8 2 1 H
1 - - - - - - - -
2 - - - O - - - -
H - - O - O - - - H
3 - - - O - - - -
4 - - - - - - - - 9
5 - - - - - - - - 5
6 - - - - - - - - 6
7 - - - - - - - - 7
8 3 4 H 9
There are 4 balls in the box
The game is rewritten as:
h b a H n m k l
a - - - - - - - - n
b - - - O - - - - m
H - - O - O - - - H
c - - - O - - - - j
d - - - - - - - - i
e - - - - - - - - e
f - - - - - - - - f
g - - - - - - - - g
h c d H i j k l
There are 4 balls in the box
As you can see, the upper-case "H" (for a "hit" ray) is not standing out of the lower-case letters for coming-out rays. Same thing, for other spectrums, for the "R" letter representing a reflected ray. I will use instead "@" for hitting rays and "&" for reflected rays.
h b a @ n m k l
a - - - - - - - - n
b - - - O - - - - m
@ - - O - O - - - @
c - - - O - - - - j
d - - - - - - - - i
e - - - - - - - - e
f - - - - - - - - f
g - - - - - - - - g
h c d @ i j k l
There are 4 balls in the box
When storing the data in the database and for some processes, the 8×8 square will be "linearised" as a 64-char line. Line 1 from the original square will be stored in chars 0-to-7 of the 64-char line, line 2 will be stored in chars 8-to-15 and so on, until line 8 which will be stored in chars 56-to-63.
In the same manner, the peripheral area (lines and columns 0 and 9) showing the in- and out-positions of rays, are stored linearly as a 32-char string. The traditional Black-Box game uses this 1-to-32 numbering scheme:
3 3 3 2 2 2 2 2
2 1 0 9 8 7 6 5
1 - - - - - - - - 24
2 - - - O - - - - 23
3 - - O - O - - - 22
4 - - - O - - - - 21
5 - - - - - - - - 20
6 - - - - - - - - 19
7 - - - - - - - - 18
8 - - - - - - - - 17
9 1 1 1 1 1 1 1
0 1 2 3 4 5 6
The 32-char string will store the peripheral positions in the
same order. There is just a 1-bias for indexing, because the
substr
function begins at 0.
Here is a simplified example in the A2_B4 configuration, that is with 2 atoms in a 4×4 box:
@ & @ &
@ O - O - @
& - - - - c
a - - - - a
b - - - - b
@ & @ c
There are 2 balls in the box
The linearised version will give the following two strings:
O-O-------------
@&ab@&@cbac@&@&@
In this example, we see:
-
3 rays which are absorbed as soon as they enter the box
-
1 ray absorbed after crossing an empty box
-
2 rays absorbed after crossing 3 boxes each
-
3 rays which are reflected as soon as they enter the box
-
1 ray reflected after crossing 3 boxes
-
2 rays which exit the box after crossing 4 boxes and without being deflected
-
1 ray which exits the box after crossing 3 boxes and being deflected once.
For reasons that will be explained later, this example is the 2nd tested configuration in the A2_B4 configuration. Using JSON syntax, the database record will look like:
{ config: "A2_B4",
number: 2,
molecule: 'O-O-------------',
spectrum: '@&ab@&@cbac@&@&@',
absorbed-number: 6,
absorbed-max-length: 3,
absorbed-max-turns: 0,
absorbed-tot-length: 7,
absorbed-tot-turns: 0,
reflected-number: 4,
reflected-edge: 3,
reflected-deep: 1,
reflected-max-length: 3,
reflected-max-turns: 0,
reflected-tot-length: 3,
reflected-tot-turns: 0,
out-number: 3,
out-max-length: 4,
out-max-turns: 1,
out-tot-length: 11,
out-tot-turns: 1
}
When I write that the programme will shoot all 32 possible rays, this is not quite true. The rays follow the "invariant return path of light". If the ray shot from position 1 comes out in position 24, then the ray shot from position 24 will come out in position 1. For configuration A4_B8, you need to shoot only 18 to 28 rays. See the examples below: either 14 coming-out rays and 4 absorbed rays, or 4 coming-out rays, 16 absorbed rays and 8 reflected rays. These examples show a situation with the maximum number of out-coming rays and a situation with the minimum number.
h b a @ n m k l & @ & c d @ @ @
a - - - - - - - - n @ - O - - - - - - &
b - - - O - - - - m @ - - - - - - - O @
@ - - O - O - - - @ @ - - - - - - - - &
c - - - O - - - - j a - - - - - - - - a
d - - - - - - - - i b - - - - - - - - b
e - - - - - - - - e & - - - - - - - - @
f - - - - - - - - f @ O - - - - - - - @
g - - - - - - - - g & - - - - - - O - @
h c d @ i j k l @ @ @ c d & @ &
There are 4 balls in the box
Using the symmetries and rotations of the 8×8 square, we see that each molecule belongs to a group of usually 8 enantiomers. Sometimes, the group contains only 4 enantiomers, or just 2. And very rarely, the molecule is alone in its group (and so does not deserve the name "enantiomer").
Anyhow, once we have achieved the lengthy calculation of the spectrum of a molecule, it is rather easy and fast to compute the spectrum of its 7 enantiomers (or 3 enantiomers, or its only 1 enantiomer). This allows us to write 8 records (or 4, or 2) at once in the database instead of just one.
Let us use the previous example from the A2_B4 configuration. So we computed:
@ & @ &
@ O - O - @
& - - - - c
a - - - - a
b - - - - b
@ & @ c
There are 2 balls in the box
After a 180-degree rotation, we obtain:
c @ & @
b - - - - b
a - - - - a
c - - - - &
@ - O - O @
& @ & @
There are 2 balls in the box
In a second step, we rename the out-coming rays, so when examined in the usual order, they are alphabetically sorted. In this case, the renaming is this:
a → b
b → a
c → c
c @ & @
a - - - - a
b - - - - b
c - - - - &
@ - O - O @
& @ & @
There are 2 balls in the box
Why this renaming? Because the letters "a" to "c" are just meaningless labels, whose only purpose is to distinguish each out-coming ray from the others. The three cases below are basically the same spectrum:
c @ & @ c @ & @ b @ & @
b - - - - b a - - - - a c - - - - c
a - - - - a b - - - - b a - - - - a
c - - - - & c - - - - & b - - - - &
@ - O - O @ @ - O - O @ @ - O - O @
& @ & @ & @ & @ & @ & @
There are 2 balls in the box
These are in essence the same spectrum. But the programmes will see three different linearised strings:
bac@&@&@@&ab@&@c
abc@&@&@@&ba@&@c
cab@&@&@@&ac@&@b
This is why we rename the out-coming rays, so they appear in alphabetical order along the peripheral area of the blackbox.
Instead of storing just one record into the database, we will store simultaneously 8 of them. To tell a long story short, here are two of them. In addition, I have stripped the statistics.
{ config: "A2_B4",
number: 2,
canonical-number: 2,
molecule: 'O-O-------------',
spectrum: '@&ab@&@cbac@&@&@',
transform: 'id',
[...]
}
{ config: "A2_B4",
number: 0,
canonical-number: 2,
molecule: '-------------O-O',
spectrum: 'abc@&@&@@&ba@&@c',
transform: 'rot180',
[...]
}
The canonical-number
property is a link to the
first molecule of the enantiomers group. The transform
property
is a reminder of which rotation or which symmetry gave the
current molecule, when applied to the canonical enantiomer. The number
property is initialised with value zero, but is filled
later when the exhaustive search comes upon the molecule.
At this time, the record will be updated to:
{ config: "A2_B4",
number: 119,
canonical-number: 2,
molecule: '-------------O-O',
spectrum: 'abc@&@&@@&ba@&@c',
transform: 'rot180',
[...]
}
In the exploration programme, the main loop extracts all 635376
molecules. So why bother about rotations and symmetries?
Actually, on each iteration, the programme begins by checking
whether the molecule is already in the database. If so, it just
fills the number
property with the right value, updates the record
in the database and skips to the next iteration. If not, it computes
the full spectrum of the current molecule, looks for enantiomers,
computes the spectrum of each enantiomer and stores all that into
the database.
How do we find exhaustively the molecules?
I will illustrate this with a A6_B4 configuration and by naming the atoms "A" to "F", instead of the anonymous "O". We begin with a molecule where all the atoms are packed on the left (using the linearised string instead of the square array):
1 ABCDEF----------
For the next 10 iterations, the rightmost atom, or atom F, shifts to the right step by step:
2 ABCDE-F---------
3 ABCDE--F--------
4 ABCDE---F-------
(...)
10 ABCDE---------F-
11 ABCDE----------F
At iteration 12, atom F can no longer move. So atom E shifts one step to the right and atom F comes back all the way to atom E.
12 ABCD-EF---------
The atom F again walks to the right, this time for only 9 iterations.
13 ABCD-E-F--------
(...)
21 ABCD-E---------F
Again, atom E moves one step to the right and atom F leaps all the way leftward to atom E.
22 ABCD--EF--------
Then 8 steps rightward
23 ABCD--E-F-------
(...)
30 ABCD--E--------F
At some moment, both atoms E and F are stuck to the right.
66 ABCD----------EF
When this happens, this is atom D's turn to move one step rightward and both atoms E and F leap leftward to atom D.
67 ABC-DEF---------
It would appear that this could be implemented as embedded loops, the outer one on atom A's position, the next one on atom B's position and so on until the inner loop on atom F's position. This does not scale up. You must allow for configurations with different numbers of atoms. And you can end with code eligible for the Daily WTF website.
But the description above shows rather a kind of transformation algorithm, which takes one molecule and finds the next one. The algorithm is as follows:
-
Look for the rightmost
O-
substring. -
If not found, rejoice! we have the very last molecule! the exhaustive search is over!
-
Split the molecule around the found
O-
substring. -
Replace the
O-
substring with-O
. -
On the right of this substring, shift all
O
's to the left.
Example with -O--O-O------OOO
111111
0123456789012345
-O--O-O------OOO
The O-
substrings are at 1, 4 et 6. The interesting one is the substring
at 6. The sting is split thus:
111111
012345 // 67 // 89012345
-O--O- // O- // -----OOO
The substrings are modified in this way:
111111
012345 // 67 // 89012345
-O--O- // -O // OOO-----
And the pieces are gathered together:
111111
0123456789012345
-O--O--OOOO-----
A counterintuitive step is how the 3 atoms are shifted to the left.
We do this with a flip
. When using different names for the atoms,
the flip
converts -----DEF
to FED-----
while we would expect
a transformation to DEF-----
. But since the programmes will only use anonymous O
's,
while the A
to F
are used only in this descriptive text,
the flip
function is the function to use.
When the Molecules table is fully populated (for a given configuration), we extract from this table all the spectrums that appear in two or more records and we store them in a new table, Spectrums.
Example with a A4_B6 configuration. We have these two molecules, with the same spectrum:
@ & @ & c d @ & @ & c d
@ O - O - - - @ @ - - O - - - @
& - - - - - - & & - - - - - - &
@ - - O - - - @ @ O - O - - - @
& - - - - - - b & - - - - - - b
@ O - - - - - @ @ O - - - - - @
& - - - - - - a & - - - - - - a
@ a @ b c d @ a @ b c d
There are 4 balls in the box
The Molecules table will contain:
{ config: "A4_B6",
number: 868,
canonical-number: 868,
molecule: 'O-O-----------O---------O-----------',
spectrum: '@&@&@&@a@bcda@b@&@dc&@&@',
transform: 'id',
[...]
}
{ config: "A4_B6",
number: 15993,
canonical-number: 7361,
molecule: '--O---------O-O---------O-----------',
spectrum: '@&@&@&@a@bcda@b@&@dc&@&@',
transform: 'id',
[...]
}
and the Spectrums table will contain:
{ config: "A4_B6",
spectrum: '@&@&@&@a@bcda@b@&@dc&@&@',
nb-mol: 2,
transform: 'id',
canonical-number: 868
}
Which means that for this spectrum, there are two molecules.
Why the transform
and canonical-number
attributes? When two asymmetrical molecules
form a spectral group, the same happens with their respective enantiomers.
So we have 8 spectrums which are rotations / symmetries of each other.
We may want to study one of them, but then we would be bored studying
the 7 other groups. By tagging one group with transform: 'id',
and the others with transform: 'rot180',
transform: 'sym-h',
and the like, we know which spectrums we need to study.
So with:
@ & @ & c d @ & @ & c d c d b @ a @ c d b @ a @
@ O - O - - - @ @ - - O - - - @ a - - - - - - & a - - - - - - 1
& - - - - - - & & - - - - - - & @ - - - - - O @ @ - - - - - O 2
@ - - O - - - @ @ O - O - - - @ b - - - - - - & b - - - - - - &
& - - - - - - b & - - - - - - b @ - - - O - O @ @ - - - O - - @
@ O - - - - - @ @ O - - - - - @ & - - - - - - & & - - - - - - &
& - - - - - - a & - - - - - - a @ - - - O - - @ @ - - - O - O @
@ a @ b c d @ a @ b c d c d & @ & @ c d & @ & @
There are 4 balls in the box
we will have these two records in the Spectrums table:
{ config: "A4_B6",
spectrum: '@&@&@&@a@bcda@b@&@dc&@&@',
nb-mol: 2,
transform: 'id',
canonical-number: 868
}
{ config: "A4_B6",
spectrum: 'a@b@&@cd&@&@@&@&@&@a@bdc',
nb-mol: 2,
transform: 'rot180',
canonical-number: 868
}
How is the Spectrums table populated? First, we insert all spectrum records
without bothering with the transform
column, which will be initialised with
a dummy value. This will be acheived by harnessing the power of the group by
SQL clause or of the MongoDB function aggregate
.
Then, we fix the transform
and canonical-number
attributes. For this, the programme scans all
canonical molecules (transform: 'id'
) whose spectrum exists in the
Spectrums table. Then the programme reads this spectrum record. If
the spectrum record is not already updated, the programmes reads the
whole enantiomer group of the current molecule, and the corresponding
records from the Spectrums table, copies the transform
and canonical-number
values from the
molecule to the spectrum and updates the spectrum record.
Which kind of database? SQL or MongoDB? I began this work as a MongoDB
project. Then, because of two problems with the existing MongoDB
module, I switched to SQLite, while ensuring the MongoDB code would
still work. And after the SQLite project was fully functional
(including the update of the Spectrums
table), I finally found a way
to bypass the MongoDB module limitations and I completed the MongoDB
version.
The first problem is that the Raku module for MongoDB does not
implement the full set of functionalities available for MongoDB.
For example, I do not know how to code in Raku a ≥
selection:
{ key: { '$gte': min-value }}
I can only code =
selections. Likewise, I do not know how to ask
MongoDB to sort the retrieved documents. If I need to sort the
documents, I must do it within the Raku programme. These problems did
not hinder me much when writing the exploration programme. On the
other hand, it was impossible to write a pure Raku programme filling
the Spectrums
collection.
The other problem is performances. Here are the times for the exploration programme using a MongoDB backend, with various Black Box configurations:
nb_mol real user
A4_B4 1820 2 min 2.201 s 2 min 46.945 s
A4_B5 12650 16 min 45.232 s 19 min 29.010 s
A4_B6 58905 132 min 22.944 s 94 min 55.280 s
With a SQLite backend and before implementing begin
/ commit
, here
are a few timing results:
nb_mol real user
A4_B4 1820 0 min 47.436 s 0 min 21.424 s
A4_B5 12650 6 min 15.180 s 2 min 50.238 s
The results with a commit
every 50 updates:
nb_mol real user
A4_B4 1820 0 min 15.031 s 0 min 15.463 s
A4_B5 12650 2 min 16.277 s 2 min 0.199 s
The results with a commit
every 500 updates:
nb_mol real user
A4_B4 1820 0 min 19.017 s 0 min 18.053 s
A4_B5 12650 2 min 4.514 s 1 min 56.635 s
A4_B6 58905 26 min 39.339 s 21 min 40.528 s
This is not my first Raku project, so I did not need to install Raku. If you need to install it, maybe you should take a look at the documentation of this previous project, especially the chapter about compiling Raku and the chapter about using Raku.
I had a few problems when developing the SQLite version.
First, kebab case. In Raku, kebab case is usually preferred
to snake case and to camel case, and it is compatible with
JSON/BSON and MongoDB. But you cannot use kebab case for SQLite
column names. So there was no immediate link between the
canonical_number
column name and the canonical-number
hash
key. Of course, you can use the translitteration of underscores
to dashes, but at first I did not think of that when I wrote the
first version of the select
statements.
The problem did not appear at first, because it was hidden behind
another problem the syntax of insert
statements. The first version
of the main insert
statement looked like:
$dbh.execute(q:to/SQL/
insert into Molecules
( config
, number
, canonical_number
[...]
, dh2)
values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
SQL
, $molecule<config >
, $molecule<number >
, $molecule<canonical-number >
[...]
, $molecule<dh2 >
);
with all 23 column names, with 23 question marks for bind values and the 23 Raku values used for the binding. You may notice that underscores in the column names are replaced by dashes in the Raku values, among other edits. The problem is that you have to specify all 23 column names and all 23 Raku values, without the convenient way of using a hash variable. Maybe there is a trick in the DBIish module which I have not seen and which would allow me to use a hash table.
In the second version of the insert
statement, I generated the list of
database columns and the list of question marks and I interpolated them
into the SQL statement. This is not a
Bobby Tables
issue.
The interpolated strings are fully controlled by the programme maintainer.
They do not come from user input or from an external source.
And this helps with the DRY principle (Don't Repeat Yourself).
The last problem is a bit ironical. There is a missing feature in MongoDB
which is implemented in SQL: consistent updates, aka database transactions. This is done with begin transaction
and commit transaction
in SQLite or similar statements in
other SQL dialects. So I coded a way to bypass this problem, which would
be necessary in MongoDB and superfluous in SQLite. On one hand, this
bypass measure was unnecessary in MongoDB, on the other hand it was buggy
with SQLite.
When I run the exploration program a second or a third time, why do I need
to delete and recreate the last enantiomer group created in the previous run?
Here is my reasoning before writing the programmes.
It may have happened, in the previous run, that Ctrl-C
took effect
when some molecules documents of the group, but not all, have been inserted
into the database. Yet, it is necessary that each enantiomer group is stored
in the database as a whole. This prerequisite is easy to implement in SQLite,
with begin transaction
/ commit transaction
, but there is no consistency
mechanism in MongoDB. So at the start of the second or subsequent run, I extract
the maximum value of number
from the database. If this is the number of a
canonical molecule, the program deletes the whole enantiomer group (sometimes
a full group, some times an incomplete group) and rebuilds it in full.
On one hand, this precaution was not required for MongoDB. In MongoDB,
I execute a bulk insert, by giving an array of 8 documents to the insert
statement. During my tests, I have never seen a partial enantiomer group
in which the programme would have had time to insert, for example, only
two documents before being interrupted by Ctrl-C
. Each time, the
last created enantiomer group was a complete one.
On the other hand, the first SQLite version did not include
begin transaction
/ commit transaction
statements. And the programme
would run insert
statements on single molecules, I did not know that SQLite could do
bulk insert
.
There was one test in which the group identified by canonical-number = 2
was fully created and the group identified by canonical-number = 3
was incomplete, with two molecules in the database and the six other dropped.
Especially, the canonical molecule identified by number = 3
was missing.
When I relaunched the exploration programme, the search for the maximum value
of number
gave 2, not 3. So the enantiomer group identified by canonical-number = 2
was deleted and created again, then the enantiomer group identified by canonical-number = 3
was created, without considering the two already existing molecules.
Therefore, an inconsistent enantiomer group with 10 molecules, including 2 duplicates.
How can I fix it? There are at least three ways:
-
Retrieve not only the maximum value of
number
, but also the maximum value ofcanonical-number
. In the example above, the program would have obtained a 2 and a 3 and it would have removed the enantiomer group identified bycanonical-number = 3
while leaving alone the enantiomer group identified bycanonical-number = 2
. -
Ensure that the canonical molecule of the enantiomer group is the first one to be inserted in the database. In this case, the extraction of the maximum value of
number
would have given a 3 and the deleted group would have been the one identified bycanonical-number = 3
. -
Add
begin transaction
/commit transaction
to the exploration programme and ensure there will be nocommit transaction
before an enantiomer group is complete.
Since I already had the intention to add begin transaction
/ commit transaction
,
I did not tried the other fixes.
As stated above, this is not my first Raku+MongoDB project, so the software was already installed. If you need to install them, please see the documentation of said project. Be sure to read the paragraphs on prerequisite installation (do not bother with Bailador and Templace::Anti), on installing the MongoDB software and installing the MongoDB Raku module.
I have explained how the features missing from the MongoDB module
made me develop the SQLite version. Yet, I still tried to find a solution
to nevertheless use these advanced MongoDB features. If the aggregate
function
is available in the Mongo shell, but not in the MongoDB
Raku module, I build
a Mongo shell (or JavaScript) source which extracts the data and aggregates them,
I run this Mongo shell programme as an external process, I gather the standard output
and I store it into a Raku variable.
You should notice that, when building the Mongo shell source, I need to mention
the An_Bp
value of the configuration in this source. Unlike what I explained in
the SQLite chapter, this is really a
"Bobby Tables"
problem,
because the An_Bp
value is specified by the user.
So the programme filters, examines, checks, sanitises and validates the value
with a very restrictive regex: an anchor at both ends, to ensure the whole value
is checked, only 3 literal chars A_B
and the digits 0
to 9
, with a limited
number of occurrences (no more than 99 atoms and the box not larger than 9×9).
Once, when running the programme building the Spectrums
collection, I encountered
an error, because of insufficient memory to process and aggregate all the
Molecules
documents. As explained by the error message, overcoming this limit
is rather simple, you just have to add a proper index to the Molecules
collection.
Therefore the add-index-mongo.sh
script.
The programmes are split in a fashion similar to the MVC principle (Model, View, Controller). For example, to display a molecule or a list of molecules, you have:
-
The
display-sql.raku
programme, which contains the model (invoking theDBIish
Raku module and the SQLite database access functions) and the parameter parsing (in theMAIN
multi-sub). -
The
display-mongo.raku
programme, which contains too the model (invoking the variousMongoDB::xx
Raku modules and the MongoDB database access functions) and the parameter parsing (in theMAIN
multi-sub). -
The
lib/display-common.rakumod
module, which contains the controller (the internal logic of the process) and the view (to format and display the data).
When initialising the Spectrums
table / collection, the process logic is very
basic: first deleting any existing records / documents, then creating all
the records / documents in a single database access. So, there is no module
for this step, only the two programmes.
When a programme calls a function defined in the module, this is very basic.
The necessary module functions are flagged with is export
, calling these
functions is straightforward. When the module calls a function defined in the
programme, it is a bit more difficult. The main programme has initialised a
dispatch table and this dispatch table is transmitted as a parameter to each
function in the module which needs it.
I write this conclusion after running my programmes on the A4_B6 configuration. I have not yet run the programmes on the A4_B8 standard configuration, but the findings on the A4_B6 configuration are interesting and in most cases can be extended to the A4_B8 configuration.
Before starting to develop this project, I knew of two ambiguous spectrums (actually 16 spectrums, because of symmetries and rotations). Each one would lead to two molecules. So I was expecting to find several other spectrums with 2 molecules, a few ones with 3 molecules and possibly, but hardly probably, one spectrum with 4 molecules (actually 8 spectrums because of symmetries and rotations).
Here is what I found with configuration A4_B6:
number of molecules spectrums spectrums molecules
for each spectrum w/o rot / sym with rot / sym with rot / sym
2 89 696 1392
3 6 36 108
4 4 24 96
5 1 8 40
total 100 764 1636
I checked the spectral groups with 4 or 5 molecules and I found they can easily be extrapolated to the A4_B8 configuration, by inserting two empty lines and two empty columns at the right places. I think this is the case also for most spectral groups with 2 or 3 molecules. It may also happen that the A4_B8 will give a few more ambiguous cases, with no corresponding case in the A4_B6 configuration, but in my opinion, this will be very rare.
Anyhow, we can suppose that the A4_B8 configuration will provide at least 750 spectrums and 1600 molecules leading to ambiguity. I did not expect such a big number.
Also, up to now, I had only found spectral groups where the molecules would differ by only one atom, that is, which would give a 5-point penalty to the player. With the programmes, I have found a few spectral groups where two molecules differ by two atoms, for a 10-point penalty.
About statistics on the lengths of paths and the number of turns, we cannot extrapolate the A4_B6 findings to the A4_B8 configuration. But this is a secondary topic. The main purpose of the project is reached with the A4_B6 configuration.
This text is licensed under the terms of Creative Commons, with attribution and share-alike (CC-BY-SA).