A wordsquare generator in Python: from a given dictionary, the program will recursively search for all possible squares in which all columns and all rows (and diagonals, depending on the version) are words.
The program runs on Python 3 (3.66 when developed).
Install the DAWG package either through pip or conda:
$ pip install DAWG
$ conda install -c conda-forge dawg
Source dictionaries have to be placed in the
Run like so:
$ python wordsquares.py [Source Dict] [Square size] [Number of Processors]
The results will be generated in the
data/squares folder. The program generates as many temporary files as there are processors involved. In the case of a keyboard interruption, those temporary files will not be erased.
In the current state, despite major speed improvements, generation can take a long time! Only a very small size, 3, will yield near-immediate results. The vast numbers of 4- or 5- squares (hundreds of thousands with the diagonal constraints, and an estimate of dozens to hundreds of millions without), mean a computation time, on my i5 quad-core laptop, of 12h-48h (or more, for very large dictionaries).
A Processing sketch called 'squareDisplay' is available. When wordsquares are selected, they should be copied in the file
squareDisplay/data/wordsquares, one per line. That is the folder in which the results will be generated (once the sketch is working, it is possible to scroll through the available squares with the left and right arrows. Save a square in regular and large format (small changes in the sketch, where lines are commented in the saveSquare() function, can make it produce a pdf instead. the Here are some results:
More in the
squareDisplay/data folder and on Recursus.
The very first steps into this project go back to last December (2017), where I implemented a first, irrecursive program now available on the 'ur' branch, cf. below.
I then proceeded to implement recursion and other technical improvements on this project in OpenFrameworks. The full power of recursion was not harnessed at this point, and I did not generate any full database for more than 3-squares. I went on to develop this other project, which attepts to find aesthetically salient squares in the database using Machine Learning and visualisations.
The 'master' branch builds squares letter by letter, horizontally first (go to row 1, check each letter from left to right, move to row 2, etc.). Given the plethora of results for 4, 5, etc., letters, I decided to implement a constrained version that also checks for diagonals: only squares that contain words in the two diagonals will be selected.
(For a while I had a 'ggl-20k' branch, working in the same way as 'master', but using a different dictionary (the 20k most frequent words in the Google corpus, found in this repo. I later implemented the dictionary choice in the arguments given to the program, and thus imported the results from this branch into 'master' before deleting it. In this project as well as in others, the quality of my input corpus is becoming a more central concern.)
The 'word-based' branch ports the former version (openFrameworks)'s paradigm, which was based on words and not letters (for each possible word in column 0, search if there are possible words in row 1, row 0 being the given word, then for each of these go to column 1, and repeat the process.
The 'unconstrained' branch produces squares letter by letter but without the constraint of having words in the diagonals.
'Multicore' branch: I experimented with multiprocessing/multicore implementation, which would allow to crunch large databases quite fast. That branch is currently optimized to do heavy crunching (in the hope I could run the program on some cluster some day). It searches for large squares (6-21 letters) one after the other, each time spawning as many subprocesses as requested (i.e. as there are cores available). This branch, yet to be run, would calculate squares for the largest list of words I found, found here, containing 479k words. I made an attempt to run that on a Google Cloud engine with 8 cores, searching for 6-squares. It ran for 5 days or so, without any of the 'workers' (cores) signalling they had even reached the 10th word in their list... In this case, the program is invoked adding the number of cores, e.g.:
python wordsquares.py 4.
The very first, primitive version of this project can be found on the branch 'ur'. It is written in Processing, and works manually: the user can input a first word (first row), then chooses words for the outer columns, and then is presented with the words available for the remaining rows. Only the outer columns are taken into account.
The program works in three steps: first, split the dictionary into lists of words of the same length, and select the appropriate one; then go through all the words in said list (or distribute parts of the list to separate cores/workers which will do this in parallel), and for each word, taken as the first word (first row), prepare the data (reduce the superset of possible words for the square), then build the squares recursively (once a square is found it is written to a file on a new line):
First, given a word in row 0, save all words starting with the letters composing it, which is the superset of possible column words. All these, as well as the global dictionary containing all x-letter words, will be stored in a special data format which makes prefix look-ups fast: I implemented the use of a Trie, and more specifically a Dawg to speed up prefix search. Then from these words retrieve all possible letters for each slot in the square (significantly less than the total). Finally, using these letters, build the equivalent word superset for the rows, and store them in Dawgs.
b l a h x y z b x y z b x y z b
Store all 'bxxx', 'lyyy' words. Knowing those, store each possible letters for each slot.
b l a h 1 2 3 4 5 6 7 8 ...
Each slot will have less than the complete set of letters. Knowing those, recursively search for possible row words.
At slot 1, fetch letter set, for each letter, chech whether there are words starting with that prefix. If yes, move to the next slot, 2, and jump into recursion: fetch letter set, are there words with prefixes '12'. Carry on, and if you reach the bottom (slot 4), save that word.
Then, using this data, it is possible to build each square much faster than before. For each letter in my set of possible ones, I perform two checks: whether the letters in the column form a prefix to a word in my dictionary (the dawg for the appropriate column); whether the letters in the row do the same (in the row dawg). If they both do, I save the letter and move on to the next slot. If they don't, I backtrack and try another letter.
b l a h 1 x x x y y
At slot 1, check if there are row words with that letter as a prefix in this row's word set. Perform the same check for the column. The diagonal check is similar, using 'b2' as a prefix in this dummy example:
b l a h 1 2 x x y z y z
Just as with other recursive processes earlier, once the bottom of the recursion is reached, the program just writes the full square to a file and moves on, removing the last letter (bottom right corner), trying the next one, going back one more letter, etc.
This version is faster than the previous one (openFrameworks) by far (I had made the mistake of not using the letter set nor the columns/row word supersets in the implementation of my database builder...), but the sheer amount of possible squares is staggering. Given the (addmittedly mammoth) dictionary I'm using (from litscape, already 4-letter words are slightly unmanageable: 4000+ words, with, for each word, roughly between 80'000 and 200'000+ solutions, leading to an estimated total of (hopefully) at most 800'000'000 squares... I made an attempt at building the database, which is possible in roughly a day on my laptop, and stopped at around 30'000'000 (the text file was already 1.3 Gb, and had only reached 'dove' as a first word). For now I won't push too far into building such huge sets of results, as it is unclear whether I'll even be able to mine them with my current hardware.
Still, the question remains whether it is possible to speed up the process even more (I saved other big word lists from litscape, and it seems that the lengths with the most words are located around 10-13 letters, which still remains out of reach in terms of calculating power).
(Update: something I did not think of at first was that the hurdle of the gazillions, my current name for the mountain of possibilities encountered in 4-lettered words (and probably also for 5, 6, etc.), can be avoided to some extent if we approach the problem by the other end, namely by searching for very large squares. It could have seemed counterintuitive, but in fact the number of possible squares given the number of words decreases massively when we reach longer words, simply because the number of possible letters in the word explodes, whilst the number of actual words decreases (thus in the middle range of around 10 letters or less have around 30k words, whereas the list for 14-lettered words only contains 8k or so of them). Because of how the program works, it does not take more time to try and calculate possibilities for these long words: so far no square could be found for 15+ words (and the 15 letters options takes around 10h to calculate, the 14 and 13 letters around 24h).)
- The multiprocessing issue has been resolved using Queue. Instead of having each of my workers writing to files (which, apparently, carries risks as the write() function does not happen immediately), I store the squares as strings into arrays, then collect all the arrays at the end and write to one file at the very end. A global counter class has also been implemented. My gratitude to my good friend Clément, who took a few hours with me to plow through this issue.
- In order to gain speed, either move back to C++ (openFrameworks), this time using all the data/methods I implemented here, or use Cython to speed things up. See this video for instance.