A 'siamese words', aka 'werewords', database generator
An inquiry of words sharing letters, and letter patterns, in Python 3.
The program currently run in the following way:
- It imports words from a given dictionary (one word per line, an example is provided in the
datafolder, but another file can be specified by the
- The process is by and large a brute-force triple loop, going through all the words in the dictionary, trying each padding position, then comparing that to each word in the dictionary, summarised like so:
for each word in the dict: for each padding position: for each other word in the dict: check if conditions are met & save
- It is to be noted that the program creates as many copies of the dictionary as there are padding positions needed. Then it retrieves the appropriate one in the loop, as follows:
b a n a n a | (padding: 0) a v a t a r | (no common letter) b n n a | (padding: 1) a a a v t r | (two common letters) b a n a n a | (padding: 2) a v a t a r | (no common letter) (And when avatar is the first word:) a v a t a r | (padding: 0) b a n a n a | (no common letter) v t r | (padding: 1) a a a b n n | (three common letters) etc.
- The process uses recursion: given one word, the given padding, the
siamesorwill move forward through the word, position by position, and for each position search for words that have the same letter in that position. All the words are saved and reused as a reduced dictionary for the next step. Once the process is complete (either by reaching the end of the word, the final allowed position given the padding, the constraint on the minimal number of differing letters or maximal number of common leters), the machine checks whether there are common letters outside the given positions (we want exactly those positions to have common letters, not elsewhere), and then saves the result in what will be a dictionary.
- The dictionary comprises keys describing the positions of the common letters, as a string. To continue with our above example, the first result would be classified under the key
'1,3;2,4', as a tuple:
(('banana', '1,3'),('avatar','2,4')), and the second under the key
- The dictionary is then saved as a JSON file, the name of which reflects the chosen options (see below for more detail), in the
resultsfolder (created if not yet there).
- I attempted to implement a multiprocessing pipeline, that parallelises the first step, the first loop through all the words, but so far it is unclear whether this improves performance significantly. Given the current time it takes to build a database, it is not so essential unless one would want to build the entire thing (all word lengths, all paddings).
A few example cases:
$ python siamesor.py --equal_length --no_padding(equivalent to
$ python siamesor.py -qg) will produce all possibilities for words of equal lengths, the shifting/padding mechanism being disabled;
$ python siamesor.py --min_length 6 --max_length 7(equivalent to
$ python siamesor.py -m 6 -M 7) will only browse through word lengths 6 to 7;
$ python siamesor.py --min_length 6 -intersect --max_length 7 --word supranational(equivalent to
$ python siamesor.py -i -m 6 -M 7 -w supranational) will search for possibilities matching the word "supranational" within words of length 6 to 7, with the constraint that the intersect letters together should form a word from the given dictionary;
$ python siamesor.py --compact --processors 18(equivalent to
$ python siamesor.py -c -p 18) will search through all possibilities using your mighty 18 cores for speedy parallel computation, and store them in a compact format (the final output dictionary keys will be produced regardless of the position within the word. This is done by aligning all positions leftward to 0: hence positions '2,3' will be equivalent to, and stored under, '0,1', as will be '3,4', '4,5', etc.;
$ python siamesor.py --allowed_letters i --min_common 3 --verbose(equivalent to
$ python siamesor.py -v -a i -C 3) will search for possibilities within equally long, unpadded words for intersect letters comprising at least three 'i's, and print every single result out to the console.
Output of the
usage: siamesor.py [-h] [-m MIN_LENGTH] [-M MAX_LENGTH] [-g] [-q] [-i] [-I INTERSECT_WORD] [-a ALLOWED_LETTERS] [-A ALLOWED_REMAINDER] [-l] [-G MIN_PADDING] [-D MIN_DIFFERENT] [-C MIN_COMMON] [-w WORD] [-c] [-k STRUCTURE] [-f FILE] [-d DICT_LIMIT] [-v] [-e] [-P PRINT_LIMIT] [-s SAMPLES] [-r] [-p PROCESSORS] [-t] Find siamese words, aka werewords optional arguments: -h, --help show this help message and exit -m MIN_LENGTH, --min_length MIN_LENGTH Minimum word length for siamese database. Defaults to 4. -M MAX_LENGTH, --max_length MAX_LENGTH Max word length for siamese database. Defaults to none. -g, --no_padding Disable the padding mechanism shifting one out of the words left/right. Defaults to False. -q, --equal_length Only produces siamese using two words of equal lengths. Defaults to False. -i, --intersect Adds the constraint that the intersect letters must form a word in the dictionary. -I INTERSECT_WORD, --intersect_word INTERSECT_WORD Specify the word that the intersect letters must form. -a ALLOWED_LETTERS, --allowed_letters ALLOWED_LETTERS Adds the constraint that the intersect letters must only be taken from the given input. Must be comma- separated, e.g. -a a,e,i,o,u,y, for vowels only. -A ALLOWED_REMAINDER, --allowed_remainder ALLOWED_REMAINDER Adds the constraint that the remaining letters (not the intersect ones, different for each word, must only be taken from the given input. Must be comma- separated, e.g. -a a,e,i,o,u,y, for vowels only. -l, --single_intersect Adds the constraint that the intersect must only be composed of one letter (any permitted). -G MIN_PADDING, --min_padding MIN_PADDING Minumum overlap allowed between the two considered words when shifting one left/right. Defaults to 3. -D MIN_DIFFERENT, --min_different MIN_DIFFERENT Minumum number of differing letters between the two considered words. Defaults to 2. -C MIN_COMMON, --min_common MIN_COMMON Minumum number of common letters between the two considered words. Defaults to 2. -w WORD, --word WORD Search for siamese containing the specified word. -c, --compact Store results in the dictionary by structure, that is, take the letter positions, e.g. '1,4,5', shift them leftward to zero '0,3,4'. The other option stores both position data for both siamese: '1,3:3,5', except when both positions are identical, in which case only one will be used, e.g. '3,6'. -k STRUCTURE, --structure STRUCTURE Search for siamese with positions equal to given structure. Format: numbers separated by commas, positions separated by a colon. Example: '1,2' will mean that positions 1 and 2 will have to be met in both words. '1,2:3,4', positions 1 and 2 for one, 3 and 4 for the other. -f FILE, --file FILE The source dictionary file (one word per line) -d DICT_LIMIT, --dict_limit DICT_LIMIT Maximum word length allowed when importing dictionary. -v, --verbose Setting verbose to True will make the script print all results to the console. Defaults to false. -e, --quiet Limiting the printing to the total found. Defaults to false. -P PRINT_LIMIT, --print_limit PRINT_LIMIT Number of siamese to be printed before stopping. Independent of verbose argument. Defaults to 0. -s SAMPLES, --samples SAMPLES Number of random samples from results once database is built (each 'structure' key from the final dictionary will be sampled in turn). Defaults to 1. -r, --no_recap Disables the recap section (printing the number of siamese for each structure. -p PROCESSORS, --processors PROCESSORS Number of cores used for parallel processing. Default: number of cores detected by the multiprocessing module. -t, --time Calculate the total time of the program.