Literary Tools to Work with Subsequences of Strings
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
subDisplay
.gitattributes
.gitignore
README.md
subwords.py

README.md

Subwords

One out of three modules of Recursus (with Wordsquares and AIT)

The current Python program, using a given list of words (each on a single line), creates list of words of equal length and decomposes words into subwords of a given length, with no remainder (we are looking for 'perfect' decompositions, where no letter is left at the end of the operation). The program also implements one more constraint: the subwords have to be 'interwoven', namely, they must be split by at least the letter of at least one other subword. The results are then printed as a text file in the data folder.

Instructions

The program runs on Python 3 (3.66 when developed).

Install the DAWG package either through pip or conda:

  • $ pip install DAWG
  • $ conda install -c conda-forge dawg

Run like so (the dictionary is a list of words, one per line; the program will create lists of words of equal length):

  • $ python subwords.py [path/to/dictionary] [number of processors]

The results will be generated in the data/results/ folder. (Source dictionary are ready for use in the data/lists folder.)

Display

I also developed a Processing sketch called 'subDisplay' that allows one to load already found decompositions from a text file, and browse through them in a GUI manner (left/right arrows to switch between decompositions, up/down to switch between various letter distribution possibilities if there are any, and tab/q to browse through the possible positions of the words on the various lines (yet another problem soluble into recursive acids). It so happened that a similar recursive process was required to reconstruct subwords from a given decomposition, which was good practice.

Results need to be copied into the 'subwords' file in data, one decomposition per line. Then the sketch allows you to go through the various possibilities:

  • left and right to switch decompositions;
  • up and down to switch possibilities (letters being distributed between the subwords);
  • tab and 'q' to browse through the different layouts (which subword on which line);
  • 's' to save.

Weaver, war, eve Vulnerabilities, uni, veil, labs, rite Missionaries, Isis, mine, soar

Many more results in subDisplay/ and on Recursus.

Mechanism

The program will go through a list of superwords (possibly containing subwords), and treat each word one after another. It takes the length of the word, and produces all possible integer partitions (e.g. for a 9-lettered words we want (6,3), (5,4), (3,3,3)...). Then, after a test to make sure that there are words containing the number of letters indicated by said partitions (useful for huge superwords or custom string inputs), it will set itself to search for subwords of these lengths: our 9-letter word will be examined first with words of 6 and 3 letters, then 5 and 4, etc. Then, for each subword length, it will generate a dictionary of regexes (for 'blah' -> (.*)b(.*)l(.*)a(.*)h(.*)) that will be used to search if the word is present in the superword. The last length of subwords will only be there as a look-up (using a Trie, and more specifically a Dawg for speed). The decomposition follows two strict rules: the subwords must be divided themselves (so no division like 'super' + 'man' or 'high' + light'); and I eliminated duplicates (where two solutions just find subword1 first and subword2 afterwards, and vice versal). For each word, once the regex check indicates a match, a(nother) recursive function calculates the possible positions in the word where the letters composing the subword are located, stores them, and then for each of those, substracts them from the superword, and performs the same operation, recursively, on the remainder letter, this time using the next subword length in our partition. Once at the bottom of the recursive process, we simply check if the remainder is a word using our Dawg, and if it is, we collect it. The delicate matter is to handle repetitions. For that, sets are useful, as they automatically get rid of repetitions. At each level in the recursion, the program creates new lists of decompositions using results from below, and once these lists are formed, they are themselves treated as elements and collected in a set, before being returned to the level above. At the very end, a check against the partition lengths allows us to reconstitute the full decomposition in the same order as the partition. Thus, for 'vulnerabilities', it will find 'uni', which leaves 'vlerabilites', which itself contains 'labs', hence the remainder 'verilite', which containts 'veil', leaving 'rite'. The function calculating the positions is also nice and recursive:

We go forward into the two words checking for identical letters, and when we do, trim both words by the appropriate amount and keep searching on the reduced strings, e.g. 'abdee' with 'be':

 a b
   b -> trim 'abdee' to 'dee', 'be' to 'e'.

 d e
   e -> one solution found, save result, carry on.

 e
 e -> other solution, carry on.

The function will then go back up in the level of recursion, keep on looping to the end of the word and find nothing more.

The multiprocessing part divides the possible word lengths (words of 6, 7, ... letters) into chunks, and gives to each processor one of the chunk to work on.

Branches

  • The master branch generates all decompositions for all words in a given dictionary, from the smaller superwords upward (6 letters, decomposable into 2 subwords of length 3, moving on to 7, into 4 and 3, then 8, which has more partitions, 5 and 3, 4 and 4, etc.). The program calculates the desired partitions recursively (no redundancy, no word shorter than 3 letters).

    In the hope of speeding up the whole process, I adopted a multiprocessing framework, whereby the user can choose how many processors to use when invoking the program.

  • The stringinput branch generates all subwords for a given string, e.g.

    $ python subwords.py [given string] [dictionary] [number of cores]

    The following command uses all the dictionaries available in the data/lists folder:

    $ for dict in data/lists/*; do python subwords.py [given string] $dict 4; done

  • The arguments branch containts a previous version where one must give the desired partition as argument to the program, like so:

    $ python subwords.py 8 4 4

  • The irrecursive branch contains a previous version without recursion.

  • The 'ur' branch containts the original idea for this whole affair. Here is the original Readme:

    This repository contains my research in the field of subwords, defined as a word that is contained within another one, with the order of letter respected: 'automatonophobic', for instance, contains words such as 'atman', 'tatoo', 'tonic' or 'topoi' within it, all in the right order, just selecting the letters needed.

    My research has been focussed on this first aspect using regex, working on lists of words extracted from litscape: it is now possible for me to find all subwords for a given long word (superword), or find all superwords for a given subword (see 'Subwords_3').

    The other tricky aspect has been to calculate all possible letter positions for one specific subword. Given 'automatonophobic' and 'tatoo', for instance, there are three possible solutions (see 'Subwords_2_7_recurse4'):

    au om   n phobic
      t  ato o      
    
    au om   noph bic
      t  ato    o   
    
    au om  on ph bic
      t  at  o  o  
    

    I am also developing a program to write constrained poems, see 'Subwords_4_wordtanglements', whereby the first line is a long word (e.g. 16 letters), the second one any subword from the first, the third any superword from the previous subword, etc.

    All these functions were developed in the context of a project using a 16x2 LCD screen, which could display these results.