A word square database generator
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin/data
src
src0
.gitignore
README.md
addons.make

README.md

Word Squares

A foray into word squares as a poetic form
Jérémie Wenger, March 2018

Machine Learning Proposal
Rebecca Fiebrink, Goldsmiths College, University of London
IS71074A: Data and Machine Learning for Artistic Practice (2017-18)


Word Squares intends to be steps in the direction of a literary practice combined with what could be called machinic imagination: the use of computer power to handle vast fields of possibilities, and interact with constraints in a fruitful way. In this project, I will be using the computer to generate a large amount of data that would be very difficult to come up with by hand (given the nature of the constraint), and use machine learning algorithms to explore the space at hand, either to single out specific solutions (unique, remarkable texts), or series (texts that would be similar to each other, modulo certain properties, in a way that is not too dissimilar to the practice of painters and artists). The ability to actually generate the entire field of possibilities, instead of it existing 'somewhere out there', could allow for a reconceptualization of the relationship one has to what can be called either 'imagination' or the 'unconscious', as well as 'method' in literary practice: instead of only 'stumbling across' interesting solutions while writing under constraint, one can study the entire space, its singularities, its groups or tendencies, and redefine artistic intent as the choice to show this or that part of the space, or approach said space with this or that particular tool.

The original inspiration for exploring word squares came from Georges Perec, a distinguished member of Oulipo, who wrote a book of poetry called Alphabets (Paris: Galilée, 1976), in which each text is constrained in the following way: they are 11 (lines) x 11 (letters) squares, each line having to use the ten most frequent letters of the French alphabet once, plus one additional letter, the same for each line of the square, which, according to Perec, gives the square it's 'tonality' or 'flavour'. Another example, closer to the current attempt, can be found in the famous Sator Square, an anonymous square of words from the Antiquity that is both symmetrical and palindromic.

A first version of the work was presented for Lior Ben-Gai's course 'Programming for Artists I'. This version, however, did not use any generative method, the computing part consisting only in a rapid look-up of possible words as well as a very old-fashioned console interface. At that stage, the squares would only have words in the lines, and one for the first and last columns.


Data generation: recursion, (OpenFrameworks | Python)

The first step in this project was to develop a program that, given a word for the first line, can calculate all possible squares for which all lines and all columns are words (all taken from a dictionary, in this case words taken from the 'Mammoth uncensored word list' on litscape). This part is already working, and I am looking at ways in which I can optimize performance. The ideal scenario would be if I could create a database of all squares of a given length, which as yet might be possible for three letters, barely achievable for four, and already unthinkable for five. The good thing, however, is that only one 4-letters word (out of the 4000+ present in the dictionary) can produce around 40'000 possible squares, which is a good start for machine learning.

The more technical (questions | tasks) in that part of the project are:

  • Port the current Processing sketch into (OpenFrameworks | Python), probably rather the former, as I heard C++ could be good for performance, although Python is the ideal language for anything dealing with text.
  • Improving the recursion function, e.g. store the prefixes into a flexible array, as I do with the column words, instead of calculating everything when I have a solution (if that improves performance)
  • Use a set instead of an array for my lists of possible column words (if that improves performance)
  • Find a simpler algorithm that achieves the same result, if there is one
  • Think about the sort of data type I want to save the squares in: currently the squares are saved in a text file, with new lines between each line word; however, it might be more sensible, if less human-readable, to save everything to a table format, with a line by square, each line word in a new column (the question whether I should also save the column words, e.g. in more table columns, remains open).

Machine Learning

The machine learning part of the project has unfortunately yet to see the light of day. The plan is to start with something rather simple, and build on that as I explore more properties of the square space. Given my current understanding, even a simple function looking for a particular property, which would go through the data and return squares displaying it, would be a very legitimate decision stump I can start working with. If this is the case, then various ideas come to mind:

  • Given a list of squares, which are the ones that have words in one diagonal or two? Are there many? If not, are there, among these specimens, ones that can be of literary significance?
  • Two further species could be looked at: the symmetric square (for which row words and column words are the same), as well as the palindromic one (the Sator Square, quoted above, is a rare occurrence of a square that is both symmetric and palindromic at the same time).
  • It could be fairly straightforward to work with letter patterns, e.g.:
    • Are there squares with the same letter on each corner?
    • That as well as the same letter in the centre? (For odd-numbered squares only)
    • Or the opposite scenario: other patterns of letters that are located on the edges and not the vertices, e.g. how about 4-squares that have a double letter on the edges ('e_gg_s'), or perhaps a 4-letters square which four times the same letter at the centre? (The three algorithms above work as decision stump classifiers.)
    • We could assign a score for letter frequency: squares which use the (smallest | largest) number of separate letters? (This would be similar to linear regression, except with discrete values.)
  • Expanding on the idea of letter frequency, one could think about the question of 'distance' and 'closeness' in that space, at various levels: the letters, for instance (two squares using exactly the same letter set would be very close), and more obviously the words (which is non-trivial knowledge, as words can occur in various positions in columns and lines); furthermore, if one has access to 'metadata' such as definitions, synonyms, etc., the concept of neighbourhood could be refined and expanded (more on this below).
  • Another question, which is perhaps more scientific than artistic, would be to know how many squares there are for each word of a certain length, and thus know if there are words that only have very few squares, which ones have the most, etc.
  • An advanced problem, which would require having entire databases of squares at hand, and a way of crunching large amounts of data, but could be a fine twist on convolution, would be to know if there are squares containing other, smaller squares (e.g., a 7-letters square containing a 3-letters one). One could think of interesting, and artistically beautiful squares which would contain another square at their core. The methodology for testing this seems to be textbook convolution: select a smaller window, e.g. 3 x 3, compare that window against the existing database of 3 x 3 squares, repeate the process for all possible windows.
  • One could also imagine a more empirical method for selecting squares: start with one word, for instance, and search for all the squares containing it, in any line or column, then select a second word, and search the first subset in the same fashion, until the number of squares is reduced to a small enough set that it is possible to apply a human, aesthetic choice to it.
  • An alleyway of exploration that I intend to step into, if time allows, is to open the project so as to have a higher-level approach on the words: if one uses word databases such as WordNet in NLTK, or perhaps a dictionary API from WordNik or Oxford Dictionaries, one could start working with semantics, looking up definitions, synonyms, and other features (for instance, are there squares that maximize semantic similarity between their words, or have a lot of their words contained in the (definitions | synonyms | antonyms) of the other ones), and what do they look like. The main question here will be to know whether it will be possible to make that many requests when looking up thousands of words.

The end result of these investigations remains singular, artistically inspiring squares, with a resolute focus on their literary value, that is, on what remains constant as one ports any of these texts from one medium to another (for instance, from the screen to a print, or even a manuscript copy). What will be exhibited are the squares themselves, either in isolation or in a series, in the same gesture as when a 'immachinic' artist or writer 'selects' the good ideas from the virtual dross in order to fix them into a 'final' form.


I started working on this project in Processing. Drafts and attempts, including the first version referred to above, can be found in this repository.