Circular poetic word paths using the Levenshtein distance
Jérémie Wenger, March 2018
Helen Pritchard, Goldsmiths College, University of London
IS71076A Computational Arts-based Research (2017-18)
Wordlaces are a highly constrained literary form based on the Levenshtein distance, a type of edit distance. From one initial word or 'urword' and a given dictionary, in this case a list of words of the same length, one finds all the words that have only one character differing from urword's (yielding a Levenshtein distance of 1). From this first generation, one chooses one word and repeats the operation while making sure always to increase by one the distance from the urword at each iteration. This algorithm allows us to reach an 'alien word', one with no character in common with the urword, as fast as possible. Once this is achieved, the same method is applied to find a path back to the origin. No word repetition is allowed throughout the lace.
This paper offers insights into the techniques used and developed to experiment with this literary form. The initial technical impulse for this research was the discovery of Natural Language Processing (NLP), the field of computer science devoted to natural languages, and in particular the Natural Language Toolkit (NLTK) library in Python. After that, the actual wordlaces are described in more detail, while the final part is focussed on the implications for literary practice of the use of constraints, the systematic generation of result copora, and the intermingling of literary writing with data mining.
The Natural Language Toolkit (NLTK) Library
NLP is concerned with the intersection between computation and natural languages, both written and spoken, many areas of which which remain a major challenge. Some tasks, such as translating from one natural language to another, is deemed to be an 'AI-complete' problem, that is, one that is impossible to solve satisfactorily unless we succeed in creating a full-blown AI. Other tasks, however, are far more simple, yet allow computer scientists, and now people in various industries, as well as artists, to explore new ways of automating their work with text and speech. Given the scope of this field, it will be more productive to focus directly on the tools we looked at rather than giving a complete overview of it.
We approached NLP through a specific set of tools called the Natural Language Toolkit (NLTK), a library written in Python and designed to provide open-source resources for NLP study and practice. NLTK includes scores of functions, as well as corpora that can be used to test or train algorithms. Examples of these include the complete works of Shakespeare, hundreds of novels, parliamentary proceedings, the US State of the Union speeches since 1945, entire online forums or large Twitter samples. These also include more technical sets of words, such as 'stopwords', which have a syntactic but not straightforwardly semantic function (pronouns, conjunctions, for instance), and that can be removed when analysing the semantic content of texts. WordNet, a lexical database of English, "helps find conceptual relationships between words such as hypernyms, hyponyms, synonyms, antonyms etc." (the tool was developed at Princeton University and adapted to the library). Such a dictionary provides user with many useful functionalities, such as calculating the 'semantic similarity' between two words, by attempting to find paths betwen them using the hypernym and hyponym relationships. Built on that, SentiWordNet assigns a score to each word in WordNet, positive, negative, or objective, so that it is possible to calculate the overal 'sentiment' of a given a text by looking up the score of each individual word. Among the very useful functions NLTK provides, one should not fail to mention those that allow one to divide texts into paragraphs, sentences, and words, as a preparation for more processing later; the labelling of words as parts of speech (e.g. a verb, noun, adjective, etc.), which increases the precision with which one can operate on the texts; and, thanks to this functionality, the ability to construct syntactic trees from given sentences, like so:
Illustration of a syntactic tree from the NLTK book.
The NLTK library also allows you to write your own grammar (called 'context-free', as it is hard-coded and independent from any text input). These structures are an important paradigm to deal with the meaning of sentences, built on Chomsky's theory of language and cognition, and are a first step toward a computational implementation of 'understanding'. (A more recent paradigm, which builds meaning out of context, is being implemented using vector spaces and deep learning. See Word2Vec and this introductory lecture.)
For a comprehensive walkthrough of the functionalities of the library, and how to use Python to work with them, see the feely available NLTK book.
Four my purpose here, which was to find a way of applying some of these tools for literary purposes, and think about the relationship between the tools and my practice, I decided to narrow down the focus to one specific function I had discovered when sketching a project for the application for the Computational Arts MA at Goldsmiths, and which has a direct implementation in NLTK: the Levenshtein distance, or edit distance. This measures the number of steps required to transition from one word to another using only the minimal operations of substitution, deletion and substitution. A frequent example, cited on Wikipedia and elsewhere, illustrates how it works:
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
- kitten → sitten (substitution of "s" for "k");
- sitten → sittin (substitution of "i" for "e");
- sittin → sitting (insertion of "g" at the end).
The concept of the Levenshtein distance, especially the unary distance, can be used to transition from one word to another in a smooth, almost imperceptible manner: two words at a distance of 1 can be seen as 'next of kin' according to that measure, and more distant as the measure increases. Sliding from one word to the other leads to the formation of 'word paths' that can connect very distant words. Meditating on the possibilities opened by this simple idea, I wondered if it would be possible to invent an overall constraint for a type of path, thus creating a 'form', that would be free enough to give rise to variety, but also sufficiently constrained that I could work with the result space. The form that came to me was the wordlace: a path of words, which would take an 'urword', or origin, at its core, go through a number of kin words, before returning to where it started, and is circumscribed by the following constraints:
- each step requires a Levenshtein distance of 1 (we change only one character at a time);
- the final word must be the same as the first one (hence the idea of a 'lace');
- no word repetition is allowed throughout.
(The inspiration coming from the study of rhythm necklaces, cf. Toussaint's paper, p. 5 sq., where 'Euclidean rhythms' are described, a family of circular, evenly distributed rhythmic patterns used in many cultures around the world.)
These three rules are still quite loose, and allow for an enormous variety of laces. Two additional constraints can be devised to restrict the field of possibilities even further:
- one can strive to achieve to reach an 'alien word' as fast as possible, with two requirements for what the alien word is:
- either a word containing none of the letters of the urword in the same spot (weak constraint);
- or a word containing none of the letters present in the urword in any spot (strong constraint);
- once the alien word is reached, one must find a path (the shortest path) back to the urword, always without repeating any word throughout.
Given the enormous amount of possibilities these constraints generate, the strong constraint is the basis of the present work. A detailed walkthrough of the construction of the algorithm can be found in the Jupyter Notebook 'Wordlace Generation'.
Once one or more wordlaces are complete and deemed worthy of presentation, they can be arranged graphically or dynamically, so as to make their cyclical nature more perceptible. A Processing sketch, that writes text on a circle, calculating the orientation of the letters and the spaces between words appropriatly, can be used to that effect.
Corpora & constraint in literature
After hours of toil, the wordlace generator is ready. For any word in the dictionary at hand, the program will produce a long list of circular laces, hundreds of them for each word. How does one navigate such a space? Does it make sense to look for just one lace, that could stand alone as did poems of old, unique and cherished among the virtual pile of unwritten variants? Even if one were to stick to the 'old' ideal, and wish to find, in these possibilities, the ones that speak to us, how would we go about doing that?
The irruption of generative algorithms onto the arena of literature questions the practice of this ancient art that are related, if not entirely identical to, some of the issues composers and artists grappled with in the past century, especially around serial and repetitive procedures (the Schoenbergian series, and its consequences, as a prime example of musical algorithmics, the question of the series in art as well). However, it seems that the direct generation and production by the computer raises the stakes, as it were, in at least two ways: quantity and closeness. On the one hand the amount of results is far greater than before, and the distance between each piece and its closest 'neighbours', however one wishes to define neighbourhood in this context, is much smaller. In this context, it seems to us difficult to find serious interest in exhibiting them all (even if this is a solution that no few e-scribblers indulge in), lest one let go of the idea of the piece entirely, solely to rely on the idea of the space of solution, exhibited as such, more often than not at the cost of the reading process as well (the visitor enters a room the walls of which are covered with thousands of nearly indistinguishable, seemingly texts, and remembers the overall setting of the installation more than anything it read during the viewing).
Facing these questions, we make two choices: first, reading will be maintained as the primary mode of reception (avoiding the common 'slippage' from literature into art); second, we do not to abandon the singular, but to interpret this plethora as a challenge to singularity waiting to be taken on. What emerges from our confrontation with these floods of multiplicity is an injunction relentlessy to strive to seek, to hunt down the singular, the one beautiful lace lost in the multitude.
It is noteworthy that we find again, as if echoing the wordlaces themselves, a cyclic movement in the process of creation: to the 'voyage out', where the writer imagines the constraints and builds the algorithm producing data from them, follows the 'nostos', in which singular, remarkable texts, or groups of texts, are extracted from the mass of results. A reading of the 'Wordlace Generation' notebook will reveal that the quest for the singular is present at the very beginning, in the choice of constraints — which can lead to tens of thousands, thousands, or only a few hundred possibilities in the 'brute heap'...
Conversely the 'nostos' process, the 'return', is also a fertile ground for more constraints: specific, intimate ones, that may only be thought of once the heap has become ever so slightly more familiar. At each corner, it seems that a core choice of 'systematic literature' is lurking: should I choose the systematic, or algorithmic, path as a solution to my current predicament, or the 'intuitive' one? As seen in the 'Wordlaces Mining' notebook, it is perfectly possible to whittle down the number of possible laces using only individual words, choosing words that have an appeal, a literary appeal, and seeing the number of possible laces drastically falling in only a few steps. However it is also possible to think of other, more statistical methods, as is attempted in the other parts of the notebook. Thinking about the frequency of certain word, it is possible to come up with an overall 'rarity score' for laces, computing the frequency of all words in all laces, and adding up these numbers for each lace so as have an idea of the composition of each one.
Even at this stage, which seems like the tipping of one's toes in an ocean of possibilities, the process at hand brings rather uneasy associations: instead of the writing process coming from within, as it were, from 'imagination', 'inspiration' or 'memory', it is as if writing could become an activity close to the study of corpora — with the only, but not insignificant difference that those corpora have been constructed instead of already being present out there, written by others (although one can argue they have been written by an 'other', albeit one that more than any scribe follows our instructions to the letter: the machine). There is a disturbing proximity between the uncovering of features (as one would when practicing close or distant reading, see Franco Moretti's work) and the selection of admirable, loveable singularities — as if the very gesture of selection became indistinguishable from writing the desired wordlace.
These experiments can only be seen as inchoate, tentative steps. Another crucial line of thought, not even evoked so far, yet that is bound to stay with us, is the nature of the constraint themselves. In the present case, the constraints we came up with are purely formal, and formal in very specific ways: they are letter-based, and frequency-based. The algorithm does not have any powers of distinction beyond these. The issue is that it cannot be enough. Even if one were to remain in the formal domain, it could be possible to work on phonemes, for instance, instead of letters. Moreover, the NLTK library allows for definition look-up, as well as other types of distances between words (e.g. how related two words are semantically): one could easily imagine other ways of 'mining' a set of wordlaces using these tools, for instance searching for laces that have maximal or minimal semantic coherence.
The overarching question around formal constraints remains: does this lead to an impoverishment of the literary gesture and resulting works? Or, better phrased, how can one make use of formal constraints so as to make it an enrichment, and not an impoverishment? A paradox I often encounter in computational arts is that in terms of meaning, and, because of that, in terms of the perceived 'difficulty' of the work, it seems that the more formal the constraints are, such as in mid-20th century art, and the more semantically complex/difficult the resulting works become, whilst on the contrary works produced by the latest AIs tend to be closer to mainstream-compatible copies of old art (deep-dreamt Van Gogh, etc.), or very popculturally flavoured, often gamified products, that are extremely easy to access, enjoy widespread popularity, and more often than not have next to nothing to do with what was understood as 'avant-garde' by the previous century.
It seems that at this juncture I am left with one of these cumbersome conundrums of literary and artistic life. Either stay outside of the realm of the computational, in order to preserve an idea of 'literary integrity' in one's practice, or jump into the junk, betting on some irruption or other, on the possibility that at some point something, possibly something unpredictable still yet artistically worthy, may emerge from the unsound grafting on one's practice of new technological branches.
Before offering a succinct list of sources, I must admit a to an unfortunate yet irrepressible distaste for nearly everything that relates to the field of 'e-literature' and its studies, a strong aversion the root of which I am yet fully to understand. Some hints, not of why this might be the case, but that this orientation was already present in my assessment of former literature, can be found in my relationship with past 'techno-oriented authors', such as H. G. Wells, Aldous Huxley, Jules Vernes or Villiers de l'Isle-Adam, all of whom I hold, if not in complete disinterest, certainly in a rather low regard compared to so many literary figures who revolutionised their field through novelty within literature itself, instead of focussing, or speculating, excessively in my view, on novelty outside it. Thus, great, and too easy, is the temptation to invoke Wittgenstein's preface to the Tractatus Logico-Philosophicus (1921):
How far my efforts agree with those of other philosophers I will not decide. Indeed what I have here written makes no claim to novelty in points of detail; and therefore I give no sources, because it is indifferent to me whether what I have thought has already been thought before me by another.
The following sources will show, I hope, that I have tried, even if rather unsuccessfully, to resist it.
Electronic literature, new media studies, other
Aarseth, Espen J.. Cybertext: Perspectives on Ergodic Literature. Baltimore and London: Johns Hopkins University Press, 1997.
Badiou, Alain. L'Être et l'évenèment. Paris: Seuil, 1988.
Bachleitner, Norbert. "The Virtual Muse: Forms and Theory of Digital Poetry". Theory into Poetry: New Approaches to the Lyric. ed. Eva Müller-Zettelmann and Margarete Rubik, Amsterdam and New York: Rodopi, 2005: 303-344.
Brown Jr, James J. "The Literary and the Computational: A Conversation with Nick Montfort". Journal of Electronic Publishing, vol. 14, no. 2, 2011.
Flores, Leonardo. I ♥ E-Poetry. 2011-present, www.iloveepoetry.com. Accessed 26 March 2018.
Hayles, N. Katherine. Writing Machines. Cambridge and London: MIT Press, 2002.
—. "Electronic Literature: What is it?". The Electronic Literature Foundation, January 2007, http://eliterature.org/pad/elp.html#note109. Accessed 26 March 2018.
—. Electronic Literature. Notre Dame, Indiana: University of Notre Dame Press, 2008.
Montfort, Nick, Noah and Wardrip-Fruin. The New Media Reader. Cambridge, MA: MIT Press, 2003.
—. "Acid-free Bits: Recommendations for Long-lasting Electronic Literature". The Electronic Literature Foundation, June 2014, http://eliterature.org/pad/afb.html#sec0. Accessed 26 March 2018.
Moretti, Franco. Graphs, Maps, Trees: Abstract Models for a Literary History. London and New York: Verso, 2005.
Toussaint, Godfried T. "The Euclidean algorithm generates traditional musical rhythms". Proceedings of BRIDGES: Mathematical Connections in Art, Music and Science. Alberta, Canada: Banff, July 31-August 3, 2005, pp. 47-56, http://cgm.cs.mcgill.ca/~godfried/publications/banff-extended.pdf. Accessed 26 March 2018.
West, Martin Litchfield. Introduction to Greek Metre. Oxford: Clarendon Press, 1987.
Carson, Anne. "A Fragment of Ibykos Translated Six Ways". London Review of Books. vol. 34, no. 21, 8 Nov. 2012, pp. 42-43, https://www.lrb.co.uk/v34/n21/anne-carson/a-fragment-of-ibykos-translated-six-ways. Accessed 26 March 2018.
Su Hui. 璇玑图 (Star Gauge). 4th c. A.D., reproduced in Hinton, David. Classical Chinese Poetry. New York: Farrar, Straus and Giroux, 2008, and www.sites.google.com/a/stu.norwich.edu/suhuistargauge/home. Accessed 26 March 2018.
Perec, Georges. Alphabets. Paris: Galilée, 2001.
—. Beaux Présents, Belles Absentes. Paris: Seuil, 1994.
Pound, Ezra. Poems & Translations. New York: Library of America, 2003.
Les Troubadours. ed. Jacques Roubaud, Paris: Seghers, 1980. See also: www.poets.org/poetsorg/text/sestina-poetic-form. Accessed 26 March 2018.