A foray into word squares as a poetic form
Jérémie Wenger, March 2018
Machine Learning Project
Rebecca Fiebrink, Goldsmiths College, University of London
IS71074A: Data and Machine Learning for Artistic Practice (2017-18)
Requirements and walkthrough
This project was realised using Python and Jupyter Notebooks. The environment is Anaconda 3 together with the following libraries: pandas from scipy for dataframes, scikit-learn for the t-SNE algorithm, gensim for Word2Vec, and bokeh for visualisation and pickle for saving and loading data. All these dependencies are present in Anaconda and can be obtained using the following command line prompts:
$ conda install -c anaconda gensim
$ conda install -c bokeh bokeh
The recommended order is the following:
- finally, 'wordsquares_visualisation_letters', to have a look at a first attempt (developed in the first draft 'wordsquares_ai_draft'), .
Wordsquares intends to be one more step toward a literary practice combined with what could be called the 'machinic imagination': the use of computer power to handle vast fields of possibilities, and interact with constraints in a fruitful way. In this project, I used the computer to generate a large amount of data that would be very difficult to come up with by hand (given the nature of the constraint), and use machine learning algorithms to explore the space at hand, either to single out specific solutions (unique, remarkable texts), or series (texts that would be similar to each other, modulo certain properties, in a way that is not too dissimilar to the practice of painters and artists). The ability to actually generate the entire field of possibilities, instead of it existing 'somewhere out there', allows for a reconceptualization of the relationship one has to what can be called either 'imagination' or the 'unconscious', as well as 'method' in literary practice: instead of only 'stumbling across' interesting solutions while writing under constraint, one can study the entire space, its singularities, its groups or tendencies, and redefine artistic intent as the choice to show this or that part of the space, or approach said space with this or that particular tool.
The firt part of that methodology, the generation of a database, was achieved using openFrameworks, and is now available under the name ofxWordSquares. In the current state, the only complete database is the one containing 3-lettered words.
The overall gesture and method, the interaction between constraint, imagination and data, completes and extends another piece of work started in another course, which led to an essay and pieces called Wordlaces.
The original inspiration for exploring word squares came from Georges Perec, a distinguished member of Oulipo, who wrote a book of poetry called Alphabets (Paris: Galilée, 1976), in which each text is constrained in the following way: they are 11 (lines) x 11 (letters) squares, each line having to use the ten most frequent letters of the French alphabet once, plus one additional letter, the same for each line of the square, which, according to Perec, gives the square it's 'tonality' or 'flavour'. Another example, closer to the current attempt, can be found in the famous Sator Square, an anonymous square of words from the Antiquity that is both symmetrical and palindromic.
My methodology in this project was to extract features from my squares, and then attempt to use the vector thus created as material for a machine algorithm, t-SNE in this case, to cluster the data and produce a meaningful visualisation. The end result would be a 2D or 3D space that I could explore, looking for interesting, unique squares, more easily and effectively than through a 'manual' selection using code.
The plan was to divide the work into two parts:
- first, a formal endeavour, focussing on features related to the 'shape' of squares, mostly through letters patterns, word redundancies;
- in a second step, I would move toward semantics, using Word2Vec first, as it is one of the most exciting advances in natural language processing of the recent years, and possibly, if time allows, the NLTK library, to analyse squares at a semantic level.
The formal features that were extracted are the following:
- Inner redundancy: do either the rows or columns contain the same word more than once?
- Rows/Columns common words: do the rows and columns identical words, and if so how many (three means the square is symmetrical, and can be 'flipped' without change);
- Letter patterns: does the square exhibit special letter patterns? The ones retained were:
- the same letter in the middle of each edge;
- the same letter in the left diagonal;
- the same letter in the right diagonal;
- the same letter in all four corners;
- the same letter in both diagonals;
- A score for letter diversity: the lowest if the square is composed only of two letters, the highest if nine letters are used (no letter repetition);
- An indicator of whether the diagonals (one, the other, or two) also form 3-letter words;
- In the first models I tried (the result of which can be seen in the notebook 'wordsquares_visualisation_letters', I also used the presence of each letter as features, which did lead to squares composed of a very similar letter set, but not necessarily the same words, to be close to each other on the map, but weighted quite heavily on the overall distribution, and was abandoned in the final version, to leave more space for the other formal features.
That enquiry did first raise the question of how to produce meaningful plots, especially when one wishes to introduce sets of features that do not have the same weight (for instance, one could imagine a plot where the formal features above are dominant globally, but where locally the squares are still clustered by letter composition).
In most of what follows, I ended up relying heavily on a talk, 'Modern NLP in Python', and notebook by Patrick Harrison I had discovered in my Wordlaces enquiry, but not used or studied in detail. I adapted it as a reference for the implementation of Word2Vec, t-SNE and Bokeh, which was useful, especially as a guide for these first steps, but might have brought more issues than expected, as, cf. below, a few of these tools have proved less reliable than I would have hoped them to be.
In the semantic part, I started with Word2Vec, as I had done some preliminary work with NLTK in the already mentioned Wordlaces. I had made some trials with Memo Atken's ofxMSAword2Vec, and thus used one of the pre-trained models he recommends. Unfortunately, 20% of the words in my dictionary, and, strangely, not the most bizarre or rare ones, were not present in the model I used, which led to dividing the number of squares by half. I also tried to work with Google's trained model, which is big (3.6 Gb), but I could not make it run on my computer, despite 8 Gb in RAM. A better scenario might have been to train my own algorithm, but with time running out, the fear this would take a very long time (like the t-SNE algorithm) made me opt for this easier solution.
The work on the semantic part was, for the most part, not particularly fruitful. It was possible to test the model for word similarity, but I soon realised that what would be needed would be the tool called Doc2Vec, which could have embedded each square in a vector space, treating each square as a bag of words, and which could have led to calculating distances between squares. Another solution, which was glimpsed but not implemented, would have been manually to calculate a 'mean' vector for each square using cosine similarity, but it was beyond me to come up with the right equation (especially one I could reliably say was performing the right calculation). As a (meagre) substitute for these two, I came up with a simple enumeration of all pairwise similarities between words in any square (including diagonals when present), which at least allows for a rough view of the 'similarity score' of the square (if the maximum value is low, the square is likely not to have semantic homogeneity, if the minimum value is high, the opposite). A few of these results can be seen as a list in the 'wordsquares_semantics' notebook, but given their overall vagueness it did not seem worthwhile to plot them.
The algorithm used to reduce dimensionality was t-SNE, which, as this useful page shows, can be both flexible and difficult to interpret. The core issue here is plain computation time: unlike on said page, where one can see the evolution of data distribution in gorgeous animations, in my case, given both a lack of mastery of the visualisation library, the algorithm itself as well as, to a large extent, the nature of my dataset, it proved impossible truly to experiment with the algorithm itself. Thus, on the page, it is recommended to tweak the perplexity as well as the number of iterations to see how the algorithm works, but as one run of the function takes from 2 to 6 hours on my machine, it was simply too slow to do real trial and error in this context. Gene Kogan's openFrameworks app ofxTSNE, Rebecca Fiebrink's recommendation, does have an option to visualise the evolution of the dataset over iterations, but it would have meant more (potentially time-consuming) porting from Python to C++.
From the two attempts I sucessfully carried out (technically three, as can be seen in the 'tsne' folder, but 'tsne_model' and 'tsne_model2', as well as their corresponding vectors.npy files, yield a very similar picture), it can be concluded that some clustering does happen:
- when the letter values are present in the feature vector, squares with a nearly identical letter set do appear close together;
- when those are left out, and only the formal features are used, the model shows very distinct areas, which seem to gather elements with similar features (e.g. one common word between rows and columns), but the visualisation is so slow that it is very inconvenient to try and research it.
Bokeh plot with letter composition:
Bokeh plot with only formal features (and a few unsystematic checks):
A few other plain technical issues arose with Bokeh:
- first and foremost, it is simply too slow; strangely enough, it is even slower with the 'formal features only' model, perhaps because squares are more concentrated; there is probably a way of controlling the number of squares on the screen, but for now it is not really usable;
- given my lack of knowledge, I could not include more data in the descriptions of the points (e.g. the letter diversity score), which would be essential to be able quickly to spot if this or that feature is the cause of a cluster. The current features also produce points gathering a high number of squares, often more than the screen can display, which means that any lengthy individual description would make the plot even more unwieldy.
A clear objective for future investigation is just as much to reach a better mastery of the whole algorithmic machinery, but also to expand the database I am using: 3-lettered wordsquares feel fairly restrictive, especially as the list I am using contains so many words that do not mean much at first sight (another improvement could have been either to reduce the size of my dictionary, of find a way to rank the rarity of words, something that unfortunately I was not able to produce using the Word2Vec model at hand). Future developments could include both a functional machine learning pipeline, improved features, especially in the semantic domain, as well as fully-fledge 4-lettered, 5-lettered (& above) wordsquare databases (that would only be achievable with a better understanding of look-up algorithms such as tries, that could help speed things up).
One of the rather sad conclusions of this first foray is that the sway of the 'empirical', 'manual' aspect of the search was not pushed back very far, and given the relative failure of the machine learning pipeline, mostly due to my inexperience in nearly every single one of the tools used (including Python itself), searching this database on a case-by case basis remains the only way to approach singular squares. This can be done in one of two ways (or those two combined):
- starting by one or more formal features, until the size of the subset is small enough that it can be perused;
- start with one word, and search for all the squares containing it, in any line, column or diagonal, then select a second word, and search the first subset in the same fashion, thus gradually reducing the subset of squares.
Machine learning study notes & references
Patrick Harrison's talk Modern NLP in Python, and the accompanying Jupyter Notebook. In his presentation, Harrison mentions a paper called 'Word2Vec parameter learning explained', which could prove useful in the future: the paper and live demo website (called 'wevi').
T-SNE and dimentionality reduction affairs:
- A most useful post and well-illustrated post by Wattenberg et al, 2016, 'How to Use t-SNE Effectively';
- Thanks to Rebecca Fiebrink's recommendation, Gene Kogan's ofxTSNE, and ofxPCA for openFrameworks, (as well as Parag Mital's course on Kadenze & [GitHub repo], which unfortunately I am yet to complete);
- A t-SNE live demo by Andrej Karpathy, hosted by Stanford University;
- The Scikit-learn t-SNE page:
- Memo Akten's openFrameworks Word2Vec implementation.
- Yet another tutorial recommended in Memo's repo.
- How to find pre-trained word vectors for NLP? A good list by 3Top on Github.
- A lecture on Word2Vec as part of the Stanford NLP course by Chris Manning and Richard Socher
- Bokeh, a library for interactive data visualisation; for future reference, a video introduction by Sarah Bird and some readily available notebooks;
- The SciPy Python library, including NumPy (that will be required if I ever get to the bottom of that cosine similarity, and hugely useful anyway), MatPlotlib, for 2D plotting that might replace Bokeh?, and Pandas, for data analysis;
- Gensim, a library for NLP, topic modelling, and Word2Vec, as well as a Gensim Word2Vec tutorial and the accompanying notebook; more tutorials and references here;