# Reading Machines
## Exploring the Linguistic Unconscious of AI

### Introduction: Two ways of thinking about computation

The history of computing revolves around efforts to automate the human labor of computation. And in many narratives of this history, the algorithm plays a central role. By _algorithm_, I refer to methods of reducing complex calculations and other operations to explicit formal rules, rules that can be implemented with rigor and precision by purely mechanical or electronic means.



But as a means of understanding Chat GPT and other forms of [generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligence), a consideration of algorithms only gets us so far. In fact, when it comes to the [large language models](https://en.wikipedia.org/wiki/Large_language_model) that have captivated the public imagination, in order to make sense of their "unreasonable effectiveness," we must attend to another strand of computing, one which, though bound up with the first, manifests distinct pressures and concerns. Instead of formal logic and mathematical proof, this strand draws on traditions of thinking about data, randomness, and probability. And instead of the prescription of (computational) actions, it aims at the description and prediction of (non-computational) aspects of the world. 


A key moment in this tradition, in light of later developments, remains Claude Shannon's* work on modeling the statistical structure of printed English ({cite}`shannon_mathematical_1948`).  In this interactive document, we will use the [Python programming language](https://www.python.org) to reproduce a couple of the experiments that Shannon* reported in his famous article, in the hopes of pulling back the curtain a bit on what seems to many (and not unreasonably) as evidence of a ghost in the machine. I, for one, do find many of these experiences haunting. But maybe the haunting doesn't happen where we at first assume.


The material that follows draws on and is inspired by my reading of Lydia Liu's _The Freudian Robot_, one of the few works in the humanities that I'm aware of to deal with Shannon's work in depth. See {cite}`liu_freudian_2010`.

### Two kinds of coding

Before we delve into our experiments, let's clarify some terminology. In particular, what do we mean by _code_? 

The demonstration below goes into a little more explicit detail, as far as the mechanics of Python are concerned, than the rest of this document. That's intended to motivate the contrast to follow, between the kind of code we write in Python, and the kind of coding that Shannon's* work deals with. 

#### Programs as code(s)

We imagine computers as machines that operate on 1's and 0's. In fact, the 1's and 0's are themselves an abstraction for human convenience: digital computation happens as a series of electronic pulses: switches that are either "on" or "off." (Think of counting to 10 by flipping a light switch on and off 10 times.)

Every digital representation -- everything that can be computed by a digital computer -- must be encoded, ultimately, in this binary form. 

But to make computers efficient for human use, many additional layers of abstraction have been developed on top of the basic binary layer. By virtue of using computers and smartphones, we are all familiar with the concept of an interface, which instantiates a set of rules prescribing how we are to interact with the device in order to accomplish well-defined tasks. These interactions get encoded down to the level of electronic pulses (and the results of the computation are translated back into the encoding of the interface). 

A programming language is also an interface: a text-based one. It represents a code into which we can translate our instructions for computation, in order for those instructions to be encoded further for processing. 

#### Baby steps in Python


Let's start with a single instruction. Run the following line of Python code by clicking the button,. You won't see any output -- that's okay.

In [None]:
answer_to_everything = 42

In the encoding specified by the Python language, the equals sign (`=`) is an instruction that loosely translates to: "Store this value (on the right side) somewhere in memory, and give that location in memory the provided name (on the left side)." The following image presents one way of imagining what happens in response to this code (with the caveat that, ultimately, the letters and numbers are represented by their binary encoding).  

By running the previous line of code, we have created a _variable_, which maps the name `answer_to_everything` to the value `42`. We can use the variable to retrieve its value (for use in other parts of our program). Run the code below to see some output.

In [None]:
print(answer_to_everything)

The `print()` _function_ is a command in Python syntax that displays a value on the screen. Python's syntax picks out the following elements:
  - the name `print`
  - the parentheses that follow it, which enclose the _argument_
  - the argument itself, which in this case is a variable name (previously defined)

These elements are perfectly arbitrary (in the Saussurean sense). This syntax was invented by the designers of the Python language, though they drew on conventions found in other programming languages. The point is that nothing about the Python command `print(answer_to_everything)` makes its operation transparent; to know what it does, you have to know the language (or, at least, be familiar with the conventions of programming languages more generally) -- just as when learning to speak a foreign language, you can't deduce much about the meaning of the words from the way they look or sound.

However, unlike so-called _natural languages_, even minor deviations in syntax will usually cause errors, and errors will usually bring the whole program to a crashing halt.


Run the code below -- you should see an error message.

In [None]:
print(answer_to_everythin)

A misspelled variable name causes Python to abort its computation. Imagine if conversation ground to a halt whenever one of the parties mispronounced a word or used a malapropism!

I tend to say that Python is extremely literal. But of course, this is merely an analogy, and a loose one. There is no room for metaphor in programming languages, at least, not as far as the computation itself is concerned. The operation of a language like Python is determined by the algorithms used to implement it. Given the same input and the same conditions of operation, a given Python program should produce the same output every time. (If it does not, that's usually considered a bug.)

#### Encoding text

While _programming languages_ are ways of encoding algorithms, the operation of the resulting _programs_ does depend, in most cases, on more than just the algorithm itself. Programs depend on data. And in order to be used in computation, data must be encoded, too.

As an engineer at Bell Labs, Claude Shannon* wanted to find -- mathematically -- the most efficient means of encoding data for electronic transmission. Note that this task involves a rather different set of factors from those that influence the design of a programming language.

The designer of the language has the luxury of insisting on a programmer's fidelity to the specified syntax. In working in Python, we have to write `print(42)`, exactly as written, in order to display the number `42` on the screen. if we forget the parentheses, for instance, the command won't work. But when we talk on the phone (or via Zoom, etc.), it would certainly be a hassle if we had to first translate our words into a strict, fault-intolerant code like that of Python. 

All the same, there is no digital (electronic) representation without encoding. To refer to the difference between these two types of codes, I am drawing a distinction between _algorithms_ and _data_. Shannon's* work illustrates the importance of this distinction, which remains relevant to any consideration of machine learning and generative AI.

#### Representing text in Python

Before we turn to Shannon's* experiments with English text, let's look briefly at how Python represents text as data.

In [None]:
a_text = "Most noble and illustrious drinkers, and you thrice precious pockified blades (for to you, and none else, do I dedicate my writings), Alcibiades, in that dialogue of Plato's, which is entitled The Banquet, whilst he was setting forth the praises of his schoolmaster Socrates (without all question the prince of philosophers), amongst other discourses to that purpose, said that he resembled the Silenes."

Running the code above creates a new variable, `a_text`, and assigns it to a _string_ representing the first sentence from Francois Rabelais' early Modern novel, _Gargantua and Pantagruel_. A string is the most basic way in Python of representing text, where "text" means anything that is not to be treated purely a numeric value. 

Anything between quotation marks (either double `""` or single `''`) is a string.

One problem with strings in Python (and other programming languages) is that they have very little structure. A Python string is a sequence of characters, where a _character_ is a letter of a recognized alphabet, a punctuation mark, a space, etc. Each character is stored in the computer's memory as a numeric code, and from that perspective, all characters are essentially equal. We can access a single character in a string by supplying its position. (Python counts characters in strings from left to right, starting with 0, not 1, for the first character.)

In [None]:
a_text[5]

We can access a sequence of characters -- here, the characters in positions 11 through 50.

In [None]:
a_text[10:50]

We can even divide the string into pieces, using the occurences of particular characters. The code below divides our text on the white space, returning a _list_ (another Python construct) of smaller strings.

In [None]:
a_text.split()

The strings in the list above correspond, loosely, to the individual words in the sentence from Rabelais' text. But Python really has no concept of "word," neither in English, nor any other (natural) language. 

### Language & chance

It's probably fair to say that when Shannon* was developing his mathematical approach to encoding information, the algorithmic ideal  dominated computational research in Western Europe and the United States. In previous decades, philosophers like Bertrand Russell and mathematicians like David Hilbert had sought to develop a formal approach to mathematical proof, an approach that, they hoped, would ultimately unify the scientific disciplines. The goal of such research was to identify a core set of axioms, or logical rules, in terms of which all other "rigorous" methods of thought could be expressed. In other words, to reduce to zero the uncertainty and ambiguity plaguing natural language as a tool for expression: to make language algorithmic.

Working within this tradition, Alan Turing had developed his model of what would become the digital computer. 

But can language as humans use it be reduced to such formal rules? On the face of it, it's easy to think not. However, that conclusion presents a problem for computation involving natural language, since the computer is, at bottom, a formal-rule-following machine. Shannon's* work implicitly challenges the assumption that we need to resort to formal rules in order to deal with the uncertainty in language. Instead, he sought mathematical means for _quantifying_ that uncertainty.  And as Lydia Liu points out, that effort began with a set of observations about patterns in printed English texts.



#### The long history of code

Of course, Shannon's* insights do not begin with Shannon*. A long history predates him of speculation on what we might call the statistical features of language. Speculations of some practical urgency, given the even longer history of cryptographic communication in political, military, and other contexts.

In the 9th Century CE, the Arab mathematician and philosopher Al-Kindi composed a work on cryptography in which he applied the relative frequency of letters in Arabic to a method for decrypting coded text ({cite}`broemeling_account_2011`). Al-Kindi, alongside his many other accomplishments, composed the earliest surviving analysis of this kind, which is a direct precursor of methods popular in the digital humanities (word frequency analysis), among other many other domains. 

Closer yet to the hearts of digital humanists, the Russian mathematician Andrei Markov, in a 1913 address to the Russian Academy of Sciences, reported on the results of his experiment with Aleksandr Pushkin's _Evegnii Onegin_: a statistical analysis of the occurrences of consonants and vowels in the first two chapters of Pushkin's novel in verse ({cite}`markov_example_2006`). From the perspective of today's large-language models, Markov improved on Al-Kindi's methods by counting not just isolated occurrences of vowels or consonants, but co-occurences: that is, where a vowel follows a consonant, a consonant a vowel, etc. As a means of articulating the structure of a sequential process, Markov's method generalizes into a powerful mathematical tool, to which he lends his name. We will see how Shannon* used [Markov chains](https://en.wikipedia.org/wiki/Markov_chain) shortly. 

#### A spate of tedious counting

First, however, let's illustrate the more basic method, just to get a feel for its effectiveness.

We'll take a text of sufficient length. Urquhart's English translation of _Gargantual and Pantagruel_, in the Everyman's Library edition, clocks in at 823 pages; that's a decent sample. If we were following the methods used by Al-Kindi, Markov, or even Shannon* himself, we would proceed as follows:
  1. Make a list of the letters of the alphabet on a sheet of paper.
  2. Go through the text, letter by letter.
  3. Beside each letter on your paper, make one mark each time you encounter that letter in the text.

Fortunately for us, we can avail ourselves of a computer to do this work. 

In the following sections of Python code, we download the Project Gutenberg edition of Rabelais' novel, saving it to the computer as a text file. We can read the whole file into the computer's memory as a single Python string. Then using a property of Python strings that allows us to _iterate_ over them, we can automate the process of counting up the occurences of each character. 

In [None]:
from urllib.request import urlretrieve
urlretrieve("https://www.gutenberg.org/cache/epub/1200/pg1200.txt", "gargantua.txt")

In [None]:
with open('gargantua.txt') as f:
    g_text = f.read()

Running the code below uses the `len()` function to display the length -- in characters -- of a string. 

In [None]:
len(g_text)

The Project Gutenberg version of _Gargantua and Pantagruel_ has close to a 2 million characters.

As an initial exercise, we can count the frequency with which each character appears. Run the following section of code to create a structure mapping each character to its frequency.

In [None]:
g_characters = {}
for character in g_text:
    if character in g_characters:
        g_characters[character] += 1
    else:
        g_characters[character] = 1

Run the code below to reveal the frequencies.

In [None]:
g_characters

Looking at the contents of `g_characters`, we can see that it consists of more than just the letters in standard [Latin script](https://en.wikipedia.org/wiki/Latin_script). There are punctuation marks, numerals, and other symbols, like `\n`, which represents a line break. 

But if we look at the 10 most commonly occurring characters, with one exception, it aligns well with the [relative frequency of letters in English](https://en.wikipedia.org/wiki/Letter_frequency) as reported from studying large textual corpora.  

In [None]:
sorted(g_characters.items(), key=lambda x: x[1], reverse=True)[:10]

#### Random writing

At the heart of Shannon's* method lies the notion of _random sampling_. It's perhaps easiest to illustrate this concept before defining it.

Using more Python code, let's compare what happens when we construct two random samples of the letters of the Latin script, one in which we select each letter with equal probability, and the other in which we weight our selections according to the frequency we have computed above.

In [None]:
from random import choices
alphabet = "abcdefghijklmnopqrstuvwxyz"
print("".join(choices(alphabet, k=50)))

The code above uses the `choices()` method to create a sample of 50 letters, where each letter is equally likely to appear in our sample. Imagine rolling a 26-sided die, with a different letter on each face, 50 times, writing down the letter that comes up on top each time.

Now let's run this trial again, this time supplying the observed frequency of the letters in _Gargantual and Pantagruel_ as weights to the sampling. (For simplicity's sake, we first remove everything but the 26 lowercase letters of the Latin script: numbers, punctuation marks, spaces, letters with accent marks, etc.)

In [None]:
g_alpha_chars = {}
for c, n in g_characters.items():
    if c in alphabet:
        g_alpha_chars[c] = n
letters = list(g_alpha_chars.keys())
weights = g_alpha_chars.values()
print(''.join(choices(letters, weights, k=50)))

Do you notice any difference between the two results? It depends to some extent on roll of the dice, since both selections are still random. But you might see _more_ runs of letters in the second that resemble sequences you could expect in English, maybe even a word or two hiding in there.

#### The difference a space makes

On Liu's telling, one of Shannon's* key innovations was his realization that in analyzing _printed_ English, the _space between words_ counts as a character. It's the spaces that delimit words in printed text; without them, our analysis fails to account for word boundaries. 

Let's say what happens when we include the space character in our frequencies.

In [None]:
g_shannon_chars = {}
for c, n in g_characters.items():
    if c in alphabet or c == " ":
        g_shannon_chars[c] = n
letters = list(g_shannon_chars.keys())
weights=g_shannon_chars.values()
print(''.join(choices(letters, weights, k=50)))

It may not seem like much improvement, but now we're starting to see sequences of recognizable "word length," considering the average lengths of words in English. 

But note that we haven't so far actually tallied anything that would count as a word: we're still operating exclusively at the level of individual characters or letters.

#### Law-abiding numbers

To unpack what we're doing a little more: when we make a _weighted_ selection from the letters of the alphabet, using the frequencies we've observed, it's equivalent to drawing letters out of a bag of Scrabble tiles, where different tiles appear in a different amounts. If there are 5 `e`'s in the bag but only 1 `z`, you might draw a `z`, but over time, you're more likely to draw an `e`. And if you make repeated draws, recording the letter you draw each time before putting it back in the bag, your final tally of letters will usually have more `e`'s than `z`'s. 

In probability theory, this expectation is called [the law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). It describes the fundamental intuition behind the utility of averages, as well as their limitation: sampling better approximates the mathematical average as the samples get larger, but in every case, we're talking about behavior in the aggregate, not the individual case. 

### Language as a drunken walk

How effectively can we model natural language using statistical means? It's worth dwelling on the assumptions latent in this question. Parts of speech, word order, syntactic dependencies, etc: none of these classically linguistic entities come up for discussion in Shannon's* article. Nor are there any claims therein about underlying structures of thought that might map onto grammatical or syntactic structures, such as we find in the Chomskian theory of [generative grammar](https://en.wikipedia.org/wiki/Generative_grammar). The latter theory remains squarely within the algorithmic paradigm: the search for formal rules or laws of thought. 

Language, in Shannon's* treatment, resembles a different kind of phenomena: biological populations, financial markets, or the weather. In each of these systems, it is taken as a given that there are simply too many variables at play to arrive at the kind of description that would even remotely resemble the steps of a formally logical proof. Rather, the systems are described, and attempts are made to predict their behavior over time, drawing on observable patterns held to be valid in the aggregate. 

Whether the human linguistic faculty is best described in terms of formal, algorithmic rules, or as something else (emotional weather, perhaps), was not a question germane to Shannon's* analysis. Inn the introduction to his 1948 article, he claims that the "semantic aspects of communication are irrelevant to the engineering problem" (i.e., the problem of devising efficient means of encoding messages, linguistic or otherwise). These "semantic aspects," excluded from "the engineering problem," return to haunt the scene of generative AI with a vengeance. But in order to set this scene, let's return to Shannon's* experiments.

Following Andrei Markov, Shannon* modeled printed English as a Markov chain: as a special kind of weighted selection where the weights of the current selection depend _only_ on the immediately previous selection. A Markov chain is often called a _random walk_, though the conventional illustration is of a person who has had a bit too much to drink stumbling about. Observing such a situation, you might not be able to determine where the person is trying to go; all you can predict is that their next position will fall within stumbling distance of where they're standing right now. Or if you prefer a less Rabelaisian metaphor, imagine threading your way among a host of puddles. With each step, you try to keep to dry land, but your path is likely to be anything but linear.

It turns out that Markov chains can be used to model lots of processes in the physical world. And they can be used to model language, too, as Claude Shannon* showed.

#### More tedious counting

One way to construct such an analysis is as follows: represent your sample of text as a continuous string of characters. (As we've seen, that's easy to do in Python.) Then "glue" it to another string, representing the same text, but with every character shifted to the left by one position. For example, the first several characters of the first sentence from _Gargantua and Pantagruel_ would look like this:

![The text "Most noble and illust" is shown twice, one two consecutive lines, with each letter surrounded by a box. The second line is shifted to the left one character, so that the "M" of the first line appears above the "o" of the second line, etc.
](https://gwu-libraries.github.io/engl-6130-dugan/_images/rabelais-1.png)
With the exception of the dangling left-most and right-most characters, you now have a pair of strings that yield, for each position, a pair of characters. In the image below, the first few successive pairs are shown, along with the position of each pair of characters with respect to the "glued" strings.

![A table with the letters "h," a space, "o," "e," and "i" along the top (column headers), and "t," space, "c," "w," "s," and "g" along the left-hand side (row labels), and numbers in the cells of the table. 
](https://gwu-libraries.github.io/engl-6130-dugan/_images/rabelais-2.png)
These pairs are called bigrams. But in order to construct a Markov chain, we're not just counting bigrams. Rather, we want to create what's called a _transition table_: a table where we can look up a given character -- the letter `e`, say -- and then for any other character that can follow `e`, find the frequency with which it occurs in that position (i.e., following an `e`). If a given character never follows another character, its bigram doesn't exist in the table. 

Below are shown some of the most common bigrams in such a transition table created on the basis of _Gargantua and Pantagruel_.


#### Preparing the text

To simplify our analysis, first we'll standardize the source text a bit. Removing punctuation and non-alphabetic characters, removing extra runs of white space and line breaks, and converting everything to lowercase will make patterns in the results easier to see (though it's really sort of an aesthetic choice, and as I've suggested, Shannon's* method doesn't presuppose any essential difference between the letters of words and the punctuation marks that accompany them). 

Run the two code sections below to clean the text of _Gargantua and Pantagruel_.

In [None]:
def normalize_text(text):
    '''
    Reduces the provided string to a string consisting of just alphabetic, lowercase characters from the Latin script and non-contiguous spaces.
    '''
    text_lower = text.lower()
    text_lower = text_lower.replace("\n", " ").replace("\t", " ")
    text_norm = ""
    for char in text_lower:
        if (char in "abcdefghijklmnopqrstuvwxyz") or (char == " " and text_norm[-1] != " "):
            text_norm += char
    return text_norm

In [None]:
g_text_norm = normalize_text(g_text)
g_text_norm[:1000]

This method isn't perfect, but we'll trust that any errors -- like the disappearance of accented characters from French proper nouns, etc. -- will get smoothed over in the aggregate. 

#### Setting the table

To create our transition table of bigrams, we'll define two new functions in Python. The first function, `create_ngrams`, generalizes a bit from our immediate use case; by setting the parameter called `n` in the function call to a number higher than 2, we can create combinations of three or more successive characters (trigrams, quadgrams, etc.). This feature will be useful a little later.

Run the code below to define the function.

In [None]:
def create_ngrams(text, n=2):
    '''
    Creates a series of ngrams out of the provided text argument. The argument n determines the size of each ngram; n must be greater than or equal to 2. 
    Returns a list of ngrams, where each ngram is a Python tuple consisting of n characters.
    '''
    text_arrays = []
    for i in range(n):
        last_index = len(text) - (n - i - 1)
        text_arrays.append(text[i:last_index])
    return list(zip(*text_arrays))

Let's illustrate our function with a small text first. The output is a Python list, which contains a series of additional collections (called tuples) nested within it. Each subcollection corresponds to a 2-character window, and the window is moved one character to the right each time. 

This structure will allow us to create our transition table, showing which characters follow which other characters most often. 

In [None]:
text = 'abcdefghijklmnopqrstuvwxyz'
create_ngrams(text, 2)

Run the code section below to define another function, `create_transition_table`, which does what its name suggests.

In [None]:
from collections import Counter
def create_transition_table(ngrams):
    '''
    Expects as input a list of tuples corresponding to ngrams.
    Returns a dictionary of dictionaries, where the keys to the outer dictionary consist of strings corresponding to the first n-1 elements of each ngram.
    The values of the outer dictionary are themselves dictionaries, where the keys are the nth elements each ngram, and the values are the frequence of occurrence.
    '''
    n = len(ngrams[0])
    ttable = {}
    for ngram in ngrams:
        key = "".join(ngram[:n-1])
        if key not in ttable:
            ttable[key] = Counter()
        ttable[key][ngram[-1]] += 1
    return ttable

Now run the code below to create the transition table for the bigrams in the alphabet.

In [None]:
create_transition_table(create_ngrams(text, 2))

Here our transition table consists of frequencies that are all 1, because (by definition) each letter occurs only once in the alphabet. The way to read the table, however, is as follows:
> The letter `b` occurs after the letter `a` 1 time in our (alphabet) sample.
> 
> The letter `c` occurs after the letter `b` 1 time in our sample.
> 
> ...

Now let's use these functions to create the transition table with bigrams _Gargantua and Pantagruel_.

In [None]:
g_ttable = create_transition_table(create_ngrams(g_text_norm, 2))

Our table will now be significantly bigger. But let's use it see how frequently the letter `e` follows the letter `h` in our text:

In [None]:
g_ttable['h']['e']

We can visualize our table fairly easily by using a Python library called [pandas](https://pandas.pydata.org/).

Run the code below, which may take a moment to finish.

In [None]:
import pandas as pd
pd.set_option("display.precision", 0)
pd.DataFrame.from_dict(g_ttable, orient='index')

To read the table, select a row for the first letter, and then a column to find the frequency of the column letter appearing after the letter in the row. (In other words, read across then down.)

The space character appears as the empty column/row label in this table. 

### Automatic writing

In Shannon's* article, these kinds of transition tables are used to demonstrate the idea that English text can be effectively represented as a Markov chain. And to effect the demonstration, Shannon* presents the results of _generating_ text by weighted random sampling from the transition tables.  

To visualize how the weighted sampling works, imagine the following:
  1. You choose a row at random on the transition table above, writing its character down on paper.
  2. The numbers in that row correspond to the observed frequencies of characters following the character corresponding to that row.
  3. You fill a big with Scrabble tiles, using as many tiles for each character as indicated by the corresponding cell in the selected row. If a cell has `NaN` in it -- the null value -- you don't put any tiles of that chracter in the bag.
  5. You draw one tile from the bag. You write down the character you just selected. This character indicates the next row on the table.
  6. Using that row, you repeat steps 1 through 4. And so on, for however many characters you want to include in your sample.

Run the code below to define a function that will do this sampling for us.

In [None]:
def create_sample(ttable, length=100):
    '''
    Using a transition table of ngrams, creates a random sample of the provided length (default is 100 characters).
    '''
    starting_chars = list(ttable.keys())
    first_char = last_char = choices(starting_chars, k=1)[0]
    l = len(first_char)
    generated_text = first_char
    for _ in range(length):
        chars = list(ttable[last_char].keys())
        weights = list(ttable[last_char].values())
        next_char = choices(chars, weights, k=1)[0]
        generated_text += next_char
        last_char = generated_text[-l:]
    return generated_text

In [None]:
create_sample(g_ttable)

Run the code above a few times for the full effect. It's still nonsense, but maybe it seems more like recognizable nonsense -- meaning nonsense that a human being who speaks English might make up -- compared with our previous randomly generated examples. If you agree that it's more recognizable, can you pinpoint features or moments that make it so?

Personally, it reminds me of the outcome of using a Ouija board: recognizable words almost emerging from some sort of pooled subconscious, then sinking back into the murk before we can make any sense out of them. 

#### More silly walks

More adept Ouija-board users can be simulated by increasing the size of our n-grams. As Shannon's* article demonstrates, the approximation to the English lexicon increases by moving from bigrams to trigrams -- such that frequencies are calculated in terms of the occurrence of a given letter immediately after a pair of letters. 

So instead of a table like this:

![A table with the letters "h," a space, "o," "e," and "i" along the top (column headers), and "t," space, "c," "w," "s," and "g" along the left-hand side (row labels), and numbers in the cells of the table. 
](https://gwu-libraries.github.io/engl-6130-dugan/_images/bigram-table.png)

we have this (where the `h`, `b`, and `w` in the row labels are all preceded by the space character):

![A table with the letters "e," "a," space, "i," "o" along the top (column headers), and "th," space "h", space "b"," "er," and space "w" along the left-hand side (row labels), and numbers in the cells of the table. 
](https://gwu-libraries.github.io/engl-6130-dugan/_images/bigram-table-2.png)

Note, however, that throughout these experiments, the level of approximation to any particular understanding of "the English lexicon" depends on the nature of the data from which we derive our frequencies. Urquhart's translation of Rabelais, dating from the 16th Century, has a rather distinctive vocabulary, as you might expect, even with the modernized spelling and grammar of the Project Gutenberg edition. 

The code below defines some interactive controls to make our experiments easier to manipulate. Run both sections of code to create the controls.

In [None]:
import ipywidgets as widgets
from IPython.display import display

def create_slider(min_value=2, max_value=5):
    return widgets.IntSlider(
            value=2,
            min=min_value,
            max=max_value,
            description='Set value of n:')
    
def create_update_function(text, transition_function, slider):
    '''
    returns a callback function for use in updating the provided transition table with ngrams from text, given slider.value, as well as an output widget
    for displaying the output of the callback
    '''
    output = widgets.Output()
    def on_update(change):
        with output:
            global ttable
            ttable = transition_function(create_ngrams(text, slider.value))
            print(f'Updated! Value of n is now {slider.value}.')
    return on_update, output

def create_generate_function(sample_function, slider):
    '''
    returns a callback function for use in generating new random samples from the provided trasition table.
    '''
    output = widgets.Output()
    def on_generate(change):
        with output:
            print(f'(n={slider.value}) {sample_function(ttable)}')
    return on_generate, output
    
def create_button(label, callback):
    '''
    Creates a new button with the provided label, and sets its click handler to the provided callback function
    '''
    button = widgets.Button(description=label)
    button.on_click(callback)
    return button

In [None]:
ttable = g_ttable
ngram_slider = create_slider()
update_callback, update_output = create_update_function(g_text_norm, create_transition_table, ngram_slider)
update_button = create_button("Update table", update_callback)
generate_callback, generate_output = create_generate_function(create_sample, ngram_slider)
generate_button = create_button("New sample", generate_callback)
display(ngram_slider, update_button, update_output, generate_button, generate_output)


Use the slider above to change the value of `n`. Click `Update table` to recreate the transition table using the new value of `n`. Then use the `New sample` button to generate a new, random sample of text from the transition table. You can generate as many samples as you like, and you can update the size of the ngrams in between in order to compare samples of different sizes.

What do you notice about the effect of higher values of `n` on the nature of the random samples produced? 

### A Rabelaisian chatbot

Following Shannon's* article, we can observe the same phenomena using whole words to create our n-grams. I find such examples more compelling, perhaps because I find it easier or more fun to look for the glimmers of sense of random strings of words than in random strings of letters, which may or may not be recognizable words. 

But the underlying procedure is the same. We first create a list of "words" out of our normalized text by splitting the latter on the occurrences of white space. As a result, instead of a single string containing the entire text, we'll have a Python list of strings, each of which is a word from the orginal text.

Note that this process is not a rigorous way of tokenizing a text. If that is your goal -- to split a text into words, in order to employ word-frequency analysis or similar techniques -- there are very useful [Python libraries](https://spacy.io/) for this task, which use sophisticated tokenizing techniques.

For purposes of our experiment, however, splitting on white space will suffice.

In [None]:
g_text_words = g_text_norm.split()

From here, we can create our ngrams and transition table as before. First, we just need to modify our previous code to put the spaces back (since we took them out in order to create our list of words). 

Run the code sections below to create some new functions, and the to create some more HTML controls for these functions.

In [None]:
def create_ttable_words(ngrams):
    '''
    Expects as input a list of tuples corresponding to ngrams.
    Returns a dictionary of dictionaries, where the keys to the outer dictionary consist of strings corresponding to the first n-1 elements of each ngram.
    The values of the outer dictionary are themselves dictionaries, where the keys are the nth elements each ngram, and the values are the frequence of occurrence.
    '''
    n = len(ngrams[0])
    ttable = {}
    for ngram in ngrams:
        key = ngram[:n-1]
        if key not in ttable:
            ttable[key] = Counter()
        ttable[key][(ngram[-1],)] += 1
    return ttable
    
def create_sample_words(ttable, length=100):
    '''
    Using a transition table of ngrams, creates a random sample of the provided length (default is 100 characters).
    '''
    starting_words = list(ttable.keys())
    first_words = last_words = tuple(choices(starting_words, k=1)[0])
    n = len(first_words)
    text = list(first_words)
    for _ in range(length):
        words = list(ttable[last_words].keys())
        weights = list(ttable[last_words].values())
        next_word = choices(words, weights, k=1)[0]
        text.append(next_word[0])
        last_words = tuple(text[-n:])
    return " ".join(text)

In [None]:
ttable = create_ttable_words(create_ngrams(g_text_words))
ngram_slider_w = create_slider()
update_callback_w, update_output_w = create_update_function(g_text_words, create_ttable_words, ngram_slider_w)
update_button_w = create_button("Update table", update_callback_w)
generate_callback_w, generate_output_w = create_generate_function(create_sample_words, ngram_slider_w)
generate_button_w = create_button("New sample", generate_callback_w)
display(ngram_slider_w, update_button_w, update_output_w, generate_button_w, generate_output_w)


Use the slider and buttons above to generate sample text for various values of `n`. Samples are based on n-grams of words from the source text.

### How drunken was our walk?

In his article, Shannon* reports various results of these experiments, using different values for `n` with both letter- and word-frequencies. He includes the following sample, apparently produced at random with word bigrams, though he does not disclose the particular textual sources from which he derived his transition tables:

>THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHOEVER TOLD THE PROBLEM FOR AN UNEXPECTED.

I've always thought that Shannon's* example seems suspiciously fortuitous, given its mention of attacks on English writers and methods for letters, etc. Who knows how many trials he made before he got this result (assuming he didn't fudge anything). All the same, one of the enduring charms of the "Markov text generator" is its propensity to produce uncanny stretches of text that, as Shannon* writes, sound "not at all unreasonable." 

A question does arise: how novel are these stretches? In other words, what proportion of the generated sample is unique relative to the source? One way approach to the question is to think in terms of unique n-grams. When using a value of 3 for `n`, by definition every three-word sequence in our generated sample will match some sequence in the source text. But what about sequences of 4 words? Just looking at the samples we've created, it's clear that at least some of these are novel, since some are plainly nonsense and not likely to appear in Rabelais' text. 

We might measure their novelty by creating a lot of samples and then, for each sample, calculating the percentage of 4-word n-grams that are _not_ in the source text. Running this procedure over 1,000 samples, I arrive at an average of 40% -- so a little less than half of all the 4-word sequences across all the samples are sequences that do _not_ appear in Rabelais' text. 

As for what percentage of those constitute phrases that are not "unreasonable" as spontaneous English utterances, that's a question that's hard to answer computationally. Obviously, it depends in part on your definition of "not unreasonable." But it's kind of fun to pick out phrases of length `n+1` (or `n+2`, etc.) from your sample and see if they appear in the original. You can do so by running code like the following. Just edit the part between the quotation marks so that they contain a phrase from your sample. If Python returns `True`, the phrase is _not_ in the source.

In [None]:
'to do a little untruss' in g_text_norm

### Where lies the labor?

The code in this notebook implements a kind of algorithm, albeit a simple one. A great many procedures, now standard parts of computer applications -- e.g., efficiently sorting a list -- involve more logical complexity. Our Markovian model of Rabelais' novel seems almost _too_ simple to produce the results it does, which is perhaps partly why the results can feel uncanny. 

And while it needs a gargantuan leap to get from our rudimentary text machine to Chat GPT, the large-language model behind the latter is, like ours, a statistical representation of patterns occurring in the textual data on which it is based. The novelty of the latest models derives from their capacity to encode overlapping contexts: to represent how the units that make up text occur in multiple relations to each other: e.g., to capture, mathematically, the fact that a certain word frequently follows another word but often appears in the same sentence or paragraph as a third word, and so on. This complexity of representation, coupled with the sheer size of the data used to train the model, leads to Chat-GPT's uncanny ability to mimic textual genres with a high degree of stylistic fidelity.



But perhaps we do Rabelais' text a disservice by calling it the "data" behind our model. We could, just as reasonably, speak of the text itself as the model -- likewise for the tera- or petabytes of text used to train Chat-GPT and its ilk. On Shannon's* theory, language encodes information. The ultimate aim of the theory is to find the most _efficient_ means of encoding (in order to solve "the engineering problem" of modern telecommunications networks); nonetheless, the success of the theory implies that any use of language (any use recognizable as such by users of the language) _already_ encodes information. In other words, the transition probabilities we generated from Rabelais' text are already expressed by Rabelais' text; our transition matrix just encodes that information in a more computationally tractable form. Every text encodes its producer's "knowledge of the statistics of the language" (in Shannon's* words). And one might argue that every text encodes its readers' knowledge, too. It's on the basis of such knowledge that we can "decode" Rabelais' novel, as well as the stochastic quasi-nonsense we can generate on its basis, which feels, relative to the former, like an excess of sense (an excess over and above Rabelais' already excessive text), spilling over the top.

#### Further experiments

To see how differences in the source of the model impact the result, try running our code on different texts. As written, our code only works on plain text format (files with a `.txt` extension). [Project Gutenberg](https://www.gutenberg.org/) is a good source for these -- just make sure that you choose the `Plain Text UTF-8` option for displaying a given text. You can copy the URL for the plain text either from your browser's address bar (if the text opens as a separate tab or page), or by right-clicking on the `Plaint Text UTF-8` link and selecting the option `Copy Link`. 

The following code creates a text box into which you can paste the link.


In [None]:
url_box = widgets.Text(
    value='',
    placeholder='Type something',
    description='URL:',
    disabled=False   
)

The big block of code below re-uses code from above to download and normalize the text at the provided URL, create a transition table of words from the text, and present options for changing the value of N and for generating new samples. 

In [None]:
if url_box.value:
    text_file, _ = urlretrieve(url_box.value)
    with open(text_file) as f:
        text = f.read()
    norm_text_words = normalize_text(text).split()
    ttable = create_ttable_words(create_ngrams(norm_text_words))
    ngram_slider_new = create_slider()
    update_callback_new, update_output_new = create_update_function(norm_text_words, create_ttable_words, ngram_slider_new)
    update_button_new = create_button("Update table", update_callback_new)
    generate_callback_new, generate_output_new = create_generate_function(create_sample_words, ngram_slider_new)
    generate_button_new = create_button("New sample", generate_callback_new)
    display(ngram_slider_new, update_button_new, update_output_new, generate_button_new, generate_output_new)
    

### Carnival intelligence?

Lacan famously said that the unconscious is structured like a language. Whether that's an apt description of the human psyche is at least debatable. But might we say that these models manifest the unconscious structures of language itself? We can catch glimpses of this manifestation in the relatively humble outcome of Shannon's* experiments: in the Markovian leaps that lead us to make _sense_ out of patterned randomness, leaps which, at the same time, reveal the nonsense that riots on the other side of sense. These experiments allow us to wander through spaces of grammatical, lexical, and stylistic possibility -- and the pleasure they offer, for me, lies in their letting us stumble into places where our rule-observant habits might not otherwise let us go. 

What if we were to approach generative AI in the same spirit? Not as the _deus ex machina_ that will save the world (which it almost certainly is not), and not only as a technology that will further alienate and oppress our labor (which it very probably is). But to borrow from Bakhtin, as a carnivalesque mirror of our collective linguistic unconscious: like carnival, offering a sense of freedom from restraint that is, at the same time, the affirmation, by momentary inversion, of the prevailing order of things. But also a reminder that language is the repository of an intelligence neither of the human (considered as an isolated being), nor of the machine, but of the collective, and that making sense is always a political act ({cite}`bakhtin_rabelais_1984`).

In [None]:
````{bibliography}
````