# Reading Machines: Exploring the Linguistic Unconscious of AI

The early history of computing revolves around efforts to automate human computation -- human labor. And from Lovelace to Turing and beyond, a key concern lay in the specification and refinement of _algorithms_: methods of reducing complex calculations and other operations to explicit formal rules, rules that could, in principle, be implemented with rigor and precision by purely mechanical (and later, of course, electronic) means. 

But as a means of understanding Chat GPT and other forms of [generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligence), a consideration of algorithms only gets us so far. In fact, when it comes to the [large language models](https://en.wikipedia.org/wiki/Large_language_model) that have captivated the public imagination, I would argue that their "unreasonable effectiveness" is less a triumph of the algorithm, than the manifestation of another strand of computation, bound up with the former, but motivated by distinct pressures and concerns. Instead of formal logic and mathematical proof, this strand draws on traditions of thinking about data, randomness, and probability. And instead of the prescription of (computational) actions, it aims at the description and prediction of (non-computational) aspects of the world. 

A rather neglected moment in this tradition, in light of later developments, remains Claude Shannon's [work](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf) on modeling the structure of printed English. In this interactive document, we will use the [Python programming language](https://www.python.org) to reproduce a couple of Shannon's experiments, in the hopes of pulling back the curtain a bit on what seems to many (and not unreasonably) as evidence of a ghost in the machine. But the aim is not necessarily to demystify experiences of generative AI. I, for one, do find many of these experiences haunting. But maybe the haunting doesn't happen where we at first assume.

The material that follows draws on and is inspired by my reading of Lydia Liu's _The Freudian Robot_, one of the few works in the humanities that I'm aware of to deal with Shannon's work in depth.

## Two kinds of coding

[introduction]

### Programs as code(s)

We imagine computers as machines that operate on 1's and 0's. In fact, the 1's and 0's are themselves an abstraction for human convenience: digital computation happens as a series of electronic pulses: basically, switches that are either "on" or "off." (Think of counting to 10 by flipping a light switch on and off 10 times.)

Every digital representation -- everything that can be computed by a digital computer -- must be encoded, ultimately, in this binary form. 

But to make computers efficient for human use, many additional layers of abstraction have been developed on top of the basic binary layer. By virtue of using computers and smartphones, we are all familiar with the concept of an interface, which instantiates a set of rules prescribing how we are to interact with the device in order to accomplish well-defined tasks. These interactions get encoded down to the level of electronic pulses (and the results of the computation are translated back into the encoding of the interface). 

A programming language is also an interface: a text-based one. It represents a code into which we can translate our instructions for computation, in order for those instructions to be encoded further for processing. 

Let's start with a single instruction. Run the following line of Python code. You won't see any output -- that's okay.


In [1]:
answer_to_everything = 42

In the encoding specified by the Python language, the equals sign (`=`) is an instruction that loosely translates to: "Store this value (right side) somewhere in memory, and given that location in memory the provided label (left side)." The following image presents one way of imagining what happens in response to this code (with the caveat that, ultimately, the letters and numbers are represented by their binary encoding).  

[image here]

By running the previous line of code, we have created a _variable_ called `answer_to_everything`. We can use the variable to retrieve its value (for use in other parts of our program).

In [2]:
print(answer_to_everything)

42


The `print()` _function_ is a Python command that displays a value on the screen. The syntax of Python -- e.g., the name "print," as well as the parentheses that follow the function name and that enclose the _argument_, which is the thing we want to print -- is perfectly arbitrary (in the Saussurean sense). This syntax was invented by the designers of the Python language, though they drew on conventions found in other programming languages. The point is that nothing about the Python command `print(answer_to_everything)` makes its operation transparent; to know what it does, you have to know the language (or, at least, be familiar with the conventions of programming languages more generally) -- just as when learning to speak a foreign language, you can't deduce much about the meaning of the words from the way they look or sound.

However, unlike so-called "natural" languages, programming languages are, generally speaking, fully determinate. In other words, even minor deviations in syntax will usually cause errors, and errors will usually bring the whole program to a crashing halt.

In [3]:
print(answer_to_everythin)

NameError: name 'answer_to_everythin' is not defined

A misspelled variable causes Python to abort its computation. Imagine if conversation ground to a halt whenever one of the parties mispronounced a word or used a malapropism!

I like to say that Python is extremely literal. (But of course, this is merely an analogy, and a loose one. There is no room for metaphor in programming languages, at least, not as far as the computation is concerned.)

### Encoding text

As an engineer at Bell Labs, Claude Shannon wanted to find -- mathematically -- the most efficient means of encoding data for electronic transmission. Note that this task involves a rather different set of factors from those that influence the design of a programming language.

The designer of the language has the luxury of insisting on a programmer's fidelity to the specified syntax. In working in Python, we have to write `print(42)`, exactly as written, in order to display the number `42` on the screen. if we forget the parentheses, for instance, the command won't work. But when we talk on the phone (or via Zoom, etc.), it would certainly be a hassle if we had to first translate our words into a strict, fault-intolerant code like that of Python. 

All the same, there is no digital (electronic) representation without encoding. To refer to the difference between these two types of codes, I am drawing a distinction between _algorithms_ and _data_. Shannon's work was among the first to illuminate this distinction, which remains relevant to any consideration of machine learning and generative AI.

Before we turn to Shannon's experiments with English text, let's look briefly at how Python represents text as data.

In [4]:
a_text = "Most noble and illustrious drinkers, and you thrice precious pockified blades (for to you, and none else, do I dedicate my writings), Alcibiades, in that dialogue of Plato's, which is entitled The Banquet, whilst he was setting forth the praises of his schoolmaster Socrates (without all question the prince of philosophers), amongst other discourses to that purpose, said that he resembled the Silenes."

Running the line above creates a new variable, `a_text`, and assigns it to a _string_ representing the first sentence from Francois Rabelais' early Modern novel, _Gargantua and Pantagruel_. A string is the most basic way in Python of representing text, where "text" means anything that is not to be treated purely a numeric value. 

Anything between quotation marks, in Python, is a string.

One problem with strings in Python (and other programming languages) is that they have very little structure.

A Python string is a sequence of characters, where a _character_ is, a letter of a recognized alphabet, a punctuation mark, a space, etc. Each character is stored in the computer's memory as a numeric code, and from that perspective, all characters are essentially equal.

We can access a single character in a string by supplying its position. (Python counts characters in strings from left to right, starting with 0, not 1, for the first character.)

In [5]:
a_text[5]

'n'

We can access a sequence of characters -- here, the characters in positions 11 through 50.

In [6]:
a_text[10:50]

' and illustrious drinkers, and you thric'

We can even divide the string into pieces, using the occurences of particular characters. The code below divides our text on the white space, returning a _list_ (another Python construct) of smaller strings.

In [7]:
a_text.split()

['Most',
 'noble',
 'and',
 'illustrious',
 'drinkers,',
 'and',
 'you',
 'thrice',
 'precious',
 'pockified',
 'blades',
 '(for',
 'to',
 'you,',
 'and',
 'none',
 'else,',
 'do',
 'I',
 'dedicate',
 'my',
 'writings),',
 'Alcibiades,',
 'in',
 'that',
 'dialogue',
 'of',
 "Plato's,",
 'which',
 'is',
 'entitled',
 'The',
 'Banquet,',
 'whilst',
 'he',
 'was',
 'setting',
 'forth',
 'the',
 'praises',
 'of',
 'his',
 'schoolmaster',
 'Socrates',
 '(without',
 'all',
 'question',
 'the',
 'prince',
 'of',
 'philosophers),',
 'amongst',
 'other',
 'discourses',
 'to',
 'that',
 'purpose,',
 'said',
 'that',
 'he',
 'resembled',
 'the',
 'Silenes.']

The strings in the list above correspond, loosely, to the individual words in the sentence from Rabelais' text. But Python really has no concept of "word," neither in English, nor any other (natural) language. 

## Language & chance

It's probably fair to say that when Shannon was developing his mathematical approach to encoding information, the algorithmic ideal  dominated computational research in Western Europe and the United States. In previous decades, philosophers like Bertrand Russell and mathematicians like David Hilbert had sought to develop a formal approach to mathematical proof, an approach that, they hoped, would ultimately unify the scientific disciplines. The goal of such research was to identify a core set of axioms, or logical rules, in terms of which all other "rigorous" methods of thought could be expressed. In other words, to reduce to zero the uncertainty and ambiguity plaguing natural language as a tool for expression: to make language algorithmic.

Working within this tradition, Alan Turing had developed his model of what would become the digital computer. 

But can language as humans use it be reduced to such formal rules? On the face of it, it's easy to think not. However, that conclusion presents a problem for computation involving human (or "natural") language, since the computer is, at bottom, a formal-rule-following machine.

Shannon's work implicitly challenges the assumption that we need to resort to formal rules in order to deal with the uncertainty in language. Instead, he sought mathematical means for _quantifying_ that uncertainty. And as Lydia Liu points out, that effort began with a set of observations about patterns in printed English texts. 

### The long history of code

Of course, Shannon's insights do not begin with Shannon. A long history predates him of speculation on what we might call the statistical features of language, speculation of some practical urgency, given the even longer history of cryptographic communication in political, military, and other contexts.

In the 9th Century CE, the Arab mathematician and philosopher Al-Kindi [composed a work on cryptography](https://www.tandfonline.com/doi/abs/10.1198/tas.2011.10191) in which he described the relative frequency of letters in [...] as a method for [...]. Al-Kindi, alongside his many other accomplishments, is typically credited with the first surviving analysis of this kind, which is a direct precursor of methods popular in the digital humanities (word frequency analysis), among other many other domains. 

Closer to the hearts of digital humanists, the Russian mathematician Andrei Markov, in [a 1913 address to the Russian Academy of Sciences](https://www-cambridge-org.proxygw.wrlc.org/core/journals/science-in-context/article/an-example-of-statistical-investigation-of-the-text-eugene-onegin-concerning-the-connection-of-samples-in-chains/EA1E005FA0BC4522399A4E9DA0304862), reported on the results of his experiment with Aleksandr Pushkin's _Evegnii Onegin_: a statistical analysis of the occurrences of consonants and vowels in the first two chapters of Pushkin's novel in verse. From the perspective of today's large-language models, Markov improved on Al-Kindi's methods by counting not just isolated occurrences of vowels or consonants, but co-occurences: that is, where a vowel follows a consonant, a consonant a vowel, etc. As a means of articulating the structure of a sequential process, Markov's method generalizes into a powerful mathematical tool, to which he lends his name. We will see how Shannon used [Markov chains](https://en.wikipedia.org/wiki/Markov_chain) shortly. 

### A spate of tedious counting

First, however, let's illustrate the more basic method, just to get a feel for its effectiveness.

We'll take a text of sufficient length. Urquhart's English translation of _Gargantual and Pantagruel_, in the Everyman's Library edition, clocks in at a respectable 823 pages, so that's a decent sample. If we were following the methods used by Al-Kindi, Markov, or even Shannon himself, we would proceed as follows:
  1. Make a list of the letters of the alphabet on a sheet of paper.
  2. Go through the text, letter by letter.
  3. Beside each letter on your paper, make one mark each time you encounter that letter in the text.

Fortunately for us, we can avail ourselves of a computer to do this work. 

In the following lines of Python code, we download the Project Gutenberg edition of Rabelais' novel, saving it to the computer as a text file. We can read the whole file into the computer's memory as a single Python string. Then using a property of Python strings that allows us to _iterate_ over them, we can automate the process of counting up the occurences of each character.

In [1]:
from urllib.request import urlretrieve
urlretrieve('https://www.gutenberg.org/cache/epub/1200/pg1200.txt', 'gargantua.txt')

('gargantua.txt', <http.client.HTTPMessage at 0x103d29990>)

In [2]:
with open('gargantua.txt') as f:
    g_text = f.read()

In [3]:
len(g_text)

1855085

The Gutenberg version has close to 2 million characters. 

In [5]:
g_characters = {}
for character in g_text:
    if character in g_characters:
        g_characters[character] += 1
    else:
        g_characters[character] = 1

The code above implements the logic we described in manual terms:
   - We create an empty _dictionary_, called `g_characters`, which allows us to associate pairs of data points.
   - We loop through the text (`g_text`), which is a string, i.e., a sequence of characters.
   - For each character, if we have encountered it already, we assume that it's associated with a number, and we increment that number (just as if we were making another hash mark on a sheet of paper).
   - Otherwise, we add this character to our collection and set the tally to 1 (since this is the first occurrence of that character).

In [6]:
g_characters

{'\ufeff': 1,
 'T': 3347,
 'h': 91306,
 'e': 174505,
 ' ': 319443,
 'P': 3050,
 'r': 86558,
 'o': 109023,
 'j': 1312,
 'c': 33883,
 't': 130222,
 'G': 1546,
 'u': 44252,
 'n': 95086,
 'b': 21433,
 'g': 28016,
 'B': 1614,
 'k': 10361,
 'f': 33708,
 'a': 113620,
 'd': 60137,
 'l': 57764,
 '\n': 31090,
 'i': 92999,
 's': 88844,
 'y': 27084,
 'w': 29773,
 'U': 202,
 'S': 1725,
 'm': 32998,
 'p': 24007,
 'v': 13734,
 '.': 13850,
 'Y': 382,
 ',': 35809,
 '-': 4525,
 'L': 1085,
 'I': 5048,
 ':': 390,
 'A': 2526,
 'F': 1470,
 'ç': 1,
 'R': 767,
 'D': 774,
 'é': 1,
 'M': 1210,
 'x': 1971,
 'q': 1769,
 '8': 54,
 '2': 137,
 '0': 60,
 '4': 186,
 '[': 1,
 '#': 2,
 '1': 343,
 ']': 1,
 '3': 188,
 'E': 856,
 'C': 2605,
 'W': 1341,
 'N': 764,
 "'": 2194,
 'O': 848,
 '9': 34,
 '"': 12,
 'V': 542,
 'K': 154,
 'H': 1875,
 '(': 404,
 '6': 60,
 '5': 174,
 ')': 404,
 ';': 3385,
 '7': 50,
 'z': 781,
 '/': 9,
 '?': 1083,
 '=': 8,
 '_': 1,
 'J': 616,
 '&': 33,
 'X': 773,
 'Z': 21,
 'Q': 126,
 '*': 12,
 '!': 632

Looking at the contents of `g_characters`, we can see that it consists of more than just the standard Roman alphabet. There are punctuation marks, numerals, and other symbols, like `\n`, which represents a line break. 

But if we look at the 10 most commonly occurring characters, with one exception, it aligns well with the [relative frequency of letters in English](https://en.wikipedia.org/wiki/Letter_frequency) as reported from studying large textual corpora.  

In [9]:
sorted(g_characters.items(), key=lambda x: x[1], reverse=True)[:10]

[(' ', 319443),
 ('e', 174505),
 ('t', 130222),
 ('a', 113620),
 ('o', 109023),
 ('n', 95086),
 ('i', 92999),
 ('h', 91306),
 ('s', 88844),
 ('r', 86558)]

### Automatic writing

Using a little Python code, let's compare what happens when we construct two random samples of the letters of the Roman alphabet, one in which we select each letter with equal probability, and the other in which we weight our selections according to the frequency we have computed above.

In [69]:
from random import choices
alphabet = 'abcdefghijklmnopqrstuvwxyz'
print(''.join(choices(alphabet, k=50)))

hjknlsdoumplgkshcuuafpvyjjloueenvodzykxwxznpevgtpa


In [15]:
g_alpha_chars = {c: n for c, n in g_characters.items() if c in alphabet}

In [17]:
print(''.join(choices(list(g_alpha_chars.keys()), weights=g_alpha_chars.values(), k=50)))

sogocbanermnttanedlvneotltsmtrpssfgalsareywdcvssaz


Do you notice any difference between the two? It depends to some extent on roll of the dice, since both selections are still random. But you might see _more_ runs of letters in the second that resemble sequences you could expect in English, maybe even a word or two hiding in there.

### The difference a space makes

On Liu's telling, one of Shannon's key innovations was his realization that in analyzing _printed_ English, the _space between words_ counts as a character. It's the spaces that delimit words in printed text; without them, our analysis fails to account for word boundaries. 

Let's say what happens when we include the space character in our frequencies.

In [23]:
g_shannon_chars = {c: n for c, n in g_characters.items() if c in alphabet or c == ' '}
print(''.join(choices(list(g_shannon_chars.keys()), weights=g_shannon_chars.values(), k=50)))

d nec rynao e lhssio  bfo emrvcmut opodoyrdn  eeh 


It may not seem like much improvement, but now we're starting to see sequences of recognizable "word length," considering the average lengths of words in English. (But note that we haven't so far actually tallied anything that would count as a word: we're still operating exclusively at the level of individual characters or letters.)

### Law-abiding numbers

To unpack what we're doing a little more: when we make a _weighted_ selection from the letters of the alphabet, using the frequencies we've observed, it's equivalent to drawing letters out of a bag of Scrabble tiles, where different tiles appear in a different amounts. If there are 5 `e`'s in the bag but only 1 `z`, you might draw a `z`, but over time, you're more likely to draw an `e`. And if you make repeated draws, recording the letter you draw each time before putting it back in the bag, your final tally of letters will usually have more `e`'s than `z`'s. In probability theory, this expectation is called [the law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). It describes the fundamental intuition behind the utility of averages (as well as their limitation: sampling better approximates the mathematical average as the samples get larger, but in every case, we're talking about behavior in the aggregate, not the individual case). 

### Language as a drunken walk

To return to Shannon's experiments and the question motivating them: How effectively can we model human or natural language using statistical means? It's worth dwelling on the assumptions latent in this question. Parts of speech, word order, syntactic dependencies, etc: none of these classically linguistic entities come up for discussion in Shannon's article. Nor are there any claims therein about underlying structures of thought that might map onto grammatical or syntactic structures, such as we find in the Chomskian theory of [generative grammar](https://en.wikipedia.org/wiki/Generative_grammar). The latter theory remains squarely within the algorithmic paradigm: the search for formal rules or laws of thought. 

Language, in Shannon's treatment, resembles a different kind of phenomena: biological populations, financial markets, or the weather. In each of these systems, it is taken as a given that there are simply too many variables at play to arrive at the kind of description that would even remotely resemble the steps of a formally logical proof. Rather, the systems are described, and attempts are made to predict their behavior over time, drawing on observable patterns held to be valid in the aggregate. 

Whether the human linguistic faculty is best described in terms of formal, algorithmic rules, or as something else (emotional weather, perhaps), was not a question germane to Shannon's analysis. He famously wrote in the introduction to his 1948 article, the "semantic aspects of communication are irrelevant to the engineering problem" (i.e., the problem of devising efficient means of encoding messages, linguistic or otherwise). These "semantic aspects" -- and, I would argue, the "syntactic aspects" of communicatiom, too -- excluded from "the engineering problem," return to haunt the scene of generative AI with a vengeance. But in order to set this scene, let's return to Shannon's experiments.

Following Andrei Markov, Shannon modeled printed English as a Markov chain: as a special kind of weighted selection where the weights of the current selection depend _only_ on the immediately previous selection. A Markov chain is often called a random walk, though the image conventionally used to explain it is of a person who has had a bit too much to drink stumbling about. Observing such a situation, you might not be able to determine where the person is trying to go; all you can predict is that their next step will fall within stepping distance of where they're standing right now. 

It turns out that Markov chains can be used to model lots of processes in the physical world. And they can be used to model language, too, as Claude Shannon showed.

### More tedious counting

One way to construct such an analysis is as follows: represent your sample of text as a continuous string of characters. (As we've seen, that's easy to do in Python.) Then "glue" it to another string, representing the same text, but with every character shifted to the left by one position. For example, the first several characters of the first sentence from _Gargantua and Pantagruel_ would look like this:

[image]

With the exception of the dangling left-most and right-most characters, you now have a pair of strings that yield, for each position, a pair of characters.

[image with highlighting]

This pairs are called bigrams. But in order to construct a Markov chain, we're not just counting bigrams. Rather, we want to create what's called a _transition table_: a table where we can look up a given character -- the letter `e`, say -- and then for any other character that can follow `e`, find the frequency with which it occurs in that position (i.e., following an `e`). If a given character never follows another character, its bigram doesn't exist in the table. 

Below are shown the most common bigrams in our transition table.

[image]

To simplify our analysis, first we'll standardize our text a bit. Removing punctuation and non-alphabetic characters, removing extra runs of white space and line breaks, and converting everything to lowercase will make patterns in the results easier to see (though it's really sort of an aesthetic choice, and as I've suggested, Shannon's method doesn't presuppose any essential difference between the letters of words and the punctuation marks that accompany them). 

In [18]:
g_text_lower = g_text.lower()
g_text_lower = g_text_lower.replace("\n", " ").replace("\t", " ")
g_text_norm = ""
for char in g_text_lower:
    if (char in "abcdefghijklmnopqrstuvwxyz") or (char == " " and g_text_norm[-1] != " "):
        g_text_norm += char

In [19]:
g_text_norm[:1000]

'the project gutenberg ebook of gargantua and pantagruel this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever you may copy it give it away or reuse it under the terms of the project gutenberg license included with this ebook or online at wwwgutenbergorg if you are not located in the united states you will have to check the laws of the country where you are located before using this ebook title gargantua and pantagruel author franois rabelais illustrator gustave dor translator peter anthony motteux sir thomas urquhart release date august ebook most recently updated december language english credits produced by sue asscher and david widger transcribers note the original project gutenberg edition of this ebook was a text file prepared by sue asscher in from master francis rabelais five books of the lives heroic deeds and sayings of gargantua and his son pantagruel translated into english b

This method isn't perfect, but we'll trust that any errors -- like the disappearance of accented characters from French proper nouns, etc. -- will get smoothed over the aggregate. 

To create our transition table of bigrams, we'll use the following code, which defines two functions in Python. The first function, `create_ngrams`, generalizes a bit from our immediate use case; by setting the parameter called `n` in the function call to a number higher than 2, we can create more combinations of three or more successive characters (trigrams, quadgrams, etc.). This feature will be useful a little later.

In [53]:
def create_ngrams(text, n=2):
    '''
    Creates a series of ngrams out of the provided text argument. The argument n determines the size of each ngram; n must be greater than or equal to 2. 
    Returns a list of ngrams, where each ngram is a Python tuple consisting of n characters.
    '''
    text_arrays = []
    for i in range(n):
        last_index = len(text) - (n - i - 1)
        text_arrays.append(text[i:last_index])
    return list(zip(*text_arrays))

Let's illustrate our function with a small text first. The output is a Python list, which contains a series of additional collections (called tuples) nested within it. Each subcollection corresponds to a 2-character window, and the window is moved one character to the right each time. 

This structure will allow us to create our transition table, showing which characters follow which other characters most often. 

In [56]:
text = 'abcdefghijklmnopqrstuvwxyz'
create_ngrams(text, 2)

[('a', 'b'),
 ('b', 'c'),
 ('c', 'd'),
 ('d', 'e'),
 ('e', 'f'),
 ('f', 'g'),
 ('g', 'h'),
 ('h', 'i'),
 ('i', 'j'),
 ('j', 'k'),
 ('k', 'l'),
 ('l', 'm'),
 ('m', 'n'),
 ('n', 'o'),
 ('o', 'p'),
 ('p', 'q'),
 ('q', 'r'),
 ('r', 's'),
 ('s', 't'),
 ('t', 'u'),
 ('u', 'v'),
 ('v', 'w'),
 ('w', 'x'),
 ('x', 'y'),
 ('y', 'z')]

In [65]:
from collections import Counter
def create_transition_table(ngrams):
    '''
    Expects as input a list of tuples corresponding to ngrams.
    Returns a dictionary of dictionaries, where the keys to the outer dictionary consist of strings corresponding to the first n-1 elements of each ngram.
    The values of the outer dictionary are themselves dictionaries, where the keys are the nth elements each ngram, and the values are the frequence of occurrence.
    '''
    n = len(ngrams[0])
    ttable = {}
    for ngram in ngrams:
        key = "".join(ngram[:n-1])
        if key not in ttable:
            ttable[key] = Counter()
        ttable[key][ngram[-1]] += 1
    return ttable
    

In [66]:
create_transition_table(create_ngrams(text, 2))

{'a': Counter({'b': 1}),
 'b': Counter({'c': 1}),
 'c': Counter({'d': 1}),
 'd': Counter({'e': 1}),
 'e': Counter({'f': 1}),
 'f': Counter({'g': 1}),
 'g': Counter({'h': 1}),
 'h': Counter({'i': 1}),
 'i': Counter({'j': 1}),
 'j': Counter({'k': 1}),
 'k': Counter({'l': 1}),
 'l': Counter({'m': 1}),
 'm': Counter({'n': 1}),
 'n': Counter({'o': 1}),
 'o': Counter({'p': 1}),
 'p': Counter({'q': 1}),
 'q': Counter({'r': 1}),
 'r': Counter({'s': 1}),
 's': Counter({'t': 1}),
 't': Counter({'u': 1}),
 'u': Counter({'v': 1}),
 'v': Counter({'w': 1}),
 'w': Counter({'x': 1}),
 'x': Counter({'y': 1}),
 'y': Counter({'z': 1})}

For the alphabet, our transition table consists entirely of 1's, because (by definition) each letter occurs only once. The way to read the table, however, is as follows:
> The letter `b` occurs after the letter `a` 1 time in our (alphabet) sample.
> 
> The letter `c` occurs after the letter `b` 1 time in our sample.
> 
> ...

Now let's use these functions to create the transition table with bigrams _Gargantua and Pantagruel_.

In [67]:
g_ttable = create_transition_table(create_ngrams(g_text_norm, 2))

Our table will now be significantly bigger. But let's use it see how frequently the letter `e` follows the letter `h` in our text:

In [68]:
g_ttable['h']['e']

40196

We can visualize our table fairly easily by using a Python library called [pandas](https://pandas.pydata.org/).

To read the table, select a row for the first letter, and then a column to find the frequency of the column letter appearing after the letter in the row. (In other words, read across then down.)

The space character appears as the empty column/row label in this table. 

In [93]:
import pandas as pd
pd.set_option("display.precision", 0)
pd.DataFrame.from_dict(g_ttable, orient='index')

Unnamed: 0,h,Unnamed: 2,e,u,a,s,r,i,o,l,...,f,g,z,q,b,j,v,d,k,x
t,48893.0,32489.0,10626.0,2540.0,4924.0,2831.0,3794.0,7459,12307.0,1671.0,...,78.0,16.0,17.0,4.0,27.0,3.0,2.0,15.0,3.0,
h,31.0,10970.0,40196.0,1339.0,14764.0,232.0,1060.0,12833,8200.0,144.0,...,92.0,,,2.0,65.0,2.0,1.0,51.0,2.0,
e,266.0,61543.0,4873.0,444.0,8741.0,12330.0,24406.0,3527,799.0,6798.0,...,1600.0,987.0,68.0,288.0,494.0,56.0,2897.0,12428.0,143.0,1224.0
,20185.0,,5985.0,4086.0,37773.0,21143.0,5827.0,18698,21971.0,8038.0,...,13280.0,7118.0,54.0,1007.0,15401.0,1481.0,2669.0,9991.0,1609.0,427.0
p,1094.0,1659.0,4253.0,1079.0,4510.0,530.0,3315.0,2220,3415.0,2257.0,...,24.0,28.0,,1.0,16.0,,1.0,15.0,15.0,
r,165.0,20393.0,19212.0,2350.0,5594.0,4951.0,1552.0,7668,6303.0,943.0,...,311.0,1676.0,4.0,47.0,431.0,13.0,571.0,2864.0,680.0,6.0
o,389.0,14494.0,412.0,15121.0,718.0,3255.0,13494.0,1062,3434.0,3589.0,...,13377.0,793.0,59.0,31.0,766.0,117.0,1502.0,2686.0,964.0,170.0
c,7651.0,1440.0,5385.0,1414.0,4450.0,98.0,1766.0,1642,6605.0,1214.0,...,1.0,4.0,1.0,55.0,33.0,,,1.0,1947.0,
g,2826.0,8690.0,4074.0,1064.0,2308.0,1063.0,2979.0,1448,2950.0,674.0,...,11.0,342.0,,,22.0,,,74.0,5.0,
u,26.0,3254.0,2256.0,52.0,1089.0,6190.0,6811.0,1065,490.0,4098.0,...,267.0,1518.0,64.0,11.0,768.0,,21.0,882.0,24.0,109.0


### Automatic writing (2)

The code below creates one more function, which uses our transition table as a basis for making random, weighted samples, much as we did with our single-letter frequencies. But this time, our random sample behaves as a Markov chain: that is each, frequency will depend on the previous character. The logic is as follows:
  - Assume the first character in our random sample is `a`.
  - The next character must be a character that immediately follows an `a` in Rabelais' text. It will be selected randomly from all the  characters following the `a`s in Rabelais' text, using their relative frequencies (i.e., of appearance in a post-`a` position) as the weights.
  - Let's assume that _this_ character, randomly selected, is a space.
  - The third character we select must be one that immediately follows a space in Rabelais' text, using the same sampling technique as above.
  - And so on, for as many characters as we choose to sample.

To start our Markov chain, we must either provide or randomly select an initial character. Our function makes an (unweighted) random selection from the bigrams in our table.

In [85]:
def create_sample(ttable, length=100):
    '''
    Using a transition table of ngrams, creates a random sample of the provided length (default is 100 characters).
    '''
    starting_chars = list(ttable.keys())
    first_char = last_char = choices(starting_chars, k=1)[0]
    n = len(first_char)
    text = first_char
    for _ in range(length):
        chars = list(ttable[last_char].keys())
        weights = list(ttable[last_char].values())
        next_char = choices(chars, weights, k=1)[0]
        text += next_char
        last_char = text[-n:]
    return text

In [88]:
create_sample(g_ttable)

'r beede bu ase endelly geve tathatoorou m pake waself r terisonanghig joutathof ty baly h avitheriofr'

Run the code above a few times for the full effect. It's still nonsense, but maybe it seems more like recognizable nonsense -- meaning nonsense that a human being who speaks English might make up -- compared with our previous randomly generated examples. If you agree that it's more recognizable, can you pinpoint features or moments that make it so?

Personally, it reminds me of the outcome whenever I've used a Ouija board with anyone else: recognizable words almost emerging from some sort of pooled subconscious, then sinking back into the murk before we can make any sense out of them. 

### Running trials

More adept Ouija board users can be simulated by increasing the size of our ngrams. As Shannon's article demonstrates, the approximation to the English lexicon increases by moving from bigrams to trigrams -- such that frequencies are calculated in terms of the occurrence of a given letter immediately after a pair of letters. 

So instead of a table like this:

[image]

we have this:

[image] 

Note, however, that throughout these experiments, the level of approximation to anyone's understanding of the "English lexicon" depends both on the nature of that understanding, and the nature of the data from which we derive our frequencies. Urquhart's translation of Rabelais, dating from the 16th Century, has a rather distinctive vocabulary, as you might expect, even in the Gutenberg Library's electronic edition, which adheres to standardized spelling and punctuation.

The code below defines some interactive controls to make our experiments easier to manipulate. 

In [99]:
import ipywidgets as widgets
from IPython.display import display

Use the slider below this cell to change the size of the ngrams and update our transition table. Click `Update` to apply the changes.

In [114]:
ngram_slider = widgets.IntSlider(
    value=2,
    min=2,
    max=5,
    step=1,
    description='Set value of n:')
button = widgets.Button(description="Update")
output = widgets.Output()

def on_update(change):
    with output:
        global g_ttable
        g_ttable = create_transition_table(create_ngrams(g_text_norm, ngram_slider.value))
        print(f'Updated! Value of n is now {ngram_slider.value}.')
        
button.on_click(on_update)
display(ngram_slider, button, output)

IntSlider(value=2, description='Set value of n:', max=5, min=2)

Button(description='Update', style=ButtonStyle())

Output()

Use the button below this cell to generate a new sample based on the current transition table. You can generate as many samples as you like, and you can update the size of the ngrams in between in order to compare samples of different sizes.

In [120]:
button2 = widgets.Button(description="Generate sample")
output2 = widgets.Output()

def on_generate(change):
    with output2:
        print(f'(n={ngram_slider.value}) {create_sample(g_ttable)}')

button2.on_click(on_generate)
display(button2, output2)

Button(description='Generate sample', style=ButtonStyle())

Output()

What do you notice about the effect of higher values of `n` on the nature of the random samples produced? 

### A Rabelaisian chatbot

I find the phenomena easier to observe when we use whole words as the inputs to our transition table as opposed to individual letters. The underlying principle is the same. We can create a list of "words" out of our normalized text by splitting the latter on the occurrences of white space. As a result, instead of a single string containing the entire text, we'll have a Python list of strings, each of which is a word from the orginal text.

In [123]:
g_text_words = g_text_norm.split()

From here, we can create our ngrams and transition table as before. First, we just need to modify our previous code to put the spaces back (since we took them out in order to create our list of words). 

In [138]:
def create_ttable_words(ngrams):
    '''
    Expects as input a list of tuples corresponding to ngrams.
    Returns a dictionary of dictionaries, where the keys to the outer dictionary consist of strings corresponding to the first n-1 elements of each ngram.
    The values of the outer dictionary are themselves dictionaries, where the keys are the nth elements each ngram, and the values are the frequence of occurrence.
    '''
    n = len(ngrams[0])
    ttable = {}
    for ngram in ngrams:
        key = ngram[:n-1]
        if key not in ttable:
            ttable[key] = Counter()
        ttable[key][(ngram[-1],)] += 1
    return ttable
    
def create_sample(ttable, length=100):
    '''
    Using a transition table of ngrams, creates a random sample of the provided length (default is 100 characters).
    '''
    starting_words = list(ttable.keys())
    first_words = last_words = tuple(choices(starting_words, k=1)[0])
    n = len(first_words)
    text = list(first_words)
    for _ in range(length):
        words = list(ttable[last_words].keys())
        weights = list(ttable[last_words].values())
        next_word = choices(words, weights, k=1)[0]
        text.append(next_word[0])
        last_words = tuple(text[-n:])
    return " ".join(text)

In [142]:
create_sample(create_ttable_words(create_ngrams(g_text_words, n=3)))

'his pugnative choler in the lord of basche said panurge that quoth rondibilis i know handsomely and featly how to discern the least hurt but will wade over acheron styx and cocytus drink whole bumpers of lethes waterthough i mortally hate it as dangerous and pernicious shame whereof he assured us again to study it is that it is in the condition wherein they are busied about nothing but to those who are of a roasted coney all that we saw aristotle holding a parley the horse but you are lodged a cornucopia that amalthaean horn which is able to go about'


Chat-GPT and its ilk have made the [Turing test](https://en.wikipedia.org/wiki/Turing_test) -- long a trope of science fiction and a topic of serious interest chiefly to computer scientists -- into something of a ubiqituous pastime. Certainly, those of us who regularly use the Internet as a source of information or participate in its discourse communities now face the disconcerting question: has what we're reading, seeing, listening to, etc., been produced by a human being or a computer program? How can we tell? Alan Turing proposed his test as a phenomenological benchmark: any machine that could successfully and reliably fool its human interlocutors into granting it the presumption of human intelligence could, in fact, be considered intelligent (in all relevant respects).

There's a lot to unpack in Turing's philosophical exercise. But as a tool for understanding how [generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligence) works, or at least, for approaching the ground from which it springs, Turing's work is arguably less useful than that of his less celebrated contemporary, Claude Shannon.

Working at Bell Labs in the 1940's, Claude Shannon developed the [mathematical theory of communication](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf). Often referred to as a "theory of _information_," it is noteworthy that Shannon framed his work as a theory of _communication_. Regardless, the practical significance of Shannon's work is [immense](https://www.quantamagazine.org/how-claude-shannons-information-theory-invented-the-future-20201222/): the mathematical modeling introduced there underpins great swaths of modern telecommunications infrastructure, and it paved the way to our data-saturated digital mediascape. 

Shannon's model is motivated by a practical question: how can we determine the _most efficient means of encoding_ a given message? And although his model has proven relevant to any medium that can be represented digitally, his work was grounded, as Lydia Liu shows, in questions about language, specifically, _printed English_.

### From Claude Shannon to Chat GPT (and back again)

If Turing's guiding concern was to know what kinds of intellectual activity could be automated, Shannon's was quite different: to know whether human language, in its panoply of uses, could be modeled as a [stochastic (random) process](https://en.wikipedia.org/wiki/Stochastic_process). Turing's point of departure lay in the resources of formal logic and mathematical proof; Shannon drew on data and probability. 

Although forms of popular (and even scholarly) imagining about AI continue to draw on a Turing-esque framework, wherein the primary concern is with the meaning of intelligence, the "unreasonable effectiveness" of [large language models](https://en.wikipedia.org/wiki/Large_language_model) hearkens back to Shannon's experiments on the probabilistic modeling of English prose. And while we certainly couldn't build Siri or Chap-GPT using just Shannon's insights, his methods might be regarded as an early exercise in machine learning. Could we also say that the digital humanities treads this same ground?

In this interactive document, we'll use the [Python programming language] to reproduce a couple of Shannon's experiments, in the hopes of pulling back the curtain a bit on what seems to many (and not unreasonably) as evidence of a ghost in the machine. But the aim is not necessarily to demystify experiences of generative AI. I, for one, do find many of these experiences haunting, but I'm not sure the haunting happens where AI's prominent boosters claim that it does.

The material that follows draws on and is inspired by my reading of Lydia Liu's _The Freudian Robot_.