In [None]:
%load_ext jupyter_black

In [None]:
%load_ext rich

In [None]:
%load_ext memory_profiler

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from rich.jupyter import print
from rich.pretty import pprint
from typing import List, Set
import spacy

In [None]:
!jupyter --version

In [None]:
print("Spacy version:", spacy.__version__)

# Finding "Tom Swifties": Part of Speech Puns in Sci-Fi Novellas of the 1910s

*Using computational linguistics to list pun candidates from book texts*

Starting from 1910, Howard Garis wrote a series of popular Sci-Fi Adventure novellas under the pen name of Victor Appleton. The main character was [Tom Swift](https://en.wikipedia.org/wiki/Tom_Swift), a young inventor living in upstate New York. 

<img src="TomSwiftMotorcycleSmallCropped.jpg" width="200" align="right" style="margin-left: 1em;">

Litovkina (2014) analyzed a pun construction which Garis employed prolifically in the series. She classifed these puns as a type of "[Wellerism](https://en.wikipedia.org/wiki/Wellerism)", a comedic device in discourse. Garis' frequent use of his specific sub-type of Wellerism helped his puns to become known as "Tom Swifties". Litovkina defines a Tom Swifty as:

> a wellerism conventionally based on the punning relationship between the way an adverb describes a speaker and simultaneously refers to the meaning of the speaker’s statement. The speaker is traditionally Tom, his statement is usually placed at the beginning of the Tom Swifty, and the adverb at the end of it, e.g. “I see,” said Tom icily (icily vs. I see).

---

T. Litovkina, A. (2014). “I see,” said Tom icily: Tom Swifties at the beginning of the 21st century. *The European Journal of Humour Research*, 2(2), 54-67. https://doi.org/10.7592/EJHR2014.2.2.tlitovkina

## Goal

Litkovina doesn't mention that she herself compiled the data she analyzed. Instead, she consulted compilations frp, the internet which she lists on the last page of her paper. Many of the links no longer work. http://www.fun-with-words.com/tom_swifties_a-e.html is a decent but seemingly small compilation of the puns, alphabetized by adverbial phrase. But it's unclear who compiled them, how they were found or if theirs is a definitive list or a selection thereof. 

To answer that question, we will conduct our own search for *Tom Swifties* with tools from computational linguistics. 

<!-- We'll ttempt to compile a full set of  from a corpus of the novellas themselves.  -->

<!-- We'll eventually produce a list of Tom Swifty candidates from a selection portion of the text after training a model to classify sentences as either containing a Tom Swifty or not containing one.  -->

## Start the Search

To begin the search, we'll build a corpus of documents and preprocess the raw text to remove Gutenberg biolerplate, blank lines and table of contents since TOCs usually don't contain puns.

### Building the Corpus

We'll download the first 5 novellas from Project Gutenburg; All were published in 1910:

- Tom Swift and His Motor Cycle (Gutenbeg id = 4230)
- Tom Swift and His Motor Boat (id = 2273)
- Tom Swift and His Airship (id = 3005)
- Tom Swift and His Submarine Boat (id = 949)
- Tom Swift and His Electric Runabout (id = 950)

Let's begin with our custom class, `GutenbergTextProvider`. There are some existing packages which clean texts from Project Gutenberg, but one requires installing a database and the other uses NLTK, so let's just keep it simple for today.

The `fetch()` function in the class internally preprocesses the text before storing it in a class property.

In [None]:
from src.tomswiftr.text_provider import GutenbergTextProvider

In [None]:
text_source_ids = [
    4230,
    2273,
    3005,
    949,
    950,
]

provider = GutenbergTextProvider()

# Let's start with just the first novella to test things out.
text = provider.fetch(4230)

### Dipping our toes into the waters

Let's examine the first paragraph or so:

In [None]:
provider.get_text(to_char=1000)

That looks fairly good; looks like we got part of the chapter title included, but I judge that this won't impede our pursuit of Tom Swifties, so let's move on. 

Now, let's convert this text into an array of sentences with the help of SpaCy and inspect just the first 10 for sanity's sake.

In [None]:
from spacy.language import Doc, Language
from spacy.tokens import Token

nlp: Language = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "sentencizer"
)  # key component. Performs far better after loading a statistical model, like en_core_web_sm.

doc: Doc = nlp(provider.get_text(to_char=1000))

# doc.sents  # it's a generator, yeah!
for idx, s in enumerate(doc.sents):
    print(f"[{idx}] {s}", soft_wrap=True)
    if idx == 10:
        break

Sentence **7** has the shape we might expect for a Tom Swiftie candidate. We notice thet SpaCy wil separate sentences within quotes, but we don't expect many candidates to fit that shape, if any at all. Let's restrict our search to only sentences starting with a quote.

In [None]:
# Get the entire book's text
doc: Doc = nlp(provider.get_text())

In [None]:
# doc.sents: Iterator[Span]
# s: spacy.tokens.span.Span
for idx, s in enumerate(doc.sents):
    if s.text.startswith('"'):
        print(f"[{s.start}-{s.end}] {s}", soft_wrap=True)

        if idx == 10:
            break

OK, that's great and all, but let's see what named entities we can find in the same span. We're still treading gingerly for now.

In [None]:
for idx, s in enumerate(doc.sents):
    if s.text.startswith('"'):
        print(f"[{s.start}-{s.end}] {s}", soft_wrap=True)

        # Span.ents = The named entities that fall completely within the span.
        #  Returns a tuple of Span objects.
        print(s.ents)

        if idx == 10:
            break

`Whoop` as a named entity? That's a little embarrasing. Well, we can restrict ourselves to only sentences where one of the named entities is Tom, at least this time, so let's do that.

### Candidates where Tom is the speaker

But how can we ensure we're targeting the clause outside the quote? We need to rely on dependency parsing and find only sentences where "Tom" is the nsubj of the clause's speech verb. Luckily, SpaCy's Sentencizer does exactly that. Let's access that parse by examining the tokens:

In [None]:
# TODO use filter() instead?
for idx, s in enumerate(doc.sents):
    if s.text.startswith('"'):
        for ne in s.ents:
            if "Tom" in ne.text:
                print(f"[{s.start}-{s.end}] {s}", soft_wrap=True)
                pprint(s.ents)

Upon visual inspection, I'd say the Span of token 1178:1202 presents a great consutrction for us to look at more closely. It's clearly not a pun but we should be able to code against this pattern of verbal root with Tom as subject and an adverb dependent on that speech verb performed by Tom. 
> "He thinks he can run over everything since he got his new auto," commented Tom aloud as he rode on.

In [None]:
for token in doc[1178:1202]:
    print(
        token.text,
        token.pos_,
        token.tag_,
        token.dep_,
    )

Well, would you look at that! `commented VERB VBD ROOT xxxx`; It even parsed the dependencies and singled out the speech verb we want to target as **ROOT**. "Ain't that grand" said Michael largely.

Perfect! almost...except that we can see ourselves that it's a speech verb, but our code doesn't know that. Let's change that.

Do note however, that there are some oddities which we'll need to deal with soon enough:

- Tom PROPN NNP compound Xxx
- aloud NOUN NN nsubj xxxx

`aloud` should be an adverb and we dearly need it to be in this exploration. We also need `Tom` to be the nsubj. Yet, SpaCy thinks `Tom` is a compund and categorizes `aloud` as the nsubj!

Let's examine a diagram of this in two forms; one is a traditional dependency arc graph and the other merely highlights the named entities in the sentence.

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc[1178:1202], style="dep")

In [None]:
displacy.render(doc[1178:1202], style="ent")

### Narrowing the search

We're getting closer, even though the above example is not comforting. Let's start implementing code to only search for the pattern we really want. Let's start with the root verb.

> the most frequently used verbs introducing the text of the statement in Tom Swifties is said. (57)

> Nevertheless, there are a number
of other verbs employed for this purpose as well, e.g.: admitted, agreed, asked, bemoaned,
considered, consented, cried, debated, decided, discovered, guessed, implied, mused, nagged,
pleaded, pretended, professed, queried, recounted, remarked, replied, revealed, reviewed,
sang, and yelled. (57)

Our first pattern will be to find sentences with a quotation followed by the named entity, "Tom" (his last name seems rarely used in the text) and a verb denoting a speech act linked to an adverb. Note that the verb is not always as explicitly denoting speech as *said* does; it may be slightly parenthentical as well, like *guessed*.

As such, the verb for which "Tom Swift" is a *nsubj* and which the adverb modifies, will be one of those verbs listed above.

In [None]:
speech_verbs: Set = {
    "admitted",
    "agreed",
    "asked",
    "bemoaned",
    "considered",
    "consented",
    "cried",
    "debated",
    "decided",
    "discovered",
    "guessed",
    "implied",
    "mused",
    "nagged",
    "pleaded",
    "pretended",
    "professed",
    "queried",
    "recounted",
    "remarked",
    "replied",
    "retored",
    "revealed",
    "reviewed",
    "said",
    "sang",
    "yelled",
}

In [None]:
candidates = []


for idx, s in enumerate(doc.sents):
    qualifies = False

    if s.text.startswith('"'):
        # TODO the Span class has a .root property pointing to the root token

        # TODO if any(target_element in tup for tup in list_of_tuples):
        for ne in s.ents:
            if "Tom" in ne.text:
                # print(f"[{s.start}-{s.end}] {s}", soft_wrap=True)
                qualifies = True
                continue

    if qualifies:
        # Now search for a speech verb which is the root of the dependency parse
        for token in doc[s.start : s.end]:
            # pprint(token, expand_all=True)

            if (
                token.pos_ == "VERB"
                and token.tag_ == "VBD"  # should be past tense
                and token.dep_ == "ROOT"  # must be root
                and token.text in speech_verbs
            ):
                # sentence qualifies; add it and move on to the next sent
                candidates.append(s)
                continue

print(len(candidates))
pprint(candidates)

99 candidates! Not a bad list, but it's clear as we restrict our list more tightly, that number is likely to go way down.

So then, let's finally return candidate sentences which have that adverb modifying the root speech verb which we've mentioned a few times already.

In [None]:
for i, s in enumerate(doc.sents):
    if s.text.startswith('"'):
        root_token = s.root

        if (
            root_token.pos_ == "VERB"
            and root_token.tag_ == "VBD"
            and root_token.text in speech_verbs
        ):
            subj, adv = None, None
            for k, child in enumerate(root_token.children):
                # Does the root have a subj and is that subj Tom?
                if (
                    child.dep_ == "nsubj"
                    and child.pos_ == "PROPN"
                    and child.text == "Tom"
                ):
                    # tag_ is NNP
                    subj = child

                # Furthermore, Does the root have a child adverbial?
                if child.dep_ == "advmod":
                    adv = child

            if subj and adv:
                print(f"---------{i} is a candidate---------")
                print(subj, adv)
                print(s)

I find it hard to believe that there's only 10 instances in the whole book. Maybe we should note the above findings regarding `compound` and check those. Noticeably, a sentence which should qualify from our Great List of 99 above is missing from our list of candidates:

> "Oh, it's just a knack," replied Tom modestly.

Why?

Let's parse out this exact sentence from scratch and maybe we will find a clue leading us to the villian of the story!

In [None]:
doc2: Doc = nlp('"Oh, it\'s just a knack," replied Tom modestly.')

# raw text
print(doc2)
displacy.render(doc2, style="dep")

Oh, it thinks Tom is npadvmod, a "noun phrase as adverbial modifier"...ummmm...what are the POS tags, though?

In [None]:
print(doc2.to_json())

for token in doc2:
    print(
        token.text,
        token.pos_,
        token.tag_,
        token.dep_,
        token.head,
    )

The POS looks right for the verb, for "Tom" and "modestly":

- replied VERB VBD ROOT replied
- Tom PROPN NNP npadvmod replied
- modestly ADV RB advmod replied

Maybe I should rely less on the dependency parse? Perhaps some complicated code relying on a combination of dependencies and POS and/or tags? But, shouldn't code be elegantly simple and not headache-inducing? 

## A Premature Ending or, "Evaluating the Model"

So far, we've found no puns in the first book even from visual inspection. I'm reluctant to continue farther since it's clear that SpaCy's dependency parse is lacking. The Github issues are filled with tickets mentioning this. These reports state that no matter which size of the model (`en_core_web_sm`, `en_core_web_md` or `en_core_web_lg`), the deficencies remain.

I can only imagine Tom Swift dusting himself off and staying in his lab or taking to the skies to experiment ceaselessly, so in that spirit, let's discuss my thoughts on future directions for this project.

### Future Follow-ons

#### Evolution over time

Garis ghost-wrote the Tom Swift series until 1935, producing 35 novellas. I'd like to expand this exploration to analyze his output of Tom Swifties over that long timespan. 

- Does the frequency of Tom Swifty usage change over that 25 year span?
- Did the grammatical shape of these puns change over time?

So far, I wonder if he started this trend with later books? One might try by repeating the above process with some of the other Gutenberg IDs listed near the start of this exploration.

#### Candidates where someone else is the speaker

> Traditionally Tom is the speaker, but this is by no means necessary for the pun to classify as a Tom Swifty. Sometimes the pun lies in the name, in which case it will usually not be Tom speaking

#### NLP to explain the puns (when found!)

Litovkin states that most adverbs inTom Swifties have either paronyms or homonyms (56). A paronym describes "words which are linked by a similarity of form" (Oxford Concise Dictionary of Linguistics, p288)

She also says that,

> either explicitly used in the statement of it or only implicitly implied. (56)




<!-- #### Finding Paronyms -->
<!-- A paronym is <???> -->

<!-- #### Finding Homonyms -->
<!-- A homonym is <???> -->

<!-- #### Finding Homophones -->
<!-- A homophone is <???> -->

---

## Notes

1. Litkovina mistakenly attributes Stratemeyer as the ghostwriter.