# Analyzing Franz Kafka's _The Metamorphosis_ with NLP

* Kafka's _Metamorphosis_ is commonly interpreted by humans as giving insights into the meaning of personal identity
* The conclusions that we can draw from texts are limited by our uniquely human preconceptions and rhetorical weaknesses
* Guiding question: What happens when computers read philosophy? That is, if we apply machine learning and linguistic approaches from natural language processing (NLP) to analyzing _The Metamorphosis_, how will the results support or contrast with our human reading?
* Read on to see what computers think about Kafka...

In [34]:
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
import random
import numpy as np
import regex as re
import pandas as pd
from matplotlib import pyplot as plt

The text is sourced from Project Gutenberg (David Wyllie
 translation): http://www.gutenberg.org/files/5200/5200-h/5200-h.htm

In [2]:
file = open("5200-h.txt", "r")
content = file.read()

Clean messy string into individual word tokens (includes punctuation and contraction fragments) with some nltk magic.

In [50]:
words = word_tokenize(content)

In [51]:
random.sample(words, 15)

['a',
 'room',
 'Ah',
 'his',
 'the',
 'out',
 'darkness',
 'but',
 'his',
 'not',
 'Gregor',
 'the',
 'her',
 'foot',
 'how']

Lots of words!

In [52]:
len(words)

24973

Count how many times each word occurs. Words are converted to lowercase to simplify things.

In [53]:
def get_word_freqs(words):
    freqs = {}
    for word in words:
        if word in freqs:
            freqs[word.lower()] += 1
        else:
            freqs[word.lower()] = 1
    return freqs

In [54]:
word_freqs = get_word_freqs(words)

Convert the dictionary of frequencies to a Pandas Series (nicer for lots of data). Note that there are far fewer unique than total words.

In [55]:
word_freqs = pd.Series(word_freqs)
len(word_freqs)

2591

The 15 most frequently occurring words. By far the most common word is "to", which isn't surprising given its grammatical importance. In fact, most of these words aren't specific to _The Metamorphosis_. Interestingly, "room" appears more often even than "that" or "have", which suggests its significance in the story.

In [57]:
word_freqs.nlargest(20)

,         1293
to         753
.          700
of         429
's         194
him        188
a          186
not        177
had        173
;          170
``         147
room       131
could      127
''         115
have       109
would      105
been       101
sister     101
be          89
door        87
dtype: int64

Since many of these words aren't very interesting, let's forget about them for now. These words are called <a href="https://en.wikipedia.org/wiki/Stop_words" target="_blank">stop words</a>, and the NLP library nltk conveniently includes a list of them so we can exclude them. Additionally, we'll filter out the punctuation and some contraction fragments that weren't included.

Some example stop words. Some may be word fragments due to how nltk parses words.

In [69]:
stop_words = set(stopwords.words('english'))
punctuation = {',','.',';','``',"''",'?','!',"'",'(',')','-','...',':'}
contraction_fragments = {"'ave","'d","'ll","'m","'re","'s","'ve"}
stop_words.update(punctuation)
stop_words.update(contraction_fragments)
random.sample(stop_words, 10)

['she', 'only', 'can', 'about', 'will', 'that', 'all', 'why', 'from', "didn't"]

Yay, that annoying "to" will be gone!

In [70]:
'to' in stop_words

True

Exclude the stop words to create a new series with only the salient words. Now the most common words are actually meaningful!

In [76]:
to_drop = [w for w in word_freqs.index if w in stop_words]
word_freqs_salient = word_freqs.drop(to_drop)
word_freqs_salient.nlargest(20)

room       131
could      127
would      105
sister     101
door        87
back        82
n't         62
way         62
time        59
said        51
little      49
get         44
still       44
even        43
go          40
made        40
see         39
without     39
head        38
much        38
dtype: int64

In [None]:
gregor_indices = [i for i,x in enumerate(words) if x == 'Gregor']

In [None]:
gregor_over_time = {}
j=0
for i in range(0, len(words), 100):
    gregor_over_time[i] = 0
    while j<len(gregor_indices) and gregor_indices[j] < i:
        gregor_over_time[i] += 1
        j+=1

In [None]:
gregor_over_time

In [None]:
plt.plot(list(gregor_over_time.keys()), list(gregor_over_time.values()))
plt.show()