# Analyzing Franz Kafka's _The Metamorphosis_ with NLP

* Kafka's _Metamorphosis_ is commonly interpreted by humans as giving insights into the meaning of personal identity
* The conclusions that we can draw from texts are limited by our uniquely human preconceptions and rhetorical weaknesses
* Guiding question: What happens when computers read philosophy? That is, if we apply machine learning and linguistic approaches from natural language processing (NLP) to analyzing _The Metamorphosis_, how will the results support or contrast with our human reading?
* Read on to see what computers think about Kafka...

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
import random
import numpy as np
import regex as re
import pandas as pd
from matplotlib import pyplot as plt

The text is sourced from Project Gutenberg (David Wyllie
 translation): http://www.gutenberg.org/files/5200/5200-h/5200-h.htm

In [3]:
file = open("5200-h.txt", "r")
content = file.read()

Clean messy string into individual word tokens (includes punctuation and contraction fragments) with some regex magic.

In [4]:
words = re.findall(r'\w+', content)

In [5]:
random.sample(words, 15)

['it',
 'he',
 'flowed',
 'as',
 'but',
 'the',
 'violin',
 'her',
 't',
 'not',
 'and',
 'and',
 'know',
 'seemed',
 'one']

Lots of words!

In [6]:
len(words)

22383

Count how many times each word occurs. Words are converted to lowercase to simplify things.

In [7]:
def get_word_freqs(words):
    freqs = {}
    for word in words:
        lower = word.lower()
        if lower in freqs:
            freqs[lower] += 1
        else:
            freqs[lower] = 1
    return freqs

In [8]:
word_freqs_dict = get_word_freqs(words)

Convert the dictionary of frequencies to a Pandas Series (nicer for lots of data). Note that there are far fewer unique than total words.

In [9]:
word_freqs = pd.Series(word_freqs_dict)
len(word_freqs)

2578

The 30 most frequently occurring words. By far the most common word is "the", which isn't surprising given its grammatical importance. In fact, most of these words aren't specific to _The Metamorphosis_. Interestingly, "room" appears more often even than generally common words "be" and "could", which suggests its significance in the story.

In [16]:
word_freqs['room']

131

In [33]:
word_freqs.nlargest(30)

the       1148
to         753
and        642
he         590
his        550
of         429
was        409
it         370
had        352
in         348
that       345
gregor     298
a          285
as         242
she        200
with       199
s          194
him        188
her        187
would      187
not        176
but        171
at         169
for        166
they       160
on         152
all        135
room       131
be         119
could      119
dtype: int64

Since many of these words aren't very interesting, let's forget about them for now. These words are called <a href="https://en.wikipedia.org/wiki/Stop_words" target="_blank">stop words</a>, and the NLP library nltk conveniently includes a list of them so we can exclude them.

Some example stop words. Some may be word fragments due to how nltk parses words.

In [19]:
stop_words = set(stopwords.words('english'))
random.sample(stop_words, 10)

['off', 'in', 'whom', "needn't", 'all', 'same', 'than', 'between', 'they', 'd']

Yay, that annoying "the" will be gone!

In [22]:
'the' in stop_words

True

Exclude the stop words to create a new series with only the salient words. Now the most common words are actually meaningful! We see the main character names at the top: Gregor and his father, sister, and mother. "Grete" doesn't appear on the list at all, reflecting Kafka's tendency to describe her in relation to Gregor (i.e. "his sister") rather than as her own character. That "room" appears at the top along with the important people suggests that it could be considered a character in itself: It plays an essential role in Gregor's development as he lives out his bug life, more so than any of his family members. "Time" appearing frequently is indicative of 

In [24]:
to_drop = [w for w in word_freqs.index if w in stop_words]
word_freqs_salient = word_freqs.drop(to_drop)
word_freqs_salient.nlargest(30)

gregor     298
would      187
room       131
could      119
father     102
sister     101
mother      89
door        87
back        82
even        80
one         72
way         62
time        59
said        51
little      49
first       44
get         44
still       44
go          40
made        40
see         39
without     39
head        38
like        38
much        38
chief       37
clerk       37
open        35
samsa       34
away        32
dtype: int64

In [None]:
gregor_indices = [i for i,x in enumerate(words) if x == 'Gregor']

In [None]:
gregor_over_time = {}
j=0
for i in range(0, len(words), 100):
    gregor_over_time[i] = 0
    while j<len(gregor_indices) and gregor_indices[j] < i:
        gregor_over_time[i] += 1
        j+=1

In [None]:
gregor_over_time

In [None]:
plt.plot(list(gregor_over_time.keys()), list(gregor_over_time.values()))
plt.show()