# Analyzing Franz Kafka's _The Metamorphosis_ with NLP

* Kafka's _Metamorphosis_ is commonly interpreted by humans as giving insights into the meaning of personal identity
* The conclusions that we can draw from texts are limited by our uniquely human preconceptions and rhetorical weaknesses
* Guiding question: What happens when computers read philosophy? That is, if we apply machine learning and linguistic approaches from natural language processing (NLP) to analyzing _The Metamorphosis_, how will the results support or contrast with our human reading?
* Read on to see what computers think about Kafka...

In [1]:
import nltk
from nltk.corpus import stopwords
import random
import numpy as np
import regex as re
import pandas as pd
from matplotlib import pyplot as plt

The text is sourced from Project Gutenberg (David Wyllie
 translation): http://www.gutenberg.org/files/5200/5200-h/5200-h.htm

In [2]:
file = open("5200-h.txt", "r")
content = file.read()

Clean messy string into individual words with some regex magic

In [3]:
words = re.findall(r"[\w']+", content)

Lots of words!

In [4]:
len(words)

22055

Count how many times each word occurs. Words are converted to lowercase to simplify things.

In [5]:
def get_word_freqs(words):
    freqs = {}
    for word in words:
        if word in freqs:
            freqs[word.lower()] += 1
        else:
            freqs[word.lower()] = 1
    return freqs

In [6]:
word_freqs = get_word_freqs(words)

Convert the dictionary of frequencies to a Pandas Series (nicer for lots of data). Note that there are far fewer unique than total words.

In [7]:
word_freqs = pd.Series(word_freqs)
len(word_freqs)

2622

The 15 most frequently occurring words. By far the most common word is "to", which isn't surprising given its grammatical importance. In fact, most of these words aren't specific to _The Metamorphosis_. Interestingly, "room" appears more often even than "that" or "have", which suggests its significance in the story.

In [8]:
word_freqs.nlargest(15)

to        753
of        429
him       188
a         186
not       176
had       172
room      131
could     119
have      107
that      107
would     102
been      101
sister     96
be         89
door       87
dtype: int64

Since many of these words aren't very interesting, let's forget about them for now. These words are called <a href="https://en.wikipedia.org/wiki/Stop_words" target="_blank">stop words</a>, and the NLP library nltk conveniently includes a list of them so we can exclude them.

Some example stop words. Some may be word fragments due to how nltk parses words.

In [9]:
stop_words = set(stopwords.words('english')) 
random.sample(stop_words, 10)

['now',
 'that',
 'wouldn',
 'not',
 "should've",
 'at',
 "didn't",
 'hasn',
 'what',
 'shouldn']

Yay, that annoying "to" will be gone!

In [10]:
'to' in stop_words

True

In [18]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [16]:
word_freqs.index not in stop_words

TypeError: unhashable type: 'Index'

In [None]:
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

In [None]:
 set(stopwords.words('english'))


In [None]:
gregor_indices = [i for i,x in enumerate(words) if x == 'Gregor']

In [None]:
gregor_over_time = {}
j=0
for i in range(0, len(words), 100):
    gregor_over_time[i] = 0
    while j<len(gregor_indices) and gregor_indices[j] < i:
        gregor_over_time[i] += 1
        j+=1

In [None]:
gregor_over_time

In [None]:
plt.plot(list(gregor_over_time.keys()), list(gregor_over_time.values()))
plt.show()