# Analyzing Franz Kafka's _The Metamorphosis_ with NLP

* Kafka's _Metamorphosis_ is commonly interpreted by humans as giving insights into the meaning of personal identity
* The conclusions that we can draw from texts are limited by our uniquely human preconceptions and rhetorical weaknesses
* Guiding question: What happens when computers read philosophy? That is, if we apply machine learning and linguistic approaches from NLP to analyzing _The Metamorphosis_, how will the results support or contrast with our human reading?
* Read on to see what computers think about Kafka...

In [1]:
import nltk
import numpy as np
import regex as re
import pandas as pd
from matplotlib import pyplot as plt

The text is sourced from Project Gutenberg (David Wyllie
 translation): http://www.gutenberg.org/files/5200/5200-h/5200-h.htm

In [2]:
file = open("5200-h.txt", "r")
content = file.read()

Clean messy string into individual words with some regex magic

In [3]:
words = re.findall(r"[\w']+", content)

Lots of words!

In [12]:
len(words)

22055

Count how many times each word occurs. Words are converted to lowercase to simplify things.

In [17]:
def word_freqs(words):
    freqs = {}
    for word in words:
        if word in freqs:
            freqs[word.lower()] += 1
        else:
            freqs[word.lower()] = 1
    return freqs

In [18]:
word_freqs = word_freqs(words)

Convert the dictionary of frequencies to a pandas dataframe (nicer for lots of data)

In [24]:
word_freqs_df = pd.DataFrame([{'word':word, 'count':count} for (word, count) in word_freqs.items()])

In [27]:
word_freqs_df.nlargest(15, 'count')

Unnamed: 0,count,word
53,753,to
69,429,of
71,188,him
25,186,a
372,176,not
108,172,had
83,131,room
38,119,could
107,107,that
184,107,have


In [8]:
word_freqs_df.sort_values(by='count', ascending=False, inplace=True)

In [9]:
word_freqs_df.reset_index(drop=True).nlargest(50, 'count')

Unnamed: 0,count,word
0,1097,the
1,753,to
2,612,and
3,524,his
4,495,he
5,429,of
6,406,was
7,350,had
8,339,in
9,324,that


In [None]:
gregor_indices = [i for i,x in enumerate(words) if x == 'Gregor']

In [None]:
gregor_over_time = {}
j=0
for i in range(0, len(words), 100):
    gregor_over_time[i] = 0
    while j<len(gregor_indices) and gregor_indices[j] < i:
        gregor_over_time[i] += 1
        j+=1

In [None]:
gregor_over_time

In [None]:
plt.plot(list(gregor_over_time.keys()), list(gregor_over_time.values()))
plt.show()