# Dissertation Analytics
## 1. Preprocessing

This notebook preprocesses my dissertation _Emotions as functional kinds: a meta-theoretical approach to constructing scientific theories of emotions_ (HU Berlin) for further analysis. It uses NLTK to parse the text file in order to get a file we can analyze later on.

In [2]:
import nltk

In [3]:
with open('diss.txt') as f:
    raw = f.read()

Let's do some replacements for better formatting. These are:
* Changing line breaks (\n) for spaces.
* Removing indendation (e.g. "oc- curred").
* Removing stylistic abbreviations: e.g., i.e.

In [4]:
raw = raw.replace('\n', ' ')
raw = raw.replace('- ', '')
raw = raw.replace('e.g.', '')
raw = raw.replace('i.e.', '')

## Tokenizing

In [5]:
word_tokenizer = nltk.RegexpTokenizer('[A-Za-z]+')

In [6]:
raw_sentences = nltk.sent_tokenize(raw.decode('utf8'))

How many sentences does the dissertation have? How many words?

In [7]:
print "# of sentences: %s" % len(raw_sentences)
print "# of words: %s" % len(word_tokenizer.tokenize(raw))

# of sentences: 4606
# of words: 96588


In [8]:
sentences_by_word = []
for sentence in raw_sentences:
    sentences_by_word.append(word_tokenizer.tokenize(sentence))

## Part-of-Speech Tagging

In [9]:
sent_tagged = []
for sentence in sentences_by_word:
    sent_tagged.append(nltk.pos_tag(sentence))

## Removing stopwords

In [10]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [11]:
sent_clean = []
for sentence in sent_tagged:
    sent_clean.append([word for word in sentence if word[0].lower() not in stop_words])

In [12]:
with open('diss_nostopwords.txt', 'w') as f:
    f.write(str(sent_clean))

## Lemmatization

In [13]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [14]:
def customLemmatizer(tagged_word):
    word = tagged_word[0]
    pos = tagged_word[1]
    
    if pos in ['NN', 'NNS', 'NNP', 'NNPS']:
        lemma = lemmatizer.lemmatize(word)
        
    elif pos in ['VB', 'VBN', 'VBZ', 'VBD', 'VBP', 'VBG']:
        lemma = lemmatizer.lemmatize(word, pos = 'v')
        
    elif pos in ['JJ', 'JJR', 'JJS']:
        lemma = lemmatizer.lemmatize(word, pos='a')
        
    else:
        lemma = False
        
    return lemma

In [15]:
lemmatized_text = []
for sentence in sent_clean:
    for word in sentence:
        lemma = customLemmatizer(word)
        if lemma:
            lemmatized_text.append(lemma)

In [16]:
with open('diss_lemmatized.txt', 'w') as f:
    f.write(str(lemmatized_text))