<h2>Preprocessing</h2>

Packages used:

In [1]:
import os, nltk, re
from nltk import word_tokenize

In [2]:
import pickle, numpy, gensim

<h4>Text selection</h4>

The first step is to randomly select a fraction of the corpus to train the topic model on. 

The basis of this process is the list of file names in alphabetical order. 

In [3]:
dirname = 'intro_dh_projekt/Dream_All_Texts_Plain'

In [4]:
filenames = sorted(os.listdir(dirname))

In [5]:
len(filenames)

34777

In [6]:
filenames[34773:34777]

['unknown pubDate - unknown author -  -                        -  1665.xml.txt',
 'unknown pubDate - unknown author -  L        -  1679.xml.txt',
 'unknown pubDate - unknown author -  L     L              -  1687.xml.txt',
 'unknown pubDate - unknown author -  L  L  1 1679 -  1679.xml.txt']

We let the function `randrange()` create the indices of the files to be picked. Using a set ensures that no file will be selected twice. The size of the set is such that a tenth of the corpus will be selected.

In [7]:
import math
from random import random, randrange

In [8]:
nums = set()
while len(nums) < 3477:
    nums.add(randrange(34777))

In [9]:
nums_sorted = sorted(list(nums))

In [10]:
len(nums_sorted)

3477

First five texts to select are:

In [15]:
nums_sorted[:5]

[11, 12, 20, 29, 36]

<h4>Reading in the data</h4>

Once we have the indices ready and sorted, we iterate ove them, using each to open the respective file in the directory, tokenize it, and append it to the initially empty list `texts`. The result will be a list of lists of words (as well as punctuation, numbers etc.). 

In [18]:
texts = []
for num in nums_sorted:
    filename = filenames[num]
    with open(os.path.join(dirname, filename)) as text:
        texts.append(word_tokenize(text.read()))

In [19]:
len(texts)

3477

In [29]:
texts[0][:30]

['<',
 '?',
 'xml',
 'version=',
 "''",
 '1.0',
 "''",
 'encoding=',
 "''",
 'UTF-8',
 "''",
 '?',
 '>',
 'The',
 'Knavish',
 'MERCHANT',
 '(',
 'Now',
 'turn',
 "'d",
 'Warehouseman',
 ')',
 'CHARACTARIZED',
 'OR',
 'A',
 'severe',
 'Scourge',
 ',',
 'for',
 'an']

Next, we transform all the words in the corpus to lowercase and clear the corpus of numbers and punctuation except for full stops, as these will be needed for the chunking. It would have been an option to also leave in questions marks and exclamation marks, but the full stop approach seemed to suffice for our purposes.

In [24]:
texts_clear = [[w.lower() for w in text if w.isalpha() or w == '.'] for text in texts]

In [25]:
len(texts_clear)

3477

In [30]:
texts_clear[0][:30]

['xml',
 'the',
 'knavish',
 'merchant',
 'now',
 'turn',
 'warehouseman',
 'charactarized',
 'or',
 'a',
 'severe',
 'scourge',
 'for',
 'an',
 'unjust',
 'cruel',
 'and',
 'unconscionable',
 'adversary',
 'by',
 'philadelphus',
 'verax',
 'a',
 'cordial',
 'friend',
 'to',
 'his',
 'honest',
 'though',
 'injuriously']

An auxiliary dictionary notes down the year of origin of each file used in the corpus by extracting the first occurence of a four-digit sequence in the title. This method is certainly not perfect but presumably good enough given the size of the corpus and the format of the titles. If no year is found, the entry will be None. 

In [44]:
getyear = {}
for num in nums_sorted:
    filename = filenames[num]
    years = re.findall('[0-9]{4}', filename)
    year = next((y for y in years), None)
    getyear[num] = year

In [46]:
len(getyear)

3477

In [47]:
filenames[nums_sorted[0]]

'              1661 -   -                           - .xml.txt'

In [48]:
getyear[nums_sorted[0]]

'1661'

<h4>Chunking</h4>

Now to the chunking function. This function takes as input a word-tokenized text, that is, a list of words and full stops. With `n` being the chunk size and `i` the starting position (initially 0), it will jump to the end position of the desired chunk (`n-1`) and check whether or not this list element is a full stop. `n` will then be incremented until a full stop is found, and the slice with end position `n` exclusive will be added to the initially empty list. The start position of the next chunk will be `i+n`, and the chunk size will be reset to its initial value. This goes on so long as the calculated end position is within the range of the text.
The last chunk, which is bound to be shorter than the desired chunk size, will be added to the chunk list directly or, if it is shorter than a predetermined minimum size, to the last chunk in the list.

In [49]:
def chunkthis(txt, chunksize, minsize):
    texts = []
    n = chunksize
    i = 0
    while i+n <= len(txt) and n != 0:
        while txt[i+n-1] != '.' and i+n<len(txt):
            n+=1
        chunk = txt[i:i+n]
        texts.append(chunk)
        i = i+n
        n = chunksize
    #if last chunk is shorter than minsize, append to last txt in chunked array (if there is one)
    if len(txt) - i < minsize and len(texts)>0:
        texts[-1] += (txt[i:])
    #if last chunk is anywhere between 100 and desired chunksize, append directly
    else:
        texts.append(txt[i:])
    return texts

Using this chunking function, we can iterate over the cleared corpus and do the following: First extract the file index of the text from `nums_sorted` (the random numbers list), then chunk the text, resulting in a list of lists of approximately 400 words each, add these chunks to an initially empty list `texts_chunked`, and add the file index to a separate list `chunk_index` according to the number of chunks created. Thus, if a text was split into 100 chunks, the file index of that text will be noted down 100 times, so that calling the list index of the chunk will return the index of the file it was extracted from. This will be important to get the year of origin of the chunks later.

In [53]:
chunk_index = []
texts_chunked = []
for i in range(len(texts_clear)):
    file_index = nums_sorted[i] #let's assume first random number is 11, therefore file_index in first loop is 11   
    chunks = chunkthis(texts_clear[i], 400, 100) #let's assume this creates 5 chunks
    texts_chunked += chunks #add chunks to chunk list
    chunk_index += [file_index]*len(chunks) #add file index 11 5 times --> chunk_index[4] will then return file index of 4th (0-based) chunk
    if i < 5:
        print('file index:', file_index)
        print(len(chunks), 'chunks')       

file index: 11
5 chunks
file index: 12
67 chunks
file index: 20
437 chunks
file index: 29
5 chunks
file index: 36
107 chunks


In [54]:
len(texts_chunked)

200577

In [55]:
chunk_index[4]

11

In [70]:
for txt in texts_chunked[:10]:
    print(len(txt))

510
410
429
412
442
412
412
403
404
429


<h4>Removal of stop words</h4>

Next, we clear the attained chunks of stop words and full stops, so that only words not appearing among the 500 most frequent terms remain. Since the stop word list contains punctuation and numbers, we end up removing 475 different terms. 'XML' was added to the stop word list as each file starts with an XML declaration.

In [57]:
#some spagetti code to read in the stop words
stop_words=[]
i = 0
with open('stopwords.txt', 'rb') as f:
    while i < 500:
        stop_words.append(str(f.readline()).split()[0][2:])
        i += 1

In [58]:
len(stop_words)

500

In [64]:
stop_words[:20]

['the',
 'of',
 'and',
 'to',
 'in',
 'that',
 'a',
 'is',
 'it',
 'for',
 'his',
 'as',
 'be',
 'he',
 'not',
 'by',
 'but',
 'they',
 'which',
 'with']

In [61]:
stop_words = [w for w in stop_words if w.isalpha()] #remove numbers and punctuation from stop words
stop_words.append('xml')

In [62]:
len(stop_words)

475

In [65]:
texts_final = [[w for w in text if w.isalpha() and w not in stop_words] for text in texts_chunked]

In [66]:
len(texts_final)

200577

The resulting corpus `texts_final` now contains 200577 text chunks extracted from 3477 texts in word-tokenized form, cleared of stop words and punctuation. As we can see by comparing the lengths of the chunks before and after weeding out the stop words, the chunks have shrunk considerably.

In [71]:
for txt in texts_final[:10]:
    print(len(txt))

206
134
137
140
141
165
155
135
170
252


<h4>Saving data</h4>

Using pickle, we save the relevant data structures in a compressed format: the final corpus, the chunk index allowing us to retrace what text a chunk was extracted from, and the file indeces of the used texts.

In [None]:
import pickle
pickle.dump(texts_final, open("texts_final_new.p", "wb"))

In [None]:
pickle.dump(chunk_index, open("chunk_index.p", "wb"))

In [None]:
pickle.dump(nums_sorted, open("file_index.p", "wb"))

In [None]:
tst = pickle.load(open("texts_final_400w.p", "rb"))

<h4>Preparing the data for Topan</h4>

This next step creates a two-dimensional array with each row corresponding to a text chunk. The two columns represent chunk identifier (a combination of file index and the number of the chunk extracted from that file, like so: `11:1`, `11:2`, ...) and the words of that chunk in simple string format. Saving this table as a CSV file allowed us to read the texts into Topan as well. 

In [None]:
anarray = []
fileindex_prev = 0
n=0
for i in range(len(texts_final)):
    fileindex = chunk_index[i]
    if fileindex != fileindex_prev:
        n = 1
    else:
        n+=1
    chunkindex = str(fileindex) + ':' +str (n)
    astring = ""
    for word in texts_final[i]:
        astring = astring + word + " "
    anarray.append([chunkindex, astring])
    fileindex_prev = fileindex

In [None]:
import numpy
numpy.savetxt("all_the_chunks.csv", anarray, delimiter=",", fmt='%s')

<h2>Topic model</h2>

Now for the actual LDA topic modelling. We use Python's gensim library to create a dictionary mapping each remaining term in the corpus to a unique id. The standard format is id2token but the class Dictionary provides the function token2id as well. This `dictionary` is then used to covert the corpus to a numeric format on the basis of the bag-of-words assumption. Each text is thus treated as a bag of words, where only word frequencies matter but word sequence is ignored. As a result, each text in the corpus is represented as a list of (word id, word frequency) tuples. The resulting data structure of `corpus` is therefore a list of lists of numeric tuples.

In [72]:
dictionary = gensim.corpora.Dictionary(texts_final)

In [None]:
corpus = [dictionary.doc2bow(text) for text in texts_final]

In [None]:
corpus_reduced = [corpus[i] for i in range(0, len(corpus), 5)]

LDA was chosen as a model since it works relatively autonomously, that is, the relevant parameters alpha (affecting per-document topic distribution) and beta (affecting per-topic word distribution) are learned from the corpus during training. The main difficulty was in determining the optimal number of topics. This was mitigated by sheer lack of computing power though, so that we ended up producing a relatively small number of topics. On top of that, the training data for the topic model had to be further reduced to one fifth of the generated corpus, thus using 40116 text chunks instead of 200577. Since the resulting topics appeared meaningful enough to us, we left it at that.

In [None]:
tm29 = gensim.models.LdaModel(corpus_reduced, id2word=dictionary, num_topics=29, passes=5)

In [None]:
tm59 = gensim.models.LdaModel(corpus_reduced, id2word=dictionary, num_topics=59, passes=5)

In [None]:
tm29.save('tm29')

In [None]:
tm59.save('tm59')

In [None]:
tm29 = gensim.models.LdaModel.load('tm29')

The last step was to visualize the topic model in application to the corpus using Python's pyLDAvis package. Although many of the smaller topics are crammed into one corner of the coordinate system, we can see that each quadrant contains a number of topics and the bigger ones are not extremely overlapping, which suggests that topic coherence might be reasonably good.

In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [None]:
lda_display = gensimvis.prepare(tm29, corpus_reduced, dictionary, sort_topics=False)
pyLDAvis.show(lda_display)
pyLDAvis.save_html(lda_display, 'tm29_reduced.html')