<h2>Preprocessing</h2>

Packages used:

In [1]:
import os, nltk, re
from nltk import word_tokenize

In [2]:
import pickle, gensim

<h4>Text selection</h4>

First step is to randomly select a fraction of the corpus to train the topic model on. 

The basis of this process is the list of file names in alphabetical order. 

In [3]:
dirname = 'intro_dh_projekt/Dream_All_Texts_Plain'

In [4]:
filenames = sorted(os.listdir(dirname))

In [5]:
len(filenames)

34777

In [6]:
filenames[34773:34777]

['unknown pubDate - unknown author -  -                        -  1665.xml.txt',
 'unknown pubDate - unknown author -  L        -  1679.xml.txt',
 'unknown pubDate - unknown author -  L     L              -  1687.xml.txt',
 'unknown pubDate - unknown author -  L  L  1 1679 -  1679.xml.txt']

We let the function `randrange()` create the indices of the files to be picked. Using a set ensures that no file will be selected twice. The size of the set is such that a tenth of the corpus will be selected.

In [7]:
import math
from random import random, randrange

In [8]:
nums = set()
while len(nums) < 3477:
    nums.add(randrange(34777))

In [9]:
nums_sorted = sorted(list(nums))

In [10]:
len(nums_sorted)

3477

In [12]:
nums_sorted

[0,
 3,
 8,
 13,
 21,
 22,
 24,
 47,
 76,
 83,
 92,
 110,
 118,
 127,
 139,
 140,
 145,
 151,
 154,
 163,
 166,
 167,
 189,
 192,
 215,
 218,
 223,
 229,
 241,
 250,
 256,
 264,
 276,
 279,
 285,
 300,
 302,
 306,
 315,
 316,
 323,
 346,
 379,
 382,
 383,
 425,
 430,
 444,
 460,
 504,
 506,
 509,
 512,
 513,
 514,
 522,
 530,
 568,
 576,
 586,
 590,
 621,
 639,
 651,
 672,
 692,
 695,
 715,
 730,
 739,
 741,
 746,
 769,
 771,
 776,
 786,
 787,
 789,
 798,
 806,
 814,
 820,
 829,
 840,
 845,
 849,
 863,
 871,
 885,
 891,
 892,
 908,
 914,
 923,
 925,
 928,
 930,
 942,
 950,
 956,
 966,
 984,
 986,
 1009,
 1013,
 1018,
 1027,
 1030,
 1031,
 1037,
 1054,
 1055,
 1098,
 1100,
 1104,
 1146,
 1155,
 1158,
 1160,
 1178,
 1181,
 1190,
 1192,
 1201,
 1216,
 1251,
 1265,
 1282,
 1291,
 1292,
 1342,
 1349,
 1354,
 1377,
 1388,
 1393,
 1408,
 1412,
 1421,
 1427,
 1436,
 1445,
 1446,
 1457,
 1467,
 1468,
 1475,
 1513,
 1516,
 1517,
 1548,
 1560,
 1569,
 1572,
 1577,
 1600,
 1605,
 1607,
 1613,
 161

<h4>Reading in the data</h4>

Once we have the indices ready and sorted, we iterate ove them, using each to open the respective file in the directory, tokenize it, and append it to the initially empty list `texts`. The result will be a list of lists of words (as well as punctuation, numbers etc.). 

In [13]:
texts = []
for num in nums_sorted:
    filename = filenames[num]
    with open(os.path.join(dirname, filename)) as text:
        texts.append(word_tokenize(text.read()))
        print(num, len(texts)) #just to check where it's at

0 1
3 2
8 3
13 4
21 5
22 6
24 7
47 8
76 9
83 10
92 11
110 12
118 13
127 14
139 15
140 16
145 17
151 18
154 19
163 20
166 21
167 22
189 23
192 24
215 25
218 26
223 27
229 28
241 29
250 30
256 31
264 32
276 33
279 34
285 35
300 36
302 37
306 38
315 39
316 40
323 41
346 42
379 43
382 44
383 45
425 46
430 47
444 48
460 49
504 50
506 51
509 52
512 53
513 54
514 55
522 56
530 57
568 58
576 59
586 60
590 61
621 62
639 63
651 64
672 65
692 66
695 67
715 68
730 69
739 70
741 71
746 72
769 73
771 74
776 75
786 76
787 77
789 78
798 79
806 80
814 81
820 82
829 83
840 84
845 85
849 86
863 87
871 88
885 89
891 90
892 91
908 92
914 93
923 94
925 95
928 96
930 97
942 98
950 99
956 100
966 101
984 102
986 103
1009 104
1013 105
1018 106
1027 107
1030 108
1031 109
1037 110
1054 111
1055 112
1098 113
1100 114
1104 115
1146 116
1155 117
1158 118
1160 119
1178 120
1181 121
1190 122
1192 123
1201 124
1216 125
1251 126
1265 127
1282 128
1291 129
1292 130
1342 131
1349 132
1354 133
1377 134
1388 135
1393 136
1

9397 938
9399 939
9400 940
9417 941
9433 942
9440 943
9442 944
9451 945
9456 946
9457 947
9471 948
9485 949
9488 950
9494 951
9498 952
9508 953
9514 954
9518 955
9544 956
9546 957
9554 958
9557 959
9571 960
9578 961
9579 962
9590 963
9597 964
9602 965
9606 966
9610 967
9648 968
9651 969
9663 970
9664 971
9676 972
9677 973
9684 974
9692 975
9695 976
9702 977
9705 978
9710 979
9727 980
9728 981
9731 982
9733 983
9750 984
9754 985
9758 986
9767 987
9770 988
9781 989
9782 990
9804 991
9823 992
9827 993
9832 994
9835 995
9843 996
9847 997
9849 998
9854 999
9858 1000
9864 1001
9873 1002
9874 1003
9876 1004
9891 1005
9892 1006
9897 1007
9901 1008
9902 1009
9908 1010
9911 1011
9933 1012
9943 1013
9964 1014
9971 1015
9973 1016
9977 1017
9978 1018
9982 1019
9990 1020
9997 1021
10008 1022
10009 1023
10015 1024
10024 1025
10027 1026
10039 1027
10043 1028
10053 1029
10061 1030
10073 1031
10086 1032
10089 1033
10097 1034
10109 1035
10128 1036
10140 1037
10195 1038
10210 1039
10215 1040
10220 1041
10

16979 1699
16995 1700
16998 1701
17015 1702
17021 1703
17023 1704
17030 1705
17053 1706
17072 1707
17099 1708
17102 1709
17105 1710
17109 1711
17110 1712
17123 1713
17125 1714
17130 1715
17205 1716
17214 1717
17227 1718
17233 1719
17247 1720
17260 1721
17285 1722
17292 1723
17296 1724
17304 1725
17305 1726
17308 1727
17313 1728
17316 1729
17330 1730
17340 1731
17353 1732
17362 1733
17381 1734
17385 1735
17387 1736
17388 1737
17393 1738
17403 1739
17416 1740
17417 1741
17421 1742
17423 1743
17453 1744
17454 1745
17459 1746
17485 1747
17490 1748
17500 1749
17517 1750
17519 1751
17526 1752
17541 1753
17545 1754
17554 1755
17568 1756
17573 1757
17588 1758
17590 1759
17606 1760
17615 1761
17626 1762
17630 1763
17643 1764
17650 1765
17655 1766
17658 1767
17672 1768
17678 1769
17693 1770
17712 1771
17731 1772
17758 1773
17764 1774
17792 1775
17809 1776
17821 1777
17832 1778
17844 1779
17853 1780
17854 1781
17857 1782
17858 1783
17872 1784
17876 1785
17884 1786
17886 1787
17896 1788
17898 1789

24485 2448
24489 2449
24494 2450
24495 2451
24498 2452
24526 2453
24530 2454
24540 2455
24567 2456
24575 2457
24598 2458
24602 2459
24622 2460
24623 2461
24645 2462
24650 2463
24651 2464
24657 2465
24665 2466
24668 2467
24680 2468
24684 2469
24686 2470
24696 2471
24709 2472
24715 2473
24716 2474
24722 2475
24723 2476
24732 2477
24733 2478
24748 2479
24766 2480
24776 2481
24791 2482
24819 2483
24827 2484
24829 2485
24859 2486
24874 2487
24881 2488
24898 2489
24940 2490
24952 2491
24962 2492
24963 2493
25000 2494
25006 2495
25008 2496
25016 2497
25026 2498
25037 2499
25040 2500
25044 2501
25055 2502
25067 2503
25071 2504
25075 2505
25093 2506
25100 2507
25103 2508
25111 2509
25115 2510
25123 2511
25137 2512
25157 2513
25162 2514
25165 2515
25170 2516
25181 2517
25197 2518
25201 2519
25212 2520
25229 2521
25258 2522
25261 2523
25273 2524
25283 2525
25290 2526
25298 2527
25303 2528
25304 2529
25312 2530
25320 2531
25323 2532
25327 2533
25332 2534
25334 2535
25368 2536
25373 2537
25374 2538

32041 3196
32047 3197
32068 3198
32080 3199
32081 3200
32104 3201
32105 3202
32133 3203
32141 3204
32143 3205
32156 3206
32160 3207
32180 3208
32199 3209
32202 3210
32204 3211
32205 3212
32218 3213
32221 3214
32246 3215
32247 3216
32249 3217
32306 3218
32308 3219
32327 3220
32343 3221
32350 3222
32356 3223
32362 3224
32364 3225
32371 3226
32373 3227
32396 3228
32398 3229
32414 3230
32421 3231
32422 3232
32430 3233
32439 3234
32456 3235
32459 3236
32464 3237
32479 3238
32494 3239
32495 3240
32510 3241
32542 3242
32545 3243
32555 3244
32559 3245
32560 3246
32570 3247
32580 3248
32592 3249
32594 3250
32595 3251
32596 3252
32615 3253
32622 3254
32624 3255
32632 3256
32635 3257
32640 3258
32646 3259
32656 3260
32661 3261
32667 3262
32679 3263
32680 3264
32683 3265
32701 3266
32717 3267
32733 3268
32740 3269
32754 3270
32766 3271
32785 3272
32793 3273
32794 3274
32800 3275
32817 3276
32820 3277
32824 3278
32834 3279
32839 3280
32841 3281
32844 3282
32848 3283
32857 3284
32860 3285
32868 3286

In [14]:
len(texts)

3477

In [15]:
texts[99]

['<',
 '?',
 'xml',
 'version=',
 "''",
 '1.0',
 "''",
 'encoding=',
 "''",
 'UTF-8',
 "''",
 '?',
 '>',
 'GOOD',
 'NEVVS',
 ':',
 'OR',
 ',',
 'Wine',
 'and',
 'Oyle',
 ',',
 'Poured',
 'into',
 'the',
 'Wounds',
 'of',
 'SINNING',
 'and',
 'DISTRESSED',
 'JACOB',
 '.',
 'In',
 'some',
 'Meditations',
 'on',
 'Isa',
 '.',
 '27.6',
 ',',
 '7',
 ',',
 '8',
 ',',
 '&',
 '9',
 ',',
 'verses',
 '.',
 'Directing',
 'to',
 'the',
 'Cause',
 'wherefore',
 'and',
 'the',
 'End',
 'for',
 'which',
 'The',
 'present',
 'Affliction',
 'is',
 'come',
 'upon',
 'him',
 '.',
 'Hinting',
 'at',
 'the',
 'Means',
 'by',
 'which',
 'his',
 'Deliverance',
 'will',
 'be',
 'wrought',
 '.',
 'And',
 'Comforting',
 'him',
 'against',
 'the',
 'Extremity',
 'of',
 'Affliction',
 ',',
 'come',
 'and',
 'coming',
 'upon',
 'him',
 '.',
 'By',
 'PAIN',
 'LuMLE',
 'A',
 'WELCH',
 'christian',
 '.',
 'Jer',
 '.',
 '30.7',
 '.',
 'Alas',
 '!',
 'for',
 'that',
 'day',
 'is',
 'great',
 ',',
 'so',
 'that',
 'none

Next, we transform all the words in the corpus to lowercase and clear it of numbers and punctuation except for full stops, as these will be needed for the chunking. It would have been an option to also leave in questions marks and exclamation marks, but the full stop approach seemed to suffice for our purposes.

In [16]:
texts_clear = [[w.lower() for w in text if w.isalpha() or w == '.'] for text in texts]

In [17]:
len(texts_clear)

3477

In [18]:
texts_clear[99]

['xml',
 'good',
 'nevvs',
 'or',
 'wine',
 'and',
 'oyle',
 'poured',
 'into',
 'the',
 'wounds',
 'of',
 'sinning',
 'and',
 'distressed',
 'jacob',
 '.',
 'in',
 'some',
 'meditations',
 'on',
 'isa',
 '.',
 'verses',
 '.',
 'directing',
 'to',
 'the',
 'cause',
 'wherefore',
 'and',
 'the',
 'end',
 'for',
 'which',
 'the',
 'present',
 'affliction',
 'is',
 'come',
 'upon',
 'him',
 '.',
 'hinting',
 'at',
 'the',
 'means',
 'by',
 'which',
 'his',
 'deliverance',
 'will',
 'be',
 'wrought',
 '.',
 'and',
 'comforting',
 'him',
 'against',
 'the',
 'extremity',
 'of',
 'affliction',
 'come',
 'and',
 'coming',
 'upon',
 'him',
 '.',
 'by',
 'pain',
 'lumle',
 'a',
 'welch',
 'christian',
 '.',
 'jer',
 '.',
 '.',
 'alas',
 'for',
 'that',
 'day',
 'is',
 'great',
 'so',
 'that',
 'none',
 'is',
 'like',
 'it',
 'it',
 'is',
 'even',
 'the',
 'time',
 'of',
 'jacob',
 'trouble',
 'but',
 'he',
 'shall',
 'be',
 'saved',
 'out',
 'of',
 'it',
 '.',
 'lam',
 '.',
 '.',
 '.',
 'how',


An auxiliary dictionary notes down the year of origin of each file used in the corpus by extracting the first occurence of a four-digit sequence in the title. This method is certainly not perfect but probably good enough given the size of the corpus and the format of the titles. If no year is found, the entry will be None. 

In [19]:
getyear = {}
for num in nums_sorted:
    filename = filenames[num]
    years = re.findall('[0-9]{4}', filename)
    year = next((y for y in years), None)
    getyear[filename] = year

In [20]:
len(getyear)

3477

In [22]:
getyear[filenames[nums_sorted[99]]]

'1661'

<h4>Chunking</h4>

Now to the chunking function. This function takes as input a word-tokenized text, that is, a list of words and full stops. With `n` being the chunk size and `i` the starting position (initially 0), it will jump to the end position of the desired chunk (`n-1`) and check whether or not this list element is a full stop. `n` will then be incremented until a full stop is found, and the slice with end position `n` exclusive will be added to the initially empty list. The start position of the next chunk will be `i+n`, and the chunk size will be reset to its initial value. This goes on so long as the calculated end position is within the range of the text.
The last chunk, which is bound to be shorter than the desired chunk size, will be added to the chunk list directly or, if it is shorter than a predetermined minimum size, to the last chunk in the list.

In [23]:
def chunkthis(txt, chunksize, minsize):
    texts = []
    n = chunksize
    i = 0
    while i+n <= len(txt) and n != 0:
        while txt[i+n-1] != '.' and i+n<len(txt):
            n+=1
        chunk = txt[i:i+n]
        texts.append(chunk)
        i = i+n
        n = chunksize
    #if last chunk is shorter than minsize, append to last txt in chunked array (if there is one)
    if len(txt) - i < minsize and len(texts)>0:
        texts[-1] += (txt[i:])
    #if last chunk is anywhere between 100 and desired chunksize, append directly
    else:
        texts.append(txt[i:])
    return texts

Using this chunking function, we can iterate over the cleared corpus and do the following: First extract the file index of the text from `nums_sorted` (the random numbers list), then chunk the text, resulting in a list of lists of approximately 400 words each, add these chunks to an initially empty list `texts_chunked`, and add the file index to a separate list `chunk_index` according to the number of chunks created. Thus, if a text was split into 100 chunks, the file index of that text will be noted down 100 times, so that calling the list index of the chunk will return the index of the file it was extracted from. This will be important to get the year of origin of the chunks later.

In [24]:
chunk_index = []
texts_chunked = []
for i in range(len(texts_clear)):
    file_index = nums_sorted[i] #let's assume first random number is 12, therefore file_index in first loop is 12
    print('file index:', file_index)
    chunks = chunkthis(texts_clear[i], 400, 100) #let's assume this creates 100 chunks
    texts_chunked += chunks #add chunks to chunk list
    chunk_index += [file_index]*len(chunks) #add file index 12 100 times --> chunk_index[99] will then return file index of 99th chunk
    print(len(chunks), 'times')

file index: 0
8 times
file index: 3
8 times
file index: 8
2 times
file index: 13
5 times
file index: 21
70 times
file index: 22
150 times
file index: 24
2 times
file index: 47
85 times
file index: 76
30 times
file index: 83
8 times
file index: 92
46 times
file index: 110
10 times
file index: 118
6 times
file index: 127
7 times
file index: 139
61 times
file index: 140
6 times
file index: 145
22 times
file index: 151
7 times
file index: 154
12 times
file index: 163
3 times
file index: 166
35 times
file index: 167
63 times
file index: 189
3 times
file index: 192
88 times
file index: 215
14 times
file index: 218
17 times
file index: 223
38 times
file index: 229
25 times
file index: 241
444 times
file index: 250
91 times
file index: 256
42 times
file index: 264
10 times
file index: 276
59 times
file index: 279
16 times
file index: 285
83 times
file index: 300
3 times
file index: 302
1 times
file index: 306
1 times
file index: 315
5 times
file index: 316
5 times
file index: 323
9 times
file 

605 times
file index: 3580
131 times
file index: 3583
1 times
file index: 3585
16 times
file index: 3593
64 times
file index: 3594
34 times
file index: 3610
1 times
file index: 3642
20 times
file index: 3648
121 times
file index: 3690
54 times
file index: 3700
121 times
file index: 3710
7 times
file index: 3725
6 times
file index: 3732
5 times
file index: 3747
5 times
file index: 3760
3 times
file index: 3803
603 times
file index: 3814
41 times
file index: 3825
53 times
file index: 3863
5 times
file index: 3871
13 times
file index: 3879
3 times
file index: 3886
9 times
file index: 3892
1 times
file index: 3898
2 times
file index: 3899
1 times
file index: 3905
14 times
file index: 3908
17 times
file index: 3909
2 times
file index: 3916
7 times
file index: 3917
3 times
file index: 3920
14 times
file index: 3947
18 times
file index: 3956
2 times
file index: 3966
38 times
file index: 3970
6 times
file index: 3972
68 times
file index: 3979
196 times
file index: 3981
61 times
file index: 398

302 times
file index: 6836
3 times
file index: 6866
150 times
file index: 6886
66 times
file index: 6928
33 times
file index: 6931
27 times
file index: 6943
4 times
file index: 6947
642 times
file index: 6954
44 times
file index: 6964
15 times
file index: 6983
34 times
file index: 6984
52 times
file index: 6985
81 times
file index: 6986
19 times
file index: 6992
41 times
file index: 6994
611 times
file index: 7003
264 times
file index: 7020
14 times
file index: 7025
82 times
file index: 7030
6 times
file index: 7033
25 times
file index: 7046
23 times
file index: 7054
53 times
file index: 7066
8 times
file index: 7075
8 times
file index: 7080
6 times
file index: 7082
26 times
file index: 7090
20 times
file index: 7100
4 times
file index: 7110
3 times
file index: 7123
7 times
file index: 7126
4 times
file index: 7151
33 times
file index: 7155
138 times
file index: 7157
13 times
file index: 7165
189 times
file index: 7171
33 times
file index: 7176
153 times
file index: 7180
14 times
file 

172 times
file index: 9874
26 times
file index: 9876
53 times
file index: 9891
658 times
file index: 9892
18 times
file index: 9897
29 times
file index: 9901
75 times
file index: 9902
137 times
file index: 9908
37 times
file index: 9911
26 times
file index: 9933
39 times
file index: 9943
277 times
file index: 9964
24 times
file index: 9971
156 times
file index: 9973
51 times
file index: 9977
29 times
file index: 9978
7 times
file index: 9982
17 times
file index: 9990
125 times
file index: 9997
248 times
file index: 10008
118 times
file index: 10009
223 times
file index: 10015
55 times
file index: 10024
10 times
file index: 10027
125 times
file index: 10039
96 times
file index: 10043
100 times
file index: 10053
47 times
file index: 10061
17 times
file index: 10073
59 times
file index: 10086
15 times
file index: 10089
19 times
file index: 10097
352 times
file index: 10109
15 times
file index: 10128
273 times
file index: 10140
32 times
file index: 10195
19 times
file index: 10210
318 time

125 times
file index: 13017
63 times
file index: 13040
7 times
file index: 13044
14 times
file index: 13055
10 times
file index: 13057
2 times
file index: 13062
48 times
file index: 13068
41 times
file index: 13087
3 times
file index: 13091
3 times
file index: 13097
7 times
file index: 13099
2 times
file index: 13119
7 times
file index: 13128
10 times
file index: 13135
83 times
file index: 13136
65 times
file index: 13142
74 times
file index: 13150
17 times
file index: 13152
20 times
file index: 13157
32 times
file index: 13174
33 times
file index: 13181
49 times
file index: 13191
76 times
file index: 13206
29 times
file index: 13240
8 times
file index: 13242
2 times
file index: 13267
9 times
file index: 13282
5 times
file index: 13300
1 times
file index: 13304
5 times
file index: 13315
12 times
file index: 13332
1 times
file index: 13365
1 times
file index: 13367
13 times
file index: 13371
8 times
file index: 13380
4 times
file index: 13387
5 times
file index: 13401
18 times
file inde

80 times
file index: 16182
21 times
file index: 16183
9 times
file index: 16195
87 times
file index: 16215
373 times
file index: 16270
16 times
file index: 16275
40 times
file index: 16281
24 times
file index: 16283
54 times
file index: 16351
3 times
file index: 16359
6 times
file index: 16366
4 times
file index: 16368
131 times
file index: 16393
9 times
file index: 16399
9 times
file index: 16404
100 times
file index: 16414
5 times
file index: 16416
7 times
file index: 16421
36 times
file index: 16436
2 times
file index: 16439
4 times
file index: 16457
6 times
file index: 16468
57 times
file index: 16470
188 times
file index: 16500
58 times
file index: 16508
45 times
file index: 16520
4 times
file index: 16584
9 times
file index: 16589
26 times
file index: 16597
1 times
file index: 16613
311 times
file index: 16618
24 times
file index: 16621
395 times
file index: 16625
22 times
file index: 16652
741 times
file index: 16656
221 times
file index: 16657
143 times
file index: 16665
12 tim

217 times
file index: 19591
78 times
file index: 19596
1 times
file index: 19598
131 times
file index: 19610
2 times
file index: 19616
66 times
file index: 19619
11 times
file index: 19629
10 times
file index: 19637
25 times
file index: 19642
15 times
file index: 19655
27 times
file index: 19659
18 times
file index: 19661
5 times
file index: 19670
10 times
file index: 19687
8 times
file index: 19694
2 times
file index: 19703
12 times
file index: 19719
21 times
file index: 19729
11 times
file index: 19736
40 times
file index: 19741
3 times
file index: 19753
2 times
file index: 19756
1 times
file index: 19765
91 times
file index: 19768
27 times
file index: 19769
2 times
file index: 19794
21 times
file index: 19814
6 times
file index: 19826
2 times
file index: 19855
2 times
file index: 19864
1 times
file index: 19866
3 times
file index: 19876
3 times
file index: 19892
9 times
file index: 19918
68 times
file index: 19930
14 times
file index: 19941
3 times
file index: 19948
2 times
file ind

327 times
file index: 22602
39 times
file index: 22603
2 times
file index: 22623
2 times
file index: 22631
172 times
file index: 22661
83 times
file index: 22699
153 times
file index: 22730
231 times
file index: 22732
14 times
file index: 22769
208 times
file index: 22770
75 times
file index: 22773
55 times
file index: 22774
49 times
file index: 22779
13 times
file index: 22781
1 times
file index: 22784
16 times
file index: 22785
8 times
file index: 22800
2 times
file index: 22801
2 times
file index: 22811
206 times
file index: 22869
1 times
file index: 22873
14 times
file index: 22881
131 times
file index: 22891
224 times
file index: 22901
278 times
file index: 22906
55 times
file index: 22920
19 times
file index: 22922
529 times
file index: 22926
51 times
file index: 22937
5 times
file index: 22954
42 times
file index: 22968
120 times
file index: 22992
434 times
file index: 22994
190 times
file index: 23005
6 times
file index: 23008
93 times
file index: 23011
69 times
file index: 230

2761 times
file index: 25816
28 times
file index: 25817
125 times
file index: 25848
80 times
file index: 25854
2 times
file index: 25860
53 times
file index: 25865
147 times
file index: 25866
40 times
file index: 25902
18 times
file index: 25938
46 times
file index: 25940
108 times
file index: 25944
2 times
file index: 25949
1 times
file index: 25950
29 times
file index: 25956
16 times
file index: 25960
593 times
file index: 25964
3 times
file index: 25967
12 times
file index: 25971
25 times
file index: 25977
8 times
file index: 25981
3 times
file index: 25986
116 times
file index: 25991
8 times
file index: 26001
1 times
file index: 26003
1 times
file index: 26013
279 times
file index: 26024
1 times
file index: 26034
12 times
file index: 26054
18 times
file index: 26059
17 times
file index: 26060
7 times
file index: 26062
2 times
file index: 26073
3 times
file index: 26107
2 times
file index: 26112
3 times
file index: 26115
12 times
file index: 26126
6 times
file index: 26129
45 times


74 times
file index: 28942
44 times
file index: 28943
25 times
file index: 28944
49 times
file index: 28948
2 times
file index: 28950
1 times
file index: 28960
2 times
file index: 28968
25 times
file index: 28980
2 times
file index: 28985
5 times
file index: 28993
2 times
file index: 29002
6 times
file index: 29003
3 times
file index: 29038
2 times
file index: 29065
50 times
file index: 29071
2 times
file index: 29107
19 times
file index: 29146
71 times
file index: 29155
2 times
file index: 29165
3 times
file index: 29166
1 times
file index: 29167
1 times
file index: 29178
1 times
file index: 29179
1 times
file index: 29181
1 times
file index: 29218
43 times
file index: 29258
28 times
file index: 29261
48 times
file index: 29269
2 times
file index: 29283
8 times
file index: 29284
2 times
file index: 29294
2 times
file index: 29307
5 times
file index: 29308
44 times
file index: 29311
580 times
file index: 29314
53 times
file index: 29322
6 times
file index: 29339
20 times
file index: 29

351 times
file index: 32246
25 times
file index: 32247
18 times
file index: 32249
2 times
file index: 32306
56 times
file index: 32308
259 times
file index: 32327
94 times
file index: 32343
5 times
file index: 32350
72 times
file index: 32356
752 times
file index: 32362
66 times
file index: 32364
43 times
file index: 32371
11 times
file index: 32373
14 times
file index: 32396
129 times
file index: 32398
60 times
file index: 32414
136 times
file index: 32421
48 times
file index: 32422
2 times
file index: 32430
2 times
file index: 32439
20 times
file index: 32456
350 times
file index: 32459
3 times
file index: 32464
125 times
file index: 32479
3 times
file index: 32494
2 times
file index: 32495
42 times
file index: 32510
3 times
file index: 32542
3 times
file index: 32545
2 times
file index: 32555
191 times
file index: 32559
2 times
file index: 32560
5 times
file index: 32570
13 times
file index: 32580
15 times
file index: 32592
55 times
file index: 32594
751 times
file index: 32595
27 t

In [25]:
len(texts_chunked)

214115

In [26]:
chunk_index[214114]

34776

In [33]:
len(texts_chunked[999])

419

<h4>Removal of stop words</h4>

Next, we clear the attained chunks of stop words and full stops, so that only words not appearing among the 200 most frequent terms will remain. Since the stop word list contained punctuation and numbers, we ended up removing a little less than 200 different terms. 'XML' was added to the stop word list as it appeared in every single file.

In [40]:
#some spagetti code to read in the stop words
stop_words=[]
i = 0
with open('stopwords.txt', 'rb') as f:
    while i < 500:
        stop_words.append(str(f.readline()).split()[0][2:])
        i += 1

In [41]:
len(stop_words)

500

In [42]:
stop_words

['the',
 'of',
 'and',
 'to',
 'in',
 'that',
 'a',
 'is',
 'it',
 'for',
 'his',
 'as',
 'be',
 'he',
 'not',
 'by',
 'but',
 'they',
 'which',
 'with',
 'this',
 'i',
 'or',
 'all',
 'their',
 'so',
 'was',
 'are',
 'them',
 'god',
 'him',
 'from',
 'you',
 'we',
 'have',
 'our',
 'if',
 'at',
 'were',
 'will',
 'had',
 'no',
 'd',
 'then',
 'may',
 'there',
 'one',
 'an',
 'my',
 'any',
 'who',
 'shall',
 'what',
 'when',
 'more',
 'such',
 'upon',
 'other',
 'these',
 'on',
 'your',
 'man',
 'some',
 'yet',
 'hath',
 'christ',
 'should',
 's',
 'being',
 'into',
 'her',
 '1',
 'men',
 'those',
 'great',
 'out',
 'would',
 'do',
 'us',
 'me',
 'good',
 '2',
 'made',
 'now',
 'first',
 'did',
 'before',
 'lord',
 'time',
 'much',
 'against',
 'must',
 'also',
 'thou',
 'many',
 'make',
 'can',
 'king',
 'thy',
 'church',
 'things',
 'after',
 'most',
 'same',
 'said',
 'how',
 'very',
 'c',
 'therefore',
 'because',
 '3',
 'without',
 'haue',
 'where',
 'nor',
 'unto',
 'like',
 'say

In [43]:
stop_words = [w for w in stop_words if w.isalpha()] #remove numbers and punctuation from stop words
stop_words.append('xml')

In [44]:
len(stop_words)

475

In [45]:
stop_words

['the',
 'of',
 'and',
 'to',
 'in',
 'that',
 'a',
 'is',
 'it',
 'for',
 'his',
 'as',
 'be',
 'he',
 'not',
 'by',
 'but',
 'they',
 'which',
 'with',
 'this',
 'i',
 'or',
 'all',
 'their',
 'so',
 'was',
 'are',
 'them',
 'god',
 'him',
 'from',
 'you',
 'we',
 'have',
 'our',
 'if',
 'at',
 'were',
 'will',
 'had',
 'no',
 'd',
 'then',
 'may',
 'there',
 'one',
 'an',
 'my',
 'any',
 'who',
 'shall',
 'what',
 'when',
 'more',
 'such',
 'upon',
 'other',
 'these',
 'on',
 'your',
 'man',
 'some',
 'yet',
 'hath',
 'christ',
 'should',
 's',
 'being',
 'into',
 'her',
 'men',
 'those',
 'great',
 'out',
 'would',
 'do',
 'us',
 'me',
 'good',
 'made',
 'now',
 'first',
 'did',
 'before',
 'lord',
 'time',
 'much',
 'against',
 'must',
 'also',
 'thou',
 'many',
 'make',
 'can',
 'king',
 'thy',
 'church',
 'things',
 'after',
 'most',
 'same',
 'said',
 'how',
 'very',
 'c',
 'therefore',
 'because',
 'without',
 'haue',
 'where',
 'nor',
 'unto',
 'like',
 'say',
 'well',
 'see'

In [46]:
len(texts_chunked[999])

419

In [47]:
texts_final = [[w for w in text if w.isalpha() and w not in stop_words] for text in texts_chunked]

In [48]:
len(texts_final)

214115

In [50]:
len(texts_final[999])

125

The resulting corpus `texts_final` now contains 200577 text chunks extracted from 3477 texts in word-tokenized form, cleared of stop words and punctuation. As we can see by comparing the lengths of the chunks before and after weeding out the stop words, the chunks have shrunk considerably, which also goes to show how much of a text is just function words.

In [51]:
for txt in texts_final[10000:10050]:
    print(len(txt))

137
132
130
117
126
149
154
155
141
166
144
148
132
119
143
127
122
128
147
144
142
144
150
162
136
162
132
131
145
141
142
110
123
138
132
153
155
123
144
162
142
151
141
127
135
141
162
135
116
117


<h4>Saving data</h4>

Using pickle, we save the relevant data structures in a compressed format: the final corpus, the chunk index allowing us to retrace what text a chunk was extracted from, and the file indeces of the used texts.

In [None]:
import pickle
pickle.dump(texts_final, open("texts_final_new.p", "wb"))

In [None]:
pickle.dump(chunk_index, open("chunk_index.p", "wb"))

In [None]:
pickle.dump(nums_sorted, open("file_index.p", "wb"))

In [None]:
tst = pickle.load(open("texts_final_400w.p", "rb"))

In [None]:
print(texts_final[10][:100])
print(tst[10][:100])

<h4>Preparing the data for Topan</h4>

This next step creates a two-dimensional array with each row corresponding to a text chunk. The two columns being chunk index (a combination of file index and the number of the chunk extracted from that file) and the remaining words of that chunk. Saving this table as a CSV file allowed us to read the texts into Topan as well. The second array is for looking up context information.

In [None]:
anarray = []
fileindex_prev = 0
n=0
for i in range(len(texts_final)):
    fileindex = chunk_index[i]
    if fileindex != fileindex_prev:
        n = 1
    else:
        n+=1
    chunkindex = str(fileindex) + ':' +str (n)
    #filename = filenames[fileindex]
    astring = ""
    for word in texts_final[i]:
        astring = astring + word + " "
    anarray.append( [chunkindex, astring])
    fileindex_prev = fileindex

In [None]:
anotherarray = []
for num in nums_sorted:
    filename = filenames[num]
    anotherarray.append([num, filename, getyear[filename]])

In [None]:
anarray[:2]

In [None]:
import numpy
numpy.savetxt("all_the_chunks.csv", anarray, delimiter=",", fmt='%s')

<h2>Topic model</h2>

NOW for the actual LDA topic modelling. We use Python's gensim library to create a dictionary mapping each remaining term in the corpus to a unique id. The standard format is id2token but the class Dictionary provides the function token2id as well. This dictionary is then used to covert the corpus to a numeric format on the basis of the bag-of-words assumption. Each text is thus treated as a bag of words, where only word frequencies matter but word sequence is ignored. Thus, each text in the corpus becomes a list of (word id, word frequency) tuples. The resulting data structure of corpus is therefore a list of lists of numeric tuples.

In [None]:
import gensim
dictionary = gensim.corpora.Dictionary(texts_final)

In [None]:
corpus = [dictionary.doc2bow(text) for text in texts_final]

LDA was chosen as a model since it works relatively autonomously, that is the relevant parameters alpha (per-document topic distribution) and beta (per-topic word distribution) are learned from the corpus during training. The default setting for alpha is assymetric, which is also the recommended setting for topic modelling. The main difficulty is in determining the optimal number of topics. This was mitigated by sheer lack of computing power though, so that we ended up producing a relatively small number of topics in relation to the size of the corpus. Since the resulting topics appeared meaningful enough to us, we left it at that.

In [None]:
tm50 = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=50, passes=10)

The last step was to visualize the topic model in application to the corpus by using Python's pyLDAvis package. Although many of the smaller topics are crammed into one corner of the coordinate system, we can see that each quadrant contains a number of topics and the bigger ones are relatively evenly-spaced, which tells us that topic coherence might be reasonably good.

In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [None]:
lda_display = gensimvis.prepare(tm50, corpus, dictionary, sort_topics=False)
pyLDAvis.show(lda_display)