# LDA for Wikipedia
### Michael Frasco

In [1]:
import json
from bs4 import BeautifulSoup
import string

In [2]:
import onlinewikipedia
import wikirandom
import printtopics
import onlineldavb

I downloaded the scripts necessary for LDA and saved them as .py files. Here I import them for use in this script.

In [3]:
# the unseen test articles
articles = sc.textFile("s3n://stat-37601/wiki.json").map(lambda x: json.loads(x))

In [4]:
def parseXML(xmlString):
    # Input: a string of XML obtained from the body of each document
    # Output: a string of ascii text that is normalized
        # lower case and no punctuation
    textString = BeautifulSoup(xmlString).get_text()
    textString = textString.encode('ascii', 'ignore')
    textString = string.replace(textString, '\n', ' ')
    textString = textString.lower()
    textString = textString.translate(string.maketrans("",""), string.punctuation)
    return textString

In [5]:
testArticles = articles.map(lambda x: parseXML(x['body'])).collect()

In [6]:
f = open('testArticles.txt', 'w')
for article in testArticles:
    f.write("%s\n" % article)

In the code chunk above I turned my list of document strings into a .txt file. I altered the code in onlinewikipedia.py so that it imports this .txt file and then converts it back to a list of strings. After the code runs for all 200 batches of random wikipedia articles, thereby training the model. I call the function "olda.update_lambda(testArticles)" where testArticles is that list of unseen articles. The output of this function is the matrix of topic proportions (gamma) for each of the test documents. I saved this matrix as a .dat file.

In [7]:
# runs online LDA over 200 batches of random Wikipedia articles
%run onlinewikipedia 201

downloaded 0/64 articles...
downloaded Alan_Gardner,_1st_Baron_Gardner. parsing...
downloaded Carl_Steinfort. parsing...
downloaded Andrei_Kuznetsov_(volleyball). parsing...
downloaded Einsatzgruppen_trial. parsing...
downloaded Lampson,_Wisconsin. parsing...
downloaded Imperial_Service_College. parsing...
downloaded Henry_Pendleton. parsing...
downloaded Sverdrup_Gold_Medal_Award. parsing...
downloaded 8/64 articles...
downloaded Art_of_Anarchy. parsing...
downloaded Operating_income_before_depreciation_and_amortization. parsing...
downloaded World_Poker_Tour_bracelet. parsing...
downloaded Patrick_Meehan. parsing...
downloaded Khenemetneferhedjet_III. parsing...
downloaded In_the_Cool,_Cool,_Cool_of_the_Evening. parsing...
downloaded David_Tikolo. parsing...
downloaded Giandomenico_Basso. parsing...
downloaded 16/64 articles...
downloaded Fethiye. parsing...
downloaded List_of_religious_slurs. parsing...
downloaded EGL_(programming_language). parsing...
downloaded Githambo. parsing..

In [8]:
% run printtopics.py dictnostops.txt lambda-200.dat

topic 0:
               class  	---	  0.1005
            canadian  	---	  0.0815
                iron  	---	  0.0595
               cross  	---	  0.0508
            squadron  	---	  0.0414
                 nov  	---	  0.0330
              flight  	---	  0.0314
                 jan  	---	  0.0299
               pilot  	---	  0.0291
             mission  	---	  0.0204
             knights  	---	  0.0174
                 oct  	---	  0.0171
              flying  	---	  0.0164
               first  	---	  0.0145
                 dec  	---	  0.0136
                 aug  	---	  0.0120
           christmas  	---	  0.0096
                 feb  	---	  0.0089
              cotton  	---	  0.0087
             october  	---	  0.0079
              cannon  	---	  0.0078
               april  	---	  0.0077
                 air  	---	  0.0072
                june  	---	  0.0061
               world  	---	  0.0060
                 oak  	---	  0.0058
              flames  	---	  0.0057
             sailin

Above we see the distribution over words for each of the 100 topics that we generate. The first thing that jumps out at me is that even the most common word in each topic occurs infrequently. There were a lot of topics were no word had more than a 5% chance of occuring. The highest value I saw was for a topic about politics. The word "state" had a 31% probability of occuring. The words in each topic seemed to make sense. I was satisfied with the results,

I noticed that a lot of the documents in the testing set had to do with basketball. As a huge basketball fan, this made me happy. But it also made me want to examine the topics to see if there were any sports related topics. I did find a topic related to sports. For me, it was topic 10. For you, it will probably be different. My hope is that the topic distribution for the sports articles will put a large amount of mass on this topic.

In [10]:
def normalizeGamma(gamma):
    # input: the topic proportions for a given document
        # gamma must be a list
    # output: normalized proportions
    
    return gamma / float(sum(gamma))

This is my normalizing function for each gamma vector. I figured that it was the simplest possible function. I couldn't really think of a more complex function that would be meaningful.

In [9]:
# these are the gamma values that I obtained by adding some code
# to the onlinewikipedia.py script.
gamma = numpy.loadtxt('gamma-unseen200.dat')

In [12]:
for i in range(gamma.shape[0]):
    gamma[i,:] = normalizeGamma(gamma[i,:])

Below, I analyze the first row of the gamma matrix, which represents the topic distributions for the first document. This document is an article about Michael Jordan.

In [68]:
sortedGamma = sorted(gamma[0,:], reverse=True)
dominantTopics = numpy.argsort(gamma[0,:])[::-1]

In [69]:
sortedGamma[:10]

[0.23068537132657579,
 0.091873950538532076,
 0.056990522724928841,
 0.050868676025854279,
 0.050082087707721046,
 0.047801955218698532,
 0.045680918762440383,
 0.024890296822106178,
 0.024553989524844957,
 0.024459034928935294]

In [70]:
dominantTopics[:10]

array([13, 70, 29, 21, 10, 65, 58, 63, 22, 48])

This probably will be different if you decide to run this code, but when I run it, I get that the most dominant topic, which consists of 23% of the document is topic 13. When I looked at the most frequent words in topic 13 I found words like, "game", "league", "games", "played", and "career". That makes a lot of sense for MJ. The second most frequent topic was topic 70 at 9%. This topic consisted of words like, "October", "March", and "December". Since basketball is played throughout the year and the article on MJ is going to use dates, this also makes a lot of sense. This algorithm is pretty cool.

In [23]:
def dissimilarityMeasure1(gamma1, gamma2):
    # input: the topic proportions for two different documents
        # these are both vectors with number of elements equal
        # to the number of topics. These are also normalized
        # so that each vector sums to one
    # output: the mean squared difference between the two
        # vectors. Similar documents will have a small value
    
    difference = numpy.array(gamma1 - gamma2)
    return numpy.mean(difference ** 2)

In [24]:
def similarityMeasure2(gamma1, gamma2, numTopics):
    # input: the normalized topic proportions for two different
        # documents and the number of topics that will be used
        # to generate the similarity measure
    # output: a similarity measure between the two documents
        # as a number between zero and one. Specifically, we
        # are calculating whether the most frequent topic of
        # document 1 is the same as the most frequent topic of
        # document 2. And we do this for each of the first
        # numTopics topics
        
    # We use arg sort to find which topics are most frequent
    topic_orders_1 = numpy.argsort(gamma1)[::-1] # reverses the array
    topic_orders_2 = numpy.argsort(gamma2)[::-1]
    
    count = 0
    for i in range(numTopics):
        if topic_orders_1[i] == topic_orders_2[i]:
            count += 1
    similarity = float(count) / numTopics
    return similarity

In [38]:
def similarityMeasure3(gamma1, gamma2, numTopics):
    # input: the normalized topic proportions for two different
        # documents and the number of topics that will be used
        # to generate the similarity measure
    # output: a similarity measure between the two documents
        # as a number between zero and one.
        
    # We use arg sort to find which topics are most frequent
    topic_orders_1 = numpy.argsort(gamma1)[::-1] # reverses the array
    topic_orders_2 = numpy.argsort(gamma2)[::-1]
    
    topics1 = topic_orders_1[:numTopics]
    topics2 = topic_orders_2[:numTopics]
    
    count = 0
    for i in range(numTopics):
        if topics1[i] in topics2:
            count += 1
    similarity = float(count) / numTopics
    return similarity

I thought of three reasonable similarity measures. The first is the mean squared difference between the two gamma vectors. This is basically finding the distance between the two vectors. However, I did not like this metric because it was not very applied to the situation at hand. For topic modelling, we would say that two articles are similar if their most dominant topics are the same. We really only care about the first few topics and whether they are the same. So I created a function that finds the most frequent topics for each document and counts the number of times that documents share the same topic ordering. This function takes as an input the number of topics to consider. A possible limitation to this function is that one document could be about topics (x, y, z, w) and another topic could be about topics (y, x, w, z). In this case, since the ordering is not the same, the similarity would be zero, when they are actually pretty similar. So I implemented a third function which counts the number of shared elements in the first numTopics topics.

In [92]:
sims1 = list()
sims2 = list()
sims3 = list()
numTopics = 5
for i in range(gamma.shape[0] - 1):
    for j in range(i + 1, gamma.shape[0]):
        dissimilarity1 = dissimilarityMeasure1(gamma[i,:], gamma[j,:])
        sims1.append(((i, j), dissimilarity1))
        
        similarity2 = similarityMeasure2(gamma[i,:], gamma[j,:], numTopics)
        sims2.append(((i, j), similarity2))
        
        similarity3 = similarityMeasure3(gamma[i,:], gamma[j,:], numTopics)
        sims3.append(((i, j), similarity3))

In [93]:
sortedSims1 = sorted(sims1, key = lambda x: x[1])

In [94]:
sortedSims1[:10]

[((21, 41), 1.42169254200624e-05),
 ((45, 54), 5.1155307506474481e-05),
 ((38, 41), 6.7973755100412069e-05),
 ((13, 18), 7.2102855477063085e-05),
 ((34, 43), 7.3031609083422031e-05),
 ((5, 6), 7.7191820126762807e-05),
 ((4, 7), 7.9513372284489541e-05),
 ((21, 38), 8.2410789157424635e-05),
 ((35, 36), 8.2975321870201632e-05),
 ((25, 26), 8.5836526889308682e-05)]

In [95]:
sortedSims2 = sorted(sims2, key = lambda x: x[1], reverse=True)

In [96]:
sortedSims2[:10]

[((12, 16), 0.8),
 ((21, 41), 0.8),
 ((33, 43), 0.8),
 ((47, 55), 0.8),
 ((5, 6), 0.6),
 ((9, 11), 0.6),
 ((12, 13), 0.6),
 ((12, 14), 0.6),
 ((12, 18), 0.6),
 ((13, 14), 0.6)]

In [97]:
sortedSims3 = sorted(sims3, key = lambda x: x[1], reverse=True)

In [98]:
sortedSims3[:10]

[((32, 34), 1.0),
 ((38, 41), 1.0),
 ((45, 54), 1.0),
 ((51, 55), 1.0),
 ((0, 1), 0.8),
 ((1, 9), 0.8),
 ((4, 7), 0.8),
 ((5, 6), 0.8),
 ((7, 8), 0.8),
 ((9, 10), 0.8)]

I would expect that the first two metrics are fairly similar. If a document shares the same ordering of its most frequent topics, the values for those topic proportions are also pretty similar. This intuition is confirmed by noticing that a good number of document pairs are shared by the first two metrics. However, I have a feeling that the third metric to result in documents that are the most similar. I chose to look at the first 5 topics because it was the smallest number that didn't result in the top ten pairs to have perfect scores.

I compared articles 21 and 41. I found that both articles were nonsense little blurbs. I don't know if I didn't process the text correctly. But I am glad that these were found to be similar. I also compared articles 5 and 6. These articles are both about swimming and the olympics. So it makes sense that they would have high similarity scores. Lastly, I compared articles 32 and 34. These songs were both about british pop music. It's hard to tell which of my metrics is the best without a very thorough examination of each of the articles. But it seems like they all did a good job.