# Latent Dirichlet Allocation

We try out scikit learn's LDA method and look at some best practices. 

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
from operator import itemgetter
import random
import time

random.seed(42)  # for better reproducibility

We fetch a sample corpus that consists of a news group posts (basically, twitter with no character limit and no pictures). We hold back about 10% of documents for evaluation. We fix the number of topics relatively arbitrarily to 12. Moreover, we get ourselves an LDA object. 

In [2]:
corpus, _ = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),return_X_y=True)
random.shuffle(corpus)
test_size=len(corpus)//10
corpus,held_corpus=corpus[test_size:],corpus[:test_size]

print("# documents in corpus: {}".format(len(corpus)))
print("# documents held back: {}".format(len(held_corpus)))
num_topics=12
lda=LatentDirichletAllocation(n_components=num_topics,learning_method='online')

# documents in corpus: 10183
# documents held back: 1131


It's always good to look at some samples. (Here, we only look at one but please poke around a bit.)

In [3]:
print(corpus[8567])

What resources and services are available on Internet/BITNET which
would be of interest to hospitals and other medical care providers?
I'm interested in anything relelvant, including institutions and
businesses of interest to the medical profession on Internet,
special services such as online access to libraries or diagnostic
information, etc. etc.


## First attempt: Run scikit learn's LDA out of the box

We first try what happens when we use LDA without any fancy processing and so on. In the first step, we need to turn each text document into a vector. This is done by <code>CountVectorizer</code> which sets up a bag-of-words model.

In [4]:
vectorizer=CountVectorizer()
vector_data=vectorizer.fit_transform(corpus)
print("size of vocabulary: {}".format(len(vectorizer.vocabulary_.keys())))

size of vocabulary: 91960


The vocabulary seems rather large for only about 10000 documents. Next, we fit the LDA model on the vectorised data. 

In [5]:
def run_fit(data):
    start=time.time()
    lda.fit(data)
    end=time.time()
    print("Fitting took {:2.1f}s".format(end-start))

run_fit(vector_data)

Fitting took 107.2s


We can access the topic-term probabilities with <code>lda.components_</code>. However, this is just a list of probabilities. 

In [6]:
lda.components_[0]

array([0.08334693, 0.08333764, 1.94665496, ..., 0.08333334, 0.08333334,
       0.08333334])

To better understand what the topics actually means we pick out the terms with largest probability for each topic. Hopefully, we'll see then that the topics really capture a natural concept that is present in the corpus. 

How prevalent a topic is in the corpus is obviously of interest, too. We compute that by summing the topic weight over all documents in the corpus. We then normalise to get a percentage.

In [7]:
def show_topic_stats(lda,corpus,num_top_words=15):
    topic_mixes=lda.transform(vectorizer.transform(corpus))
    total_topic_weights=topic_mixes.sum(axis=0)
    rel_topic_weights=total_topic_weights/sum(total_topic_weights)    
    topics_by_weight=sorted([(topic,topic_weight) for topic,topic_weight in enumerate(rel_topic_weights)],key=itemgetter(1),reverse=True)
    feature_names = vectorizer.get_feature_names()
    for topic,topic_weight in topics_by_weight:
        topic_dist=lda.components_[topic]
        top_terms=topic_dist.argsort()[:-num_top_words - 1:-1]
        message=" ".join([feature_names[i] for i in top_terms])
        print("Topic {:2}: {:4.1f}% -> ".format(topic,topic_weight*100)+message)

show_topic_stats(lda,corpus)

Topic 10: 41.1% -> the to and is it of that you for in this have be with on
Topic  7: 22.9% -> the of to and that is in it you not are as be this for
Topic 11: 19.1% -> the and to in of that was he they on it for at you but
Topic  5:  8.5% -> the and of for to in is on file from edu by or are program
Topic  1:  2.2% -> 00 10 25 15 20 16 11 12 14 13 18 17 55 30 24
Topic  4:  1.7% -> of university 1993 national by health april research center states united and in medical information
Topic  8:  1.1% -> gm at de coli candida infections infection mydisplay ci son cd moncton rochester een loser
Topic  0:  0.9% -> db entry output entries rules oname mov contest fprintf excellent cs year check_io uuencode ___
Topic  6:  0.7% -> pl 1t 3t bh qq 0d m3 mq sl ah 34 d9 5u 7u m_
Topic  2:  0.7% -> maine en op_cols op_rows da kotl heh turk ul ob iv reuss salonica bedouin ermeni
Topic  9:  0.6% -> cx w7 c_ hz uw t7 ck chz lk w1 17 mv k8 sk a7
Topic  3:  0.5% -> ax max g9v b8f a86 145 1d9 0t bhj giz wm 

This is obviously garbage. In particular we see many common short words, such as "the", "to", "and", and a good number of terms that are just rubbish ("g9v", "__" etc). 

While it is clear that these topics are not useful, we still try to evaluate them numerically. This is not easy to do well. Here, we look at two measures: the log likelihood of a whole corpus and the [perplexity](https://en.wikipedia.org/wiki/Perplexity) per word. Both are evaluated on the held back corpus. Larger values are better for the log likelihood; smaller values are better for perplexity.

In [8]:
def print_validation(test_corpus,old_scores=None):
    test_data=vectorizer.transform(held_corpus)
    score=lda.score(test_data)
    num_words=sum([len(doc.split(" ")) for doc in test_corpus])
    perplexity=lda.perplexity(test_data)/num_words
    old_like_str=""
    old_perp_str=""
    if old_scores is not None:
        old_like,old_perp=old_scores
        old_like_str=" (was: {:.2f})".format(old_like)
        old_perp_str=" (was: {:.4f})".format(old_perp)
    print("log likelihood of test corpus : {:.2f}".format(score)+old_like_str)
    print("word perplexity of test corpus:  {:.4f}".format(perplexity)+old_perp_str)
    return score,perplexity
    
first_val=print_validation(held_corpus)

log likelihood of test corpus : -1927517.24
word perplexity of test corpus:  0.0279


It's hard to interpret these numbers at this moment. They will become useful below, when we can compare different models. (This is also the reason why the code is a little bit more complicated then we'd need right now.)

## Second attempt: Filter very frequent words

Very frequent words do not tell us anything about the topic of a document. The word "the" will occur in almost every document but will stil be accomodated by the LDA model. So, it's better to filter out such words. 

First let us have a look at the terms with largest document frequency, ie, at the words that appear in the largest number of documents. (Note, that we do not count here how often the word appears in total, but in how many document it appears.)

In [9]:
from collections import defaultdict

def document_frequency(corpus):
    doc_freq=defaultdict(int)
    for document in corpus:
        for token in set(document.split()):
            doc_freq[token]+=1
    return doc_freq

def print_top_df_words(corpus,num=15):
    df=document_frequency(corpus)
    top_df_words=sorted([(key,count) for key,count in df.items()],key=itemgetter(1),reverse=True)
    print("words occuring most frequently in different documents:")
    for key,count in top_df_words[:num]:
        print(" {:6}: {}".format(key,count))

print_top_df_words(corpus)

words occuring most frequently in different documents:
 the   : 8357
 to    : 7533
 a     : 7342
 of    : 6816
 and   : 6774
 I     : 6313
 in    : 6110
 is    : 6020
 that  : 5615
 for   : 5439
 it    : 4726
 have  : 4574
 on    : 4411
 be    : 4338
 with  : 4261


This list is not a surprise and clearly none of these terms helps to understand what a given document is about. We filter them out. <code>CountVectorizer</code> can do that quite conveniently. We only have to pass <code>max_df=0.1</code> and then it will suppress all terms that appear in more than 10% of the documents. We also set <code>min_df=10</code> to filter out all those terms that appear in only up to 10 documents. (A fractional value, e.g. <code>min_df=0.01</code>, will filter out all terms that appear in fewer than 1% of the documents.)

By the way, it seems it's good to be aggressive when filtering out words that are too common. Smaller fractions than 10% document frequency (perhaps 5%?) may lead to even better results. A large maximum document frequency (ie, 50%) will cull only a few extremely common words; see the list above. What is best here likely depends on the corpus and the number of topics.

In [10]:
vectorizer=CountVectorizer(max_df=0.1, min_df=10)
vector_data=vectorizer.fit_transform(corpus)
print("size of vocabulary: {}".format(len(vectorizer.vocabulary_.keys())))
run_fit(vector_data)

size of vocabulary: 9736
Fitting took 43.7s


As expected the vocabulary is much smaller than previously, which also results in much faster fitting.

In [11]:
show_topic_stats(lda,corpus)

Topic  3: 22.2% -> re better really sure since something might problem work ll doesn probably still thing point
Topic  0: 13.3% -> god jesus believe us him our christian bible true did question must life church law
Topic  1: 11.6% -> file edu windows program com files window ftp available version db graphics server image using
Topic  6: 10.5% -> she car him her said back off went go didn down got over us left
Topic 11:  9.3% -> drive card system dos disk scsi hard video pc memory mac drives speed monitor apple
Topic  4:  8.4% -> gun right israel us against children israeli rights anti our police state said human guns
Topic 10:  6.6% -> space university edu information 00 nasa 1993 research data 10 center available 20 list contact
Topic  2:  6.0% -> 10 00 game team 25 15 17 11 12 16 20 games 14 13 year
Topic  8:  5.1% -> mr president government states state national our american bill united public health congress groups years
Topic  5:  4.1% -> key encryption chip keys clipper security 

While still not good, we see that now at least some of the topics make sense. Still there are some garbage terms that pollute the topics. 

In [12]:
second_val=print_validation(held_corpus,old_scores=first_val)

log likelihood of test corpus : -897250.40 (was: -1927517.24)
word perplexity of test corpus:  0.0146 (was: 0.0279)


As expected, log likelihood and word perplexity have improved.

## Third attempt: More professional preprocessing

We use the <code>gensim</code> library to improve our preprocessing. See https://radimrehurek.com/gensim/ for the documentation.

Let us take a sample and let us see how <code>gensim</code> transforms it. 

In [13]:
sample_doc=corpus[5678]
print(sample_doc)



Since running any GUI over a network is going to slow it down by a
fair amount, I expect Windows NT will be multiuser only in the sense
of sharing filesystems. Someone will likely write a telnetd for it so
one could run character-based apps, but graphics-based apps will have
to be shared by running the executables on the local CPU. This is how
things are shaping up everywhere: client-server architectures are
taking over from the old cpu-terminal setups. 

Note that the NeXT does this: you can always telnet into a NeXT and
run character-based apps but you can't run the GUI. (Yeah, I know
about X-Windows, just haven't been too impressed by it...)..






-- 


<code>gensim</code> has a number of preprocessing routines. We use <code>gensim.parsing.preprocessing.preprocess_string</code> that takes as argument a number of filters. We lower case everything, we strip punctuation marks (,.1?_ etc), we throw away numbers and other special characters, we remove very short words (up to two characters) and stop words. Let's look at the result.

In [14]:
import gensim
from gensim.parsing.preprocessing import *
filters=[lambda x: x.lower(),strip_tags,strip_punctuation,strip_multiple_whitespaces,strip_numeric,remove_stopwords,strip_short]

processed=gensim.parsing.preprocessing.preprocess_string(sample_doc,filters=filters)
print(" ".join(processed))

running gui network going slow fair expect windows multiuser sense sharing filesystems likely write telnetd run character based apps graphics based apps shared running executables local cpu things shaping client server architectures taking old cpu terminal setups note telnet run character based apps run gui yeah know windows haven impressed


Clearly, the sample document has become much shorter and many non-interesting terms are gone. 

Let us look at the stop words that were removed.

In [15]:
" ".join(sorted(gensim.parsing.preprocessing.STOPWORDS))

'a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant co computer con could couldnt cry de describe detail did didn do does doesn doing don done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fifty fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself just keep kg km last latter latterly least less ltd made make many may me meanwhile might mill mine more moreover

Now we process the whole corpus (and also the held back corpus).

In [16]:
filters=[lambda x: x.lower(),strip_tags,strip_punctuation,strip_multiple_whitespaces,strip_numeric,remove_stopwords,strip_short]

def preprocess_corpus(corpus,filters=filters):
    preprocessed_corpus=[]
    for document in corpus:
        preprocessed=gensim.parsing.preprocessing.preprocess_string(document,filters=filters)
        preprocessed_document=" ".join(preprocessed)
        preprocessed_corpus.append(preprocessed_document)
    return preprocessed_corpus

preprocessed_corpus=preprocess_corpus(corpus)
preprocessed_held_corpus=preprocess_corpus(held_corpus)

We set up the bag-of-words model and we fit the LDA model.

In [17]:
vectorizer=CountVectorizer(max_df=0.1, min_df=10)
vector_data=vectorizer.fit_transform(preprocessed_corpus)
print("size of vocabulary: {}".format(len(vectorizer.vocabulary_.keys())))
run_fit(vector_data)

size of vocabulary: 8520
Fitting took 38.2s


Not surprisingly, vocabulary and running time have improved again.

In [18]:
show_topic_stats(lda,corpus)

Topic  9: 19.9% -> going better right said got little things years lot look come let thing maybe sure
Topic  0: 12.4% -> problem help mail looking work email windows read info post advance sound send sure appreciated
Topic 11: 11.1% -> state government right gun law israel jews rights states case fact question israeli issue point
Topic  1:  8.9% -> max file edu available window program ftp files image graphics version server windows software code
Topic 10:  8.2% -> game team year edu play games season win hockey players league san period com teams
Topic  3:  7.9% -> drive card dos disk scsi hard memory video mac drives monitor bit apple board controller
Topic  5:  7.0% -> god jesus believe bible christian church life christians faith christ religion true man book truth
Topic  7:  6.1% -> space president information research nasa national university program data center april technology earth science available
Topic  2:  5.7% -> car power bike cars engine price sale goal shipping sell ex

This looks much better! A lot of the garbage is gone and many of the topics are recognisable. There are still some issues, though. We may see singular and plural form of some nouns ("car" and "cars") -- surely, if some document talks about "cars" and some other about a "car" then both documents are about "motor vehicles". 

In [19]:
third_val=print_validation(preprocessed_held_corpus,old_scores=second_val)

log likelihood of test corpus : -682352.84 (was: -897250.40)
word perplexity of test corpus:  0.0897 (was: 0.0146)


Again log likelihood has improved. Word perplexity has not though. This is most likely because many common words are now suppressed.

## Fourth attempt: Stemming

"Car" and "cars" designate the same topic, "run" and "running" too. *Stemming* is a technique that reduces words to their root, so that both "runs" and "running" become "run". We use here a simple stemming algorithm, *Porter stemming*. Other libraries implement more sophisticated stemming methods that try to identify whether the term is a noun, a verb and so on to do a better stemming.

In [20]:
from gensim.parsing.porter import PorterStemmer
porter_stemmer=PorterStemmer()

document="I run running she runs lexicographic lexicographically"
porter_stemmer.stem_sentence(document)

'i run run she run lexicograph lexicograph'

Let's look at a sample document.

In [21]:
filters=[lambda x: x.lower(),strip_tags,strip_punctuation,strip_multiple_whitespaces,strip_numeric,remove_stopwords,strip_short,stem_text]

processed=gensim.parsing.preprocessing.preprocess_string(sample_doc,filters=filters)
print(" ".join(processed))

run gui network go slow fair expect window multius sens share filesystem like write telnetd run charact base app graphic base app share run execut local cpu thing shape client server architectur take old cpu termin setup note telnet run charact base app run gui yeah know window haven impress


We immediately identify a problem: Some of the terms are no longer recognisable as English words ("multius", "charact"). To be able to go back from the stemmed terms to the original terms we compute a dictionary <code>lemma_mapping</code>. (My excuses: the code is perhaps a bit more complicated than necessary.)

In [22]:
porter_stemmer=PorterStemmer()
filters=[lambda x: x.lower(),strip_tags,strip_punctuation,strip_multiple_whitespaces,strip_numeric,remove_stopwords,strip_short]

def preprocess_document(document,lemma_mapping,filters=filters):
    result = []
    processed=gensim.parsing.preprocessing.preprocess_string(document,filters=filters)
    for token in processed:
        lemma = porter_stemmer.stem(token)
        result.append(lemma)
        # map lemma to token
        # ...if lemma not yet in dictionary
        # ...or with small probability (thus, more likely that lemma is mapped to a common term)
        if (lemma not in lemma_mapping.keys()) or (random.random()<0.05):
            lemma_mapping[lemma]=token
    return result

def preprocess_corpus_stemming(corpus,filters=filters):
    lemma_mapping={}
    result=[]
    for document in corpus:
        result.append(" ".join(preprocess_document(document,lemma_mapping,filters=filters)))
    return result,lemma_mapping

preprocessed_corpus,lemma_mapping=preprocess_corpus_stemming(corpus)
preprocessed_held_corpus,_=preprocess_corpus_stemming(held_corpus)
lemma="multius"
print('The stemmed term "{}" came from "{}"'.format(lemma,lemma_mapping[lemma]))

The stemmed term "multius" came from "multiuser"


Wonderful, we can do perform stemming but also recover the original term. (Or rather some common original term.) We vectorise and fit the model.

In [23]:
vectorizer=CountVectorizer(max_df=0.1, min_df=10)
vector_data=vectorizer.fit_transform(preprocessed_corpus)
print("size of vocabulary: {}".format(len(vectorizer.vocabulary_.keys())))
run_fit(vector_data)

size of vocabulary: 6207
Fitting took 37.1s


Note how the size of the vocabulary has decreased further due to stemming. Because of stemming, we need to adapt the method <code>show_topic_stats</code> to take advantage of our inverse stemming-map.

In [24]:
def show_topic_stats(lda,corpus,lemma_mapping,num_top_words=15):
    topic_mixes=lda.transform(vectorizer.transform(corpus))
    total_topic_weights=topic_mixes.sum(axis=0)
    rel_topic_weights=total_topic_weights/sum(total_topic_weights)    
    topics_by_weight=sorted([(topic,topic_weight) for topic,topic_weight in enumerate(rel_topic_weights)],key=itemgetter(1),reverse=True)
    feature_names = vectorizer.get_feature_names()
    for topic,topic_weight in topics_by_weight:
        topic_dist=lda.components_[topic]
        top_terms=topic_dist.argsort()[:-num_top_words - 1:-1]
        message=" ".join([lemma_mapping[feature_names[i]] for i in top_terms])
        print("Topic {:2}: {:4.1f}% -> ".format(topic,topic_weight*100)+message)
 
show_topic_stats(lda,preprocessed_corpus,lemma_mapping)

Topic  5: 18.5% -> going car day started let said happen surely lots tells says ask got gets went
Topic  6: 12.3% -> god believe christian jesus means questions existing point word reasonably claim argument true bible personally
Topic  2: 11.2% -> max file windows programs version running dos images set display server graphics available colors software
Topic 11:  8.9% -> drive cards guns disk controlled driver scsi price hard monitor board video mac speed sale
Topic  4:  7.8% -> games play team players season win scoring league hockey goal period running second fan point
Topic  3:  7.4% -> space development data research nasa programs orbiter centers science designation images including model information universal
Topic  8:  7.1% -> bikes rides dod appears little left got black cause motorcycles turned flame picture better man
Topic  7:  6.4% -> state government president national american law israel public united weapons country israeli health report politic
Topic 10:  5.8% -> armenia

The topics appear to make more and more sense. There is clearly a tech topic, there is a religion topic, there is a topic about guns and so on.

In [25]:
fourth_val=print_validation(preprocessed_held_corpus,old_scores=third_val)

log likelihood of test corpus : -621667.91 (was: -682352.84)
word perplexity of test corpus:  0.0825 (was: 0.0897)


Log likelihood and word perplexity have improved somewhat.

## Fifth attempt: n-grams

How can we improve the model further? One issue with our approach: We implicitly assume that spaces separate the terms. This is true in a sentence such as "I am eating cake". In the sentence "Ulm University is awesome", however, the terms "Ulm" and "University" became separate, which seems wrong. We do not talk about Ulm and also not about universities in general but about a specific university, namely "Ulm university". To identify such terms we look at all 2-grams, pairs of words that appear consecutively. 

In [26]:
def twograms(sentence):
    word_list=sentence.split()
    if len(word_list)<=1:
        return word_list
    result=[]
    for i,word in enumerate(word_list[1:]):
        result.append(word_list[i]+" "+word)
    return result

twograms("my cat is eating cake")

['my cat', 'cat is', 'is eating', 'eating cake']

Now we identify the most frequent 2-grams.

In [27]:
def count_occurences(twograms_list):
    count=defaultdict(int)
    for twogram in twograms_list:
        count[twogram]+=1
    return count

def top_twograms(corpus,num=15):
    twograms_list=[]
    for document in corpus:
        twograms_list.extend(twograms(document.lower()))
    count_dict=count_occurences(twograms_list)
    top=sorted([(key,count) for key,count in count_dict.items()],key=itemgetter(1),reverse=True)
    return top[:num]

top_twograms(corpus)

[('of the', 10394),
 ('in the', 7120),
 ('to the', 4042),
 ('on the', 3763),
 ('to be', 2911),
 ('it is', 2887),
 ('for the', 2729),
 ('that the', 2641),
 ('is a', 2497),
 ('and the', 2433),
 ('i have', 2225),
 ("max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>' max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'",
  2203),
 ('if you', 2126),
 ('with the', 2030),
 ('this is', 1845)]

Okay, that did not work. Obviously, we first have to filter out the garbage and *then* look for frequent 2-grams. Note that we don't do stemming yet. This is mostly, so that we can still recognise the terms. 

In [28]:
preprocessed_corpus=preprocess_corpus(corpus)
preprocessed_held_corpus=preprocess_corpus(held_corpus)
top_twos=top_twograms(preprocessed_corpus,num=20)
top_twos

[('max max', 2838),
 ('united states', 289),
 ('new york', 264),
 ('thanks advance', 222),
 ('years ago', 210),
 ('law enforcement', 207),
 ('anonymous ftp', 194),
 ('mit edu', 184),
 ('clipper chip', 172),
 ('los angeles', 148),
 ('hard disk', 141),
 ('let know', 136),
 ('looks like', 133),
 ('power play', 131),
 ('hard drive', 129),
 ('edu pub', 128),
 ('public key', 118),
 ('nasa gov', 114),
 ('bhj bhj', 114),
 ('year old', 111)]

Much better. "United States" and "New York" are certainly terms we want to keep. (Instead of "United" and "States".) 

Let's set up a function that turns a 2-gram "United States" into "United_States", so that when we split by spaces, the 2-grams stay together.

In [29]:
def mark_twograms(corpus,twogram_list):
    processed_corpus=[]
    for document in corpus:
        document=document.lower()
        for twogram in twograms(document):
            if twogram in twogram_list:
                first_word,second_word=twogram.split()
                document=document.replace(twogram,first_word+"_"+second_word)
        processed_corpus.append(document)
    return processed_corpus

### let's test this
toy_corpus=["my cat is moving from new york to los angeles and back to new york","my new cat is born in york"]
mark_twograms(toy_corpus,["new york","los angeles","hard drive"])

['my cat is moving from new_york to los_angeles and back to new_york',
 'my new cat is born in york']

Now, we mark the most frequent 2-grams and then we perform stemming.

In [30]:
top_twos=[twogram for twogram,_ in top_twos]
preprocessed_corpus=mark_twograms(preprocessed_corpus,top_twos)
preprocessed_held_corpus=mark_twograms(preprocessed_held_corpus,top_twos)
preprocessed_corpus,lemma_mapping=preprocess_corpus_stemming(preprocessed_corpus,filters=[])  # only do stemming
preprocessed_held_corpus,_=preprocess_corpus_stemming(preprocessed_held_corpus,filters=[])

Finally, vectorisation and fitting. Note that the size of the vocabulary will go up a bit due to the 2-grams.

In [31]:
vectorizer=CountVectorizer(max_df=0.1, min_df=10)
vector_data=vectorizer.fit_transform(preprocessed_corpus)
print("size of vocabulary: {}".format(len(vectorizer.vocabulary_.keys())))
run_fit(vector_data)

size of vocabulary: 6224
Fitting took 37.9s


In [32]:
show_topic_stats(lda,preprocessed_corpus,lemma_mapping)

Topic  2: 12.1% -> key bit card encryption chips number data prices disk phone mac monitor control include devices
Topic 11: 11.4% -> questions point different book case discussion general example reason mean person caused effect group idea
Topic  7: 11.4% -> drives help tell got going read gets happen sure maybe heard started scsi set lot
Topic  8: 10.8% -> filed windows program images available version running dos server include set applications lists graphics users
Topic  3:  9.8% -> god christian believe jesus said jews lived israel religion bible word church says life mean
Topic  4:  8.9% -> team games playing players win season goal hockey league city said pick flame san great
Topic  6:  8.3% -> car engine power bike ground wires water rides light oil miles tired dod braking auto
Topic  1:  8.0% -> government gun states law president american publication countries going issues weapon crime court person control
Topic 10:  5.8% -> games max running went hit got second left reds sco

The topics don't look too different and no 2-gram is among the top words. Here I had more hopes. 

I should point out, though, that identifying phrases that belong together can be done a lot better. By restricting ourselves to 2-grams, for instance, we will never keep "United States of America". 

In [33]:
fifth_val=print_validation(preprocessed_held_corpus,old_scores=fourth_val)

log likelihood of test corpus : -622211.22 (was: -621667.91)
word perplexity of test corpus:  0.0842 (was: 0.0825)


Again, not much change here. 

## Exploration

Let's look at some of the documents.

In [34]:
def show_doc_and_topic(doc_num,snippet_length=600):
    if len(corpus[doc_num])>snippet_length:
        print(corpus[doc_num][:snippet_length]+"...\n")
    else:
        print(corpus[doc_num]+"\n")
    topic_mix=lda.transform(vectorizer.transform([preprocessed_corpus[doc_num]]))[0]
    rel_topics=sorted([(i,proportion) for i,proportion in enumerate(topic_mix) if proportion>0.005],key=itemgetter(1),reverse=True)
    for topic,proportion in rel_topics:
        print("Topic {:2}: {:.2f}%".format(topic,proportion*100))        

show_doc_and_topic(1234)





How about Kevin Hatcher? Scored roughly 35 goals, plays 30 minutes a game.


That's really sad when two second-rate goalies (Barasso and Belfour)
are the main contenders for the Vezina. Call me crazy, but how about
Tommy Soderstrom - five shutouts for a 6th place team that doesn't
really play defense. It's really unfortunate that the better goalies
in the league (McLean, Essensa, Vernon) had unspectacular years. BTW,
if you are going to award the Norris on the basis of the last 30 days,
why not give the Vezina to Moog? He has been the best goalie over the
past month.





Arbour or King. B...

Topic  4: 68.87%
Topic 10: 20.07%
Topic  7: 5.15%
Topic  6: 4.68%


In [35]:
show_doc_and_topic(5364)

In <1993Apr15.143320.8618@desire.wright.edu>, demon@desire.wright.edu sez:

There's this minor thing called "interest of finality/repose".  What
it means is that parties aren't dragged into court over and over again
because the losing side "discovers" some "new" evidence.  I don't know
about you, Brett, but I suspect GM had the resources to find just
about as many expert and fact witnesses as it wanted before the trial
started.  Letting them re-open the case now is practically an
invitation to every civil litigant on earth to keep an ace in the hole
in case the verdict goes against him.

BTW, ...

Topic  1: 41.95%
Topic 11: 24.84%
Topic  3: 10.88%
Topic  4: 7.88%
Topic  0: 6.24%
Topic  5: 5.59%
Topic 10: 2.10%


In [36]:
show_doc_and_topic(7654)


Something to bear in mind is what the V in VLB stands for!

V for Video - the origional intention of the bus was to speed up
the bus so that large memory to memory transfers would be faster.
This is espically useful in transfering data from main memory to
video memory.

Since there are usually 3 VLB slots card makers have been making 
cards to fit in the other two. 

How about an VLB ethernet card? Move the data into the card at
130 odd MB/s and then wait for it to tickle onto the net at
just over 1Mb/s.

[ Do do however free the local bus for other cards ]

Some times you need fast busses an...

Topic  2: 47.23%
Topic  5: 27.74%
Topic  7: 18.00%
Topic  1: 5.52%


In [37]:
show_doc_and_topic(8737)

Hello
	HELP!!! please
		I am a student of turbo c++ and graphics programming
	and I am having some problems finding algorithms and code
	to teach me how to do some stuff..

	1) Where is there a book or code that will teach me how
	to read and write pcx,dbf,and gif files?

	2) How do I access the extra ram on my paradise video board
	so I can do paging in the higher vga modes ie: 320x200x256
	800x600x256
	3) anybody got a line on a good book to help answer these question?

Thanks very much !

send reply's to : Palm@snycanva.bitnet

Topic  8: 30.12%
Topic  2: 28.63%
Topic 11: 21.97%
Topic  7: 15.07%
Topic  1: 2.82%


Let's now pick out some documents that score highly for a chosen topic.

In [38]:
topic_mixes=lda.transform(vectorizer.transform(preprocessed_corpus))

def show_top_doc_for_topic(topic_num):
    for doc_num,topic_mix in enumerate(topic_mixes):
        if topic_mix[topic_num]>0.7:
            break
    show_doc_and_topic(doc_num)
    
show_top_doc_for_topic(8)

Has anybody generated an X server for Windows NT?  If so, are you willing
to share your config file and other tricks necessary to make it work?

Thanks for any information.

Topic  8: 92.36%
Topic  1: 0.69%
Topic  7: 0.69%
Topic 11: 0.69%
Topic  2: 0.69%
Topic  3: 0.69%
Topic  0: 0.69%
Topic  6: 0.69%
Topic  4: 0.69%
Topic 10: 0.69%
Topic  9: 0.69%
Topic  5: 0.69%


In [39]:
show_top_doc_for_topic(4)


Actually, fired-coach George Kingston was a third of the GM
triumvirate.  Now that the trio is now duo (Dean Lombardi and Chuck
Grillo), the Sharks are already on their 3rd "office of the GM". And a
4th is likely to happen before September; they'll either add the new
coach to the OofGM, or name a single GM. So your wager should be
amended to read that Sharks are likely to have their 5th GM before the
Panther's get their 2nd. Can't wait to see how the next season's NHL
Guide and Record Book lists the GM history of the Sharks.

Given the depth of next year's draft, the expansion draft rules, an...

Topic  4: 91.40%
Topic  5: 2.61%
Topic  1: 2.51%
Topic 11: 2.30%


In [40]:
show_top_doc_for_topic(10)


wondering
                                                      -------------

Do you mean Juan Berenguer?  He was traded for Mark Davis in the middle
of last season.  Exchanged one stiff for another, as Berenguer hadn't
come back from his injury in 91.  I think he's retired now.

Anyhow, as middle relief, Marvin ain't that bad.  He at least can
pitch a couple of innings or do mop-up work.  I don't know much
about McMichael (was he the Mexican League guy?), but
everybody else in the pen is a 1 inning man, except maybe
Mercker.


-------------------------------------------------------
Eric Rou...

Topic 10: 70.83%
Topic  4: 13.19%
Topic  7: 7.63%
Topic  5: 6.50%


Version: 09/09/22