# Latent Dirichlet Allocation

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
corpus, _ = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),return_X_y=True)

In [3]:
print(corpus[8567])



As you point out, the experiments would be difficult. But we know
enough about the physics of the situation to do some calculations.
There are in fact three effects contributing to leaning the bike over
to begin a turn.

	1. Gyro effect causing a torque which twists the bike over.

	2. Contact patch having shifted to one side, causing bike to fall over.

	3. Contact patch being accelerated to the side, causing a
	torque which twists the bike over.

Take an average bike/rider, average bike wheel, and at speeds of 5,
15, and 50 mph (say) calculate how much twist of the bars would be
needed to produce (say) 20 degrees of lean in (say) 2 seconds by each
effect alone. My guess is that at slow speeds 2 is dominant, and at
high speeds 3 is dominant, and at all speeds 1 contributes not far off
bugger all, relatively speaking.

By the way, a similar problem is this: how does a runner who wants to
run round a corner get leaned into the corner fast? Is there a running
group where we could start

In [4]:
vectorizer=CountVectorizer()
vector_data=vectorizer.fit_transform(corpus)
print("size of vocabulary: {}".format(len(vectorizer.vocabulary_.keys())))

size of vocabulary: 101631


This is quite large, so let's do this again. This time we set max_df, the maximum frequency of words to be included in the dictionary. We set it to max_df=0.9, which means that words that occur in more than 90% of the documents are filtered out. We also set a minimum count, min_df=10, which has the effect that a word needs to appear in at least ten documents to be included in the dictionary. Finally, let's filter out stop words. 

A more professional approach would use a lemmatiser, for instance the lemmatiser of spaCy, but let's keep things simple.

In [5]:
vectorizer=CountVectorizer(max_df=0.9, min_df=10,stop_words='english')
vector_data=vectorizer.fit_transform(corpus)
print("size of vocabulary: {}".format(len(vectorizer.vocabulary_.keys())))

size of vocabulary: 10441


We fit the Latent Dirichlet Allocation with ten topics (n_components=10). Because it's less memory intensive, we do online (ie minibatch) learning.

In [6]:
lda=LatentDirichletAllocation(n_components=10,learning_method='online')
lda.fit(vector_data)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=0)

Let's print the top words for each topic.

In [7]:
feature_names = vectorizer.get_feature_names()
n_top_words=10
for i, topic in enumerate(lda.components_):
    message=" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
    print("Topic {}: ".format(i)+message)


Topic 0: don know just like think time ve people good want
Topic 1: key public use information government encryption law chip number security
Topic 2: god people jesus believe does bible evidence say life true
Topic 3: year game team games good play season hockey think players
Topic 4: edu com new car drive price power hard used sale
Topic 5: ax max g9v b8f a86 pl 145 1d9 34u 1t
Topic 6: space nasa research center 1993 earth health university years water
Topic 7: file use windows edu software available program using version files
Topic 8: 00 10 25 11 15 17 16 12 14 20
Topic 9: people said gun mr government israel armenian war state armenians


There's obviously still a bit of garbage in the data. 

In [8]:
doc_num=1234
print(corpus[doc_num])



No. The issue is reducing crime, not guns. If gun control doesn't
lower crime overall, then is doesn't address the issue.


Does that matter if assaults with a baseball bat become much
more common? Muggers using a gun rely primarily on the
threat of the gun, and rarely shoot their victim. A mugger
using a knife is much more likely to start by stabbing his victim 
in an effort incapacitate him. So, while a knif may not
be as deadly as a gun, criminals are more likely to actually
_use_ the knife (as opposed to threatening the victim with it.)
It isn't at all clear that replacing the criminal's gun with a
knife would reduce murders. Stabbings might just become more
common. That's why it is important to look at the overall
(not the with-gun) homicide rate. It avoids the issue of
substitution, different criminal techinques of using different
weapons, etc... and measures what we want to prevent: Murders.


"Face"? Possibly. However, facing knife-welding attackers isn't
usual tactic. Very f

In [9]:
topic_mix=lda.transform(vectorizer.transform([corpus[doc_num]]))[0]
for i,proportion in enumerate(topic_mix):
    if proportion > 0.005:
        print("Topic {}: {:.2f}%".format(i,proportion*100))

Topic 0: 19.51%
Topic 1: 7.07%
Topic 2: 7.91%
Topic 3: 7.53%
Topic 6: 4.73%
Topic 9: 52.97%


Seems about right.