# Vectors and Matrices: Basic Text Analytics

## Bonus: A (Very) Advanced Example: Topic Modeling

This example is a sneak peak into the more advanced features of data mining. We will try find topics in Marx Communist Manifesto. Many of the features shown below will be covered in more detail later in this course.

In [34]:
import requests
text = requests.get('http://www.gutenberg.org/cache/epub/61/pg61.txt').text.lower()
print(text[:100])

﻿the project gutenberg ebook of the communist manifesto
by karl marx and friedrich engels

this e


We study the book by approximately paragraphs:

In [35]:
text[1000:1500]

'party in opposition that has not been decried as\r\ncommunistic by its opponents in power?  where is the opposition\r\nthat has not hurled back the branding reproach of communism,\r\nagainst the more advanced opposition parties, as well as against\r\nits reactionary adversaries?\r\n\r\ntwo things result from this fact.\r\n\r\ni.  communism is already acknowledged by all european powers\r\nto be itself a power.\r\n\r\nii.  it is high time that communists should openly, in the\r\nface of the whole world, publish their vi'

In [65]:
# we split by two hard returns
paras = text.split('\r\n\r\n')
print(len(paras))
print(paras[190])

294
by this, the long wished-for opportunity was offered to "true"
socialism of confronting the political movement with the
socialist demands, of hurling the traditional anathemas
against liberalism, against representative government, against
bourgeois competition, bourgeois freedom of the press, bourgeois
legislation, bourgeois liberty and equality, and of preaching to
the masses that they had nothing to gain, and everything to lose,
by this bourgeois movement.  german socialism forgot, in the nick
of time, that the french criticism, whose silly echo it was,
presupposed the existence of modern bourgeois society, with its
corresponding economic conditions of existence, and the political
constitution adapted thereto, the very things whose attainment
was the object of the pending struggle in germany.


In [66]:
# sonnets start here
print(paras[6])


*** start of this project gutenberg ebook the communist manifesto ***


In [67]:
# sonnets end here
print(paras[-55])

*** end of this project gutenberg ebook the communist manifesto ***


Then we properly tokenize the each sonnet, i.e. separate words from punctuation (for more details, see next lecture).

In [46]:
from nltk.tokenize import word_tokenize
sentence = 'This example, will be: properly tokenized!'
tokens = word_tokenize(sentence)
print(tokens)

['This', 'example', ',', 'will', 'be', ':', 'properly', 'tokenized', '!']


In the code below we do many things:
    - Tokenize each sonnet
    - Select longer sonnets, more than 5 tokens and store them in a variable long_poems
    - Retain only alphabetical items

In [107]:
# code is rather hacky, but hey it works!
long_paras = []

for para in paras[7:-55]: # ignore the gutenberg appendices
    
    tokens = word_tokenize(para) # seperate words form interpunction     
    if len(tokens) >= 20: # if we have more than 5 alphabetic tokens
        long_paras.append(tokens) # add the sonnet to long poems
            
print('We selected ',len(long_paras),' paragraphs of the total of ', len(poems)) # print the result

We selected  178  paragraphs of the total of  294


In [None]:
# demonstrate what happens in the loop 

To feed our data to Topic Modeling algorithm, we have to make a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix), for this we need to have a list with all types (distinct words) often called a vocabulary.

In [108]:
# tokenize the whole text at one
all_tokens = word_tokenize(text)
print(len(all_tokens))
print(len(set(all_tokens)))
vocab = set(all_tokens)

16568
2733


`set()` transforms a list to unordered set and thereby removes all duplicates as in the example below. 

In [50]:
l = [1,1,1,2,3,4,4]
print(l)
s = set(l)
print(s)
l = list(s)
print(l)

[1, 1, 1, 2, 3, 4, 4]
{1, 2, 3, 4}
[1, 2, 3, 4]


In [None]:
# same as
list(set(l))

In [109]:
# print the hundred first items of list(set())
print(list(all_tokens)[:100])

['\ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'the', 'communist', 'manifesto', 'by', 'karl', 'marx', 'and', 'friedrich', 'engels', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'you', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.net', 'title', ':', 'the', 'communist', 'manifesto', 'author', ':', 'karl', 'marx', 'and', 'friedrich', 'engels', 'release', 'date', ':', 'january', '25', ',', '2005', '[', 'ebook', '#', '61', ']', 'language', ':', 'english', '***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'the', 'communist', 'manifesto', '***', 'transcribed', 'by']


Function words are often discarded. This can be easily done using a membership condition.

In [51]:
from nltk.corpus import stopwords
stopw = stopwords.words('english') # get a list of stop words from an external library
print(stopw)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [110]:
vocab_wo_punct = []
print(len(vocab))
for w in vocab: # iterate over all the tokens in all_tokens
    if w.isalpha() and w not in stopw: # if an items is alphanumeric 
        vocab_wo_punct.append(w)
print(len(vocab_wo_punct))

2733
2452


`vocab` now contains all alphabetic words that are not stop words.

Now let's filter out the word that appear less than five times.

In [111]:
from nltk import FreqDist
vocab_filtered = []
fd = FreqDist(all_tokens)
for v in vocab_wo_punct:
    if fd[v] > 5:
        vocab_filtered.append(v)
print(len(vocab_filtered))

261


Now we will transform all the titles to a **count vector**: a list where each value indicates how often a word appears (int) or not (0):

In [112]:
# Let's take a random example
example = long_paras[30]
print(example)
fd_s = FreqDist(example)

['modern', 'industry', 'has', 'converted', 'the', 'little', 'workshop', 'of', 'the', 'patriarchal', 'master', 'into', 'the', 'great', 'factory', 'of', 'the', 'industrial', 'capitalist', '.', 'masses', 'of', 'labourers', ',', 'crowded', 'into', 'the', 'factory', ',', 'are', 'organised', 'like', 'soldiers', '.', 'as', 'privates', 'of', 'the', 'industrial', 'army', 'they', 'are', 'placed', 'under', 'the', 'command', 'of', 'a', 'perfect', 'hierarchy', 'of', 'officers', 'and', 'sergeants', '.', 'not', 'only', 'are', 'they', 'slaves', 'of', 'the', 'bourgeois', 'class', ',', 'and', 'of', 'the', 'bourgeois', 'state', ';', 'they', 'are', 'daily', 'and', 'hourly', 'enslaved', 'by', 'the', 'machine', ',', 'by', 'the', 'over-looker', ',', 'and', ',', 'above', 'all', ',', 'by', 'the', 'individual', 'bourgeois', 'manufacturer', 'himself', '.', 'the', 'more', 'openly', 'this', 'despotism', 'proclaims', 'gain', 'to', 'be', 'its', 'end', 'and', 'aim', ',', 'the', 'more', 'petty', ',', 'the', 'more', 'h

In [113]:
vector = []
for w in vocab_filtered:
    count = fd_s.get(w,0)
    vector.append(count)
print(vector)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]


Each row in the document-term matrix is a vector representing one sonnet.

Now we transform our corpus to a document term matrix: A matrix where the rows represent songs, and columns the presence of a word.

In [114]:
vectors = []
for tokens in long_paras:
    vector = []
    freqdist = FreqDist(tokens)
    for w in vocab_filtered:
        
        vector.append(freqdist.get(w,0))
    vectors.append(vector)
            
print(len(vectors) == len(long_paras))
print(len(vectors[0]) == len(vocab_filtered))

True
True


Now we can train a topic model on the document-term matrix.

In [115]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=5, max_iter=20, #maybe n_components
                                learning_method='online',
                                random_state=0,
                                verbose=0,
                                n_jobs=1)
lda.fit(vectors)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=20, mean_change_tol=0.001,
             n_jobs=1, n_topics=5, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

And print the different topics:

In [116]:
# Example takes from: http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
print_top_words(lda, list(vocab_filtered), 10)

Topic #0: german socialism french ideas social society true one man class
Topic #1: production selling buying industries common women free national country mere
Topic #2: french family criticism philosophical money foundation state bourgeois everywhere german
Topic #3: bourgeois property production society existence relations conditions old means labour
Topic #4: class proletariat bourgeoisie society social bourgeois conditions political working revolutionary



We can improve the topic by deleting words based on their document frequency.

In [117]:
flattened = []
for p in long_paras:
    flattened.extend(list(set(p)))

fd_df = FreqDist(flattened)

total_docs = len(long_paras)

vocab_df = []
for w in vocab_wo_punct:
    if 0.01  < fd_df.get(w,0) / total_docs < 0.5:
        vocab_df.append(w)

In [118]:
print(len(vocab_df))

756


In [119]:
vocab_df

['individuality',
 'law',
 'destroy',
 'aims',
 'overcome',
 'chartists',
 'continued',
 'organised',
 'revolutionary',
 'openly',
 'gospel',
 'measures',
 'members',
 'masters',
 'philosophy',
 'war',
 'live',
 'distinctive',
 'discovery',
 'lose',
 'raised',
 'universal',
 'opposition',
 'forms',
 'writings',
 'element',
 'mass',
 'literate',
 'accusation',
 'improvement',
 'could',
 'strata',
 'position',
 'epoch',
 'created',
 'alone',
 'prejudices',
 'possible',
 'attainment',
 'gone',
 'requiring',
 'monopolised',
 'globe',
 'increased',
 'absolute',
 'marriage',
 'revolutionising',
 'naturally',
 'appropriation',
 'battles',
 'requires',
 'feet',
 'socialists',
 'rule',
 'assumed',
 'absolutism',
 'others',
 'half',
 'product',
 'compels',
 'oppose',
 'distinctions',
 'future',
 'wealth',
 'things',
 'end',
 'market',
 'practical',
 'find',
 'contrary',
 'open',
 'circumstances',
 'struggles',
 'aim',
 'patriarchal',
 'interest',
 'producing',
 'generally',
 'nay',
 'rest',
 're

In [120]:
vectors = []
for tokens in long_paras:
    vector = []
    freqdist = FreqDist(tokens)
    for w in vocab_df:
        
        vector.append(freqdist.get(w,0))
    vectors.append(vector)
            
print(len(vectors) == len(long_paras))
print(len(vectors[0]) == len(vocab_df))

True
True


In [121]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=5, max_iter=20, #maybe n_components
                                learning_method='online',
                                random_state=0,
                                verbose=0,
                                n_jobs=1)
lda.fit(vectors)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=20, mean_change_tol=0.001,
             n_jobs=1, n_topics=5, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [122]:
print_top_words(lda, list(vocab_df), 10)

Topic #0: property bourgeois french ideas production german abolition ancient national capital
Topic #1: action party socialism historical philosophical proletariat french society support everywhere
Topic #2: class bourgeois bourgeoisie society proletariat conditions social production modern political
Topic #3: conditions french revolution literature bourgeoisie german social demands appropriation nothing
Topic #4: bourgeoisie socialism power german form section country property existence socialist

