# Vectors and Matrices: Basic Text Mining Concepts

## Bonus: A (Very) Advanced Example: Topic Modeling

This example is a sneak peak into the more advanced features of data mining. We will try find topics in the King James Bible. Many of the features shown below will be covered in more detail later in this course.

In [None]:
import requests
#text = requests.get('http://www.gutenberg.org/files/1081/1081-0.txt').text.lower()
text = requests.get('http://www.gutenberg.org/cache/epub/1041/pg1041.txt').text.lower()
print(text[:100])

In [None]:
We study the book by approximately paragraphs:

In [None]:
text[1000:1500]

In [None]:
# we split by two hard returns
poems = text.split('\r\n\r\n')
print(len(poems))
print(poems[301])

In [None]:
# sonnets start here
print(poems[17])

In [None]:
# sonnets end here
print(poems[-55])

Then we properly tokenize the each sonnet, i.e. separate words from punctuation (for more details, see next lecture).

In [None]:
from nltk.tokenize import word_tokenize
sentence = 'This example, will be: properly tokenized!'
tokens = word_tokenize(sentence)
print(tokens)

In the code below we do many things:
    - Tokenize each sonnet
    - Select longer sonnets, more than 5 tokens and store them in a variable long_poems
    - Retain only alphabetical items

In [None]:
# code is rather hacky, but hey it works!
long_poems = []

for poem in poems[17:-55]: # ignore the gutenberg appendices
    
    tokens = word_tokenize(poem) # seperate words form interpunction

    alpha_tokens = [] # here we will store all alphabetic items
        
    for token in tokens: # iterate over the tokenized version of the sonnet
        if token.isalpha(): # if the token is alphabetic
            alpha_tokens.append(token) # append it to alpha_tokens
        
    if len(alpha_tokens) >= 5: # if we have more than 5 alphabetic tokens
        long_poems.append(alpha_tokens) # add the sonnet to long poems
            
print('We selected ',len(long_poems),' books of the total of ', len(poems)) # print the result

In [None]:
# demonstrate what happens in the loop 

To feed our data to Topic Modeling algorithm, we have to make a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix), for this we need to have a list with all types (distinct words) often called a vocabulary.

In [None]:
# tokenize the whole text at one
all_tokens = set(word_tokenize(text))
len(all_tokens)
len(set(all_tokens))

`set()` transforms a list to unordered set and thereby removes all duplicates as in the example below. 

In [None]:
l = [1,1,1,2,3,4,4]
print(l)
s = set(l)
print(s)
l = list(s)
print(l)

In [None]:
# same as
list(set(l))

In [None]:
# print the hundred first items of list(set())
print(list(all_tokens)[:100])

Function words are often discarded. This can be easily done using a membership condition.

In [None]:
from nltk.corpus import stopwords
stopw = stopwords.words('english') # get a list of stop words from an external library
print(stopw)

In [None]:
vocab = []
for w in all_tokens: # iterate over all the tokens in all_tokens
    if w.isalpha() and w not in stopw: # if an items is alphanumeric 
        vocab.append(w)
print(len(vocab))

`vocab` now contains all alphabetic words that are not stop words.

Now we will transform all the titles to a **binary vector**: a list where each value indicates if a word appears (1) or not (0):

In [None]:
# Let's take a random example
example = long_poems[30]
print(example)

In [None]:
vector = []
for w in vocab:
    if w in example:
        vector.append(1)
    else:
        vector.append(0)

len(vector) == len(vocab)

In [None]:
print(vector[:100])

Each row in the document-term matrix is a vector representing one sonnet.

Now we transform our corpus to a document term matrix: A matrix where the rows represent songs, and columns the presence of a word.

In [None]:
vectors = []
for tokens in long_poems:
    vector = []
    set_tokens = set(tokens)
    for w in vocab:
        if w in set_tokens:
            vector.append(1)
        else:
            vector.append(0)
    vectors.append(vector)
            
print(len(vectors) == len(long_poems))
print(len(vectors[0]) == len(vocab))

Now we can train a topic model on the document-term matrix.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=10, max_iter=10, #maybe n_components
                                learning_method='online',
                                random_state=0,
                                verbose=1,
                                n_jobs=1)
lda.fit(vectors)

And print the different topics:

In [None]:
# Example takes from: http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
print_top_words(lda, list(vocab), 10)

## Exercises DIY: Loops, Conditions and Collections

Take a list, say for example this one:

  `a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]`

and write a program that prints out all the elements of the list that are less than 5.

Extras:

- Instead of printing the elements one by one, make a new list that has all the elements less than 5 from this list in it and print out this new list.

- Ex. 6: Write code that multiplies all items in the list 
    
`a = [1,2,3,4,5]`

In [None]:
1*2*3*4*5==result

- Ex. 7: Write a Python program to get the smallest number from a list of integers. 
    
`a = [6,9,4,2,7,8,9]`

- Ex. 8: Write a Python program to get the highest number from a list of integers. 
    
`a = [6,9,4,2,7,8,9]`

- Ex. 9: Write a Python program that prints the longest word in the following sentence:

`sentence = "Zunächst ein etwas abgenutzter Koffer aus weißem Leder, dem man es ansah, daß er nicht zum erstenmal eine Reise machte."`