<h1 style="background-color:#0071BD;color:white;text-align:center;padding-top:0.8em;padding-bottom: 0.8em">
  LDA on Returns
</h1>

This notebook illustrates the core ideas of Latent Dirichlet Allocation on a very minimal corpus. After you have worked through this notebook, you should have understood:
  * A __corpus__ consists of a list of documents.
  * The __vocabulary__ consists of the union of words that we consider relevant in the documents.
  * Each document is represented by the __word counts__ of the words in the vocabulary.
  * A __topic__ is a probability distribution over the vocabulary.
  * The __topic distribution__ gives us the share that each topic has on a given document.
  * Topic distribution times topics is an approximation of the word counts.
  
<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

In [1]:
import re

import numpy as np

from sklearn.preprocessing import normalize
from sklearn.decomposition import LatentDirichletAllocation

In [2]:
# Comments in this notebook are meant as invitations to explore alternatives.
# On the first read, you should just ignore all the comments. On a second read
# you might want to add more sentences to the corpus (see cells below).
# So if this is your first read, you should start ignoring comments now.

# If you enlarge the corpus, you might want to enlarge the width of the notebook
# on the screen, to see the tables without line breaks. The two lines below make
# the cells as wide as possible:

# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:100% !important; }</style>"))

In [3]:
def print1d(template, values):
    for value in values: print(template.format(value), end = '')
    print()
    
def print2d(template, valuess, blank = '', threshold = None):
    for values in valuess:
        for value in values:
            print(template.format(value) if (threshold == None) or (value > threshold) else blank, end = '')
        print()

## Corpus definition and vocabulary creation

Below you find a few sentences that obviously cover two quite distinct topics. They share a common word that has two different meanings. We consider each sentence to be a separate document. Let's see whether Latend Dirichlet Allocation is able to detect that we are looking at two different topics. Notice that the process is unsupervised, i.e. we never tell the algorithms for any document (sentence) which topic it covers. The only hint, we will give the algorithm that it should look for exactly 2 topics.

In [4]:
corpus = [
    'Investment in Deutsche Bank yields low return.',
    'My investment may return nothing.', 
    'Federer’s return was good, his volley was not.',
    'Return volley, return volley; tennis is boring.',
    'Return on investment is on a ten year high.',
#    'Tennis is for Federer!',
#    'Deutsche Bank may be an investment bank.'
]

In [5]:
bags = []

for document in corpus:
    
    tokens = re.split('[ .!,;’]', document)
    bag    = [token.lower() for token in tokens if len(token) > 3]
    
#    stop_words = ['', 'in', 'my', 'may', 's', 'was', 'his', 'not', 'is', 'on', 'a', 'ten', 'for', 'be', 'an']
#    bag        = [token.lower() for token in tokens if token.lower() not in stop_words]

#    bag        = ['hi/lo' if word in {'high', 'low'} else word for word in bag]
    
    bags.append(bag)

print('WORDS PER DOCUMENT:')
print2d('{:12s}', bags)

WORDS PER DOCUMENT:
investment  deutsche    bank        yields      return      
investment  return      nothing     
federer     return      good        volley      
return      volley      return      volley      tennis      boring      
return      investment  year        high        


In [6]:
vocabulary = dict.fromkeys([word for bag in bags for word in bag])

# vocabulary = dict.fromkeys(['investment', 'return', 'federer', 'volley'])
# for bag in bags: bag = [word for word in bag if word in vocabulary]

words = [word for word in vocabulary.keys()]

print('COMBINED VOCABULARY OF ALL DOCUMENTS:')
print1d('{}  ', words)

COMBINED VOCABULARY OF ALL DOCUMENTS:
investment  deutsche  bank  yields  return  nothing  federer  good  volley  tennis  boring  year  high  


In [7]:
for key in vocabulary.keys(): vocabulary[key] = 0
word_counts = np.zeros((len(corpus), len(vocabulary)), dtype=int)

for d, bag in enumerate(bags):
    for w, word in enumerate(words):
        
        count = bag.count(word)
        
        vocabulary[word] += count
        word_counts[d, w] = count

LINE = '-' + len(vocabulary) * 9 * '-'
print('WORD COUNTS IN THE DOCUMENTS:'); print(LINE)
print1d('{:>9}', vocabulary.keys()             ); print(LINE)
print2d('{:9d}', word_counts,        9 * ' ', 0); print(LINE)
print1d('{:9d}', vocabulary.values()           )

WORD COUNTS IN THE DOCUMENTS:
----------------------------------------------------------------------------------------------------------------------
investment deutsche     bank   yields   return  nothing  federer     good   volley   tennis   boring     year     high
----------------------------------------------------------------------------------------------------------------------
        1        1        1        1        1                                                                        
        1                                   1        1                                                               
                                            1                 1        1        1                                    
                                            2                                   2        1        1                  
        1                                   1                                                              1        1
-----------------------

In [8]:
n_topics = 2

lda = LatentDirichletAllocation(n_components = n_topics, learning_method='batch', max_iter=50, n_jobs = -1)

lda.fit(word_counts)

words_in_topics = normalize(lda.components_, norm='l1')

print1d('{:>9}',   vocabulary.keys()); print(LINE)
print2d('{:9.1f}', lda.components_  ); print(LINE)
print2d('{:9.1%}', words_in_topics  ); print(LINE)

investment deutsche     bank   yields   return  nothing  federer     good   volley   tennis   boring     year     high
----------------------------------------------------------------------------------------------------------------------
      2.4      0.5      0.5      0.5      5.5      1.5      1.5      1.5      3.5      1.5      1.5      1.5      1.5
      1.6      1.5      1.5      1.5      1.5      0.5      0.5      0.5      0.5      0.5      0.5      0.5      0.5
----------------------------------------------------------------------------------------------------------------------
    10.1%     2.2%     2.2%     2.2%    23.7%     6.3%     6.4%     6.4%    15.0%     6.4%     6.4%     6.4%     6.4%
    14.0%    12.7%    12.7%    12.7%    12.6%     4.5%     4.4%     4.4%     4.3%     4.3%     4.3%     4.4%     4.4%
----------------------------------------------------------------------------------------------------------------------


In [9]:
topics_in_corpus = lda.transform(word_counts)

print1d('Topic{:2d}  ', range(n_topics)               )
print2d('{:7.0%}  ',    topics_in_corpus, 9 * ' ', 0.5)

Topic 0  Topic 1  
             89%  
    83%           
    89%           
    92%           
    87%           


In [10]:
words_in_corpus  = topics_in_corpus.dot(words_in_topics)
length_in_corpus = [len(bag) for bag in bags]
word_counts_in_corpus = np.diag(length_in_corpus).dot(words_in_corpus)

print1d('{:>9}',   vocabulary.keys()                    ); print(LINE)
print2d('{:9d}',   word_counts,           9 * ' ', 0    ); print(LINE)
print2d('{:9.1f}', word_counts_in_corpus, 9 * ' ', 0.334)

investment deutsche     bank   yields   return  nothing  federer     good   volley   tennis   boring     year     high
----------------------------------------------------------------------------------------------------------------------
        1        1        1        1        1                                                                        
        1                                   1        1                                                               
                                            1                 1        1        1                                    
                                            2                                   2        1        1                  
        1                                   1                                                              1        1
----------------------------------------------------------------------------------------------------------------------
      0.7      0.6      0.6      0.6      0.7        

In [11]:
def topic_description(words, probabilities):

    cumulated = 0
    description = ''
    
    for w in np.argsort(probabilities)[::-1]:

        probability = probabilities[w]
        description += words[w]  + ','
        
        if (cumulated < 1/3 <= cumulated + probability) or (cumulated < 4/5 <= cumulated + probability):
            description += '  '
        
        cumulated += probability
    
    return description.rstrip(' ').rstrip(',')

descriptions = []

for probabilities in words_in_topics:
    description = topic_description(words, probabilities)
    print(description)
    descriptions.append(description)

return,volley,  investment,boring,tennis,good,federer,high,  year,nothing,yields,bank,deutsche
investment,yields,bank,  deutsche,return,nothing,high,year,good,  federer,volley,boring,tennis


In [12]:
for document, probabilities in zip(corpus, topics_in_corpus):

    print('\n"{}"'.format(document))
    
    for probability, description in zip(probabilities, descriptions):
        print('{} {:.0%} {:}'.format('X ' if probability > 0.5 else '- ', probability, description))


"Investment in Deutsche Bank yields low return."
-  11% return,volley,  investment,boring,tennis,good,federer,high,  year,nothing,yields,bank,deutsche
X  89% investment,yields,bank,  deutsche,return,nothing,high,year,good,  federer,volley,boring,tennis

"My investment may return nothing."
X  83% return,volley,  investment,boring,tennis,good,federer,high,  year,nothing,yields,bank,deutsche
-  17% investment,yields,bank,  deutsche,return,nothing,high,year,good,  federer,volley,boring,tennis

"Federer’s return was good, his volley was not."
X  89% return,volley,  investment,boring,tennis,good,federer,high,  year,nothing,yields,bank,deutsche
-  11% investment,yields,bank,  deutsche,return,nothing,high,year,good,  federer,volley,boring,tennis

"Return volley, return volley; tennis is boring."
X  92% return,volley,  investment,boring,tennis,good,federer,high,  year,nothing,yields,bank,deutsche
-  8% investment,yields,bank,  deutsche,return,nothing,high,year,good,  federer,volley,boring,tenn

<img source='images/lda-on-returns-word-use-in-5-sentences.PNG'/>

## Lattice of the "Sentence uses word" relation

Have you been suspicious of whether we actually need a probabilistic approach to distinguish these few documents? If yes, you were right. The lattice below illustrates, which document contains which word. (A document contains a word if you can reach a word starting from the document by following lines upwards.) As you see that we could just ignore "return" as all documents contain this word. The presence of "investment" or "volley" separates the corpus into two. The remaining words are then just specific to each of the document.

<img src='images/lda-on-returns-word-use-in-5-sentences.PNG' style='width:60%'/>

So, it is time to increase the corpus a bit. Scroll back to the top and include the given two more sentences. The lattice below demonstrates that an analysis based on set theory becomes hareder.

## Lattice of the "Sentence uses word" relation, given two more sentences
<img src='images/lda-on-returns-word-use-in-7-sentences.PNG' style='width:60%'/>

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; D. Speicher<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>