# LDA - Latent Dirichlet Allocation.

### LDA. Quick facts:
- Well tested and well known approach.
- Automatically extract topics from documents.
- Can be directly applied to a new document.

### LDA. The way it works:
- Each document is seen as a mixture of various topics.
- Each word in the document is assigned to the topics with the certain probabilities
- Each topic is seen as a set of words with assigned probabilities.
- Person should assign topic-names himself, but only once.

### Reference
- Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), pp.993-1022.



In [31]:
import spacy
import json
import pandas as pd
import time
import numpy as np
from os import listdir
from os.path import isfile, join
from itertools import groupby

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode(connected=True)

from spacy.lang.de.stop_words import STOP_WORDS
nlp = spacy.load("de")

In [32]:
%run src/file_utils.py
%run src/configuration.py
%run 'load_and_prepro_document.ipynb'

## In this notebook, we will show two use cases of LDA:
1. [Find the distribution of topic "Risk Management" in one document of a given bank throughout paragraphs of this document.](#case_1)
2. [Show the fluctuation of topic "Risk Management" for different banks throughout several years.](#case_2)

### Collect names of all reports, related to banks:

In [33]:
banks = [f for f in listdir(FILE_PATH) if isfile(join(FILE_PATH, f)) and 'bank' in f.lower()]

### Load clean documents, related to banks

In [34]:
start_time = time.time()
lemm_docs_prep, names = get_clean_documents(banks,get_paragraph=True)

print ('Time to load paragraphs of {0:d} documents took {1:.2f} seconds'.format(
        len(names), 
        time.time() - start_time))

Time to load paragraphs of 176 documents took 0.37 seconds


In [35]:
def readPageAndParInfo(file_name):
    contents = []
   
    try:
        data = json.loads(FileUtils.fix_json(file_name))
        for item in data:
            typeDoc = item['type']
            if typeDoc == 'paragraph':
                contents.append({
                    'page':item['pagenumber'],
                    'paragraph':item['counter']
                })
    except:
        pass
    return contents

### List of lists of paragraphs -> list of paragraphs

In [36]:
lem_pars = []
for sublist in lemm_docs_prep:
    for item in sublist:
        lem_pars.append(item)

<a id='case_1'></a>
### Let us start with the first use case [1]:
- Train LDA model by using 127 bank documents. Get 9 topics.
- Input a new document.
- For each paragraph in this document, calculate its probability of belonging to each topic. 
- Specify one topic, output most related paragraphs and group  them to pages

In [37]:
tf_vectorizer = CountVectorizer()
start_time = time.time()
tf = tf_vectorizer.fit_transform(lem_pars)

print ('Fit and transofrm of CountVectoriser on {0:d} paragrpaphs took {1:.2f} seconds'.format(
        len(lem_pars), 
        time.time() - start_time))

Fit and transofrm of CountVectoriser on 91042 paragrpaphs took 2.54 seconds


In [38]:
lda = LatentDirichletAllocation(n_components=9,
                                learning_method='batch',
                                random_state=0)
start_time = time.time()
lda.fit(tf)

print ('Fit phase of LDA took {0:.2f} seconds'.format(time.time() - start_time))

Fit phase of LDA took 206.24 seconds


In [39]:
# Prints given number of top words for a LDA model
def print_top_words(model, feature_names, n_top_words):
    matr = model.components_ / model.components_.sum(axis=1)[:, np.newaxis]
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        
        message += " ".join([str(feature_names[i]) + ": " + "{:.5f}".format(matr[topic_idx, i])
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
        print()
    print()

### List topics, retrieved by LDA.

In [41]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 10)

Topic #0: bank: 0.01543 kunde: 0.01322 segment: 0.00979 bereich: 0.00842 management: 0.00764 mitarbeiter: 0.00619 mein: 0.00586 geschäft: 0.00542 produkt: 0.00458 unternehmen: 0.00429

Topic #1: fair: 0.01497 value: 0.01424 beizulegenden: 0.01303 vermögenswerte: 0.01132 zeitwert: 0.01001 bewerten: 0.00998 bewertung: 0.00879 finanziell: 0.00871 verbindlichkeit: 0.00853 finanzinstrumente: 0.00783

Topic #2: million: 0.06423 mio: 0.03448 euro: 0.02996 vorjahr: 0.02245 höhe: 0.02056 ergebnis: 0.01952 ertrag: 0.01590 steuer: 0.01199 mrd: 0.01139 betragen: 0.01007

Topic #3: konzern: 0.01752 unternehmen: 0.01719 bank: 0.01617 deutschen: 0.00852 anteil: 0.00840 wesentlich: 0.00608 ifrs: 0.00596 verfahren: 0.00521 gesellschaft: 0.00518 konzernabschluss: 0.00494

Topic #4: euro: 0.02013 deutlich: 0.00996 prozent: 0.00918 hoch: 0.00835 entwicklung: 0.00823 weiterhin: 0.00787 erwarten: 0.00633 insbesondere: 0.00623 positiv: 0.00599 liegen: 0.00596

Topic #5: risiko: 0.01869 konzern: 0.00796 risk:

### Topic #2 resembles "Risk Management". Let us collect information about this topic in annual report of CommerzBank for the year 2016

In [42]:
COMMERZBANK_FILE = 'Commerzbank-AnnualReport-2016.json'
#COMMERZBANK_FILE = 'Aareal-AnnualReport-2010.json'

In [43]:
commerz_paragraphs, names = get_clean_documents([COMMERZBANK_FILE],get_paragraph=True)

In [44]:
commerz_paragraphs_numbers = read_page_and_par_Info(FILE_PATH + COMMERZBANK_FILE)

In [45]:
tf_commerz = tf_vectorizer.transform(commerz_paragraphs[0])

start_time = time.time()
topic_model = lda.transform(tf_commerz)

print ('Transform phase of LDA took {0:.2f} seconds'.format(time.time() - start_time))

Transform phase of LDA took 0.17 seconds


### Collect information about risk management for all paragraphs of the document

In [46]:
bank_risk_management = []
for doc, document_name in enumerate(range(len(commerz_paragraphs[0]))):
    company_name = document_name
 
    #print('\n{:40.40}: '.format(str(document_name)), end ='')
    most_probable = np.argsort(topic_model[doc, :])[:-6:-1]

    cummulated = 0
    for topic in most_probable:

        probability = topic_model[doc, topic]
        if int(topic) == 5:
            bank_risk_management.append({
                'paragraph': company_name,
                'value': str(probability)
            })
        # print('{:6.2%} {:3} '.format(probability, topic), end = '')
        cummulated = cummulated + probability
        if cummulated > 0.95: break

### Add extra information about page, which contain paragraph of interest in the document.

In [47]:
sorted_risk_management = sorted(bank_risk_management, key=lambda k: -1 * float(k['value']))
top_result = [paragraph for paragraph in sorted_risk_management if float(paragraph['value']) > 0.5]

value_sum = 0.0
top_result_page_number = []
for result in top_result:
    item = {}
    value_sum += float(result['value'])
    item['paragraph'] = result['paragraph']
    item['value'] = result['value']
    item['page'] = commerz_paragraphs_numbers[result['paragraph']]['page']
    item['page_par'] = commerz_paragraphs_numbers[result['paragraph']]['paragraph']
    top_result_page_number.append(item)
top_result_page_number = sorted(top_result_page_number, key=lambda k:int(k['page']) * 100 + int(k['page_par']))

### Group paragraphs by page.

In [48]:
page_topic_dict = dict()
for k, v in groupby(top_result_page_number, lambda x: x['page']):
    page_topic_dict[k] = list(v)

### Assign maximum value of the paragraph on the page to the page.

In [49]:
values = []
pages = []
texts = []
for k,v in page_topic_dict.items():
    value = max([item['value'] for item in v])
    values.append(value)
    pages.append(k)
    texts.append('Page: ' + str(k))

### Plot topic distribution throughout pages.

In [50]:
trace = go.Bar(
    x=pages,
    y=values,
    text=texts
)

data = [trace]
layout = go.Layout(
    title='Commerzbank 2016 Report',
    xaxis=dict(
        title='Pages'
    ),
    yaxis=dict(
        title='Probability'
    )
)

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='text-hover-bar')

[Link to this image](https://github.com/michaelborisov/text-analysis-lab/blob/master/img/commerzbank_report.png)

As a result, after using LDA, the person who has to analyze document can now find pages, containing topic of interest without reading the whole document and even if this topic is not stated in the table of content.

<a id='case_2'></a>
# Let us now move to the second use case [2]: Topic fluctuation.

### Repeat experiment, but instead of paragraphs of one document, take complete documents of different banks.

In [51]:
banks_documents, names = get_clean_documents(banks,get_paragraph=False)

In [52]:
tf_banks = tf_vectorizer.transform(banks_documents)
topic_model_banks = lda.transform(tf_banks)

In [53]:
#print()
#print(' Dominant topics per document ')
#print('------------------------------')

bank_risks = {}
for doc, document_name in enumerate([file for file in banks]):
    if 'Annual' not in document_name:
        continue
    company_name = document_name[:document_name.find('-')]
    
    if company_name not in bank_risks:
        bank_risks[company_name] = []
        
    
    #print('\n{:40.40}: '.format(document_name), end ='')
    most_probable = np.argsort(topic_model_banks[doc, :])[:-6:-1]

    cummulated = 0
    for topic in most_probable:

        probability = topic_model_banks[doc, topic]
        if int(topic) == 5:
            year = document_name[document_name.rfind('-') + 1:document_name.rfind('-') + 5]
            bank_risks[company_name].append({
                'year': year,
                'value': str(probability)
            })
        #print('{:6.2%} {:3} '.format(probability, topic), end = '')
        cummulated = cummulated + probability
        if cummulated > 0.95: break

In [54]:
bank_risks['DeutscheBank']
#bank_risks['Commerzbank']

[{'year': '2015', 'value': '0.1975249094182455'},
 {'year': '2013', 'value': '0.23285458584931415'},
 {'year': '2010', 'value': '0.1318258658106828'},
 {'year': '2014', 'value': '0.2462806818210243'},
 {'year': '2016', 'value': '0.18816099096332956'},
 {'year': '2012', 'value': '0.20531318018187397'},
 {'year': '2011', 'value': '0.15644785693728225'}]

### Plot topic fluctuation throughout years for different banks.

In [55]:
years = [2011,2012, 2013, 2014, 2015, 2016]
values = []
traces = []
bank_of_interest = ['DeutscheBank', 'Commerzbank', 'BayerischeLandesbank', 'DzBank']
for key, value in bank_risks.items():
    if key not in bank_of_interest:
        continue
    newlist = sorted(bank_risks[key], key=lambda k: int(k['year']))
    for l in newlist:
        values.append(round(float(l['value']) * 100,2))
    if key != 'Commerzbank':
        values = values[1:]
    print (key)    
    print (values)
    trace = go.Scatter(
        x = years,
        y = values,
        mode = 'lines+markers',
        name = key
    )
    traces.append(trace)
    values = []
py.offline.iplot(traces, filename='scatter-mode.html')

BayerischeLandesbank
[24.02, 23.86, 27.72, 16.18, 16.89, 17.65]
DeutscheBank
[15.64, 20.53, 23.29, 24.63, 19.75, 18.82]
DzBank
[16.49, 15.87, 17.92, 20.98, 20.9, 21.07]
Commerzbank
[10.59, 12.22, 13.0, 12.81, 14.91, 15.07, 16.34]


[Link to this image](https://github.com/michaelborisov/text-analysis-lab/blob/master/img/topic_fluctuation.png)

Here we can see how the percentage of the topic of "Risk Management" fluctuated in annual reports of different banks throughout several years.

## Summary:
- LDA is quite complicated, but allows meaningful topic detection. 
- Topic labeling should be done manually, but only once per set of documents. 