# LDA - Latent Dirichlet Allocation.

### LDA. Quick facts:
- Well tested and well known approach.
- Automatically extract topics from documents.
- Can be directly applied to a new document.

### LDA. The way it works:
- Each document is seen as a mixture of various topics.
- Each word in the document is assigned to the topics with the certain probabilities
- Each topic is seen as a set of words with assigned probabilities.
- Person should assign topic-names himself, but only once.


In [2]:
import spacy
import json
import pandas as pd
import time
import numpy as np

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import plotly as py
import plotly.graph_objs as go
py.offline.init_notebook_mode(connected=True)

from os import listdir
from os.path import isfile, join
from itertools import groupby

from spacy.lang.de.stop_words import STOP_WORDS
nlp = spacy.load("de")

In [3]:
%run src/file_utils.py
%run src/configuration.py
%run 'load_and_prepro_document.ipynb'

## In this notebook, we will show two use cases of LDA:
1. Find the distribution of topic "Risk Management" in one document of a given bank throughout paragraphs of this document. [1]
2. Show the fluctuation of topic "Risk Management" for different banks throughout several years. [2]

### Collect names of all reports, related to banks:

In [4]:
banks = [f for f in listdir(FILE_PATH) if isfile(join(FILE_PATH, f)) and 'bank' in f.lower()]

### Load clean documents, related to banks

In [5]:
start_time = time.time()
lemm_docs_prep, names = get_clean_data(banks,get_paragraph=True)

print ('Time to load paragraphs of {0:d} documents took {1:.2f} seconds'.format(
        len(names), 
        time.time() - start_time))

Time to load paragraphs of 176 documents took 0.38 seconds


In [6]:
def readPageAndParInfo(file_name):
    contents = []
   
    try:
        data = json.loads(FileUtils.fix_json(file_name))
        for item in data:
            typeDoc = item[TYPE]
            if typeDoc == PARAGRAPH:
                contents.append({
                    'page':item['pagenumber'],
                    'paragraph':item['counter']
                })
    except:
        pass
    return contents

### List of lists of paragraphs -> list of paragraphs

In [7]:
lem_pars = []
for sublist in lemm_docs_prep:
    for item in sublist:
        lem_pars.append(item)

### Let us start with the first use case [1]:
- Train LDA model by using 127 bank documents. Get 9 topics.
- Input a new document.
- For each paragraph in this document, calculate its probability of belonging to each topic. 
- Specify one topic, output most related paragraphs and group  them to pages

In [8]:
tf_vectorizer = CountVectorizer()
start_time = time.time()
tf = tf_vectorizer.fit_transform(lem_pars)

print ('Fit and transofrm of CountVectoriser on {0:d} paragrpaphs took {1:.2f} seconds'.format(
        len(lem_pars), 
        time.time() - start_time))

Fit and transofrm of CountVectoriser on 91042 paragrpaphs took 2.54 seconds


In [9]:
lda = LatentDirichletAllocation(n_components=9,
                                learning_method='batch',
                                random_state=0)
start_time = time.time()
lda.fit(tf)

print ('Fit phase of LDA took {0:.2f} seconds'.format(time.time() - start_time))

Fit phase of LDA took 252.48 seconds


In [10]:
# Prints given number of top words for a LDA model
def print_top_words(model, feature_names, n_top_words):
    matr = model.components_ / model.components_.sum(axis=1)[:, np.newaxis]
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        
        message += " ".join([str(feature_names[i]) + ": " + "{:.5f}".format(matr[topic_idx, i])
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
        print()
    print()

### List topics, retrieved by LDA.

In [11]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 10)

Topic #0: euro: 0.03303 milliarde: 0.01997 deutlich: 0.00992 prozent: 0.00923 ergebnis: 0.00920 hoch: 0.00916 vorjahr: 0.00906 million: 0.00784 liegen: 0.00776 positiv: 0.00718

Topic #1: mio: 0.04605 mrd: 0.03011 ertrag: 0.01223 höhe: 0.01143 aktie: 0.00991 quartal: 0.00960 dezember: 0.00955 betragen: 0.00945 aufwendung: 0.00909 anstieg: 0.00748

Topic #2: risiko: 0.02038 konzern: 0.00869 intern: 0.00769 wesentlich: 0.00667 rahmen: 0.00654 basis: 0.00587 steuerung: 0.00556 kapital: 0.00522 risk: 0.00511 erfolgen: 0.00485

Topic #3: bank: 0.01819 kunde: 0.00927 segment: 0.00715 bereich: 0.00683 mein: 0.00618 deutschen: 0.00533 mitarbeiter: 0.00526 konzern: 0.00510 management: 0.00465 unternehmen: 0.00407

Topic #4: milliarde: 0.01189 kunde: 0.00722 geschäft: 0.00619 markt: 0.00595 position: 0.00525 transaktion: 0.00503 dezember: 0.00491 segment: 0.00486 helaba: 0.00481 finance: 0.00466

Topic #5: million: 0.07593 euro: 0.02452 höhe: 0.01731 eur: 0.01609 vorjahr: 0.01517 verlust: 0.0099

### Topic #2 resembles "Risk Management". Let us collect information about this topic in annual report of CommerzBank for the year 2016

In [12]:
COMMERZBANK_FILE = 'Commerzbank-AnnualReport-2016.json'

In [13]:
commerz_paragraphs, names = get_clean_data([COMMERZBANK_FILE],get_paragraph=True)

In [14]:
commerz_paragraphs_numbers = readPageAndParInfo(FILE_PATH + COMMERZBANK_FILE)

In [15]:
tf_commerz = tf_vectorizer.transform(commerz_paragraphs[0])

start_time = time.time()
topic_model = lda.transform(tf_commerz)

print ('Transform phase of LDA took {0:.2f} seconds'.format(time.time() - start_time))

Transform phase of LDA took 0.21 seconds


### Collect information about risk management for all paragraphs of the document

In [16]:
bank_risk_management = []
for doc, document_name in enumerate(range(len(commerz_paragraphs[0]))):
    company_name = document_name
 
    #print('\n{:40.40}: '.format(str(document_name)), end ='')
    most_probable = np.argsort(topic_model[doc, :])[:-6:-1]

    cummulated = 0
    for topic in most_probable:

        probability = topic_model[doc, topic]
        if int(topic) == 2:
            bank_risk_management.append({
                'paragraph': company_name,
                'value': str(probability)
            })
        # print('{:6.2%} {:3} '.format(probability, topic), end = '')
        cummulated = cummulated + probability
        if cummulated > 0.95: break

### Add extra information about page, which contain paragraph of interest in the document.

In [17]:
sorted_risk_management = sorted(bank_risk_management, key=lambda k: -1 * float(k['value']))
top_result = [paragraph for paragraph in sorted_risk_management if float(paragraph['value']) > 0.5]

value_sum = 0.0
top_result_page_number = []
for result in top_result:
    item = {}
    value_sum += float(result['value'])
    item['paragraph'] = result['paragraph']
    item['value'] = result['value']
    item['page'] = commerz_paragraphs_numbers[result['paragraph']]['page']
    item['page_par'] = commerz_paragraphs_numbers[result['paragraph']]['paragraph']
    top_result_page_number.append(item)
top_result_page_number = sorted(top_result_page_number, key=lambda k:int(k['page']) * 100 + int(k['page_par']))

### Group paragraphs by page.

In [18]:
page_topic_dict = dict()
for k, v in groupby(top_result_page_number, lambda x: x['page']):
    page_topic_dict[k] = list(v)

### Assign maximum value of the paragraph on the page to the page.

In [19]:
values = []
pages = []
texts = []
for k,v in page_topic_dict.items():
    value = max([item['value'] for item in v])
    values.append(value)
    pages.append(k)
    texts.append('Page: ' + str(k))

### Plot topic distribution throughout pages.

In [20]:
trace = go.Bar(
    x=pages,
    y=values,
    text=texts
)

data = [trace]
layout = go.Layout(
    title='Commerzbank 2016 Report',
    xaxis=dict(
        title='Pages'
    ),
    yaxis=dict(
        title='Probability'
    )
)

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='text-hover-bar')

As a result, after using LDA, the person who has to analyze document can now find pages, containing topic of interest without reading the whole document and even if this topic is not stated in the table of content.

# Let us now move to the second use case [2]: Topic fluctuation.

### Repeat experiment, but instead of paragraphs of one document, take complete documents of different banks.

In [21]:
banks_documents, names = get_clean_data(banks,get_paragraph=False)

In [22]:
tf_banks = tf_vectorizer.transform(banks_documents)
topic_model_banks = lda.transform(tf_banks)

In [23]:
#print()
#print(' Dominant topics per document ')
#print('------------------------------')

bank_risks = {}
for doc, document_name in enumerate([file for file in banks]):
    if 'Annual' not in document_name:
        continue
    company_name = document_name[:document_name.find('-')]
    
    if company_name not in bank_risks:
        bank_risks[company_name] = []
        
    
    #print('\n{:40.40}: '.format(document_name), end ='')
    most_probable = np.argsort(topic_model_banks[doc, :])[:-6:-1]

    cummulated = 0
    for topic in most_probable:

        probability = topic_model_banks[doc, topic]
        if int(topic) == 2:
            year = document_name[document_name.rfind('-') + 1:document_name.rfind('-') + 5]
            bank_risks[company_name].append({
                'year': year,
                'value': str(probability)
            })
        #print('{:6.2%} {:3} '.format(probability, topic), end = '')
        cummulated = cummulated + probability
        if cummulated > 0.95: break

In [24]:
bank_risks['DeutscheBank']

[{'year': '2015', 'value': '0.16415967578501264'},
 {'year': '2013', 'value': '0.19845716278419687'},
 {'year': '2010', 'value': '0.11417129827446894'},
 {'year': '2014', 'value': '0.20251953613237136'},
 {'year': '2016', 'value': '0.16164756455642815'},
 {'year': '2012', 'value': '0.17401214059577957'},
 {'year': '2011', 'value': '0.12702646388376182'}]

### Plot topic fluctuation throughout years for different banks.

In [28]:
years = [2011,2012, 2013, 2014, 2015, 2016]
values = []
traces = []
bank_of_interest = ['DeutscheBank', 'Commerzbank', 'BayerischeLandesbank', 'DzBank']
for key, value in bank_risks.items():
    if key not in bank_of_interest:
        continue
    newlist = sorted(bank_risks[key], key=lambda k: int(k['year']))
    for l in newlist:
        values.append(round(float(l['value']) * 100,2))
    if key != 'Commerzbank':
        values = values[1:]
    print (key)    
    print (values)
    trace = go.Scatter(
        x = years,
        y = values,
        mode = 'lines+markers',
        name = key
    )
    traces.append(trace)
    values = []
py.offline.iplot(traces, filename='scatter-mode.html')

BayerischeLandesbank
[25.99, 25.18, 28.49, 17.39, 17.33, 17.89]
DeutscheBank
[12.7, 17.4, 19.85, 20.25, 16.42, 16.16]
DzBank
[17.09, 16.62, 18.14, 19.71, 22.04, 21.79]
Commerzbank
[9.7, 11.52, 11.88, 13.47, 13.13, 15.41]


Here we can see how the percentage of the topic of "Risk Management" fluctuated in annual reports of different banks throughout several years.

## Summary:
- LDA is quite complicated, but allows meaningful topic detection. 
- Topic labeling should be done manually, but only once per set of documents. 