## Applying Hierarchical LDA Model

Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non-parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and readily accommodates growing data collections. The hLDA model combines this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation [1].

Idea inspired from [Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)

**Reference:**

[1] David M. Blei , Michael I. Jordan , Thomas L. Griffiths , Joshua B. Tenenbaum, Hierarchical topic models and the nested chinese restaurant process, Proceedings of the 16th International Conference on Neural Information Processing Systems, p.17-24, December 09-11, 2003, Whistler, British Columbia, Canada

#### Hierarchical LDA Model example
<img src="https://i.loli.net/2018/09/21/5ba4c638ee55a.png" alt="Hierarchical_LDA_Model_example.png" title="Hierarchical_LDA_Model_example.png" width="60%" height="60%"/>

In [1]:
from gensim import corpora, models, similarities
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis.gensim

import os
import pickle
import pandas as pd
import numpy as np
import warnings

pyLDAvis.enable_notebook()

#### 1. Generate Dictionary tokens-- In order to train LDA model later on
- params:
    - processed_doc_path: cleand preprocessed data folder path
    - non_german_file_path: a list of all English files
    - topic_related_documents: documents that belong to the same topic
    - document_limit: restrict the number of documents to process
- variables:
    - dictionary: a list of tokens loaded from processed files
    - filenames: a list of processed file names

In [2]:
def generate_dictionary_tokens(processed_doc_path, non_german_file_path, topic_related_documents, document_limit):
    # get a list of non_german files
    non_german_files = []
    with open(non_german_file_path, 'r') as fr:
        for line in fr:
            non_german_files.append(line.strip())

    dictionary_tokens = []
    filenames = []

    for root, dirs, files in os.walk(processed_doc_path):
        for f in sorted(files):
            if f in topic_related_documents:
                if f[:-11] not in non_german_files:
                    document_limit -= 1
                    try:
                        with open(root+'/'+f, 'rb') as fr:
                            filenames.append(f)
                            document_tokens = pickle.load(fr)
                            dictionary_tokens.append(document_tokens)
                    except:
                        print('Error while processing: ', f)

            if document_limit == 0:
                break

    return dictionary_tokens, filenames

#### 2. Apply the hierachical LDA Model -- In order to group those documents which seem to belong to the same topic
- params:
    - processed_doc_path: cleand preprocessed data folder
    - non_german_file_path: a list of all English files
    - topic_num: topics numbers to choose while training the model
    - document_limit: restrict the number of documents to process
- variables:
    - gensim_dictionary: [Dictionay](https://radimrehurek.com/gensim/corpora/dictionary.html) type, defined by gesim library 
    - corpus: doc2bow, tokenized and nomalized list
    - lda: LDA model after training with given dataset

In [3]:
def hierachicalLDA(processed_doc_path, non_german_file_path, topic_related_documents, topic_num=50, document_limit=1000):

    dictionary_tokens, filenames = generate_dictionary_tokens(
        processed_doc_path, non_german_file_path, topic_related_documents, document_limit)

    gensim_dictionary = Dictionary(dictionary_tokens)
    corpus = [gensim_dictionary.doc2bow(text) for text in dictionary_tokens]
    lda = LdaModel(corpus, num_topics=topic_num,
                   id2word=gensim_dictionary, iterations=200)
    topics = lda.show_topics(num_topics=-1, num_words=20)

    return lda, gensim_dictionary, corpus, filenames

### Demonstration:

**Steps**:
1. Load 200 preprocessed files to generate dictionary tokens list
2. Train initial LDA model with obtained dictionary tokens to generate desired topics, e.g. 15 topics derived from 200 files
3. Visualize initial LDA model with 15 topics
4. Classify documents to the same topic (by topic probability)
5. Apply hierachical LDA model based on grouped topic related documents (further generate 4 topics)
6. Visualize second LDA model with 4 topics

In [4]:
%%time
# relative path
non_german_file_path = './non_german_files.txt'
processed_doc_path = '../spacy_corpus/'
topic_num_level_1 = 15
# right now, treat every document to the same topic
topic_related_documents = os.listdir(processed_doc_path)
init_lda, init_gensim_dictionary, init_corpus, init_filenames = hierachicalLDA(
    processed_doc_path, non_german_file_path, topic_related_documents, topic_num_level_1, 200)

CPU times: user 24.5 s, sys: 228 ms, total: 24.8 s
Wall time: 8.45 s


#### Visualize Initial LDA model with 15 topics

In [5]:
pyLDAvis.gensim.prepare(init_lda, init_corpus, init_gensim_dictionary)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


#### Group topic related documents by assuming that belong to the same topic with the probability larget than 0.5
- params: 
    - df: pandas DataFrame, each column represents one topic, and each row indicates the probability of that document belongs to this topic
- variables: 
    - topic_related_documents: list, contain group of doucment names which seem to belong to the same topic

In [6]:
def get_topic_related_doc(lda, corpus, filenames, topic_num):  
    
    doc_pos = 0
    mat = np.zeros((len(filenames), topic_num))

    for doc in corpus:
        vector = lda[doc] # get topic probability distribution for a document
        for element in vector:
            mat[doc_pos][element[0]] = element[1]
        doc_pos += 1

    df = pd.DataFrame(mat, index=filenames, columns=range(0,topic_num))

    topic_related_documents = []
    
    for column in range(df.shape[1]):
        row_count = 0
        for row in range(df.shape[0]):
            if df.iat[row, column] > 0.5:
                row_count += 1
        topic_related_documents.append(df.nlargest(row_count, column).index)
    
    return topic_related_documents



In [7]:
topic_related_documents = get_topic_related_doc(init_lda, init_corpus, init_filenames, topic_num_level_1)
df_topics = pd.DataFrame(topic_related_documents).transpose()
df_topics = df_topics.rename(columns=lambda x: 'Topic '+ str(x))

**Desigante documents to topics (each topic consists of a list of documnents)**

In [8]:
df_topics

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,Topic 10,Topic 11,Topic 12,Topic 13,Topic 14
0,ADVA_Optical-QuarterlyReport-2012-Q3,DeutscheWohnen-AnnualReport-2011,FuchsPetrolub-AnnualReport-2011,BMW-QuarterlyReport-2014-Q3,DIC-Asset-AnnualReport-2013,Daimler-AnnualReport-2015,Diebold_Nixdorf-QuarterlyReport-2012-Q3,,Cancom-AnnualReport-2011,Bilfinger-QuarterlyReport-2010-Q1,DeutscheBank-AnnualReport-2014,EVOTEC-QuarterlyReport-2012-Q2,BVB-QuarterlyReport-2011-Q1,Biotest-AnnualReport-2016,
1,ADVA_Optical-QuarterlyReport-2016-Q3,DeutscheWohnen-AnnualReport-2010,Allianz-QuarterlyReport-2015-Q1,EnBW-QuarterlyReport-2011-Q3,DialogSemiconductor-QuarterlyReport-2015-Q1,Daimler-AnnualReport-2014,Diebold_Nixdorf-QuarterlyReport-2012-Q2,,Cancom-AnnualReport-2010,BayerischeLandesbank-QuarterlyReport-2017-Q1,DeutscheBank-AnnualReport-2012,Elring_Klinger-QuarterlyReport-2013-Q3,EVOTEC-QuarterlyReport-2016-Q2,Biotest-AnnualReport-2011,
2,Brenntag-AnnualReport-2015,Deutsche_Euroshop-QuarterlyReport-2010-Q2,Bayer-QuarterlyReport-2014-Q3,EnBW-QuarterlyReport-2011-Q2,Drillisch-QuarterlyReport-2013-Q1,Daimler-AnnualReport-2013,Bilfinger-QuarterlyReport-2014-Q1,,Cancom-QuarterlyReport-2015-Q2,BetAtHome-QuarterlyReport-2017-Q1,DeutscheBank-AnnualReport-2013,DialogSemiconductor-QuarterlyReport-2011-Q1,ADVA_Optical-QuarterlyReport-2012-Q2,Biotest-AnnualReport-2010,
3,Commerzbank-QuarterlyReport-2014-Q2,Deutsche_Euroshop-QuarterlyReport-2010-Q3,Fresenius-QuarterlyReport-2013-Q1,DIC-Asset-QuarterlyReport-2014-Q3,Drillisch-QuarterlyReport-2017-Q1,Daimler-AnnualReport-2012,Bertrandt-QuarterlyReport-2014-Q2,,Cancom-QuarterlyReport-2015-Q3,,DeutscheBank-AnnualReport-2015,Elring_Klinger-QuarterlyReport-2013-Q2,Diebold_Nixdorf-QuarterlyReport-2016-Q3,Deutsche_Euroshop-AnnualReport-2015,
4,Beiersdorf-QuarterlyReport-2014-Q1,DeutscheWohnen-AnnualReport-2016,Bayer-QuarterlyReport-2014-Q2,CarlZeissMeditec-QuarterlyReport-2012-Q1,DeutscheBoerse-QuarterlyReport-2017-Q1,AmadeusFiRe-AnnualReport-2016,EvonikIndustries-QuarterlyReport-2015-Q2,,Cancom-QuarterlyReport-2011-Q2,,DeutscheBank-QuarterlyReport-2013-Q3,Aareal-QuarterlyReport-2013-Q1,,Deutsche_Euroshop-AnnualReport-2014,
5,ADVA_Optical-QuarterlyReport-2016-Q2,Continental-AnnualReport-2012,Durr-QuarterlyReport-2013-Q3,Fraport-QuarterlyReport-2012-Q1,Freenet-QuarterlyReport-2011-Q1,AmadeusFiRe-AnnualReport-2010,Bertrandt-QuarterlyReport-2014-Q3,,Cancom-QuarterlyReport-2011-Q3,,DeutscheBank-QuarterlyReport-2013-Q2,Commerzbank-QuarterlyReport-2010-Q2,,Fresenius_Medical_Care-QuarterlyReport-2016-Q3,
6,Commerzbank-QuarterlyReport-2014-Q3,Deutsche_Euroshop-QuarterlyReport-2014-Q2,Capital_Stage-QuarterlyReport-2013-Q3,Fielmann-QuarterlyReport-2012-Q1,DIC-Asset-AnnualReport-2014,Daimler-QuarterlyReport-2010-Q3,eon-AnnualReport-2010,,EvonikIndustries-QuarterlyReport-2011-Q3,,Deutsche_Post-AnnualReport-2015,Elring_Klinger-QuarterlyReport-2017-Q2,,Fresenius_Medical_Care-QuarterlyReport-2016-Q2,
7,,DeutscheWohnen-QuarterlyReport-2017-Q1,Fresenius-QuarterlyReport-2017-Q1,DeutscheWohnen-QuarterlyReport-2013-Q1,DeutscheBoerse-QuarterlyReport-2013-Q1,Daimler-QuarterlyReport-2010-Q2,DIC-Asset-QuarterlyReport-2014-Q2,,,,Deutsche_Post-AnnualReport-2014,Beiersdorf-QuarterlyReport-2010-Q1,,EnBW-QuarterlyReport-2015-Q2,
8,,Deutsche_Euroshop-AnnualReport-2012,Bayer-QuarterlyReport-2010-Q2,Bechtle-QuarterlyReport-2011-Q3,DIC-Asset-AnnualReport-2015,Daimler-QuarterlyReport-2014-Q3,,,,,DzBank-QuarterlyReport-2015-Q1,Aareal-QuarterlyReport-2017-Q1,,Eventim-AnnualReport-2015,
9,,Continental-QuarterlyReport-2013-Q2,Bayer-QuarterlyReport-2010-Q3,eon-QuarterlyReport-2015-Q1,CompuGroupMedical-QuarterlyReport-2010-Q1,Daimler-QuarterlyReport-2014-Q2,,,,,Deutsche_Post-AnnualReport-2012,Commerzbank-QuarterlyReport-2010-Q3,,Eventim-AnnualReport-2014,


#### Give a simple example, apply the Hierarchical LDA on one topic (e.g Topic 2) that has the maximum related documents to further generate 4 topics

In [9]:
topic_num_level_2 = 4
topic_docs = max(topic_related_documents, key=len)
lda, gensim_dictionary, corpus, filenames = hierachicalLDA(
    processed_doc_path, non_german_file_path, topic_docs.tolist(), topic_num_level_2)
pyLDAvis.gensim.prepare(lda, corpus, gensim_dictionary)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


First, we could apply initial LDA model to generalize topics from the whole documents, in the above case, we derive 15 topics from 200 files. Then, we designate each document to certain topic with high topic probability. 

Further, for demonstration, we choose the topic with maximum related documents to apply LDA model again to derive 4 topics. In this way, we assume it is possible to derive more detailed topics based on the previously generated topics. 

To sum up, it is possible to apply LDA model hierarchically to obtain better results. 