## Applying Hierarchical LDA Model

Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non-parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and readily accommodates growing data collections. The hLDA model combines this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation [1].

Idea inspired from [Hierarchical Topic Models and the Nested Chinese Restaurant Process](http://www.cs.columbia.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf)

**Reference:**

[1] David M. Blei , Michael I. Jordan , Thomas L. Griffiths , Joshua B. Tenenbaum, Hierarchical topic models and the nested chinese restaurant process, Proceedings of the 16th International Conference on Neural Information Processing Systems, p.17-24, December 09-11, 2003, Whistler, British Columbia, Canada

<font color="blue"/>

### dsp:
  * &#x1f642; Wow! That is something extra. Great!
  * To reference a scientific publication, you should give more complete data, e.g. for the 2010 version of the document: "David M. Blei, Thomas L. Griffiths, and Michael I. Jordan. 2010. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J. ACM 57, 2, Article 7 (February 2010), 30 pages. DOI: https://doi.org/10.1145/1667053.1667056"

#### Hierarchical LDA Model example
![Hierarchical LDA Model example](Hierarchical_LDA_Model_example.png)

In [1]:
from gensim import corpora, models, similarities
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis.gensim

import os
import pickle
import pandas as pd
import numpy as np
import warnings

pyLDAvis.enable_notebook()

<font color="blue"/>

### dsp:
  * Sorting and adding empty lines might make this import list more accessible.

#### 1. Apply the initial LDA Model -- In order to group those documents which seem to belong to the same topic
- params:
    - processed_doc_path: cleand preprocessed data folder
    - non_german_file_path: a list of all English files
    - topic_num: topics numbers to choose while training the model
    - document_limit: restrict the number of documents to process
- variables:
    - gensim_dictionary: [Dictionay](https://radimrehurek.com/gensim/corpora/dictionary.html) type, defined by gesim library 
    - corpus: doc2bow, tokenized and nomalized list
    - lda: LDA model after training with given dataset

<font color="blue"/>

### dsp:
  * "are seemed" ~> "seem"
  * What you call a "path" here you have been calling a "folder" before.

In [2]:
def initialLDAModel(processed_doc_path, non_german_file_path, topic_num=50, document_limit=1000):
    
    #get a list of non_german files
    non_german_files = []
    with open(non_german_file_path, 'r') as fr:
        for line in fr:
            non_german_files.append(line.strip())
    
    dictionary = []
    filenames = []
    
    for root, dirs, files in os.walk(processed_doc_path):
        for f in files:
            if f[:-11] not in non_german_files:
                document_limit -= 1
                try:
                    with open(root+'/'+f, 'rb') as fr:
                        filenames.append(f)
                        document_tokens = pickle.load(fr)
                        dictionary.append(document_tokens)
                except:
                    print('Error while processing: ', f)
            
            if document_limit == 0:
                break
    
    gensim_dictionary = Dictionary(dictionary)
    corpus = [gensim_dictionary.doc2bow(text) for text in dictionary]
    lda = LdaModel(corpus, num_topics=topic_num, id2word=gensim_dictionary, iterations=200)
    topics = lda.show_topics(num_topics=-1, num_words=20)
    
    doc_pos = 0
    mat = np.zeros((len(filenames), topic_num))

    for doc in corpus:
        vector = lda[doc] # get topic probability distribution for a document
        for element in vector:
            mat[doc_pos][element[0]] = element[1]
        doc_pos += 1
        
    df = pd.DataFrame(mat, index=filenames, columns=range(0,topic_num))
    
    return df

<font color="blue"/>

### dsp:
  * The first half of this function could be a function in its own right. I think I have seen similar code in other notebooks as well.

In [3]:
%%time
warnings.filterwarnings("ignore")
#please modify the path
# non_german_file_path = '/home/bit/ma0/LabShare/data/non_german_files.txt'
# processed_doc_path = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'

#relative path
non_german_file_path = './non_german_files.txt'
processed_doc_path = '../spacy_corpus/'
df = initialLDAModel(processed_doc_path, non_german_file_path, 20, 200)

CPU times: user 33.7 s, sys: 348 ms, total: 34.1 s
Wall time: 10.4 s


<font color="blue"/>

### dsp:
  * Do you know, which warning you suppress? Did you check whether it is OK to ignore them? Are there no other warnings?

#### Group topic related documents by assuming that belong to the same topic with the probability larget than 0.3
- params: 
    - df: pandas DataFrame, each column represents one topic, and each row indicates the probability of that document belongs to this topic
- variables: 
    - topic_related_documents: list, contain group of doucment names which seem to belong to the same topic

In [4]:
def get_Topic_Related_Doc(df):  
    
    topic_related_documents = []
    
    for column in range(df.shape[1]):
        row_count = 0
        for row in range(df.shape[0]):
            if df.iat[row, column] > 0.3:
                row_count += 1
        topic_related_documents.append(df.nlargest(row_count, column).index)
    
    return topic_related_documents

#### 2. Apply the HierachicalLDA LDA Model 
- params:
    - folder_path: cleand preprocessed data folder
    - non_german_file_path: a list of all English files
    - topic_num: topics numbers to choose while training the model
    - document_limit: restrict the number of documents to process
    - **topic_related_documents**: list, contain group of doucment names which seem to belong to the same topic (grouped together by high probability belonging to the same topic)
- variables:
    - gensim_dictionary: [Dictionay](https://radimrehurek.com/gensim/corpora/dictionary.html) type, defined by gesim library 
    - corpus: doc2bow, tokenized and nomalized list
    - lda: LDA model after training with given dataset

In [5]:
def hierachicalLDA(folder_path, non_german_file_path, topic_related_documents, topic_num=5, document_limit=1000):
    #get a list of non_german files
    non_german_files = []
    with open(non_german_file_path, 'r') as fr:
        for line in fr:
            non_german_files.append(line.strip())
            
    dictionary = []
    filenames = []
    for root, dirs, files in os.walk(folder_path):
        for f in sorted(files):
            if f in topic_related_documents:
                if f[:-11] not in non_german_files:
                    with open(root+'/'+f, 'rb') as fr:
                        filenames.append(f)
                        document_tokens = pickle.load(fr)
                        dictionary.append(document_tokens)

    gensim_dictionary = Dictionary(dictionary)
    corpus = [gensim_dictionary.doc2bow(text) for text in dictionary]
    lda = models.ldamodel.LdaModel(corpus, num_topics=topic_num, id2word=gensim_dictionary, iterations=200)
    topics = lda.show_topics(num_topics=-1, num_words=20)
    
    return lda, corpus, gensim_dictionary

<font color="blue"/>

### dsp:
  * This is probably the special part of this notebook.
  * You did not document the parameter `topic_related_documents`.
  * Does this parameter realized the "hierarchical" aspect of your "hierarchical LDA"?
  * You do not seem to pass a document list containing only "topic related" documents, do you?

In [6]:
docs = get_Topic_Related_Doc(df)

#### Visualize the topic model by pyLDAvis library
- params: 
    - folder_path: cleand preprocessed data folder
    - docs: group of documents after initial LDA Model
    - topic_num: topics number to choose

In [7]:
def visualization(data_path, docs, topic_num):
    
    #only visualize those topics with more than 5 documents
    if len(docs) > 5:
        lda, corpus, dictionary = hierachicalLDA(data_path, non_german_file_path, docs, topic_num)
        return pyLDAvis.gensim.prepare(lda, corpus, dictionary)
    
    else:
        print('Cannot visualize, Too few documents')

<font color="blue"/>

### dsp:
  * "Too less documents" ~> "Too few documents"

Give a simple example, apply the Hierarchical LDA on one topic that has the maximum related documents.

In [8]:
# visualization(processed_doc_path, docs[6].tolist(), 3)
doc_example = max(docs, key=len)
visualization(processed_doc_path, doc_example.tolist(), 3)

<font color="blue"/>

### dsp:
  * Interesting experiment. 
  * I have still to study the literature.
  * From the code I can not understand how the "hierarchical" aspect is meant to work and I suspect you do not understand it either, or do you?