## Get started
1. Open Url https://open-science.humlab.umu.se
2. Enter credentials. Usernames **phd_user_01, phd_user_02, ..., phd_user_11** and password **phd_course_2018**
3. Open folder **phd_course** (just click on it).
4. Open notebook **intro_to_topic_modelling**.

## What is Topic Modelling?

- Topic modelling can be seen as a method for finding a "groups of words" (i.e themes/topics) from a collection of documents that in some way capture the information in the collection.
- It can also be thought of as a form of text mining – a way to obtain recurring patterns of words in textual material.

## What is an LDA Topic Model?
- LDA is a so called "generative probabalistic" model.
- LDA, as all TM, assumes that a document consists of **underlying "topics"**
- This means that certain **"groups of words" (i.e. topics) more frequently** occur in a specific document
- Each topic is hence simply **a word frequency distribution**
- The premise of LDA is the assumption is that the **documents have been generated via a statistical model/process**.
- A simplified view of how a document is generated is
  - Selects the document's (mix of) topics => a topic distribution
  - Generate the document's words by repeating:
    - Draw a topic from the topic distribution
    - Draw a word from that topic

Given this imaginary generative process, the corpus at hand is the correct answer!<br>
Commonly used computational processes can be used to fit the corpus to the statistical model, which gives the topic distributions.<br>
See *Blei, 2003: Latent dirichlet allocation* [PDF](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) for a description of LDA.

<span style="float:left">
    <img src="./images/blei_lda.jpg" style="width: 600px;padding: 0; margin: 0;">
    <br>
    <img src="./images/blei_2012b.png" style="width: 600px;padding: 0; margin: 0;">
</span>
<br>




## What is a Jupyter Notebook?
<span style="float:left"><img src="./images/narrative_new.svg" style="width: 300px;padding: 0; margin: 0;"></span><br>
> - [Jupyter](http://jupyter.org/) is an open-source software for **interactive and reproducible computing**.<br>
> - The **open science movement** is a driving force for Jupyter's popularity.<br>
> - Which in part is a response to the **reproducibility crisis in science** and the **statistical crisis in science**<br>
> - Jupyter Notebooks contain **excutable code, equations, visualizations and narrative text**.<br>
> - It is a **web application** with a simple and easy to use web interface.
> - Supports a large number of programming languages (50+ e.g. Python, R, JavaScript)
> - Sponsered by large companies such as Google and Microsoft.

#### Brief Instructions on How to Use Notebooks
- **Menu Help -> User Interface Tour** gives an overview of the user interface.
- **Code cells** contains the script code and have **In [x]** in the left margin.
  - **In []** indicates that the code cell hasn't been executed yet.
  - **In [n]** indicates that the code has been executed(n is an integer).
  - **In [\*]** indicates that the code is executing, or waiting to be executed (i.e. other cells are executing).
- **The current code** is highlighted with a blue border - you make it current by clicking on it.
- **SHIFT+ENTER** or **Play button** executes the current cell. Code cells aren't executed automatically.
- **Out[n]** indicates the output (or result) of a cell's execution and is directly below the executed cell.
- **SHIFT+ENTER** automatically selects the next code cell.
- **SHIFT+ENTER** can hence be used repeatedly to executes the code cells in sequence.
- **Menu Cell -> Run All** executes the entire notebook in a single step (can take some time to finish, notice how "In [\*]" indicators change to "In [n]" ).
- **Double-Click** on a cell to edit its content.
- **ESC key** Leaves edit mode (or just click on any other cell).
- **Kernel -> Restart** restarts server side kernel (use if notebook seems stuck)


### Risks
- The risk of using tools and methods **without fully understanding** them
- The risk of using tools and methods **for non-intended purposes or in new contexts**
- How to verify **performance** (correctness of result)
- Risk of **data dredging**, p-hacking, "the statistical crisis".
- The risk that **engineer makes micro-decisions** the researcher don't know about, or don't fully understand.
- The risk of **reading to much into visualizations** (networks, layouts, clusters).

### Challenges
- **What’s easy for humans can be extremely hard for computers**
- **Human-in-the-loop or supervised learning can be very expensive**
- Ambiguity and fuzziness of terms and phrases
- Poor data quality, errors in data, wrong data, missing data, ambigious data
- Understand domain contexts, metadata, domain-specific data
- Data size (to much, to little)
- How to understand internal representations of data used by computational methods
- Internal representations are simplified views of the actual data (e.g. "bag-of-word" model)
- etc...

### A sample high-level workflow

<img src="./images/text_analysis_workflow.svg" alt="" width="1200"/>

## <span style='color: green'>COLLECT</span> Extract Text From PDFs <span style='color: blue; float: right'>SKIP</span>  
The source data consists of 25 english academic articles downloded as PDF from Zotera. The first step is to extract the text from the PDF files. This can also be done manually.

In [1]:
import glob
import os
import zipfile

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFTextExtractionNotAllowed, PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

def extract_pdf_text(filename):
    text_lines = []
    with open(filename, 'rb') as fp:
        
        parser = PDFParser(fp)
        document = PDFDocument(parser)

        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed

        resource_manager = PDFResourceManager()

        result_buffer = StringIO()

        device = TextConverter(resource_manager, result_buffer, codec='utf-8', laparams=LAParams())

        interpreter = PDFPageInterpreter(resource_manager, device)

        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)

        lines = result_buffer.getvalue().splitlines()
        for line in lines:
            text_lines.append(line)

    return text_lines


def extract_pdf_texts(source_folder, target_zip_filename):
    with zipfile.ZipFile(target_zip_filename, 'w', zipfile.ZIP_DEFLATED) as target_zip:

        for filename in glob.glob(os.path.join(source_folder,'*.pdf')):

            print('Processing: ' + filename)

            text_lines = extract_pdf_text(filename)

            target_filename = os.path.splitext(os.path.split(filename)[1])[0] + '.txt'
            target_filename = target_filename.lower().replace(' ', '_').replace(',','')

            target_zip.writestr(target_filename, '\n'.join(text_lines))
            
#source_folder = './data/pdf'
#target_zip_filename = 'data/paper_extracted_texts.zip'
#extract_pdf_texts(source_folder, target_zip_filename)

## <span style='color: green'>INITIALIZE </span> Setup and initialize used packages <span style='color: red; float: right'>MANDATORY RUN!</span>  
The following CODE CELL must be run once to set up the run time environment. Please select the cell and hit **SHIFT-ENTER** or the **RUN** button in the toolbar.

In [20]:
# folded code
%load_ext autoreload
%autoreload 2
import os, warnings, types
import numpy as np, pandas as pd
import bokeh, bokeh.plotting, bokeh.models, matplotlib.pyplot as plt
import ipywidgets as widgets
import re, string, zipfile
import nltk, spacy, textacy, textacy.extract, textacy.preprocess

import common.utility as utility
import common.widgets_utility as widgets_utility
from gensim import corpora, models, matutils
from IPython.display import display, HTML #, clear_output, IFrame
from pivottablejs import pivot_ui
from spacy import displacy

logger = utility.getLogger(format="%(levelname)s;%(message)s")

warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

pd.set_option('precision', 10)

def get_filenames(zip_filename, extension='.txt'):
    with zipfile.ZipFile(zip_filename, mode='r') as zf:
        return [ x for x in zf.namelist() if x.endswith(extension) ]
    
def get_text(zip_filename, filename):
    with zipfile.ZipFile(zip_filename, mode='r') as zf:
        return zf.read(filename).decode(encoding='utf-8')

DEFAULT_TERM_PARAMS = dict(
    args=dict(ngrams=1, named_entities=True, normalize='lemma', as_strings=True),
    kwargs=dict(filter_stops=True, filter_punct=True, filter_nums=True, min_freq=1, drop_determiners=True, include_pos=('NOUN', 'PROPN', ))
)
    
def filter_terms(doc, term_args, chunk_size=None, min_length=2):
    kwargs = utility.extend({}, DEFAULT_TERM_PARAMS['kwargs'], term_args['kwargs'])
    args = utility.extend({}, DEFAULT_TERM_PARAMS['args'], term_args['args'])
    terms = (x for x in doc.to_terms_list(
        args['ngrams'],
        args['named_entities'],
        args['normalize'],
        args['as_strings'],
        **kwargs
    ) if len(x) >= min_length)
    return terms
        
def slim_title(x):
    try:
        m = re.match('.*\((.*)\)$', x).groups()
        if m is not None and len(m) > 0:
            return m[0]
        return ' '.join(x.split(' ')[:3]) + '...'
    except:
        return x
            
LANGUAGE = 'en'
SOURCE_FOLDER = './data'

EXTRACTED_TEXT_FILENAME = os.path.join(SOURCE_FOLDER, 'paper_extracted_texts.zip')
EDITED_TEXT_FILENAME = os.path.join(SOURCE_FOLDER, 'paper_edited_texts.zip')
PREPROCESSED_TEXT_FILENAME = os.path.join(SOURCE_FOLDER, 'paper_preprocessed_text.zip')

SOURCE_FILES = {
    'source_text_raw': { 'filename': EXTRACTED_TEXT_FILENAME, 'description': 'Raw text from PDF: Automatic text extraction using pdfminer Python package. ' },
    'source_text_edited': { 'filename': EDITED_TEXT_FILENAME, 'description': 'Manually edited text: List of references, index, notes and page headers etc. removed.' },
    'source_text_preprocessed': { 'filename': PREPROCESSED_TEXT_FILENAME, 'description': 'Preprocessed text: Normalized whitespaces. Unicode fixes. Urls, emails and phonenumbers removed. Accents removed.' }
}

HYPHEN_REGEXP = re.compile(r'\b(\w+)-\s*\r?\n\s*(\w+)\b', re.UNICODE)
DF_TAGSET = pd.read_csv('./data/tagset.csv', sep='\t').fillna('')
TOOLS = "pan,wheel_zoom,box_zoom,reset,previewsave"
AGGREGATES = { 'mean': np.mean, 'sum': np.sum, 'max': np.max, 'std': np.std }

logger.info('POS tag set: ' + ' '.join(list(DF_TAGSET.POS.unique())))

%matplotlib inline
bokeh.plotting.output_notebook()

class TopicModelNotComputed(Exception):
    @staticmethod
    def check():
        if 'TM_GUI_MODEL' in globals():
            gui =  globals()['TM_GUI_MODEL']
            if None not in (gui, gui.model):
                return True
        msg = 'A topic model must be computed using step "MODEL Compute an LDA Topic Model"'
        raise TopicModelNotComputed(msg)

def get_current_model():
    TopicModelNotComputed.check()
    return globals()['TM_GUI_MODEL'].model


INFO : POS tag set: PUNCT SYM X ADJ VERB CONJ NUM DET ADV ADP  NOUN PROPN PART PRON SPACE INTJ


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## <span style='color: green'>PREPARE: </span> Load and Prepare the Text Corpus <span style='color: red; float: right'>MANDATORY RUN</span>

In [3]:
from spacy.language import Language
from textacy.spacier.utils import merge_spans

def preprocess_text(source_filename, target_filename=None):
    filenames = get_filenames(source_filename)
    basename, extension = os.path.splitext(source_filename)
    target_filename = target_filename or basename + '_preprocessed' + extension
    texts = ( (filename, get_text(source_filename, filename)) for filename in filenames )
    with zipfile.ZipFile(target_filename, 'w', zipfile.ZIP_DEFLATED) as zf:
        for filename, text in texts:
            logger.info('Processing ' + filename)
            text = re.sub(HYPHEN_REGEXP, r"\1\2\n", text)
            text = textacy.preprocess.normalize_whitespace(text)   
            text = textacy.preprocess.fix_bad_unicode(text)   
            text = textacy.preprocess.replace_currency_symbols(text)
            text = textacy.preprocess.unpack_contractions(text)
            text = textacy.preprocess.replace_urls(text)
            text = textacy.preprocess.replace_emails(text)
            text = textacy.preprocess.replace_phone_numbers(text)
            text = textacy.preprocess.remove_accents(text)
            zf.writestr(filename, text)
            
def create_textacy_corpus(source_filename, language, preprocess_args):
    make_title = lambda filename: filename.replace('_', ' ').replace('.txt', '').title()
    filenames = get_filenames(source_filename)
    corpus = textacy.Corpus(language)
    text_stream = ( (filename, get_text(source_filename, filename)) for filename in filenames )
    for filename, text in text_stream:
        logger.info('Processing ' + filename)
        text = re.sub(HYPHEN_REGEXP, r"\1\2\n", text)
        text = textacy.preprocess.preprocess_text(text, **preprocess_args)
        corpus.add_text(text, dict(filename=filename, title=make_title(filename)))
    for doc in corpus:
        doc.spacy_doc.user_data['title'] = doc.metadata['title']
    return corpus

def remove_whitespace_entities(doc):
    doc.ents = [ e for e in doc.ents if not e.text.isspace() ]
    return doc

def generate_textacy_corpus(source_filename, language, corpus_args, preprocess_args, merge_named_entities=True, force=False):
    
    corpus_tag = '_'.join([ k for k in preprocess_args if preprocess_args[k] ]) + \
        '_disable(' + ','.join(corpus_args.get('disable', [])) +')'
    
    textacy_corpus_filename = os.path.join(SOURCE_FOLDER, 'corpus_{}_{}.pkl'.format(language, corpus_tag))
    
    Language.factories['remove_whitespace_entities'] = lambda nlp, **cfg: remove_whitespace_entities
    
    logger.info('Loading model: english...')
    nlp = textacy.load_spacy('en_core_web_sm', **corpus_args)
    pipeline = lambda: [ x[0] for x in nlp.pipeline ]
        
    logger.info('Using pipeline: ' + ' '.join(pipeline()))

    if force or not os.path.isfile(textacy_corpus_filename):
        logger.info('Working: Computing new corpus ' + textacy_corpus_filename + '...')
        corpus = create_textacy_corpus(source_filename, nlp, preprocess_args)
        corpus.save(textacy_corpus_filename)
    else:
        logger.info('Working: Loading corpus ' + textacy_corpus_filename + '...')
        corpus = textacy.Corpus.load(textacy_corpus_filename)
        
    if merge_named_entities:
        logger.info('Working: Merging named entities...')
        for doc in corpus:
            named_entities = textacy.extract.named_entities(doc)
            merge_spans(named_entities, doc.spacy_doc)
    else:
        logger.info('Note: named entities not merged')
        
    logger.info('Done!')
    return textacy_corpus_filename, corpus

def assign_document_titles(corpus):
    for doc in corpus:
        doc.spacy_doc.user_data['title'] = doc.metadata['title']
    
def get_corpus_documents(corpus):
    df = pd.DataFrame([
        (document_id, doc.metadata['title'], doc.metadata['filename'])
                for document_id, doc in enumerate(corpus) ], columns=['document_id', 'title', 'filename']
    ).set_index('document_id')
    return df

if not os.path.isfile(PREPROCESSED_TEXT_FILENAME):
    logger.info("Preprocessing text archive...")
    preprocess_text(EDITED_TEXT_FILENAME, PREPROCESSED_TEXT_FILENAME)
    
TEXTACY_CORPUS_FILENAME, CORPUS = generate_textacy_corpus(PREPROCESSED_TEXT_FILENAME, LANGUAGE, corpus_args=dict(), preprocess_args=dict(), merge_named_entities=True, force=False)


INFO : Loading model: english...
INFO : Using pipeline: tagger parser ner
INFO : Working: Loading corpus ./data/corpus_en__disable().pkl...
INFO : Working: Merging named entities...
INFO : Done!


## <span style='color: green'>PREPARE/DESCRIBE </span> Clean Up the Text <span style='float: right; color: green'>TRY IT</span>

In [21]:
def display_cleanup_text_gui(corpus, callback):
    
    documents = get_corpus_documents(corpus)
    document_options = {v: k for k, v in documents['title'].to_dict().items()}
    
    #pos_options = [ x for x in DF_TAGSET.POS.unique() if x not in ['PUNCT', '', 'DET', 'X', 'SPACE', 'PART', 'CONJ', 'SYM', 'INTJ', 'PRON']]  # groupby(['POS'])['DESCRIPTION'].apply(list).apply(lambda x: ', '.join(x)).to_dict()
    pos_tags = DF_TAGSET.groupby(['POS'])['DESCRIPTION'].apply(list).apply(lambda x: ', '.join(x[:1])).to_dict()
    pos_options = { k + ' (' + v + ')': k for k,v in pos_tags.items() }
    display_options = {
        'Source text (raw)': 'source_text_raw',
        'Source text (edited)': 'source_text_edited',
        'Source text (processed)': 'source_text_preprocessed',
        'Sanitized text': 'sanitized_text',
        'Statistics': 'statistics'
    }

    gui = types.SimpleNamespace(
        document_id=widgets.Dropdown(description='Paper', options=document_options, value=0, layout=widgets.Layout(width='400px')),
        progress=widgets.IntProgress(value=0, min=0, max=5, step=1, description='', layout=widgets.Layout(width='90%')),
        min_freq=widgets.FloatSlider(value=0, min=0, max=1.0, step=0.01, description='Min frequency', layout=widgets.Layout(width='400px')),
        ngrams=widgets.Dropdown(description='n-grams', options=[1,2,3], value=1, layout=widgets.Layout(width='180px')),
        min_word=widgets.Dropdown(description='Min length', options=[1,2,3,4], value=1, layout=widgets.Layout(width='180px')),
        normalize=widgets.Dropdown(description='Normalize', options=[ False, 'lemma', 'lower' ], value=False, layout=widgets.Layout(width='180px')),
        filter_stops=widgets.ToggleButton(value=False, description='Filter stops',  tooltip='Filter out stopwords', icon='check'),
        filter_nums=widgets.ToggleButton(value=False, description='Filter nums',  tooltip='Filter out stopwords', icon='check'),
        filter_punct=widgets.ToggleButton(value=False, description='Filter punct',  tooltip='Filter out punctuations', icon='check'),
        named_entities=widgets.ToggleButton(value=False, description='Merge entities',  tooltip='Merge entities', icon='check'),
        drop_determiners=widgets.ToggleButton(value=False, description='Drop determiners',  tooltip='Drop determiners', icon='check'),
        include_pos=widgets.SelectMultiple(description='POS', options=pos_options, value=list(), rows=10, layout=widgets.Layout(width='400px')),
        display_type=widgets.Dropdown(description='Show', value='statistics', options=display_options, layout=widgets.Layout(width='180px')),
        output_text=widgets.Output(layout={'height': '500px'}),
        output_statistics = widgets.Output(),
        boxes=None
    )
    
    uix = widgets.interactive(

        callback,

        corpus=widgets.fixed(corpus),
        gui=widgets.fixed(gui),
        display_type=gui.display_type,
        document_id=gui.document_id,
        
        ngrams=gui.ngrams,
        named_entities=gui.named_entities,
        normalize=gui.normalize,
        filter_stops=gui.filter_stops,
        filter_punct=gui.filter_punct,
        filter_nums=gui.filter_nums,
        include_pos=gui.include_pos,
        min_freq=gui.min_freq,
        drop_determiners=gui.drop_determiners
    )
    
    gui.boxes = widgets.VBox([
        gui.progress,
        widgets.HBox([
            widgets.VBox([
                gui.document_id,
                widgets.HBox([gui.display_type, gui.normalize]),
                widgets.HBox([gui.ngrams, gui.min_word]),
                gui.min_freq
            ]),
            widgets.VBox([
                gui.include_pos
            ]),
            widgets.VBox([
                gui.filter_stops,
                gui.filter_nums,
                gui.filter_punct,
                gui.named_entities,
                gui.drop_determiners
            ])
        ]),
        widgets.HBox([
            gui.output_text, gui.output_statistics
        ]),
        uix.children[-1]
    ])
    
    display(gui.boxes)
                                  
    uix.update()
    return gui, uix

def plot_xy_data(data, title='', xlabel='', ylabel='', **kwargs):
    x, y = list(data[0]), list(data[1])
    labels = x
    plt.figure(figsize=(10, 10 / 1.618))
    plt.plot(x, y, 'ro', **kwargs)
    plt.xticks(x, labels, rotation='45')
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()
    
def display_cleaned_up_text(corpus, gui, display_type, document_id, **kwargs): # ngrams, named_entities, normalize, include_pos):
    
    gui.output_text.clear_output()
    gui.output_statistics.clear_output()
    
    #Additional candidates;
    #is_alpha	bool	Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().
    #is_ascii	bool	Does the token consist of ASCII characters? Equivalent to [any(ord(c) >= 128 for c in token.text)].
    #like_url	bool	Does the token resemble a URL?
    #like_email	bool	Does the token resemble an email address?

    doc = corpus[document_id]
    
    terms = [ x for x in doc.to_terms_list(as_strings=True, **kwargs) ]
    
    if display_type.startswith('source_text'):
        
        source_filename = SOURCE_FILES[display_type]['filename']
        description =  SOURCE_FILES[display_type]['description']
        text = get_text(source_filename, doc.metadata['filename'])
        with gui.output_text:
            #print('{}\n.................\n(NOT SHOWN TEXT)\n.................\n{}'.format(document[:2500], document[-250:]))
            #print(doc)
            print('[ ' + description.upper() + ' ]')
            print(text)
        return

    if len(terms) == 0:
        with gui.output_text:
            print("No text. Please change selection.")
        return
    
    if display_type in ['sanitized_text', 'statistics']:

        if display_type == 'sanitized_text':
            with gui.output_text:
                #display('{}\n.................\n(NOT SHOWN TEXT)\n.................\n{}'.format(
                #    ' '.join(tokens[:word_count]),
                #    ' '.join(tokens[-word_count:])
                #))
                print(' '.join(list(terms)))
                return

        if display_type == 'statistics':

            wf = nltk.FreqDist(terms)

            with gui.output_text:

                print('Word count (number of terms): {}'.format(wf.N()))
                print('Unique word count (vocabulary): {}'.format(wf.B()))
                print(' ')

                df = pd.DataFrame(wf.most_common(25), columns=['token','count'])
                display(df)

            with gui.output_statistics:

                data = list(zip(*wf.most_common(25)))
                plot_xy_data(data, title='Word distribution', xlabel='Word', ylabel='Word count')

                wf = nltk.FreqDist([len(x) for x in terms])
                data = list(zip(*wf.most_common(25)))
                plot_xy_data(data, title='Word length distribution', xlabel='Word length', ylabel='Word count')

xgui, xuix = display_cleanup_text_gui(CORPUS, display_cleaned_up_text)



VBox(children=(IntProgress(value=0, layout=Layout(width='90%'), max=5), HBox(children=(VBox(children=(Dropdown…

## <span style='color: green;'>MODEL</span> Compute an LDA Topic Model<span style='color: red; float: right'>MANDATORY RUN</span>


In [22]:
import types

class LdaDataCompiler():
    
    @staticmethod
    def compile_dictionary(model):
        logger.info('Compiling dictionary...')
        token_ids, tokens = list(zip(*model.id2word.items()))
        dfs = model.id2word.dfs.values() if model.id2word.dfs is not None else [0] * len(tokens)
        dictionary = pd.DataFrame({
            'token_id': token_ids,
            'token': tokens,
            'dfs': list(dfs)
        }).set_index('token_id')[['token', 'dfs']]
        return dictionary

    @staticmethod
    def compile_topic_token_weights(tm, dictionary, num_words=200):
        logger.info('Compiling topic-tokens weights...')

        df_topic_weights = pd.DataFrame(
            [ (topic_id, token, weight)
                for topic_id, tokens in (tm.show_topics(tm.num_topics, num_words=num_words, formatted=False))
                    for token, weight in tokens if weight > 0.0 ],
            columns=['topic_id', 'token', 'weight']
        )

        df = pd.merge(
            df_topic_weights.set_index('token'),
            dictionary.reset_index().set_index('token'),
            how='inner',
            left_index=True,
            right_index=True
        )
        return df.reset_index()[['topic_id', 'token_id', 'token', 'weight']]

    @staticmethod
    def compile_topic_token_overview(topic_token_weights, alpha, n_words=200):
        """
        Group by topic_id and concatenate n_words words within group sorted by weight descending.
        There must be a better way of doing this...
        """
        logger.info('Compiling topic-tokens overview...')

        df = topic_token_weights.groupby('topic_id')\
            .apply(lambda x: sorted(list(zip(x["token"], x["weight"])), key=lambda z: z[1], reverse=True))\
            .apply(lambda x: ' '.join([z[0] for z in x][:n_words])).reset_index()
        df['alpha'] = df.topic_id.apply(lambda topic_id: alpha[topic_id])
        df.columns = ['topic_id', 'tokens', 'alpha']

        return df.set_index('topic_id')

    @staticmethod
    def compile_document_topics(model, corpus, documents, minimum_probability=0.001):

        def document_topics_iter(model, corpus, minimum_probability):

            data_iter = model.get_document_topics(corpus, minimum_probability=minimum_probability)\
                if hasattr(model, 'get_document_topics')\
                else model.load_document_topics()

            for i, topics in enumerate(data_iter):
                for x in topics:
                    yield (i, x[0], x[1])
        '''
        Get document topic weights for all documents in corpus
        Note!  minimum_probability=None filters less probable topics, set to 0 to retrieve all topcs

        If gensim model then use 'get_document_topics', else 'load_document_topics' for mallet model
        '''
        logger.info('Compiling document topics...')
        logger.info('  Creating data iterator...')
        data = document_topics_iter(model, corpus, minimum_probability)
        logger.info('  Creating frame from iterator...')
        df_doc_topics = pd.DataFrame(data, columns=[ 'document_id', 'topic_id', 'weight' ]).set_index('document_id')
        logger.info('  Merging data...')
        df = pd.merge(documents, df_doc_topics, how='inner', left_index=True, right_index=True)
        return df

    @staticmethod
    def compute_compiled_data(model, corpus, id2term, documents):

        dictionary = LdaDataCompiler.compile_dictionary(model)
        topic_token_weights = LdaDataCompiler.compile_topic_token_weights(model, dictionary, num_words=200)
        topic_token_overview = LdaDataCompiler.compile_topic_token_overview(topic_token_weights, model.alpha)
        document_topic_weights = LdaDataCompiler.compile_document_topics(model, corpus, documents, minimum_probability=0.001)

        return types.SimpleNamespace(
            dictionary=dictionary,
            documents=documents,
            topic_token_weights=topic_token_weights,
            topic_token_overview=topic_token_overview,
            document_topic_weights=document_topic_weights
        )
    
    @staticmethod
    def get_topic_titles(topic_token_weights, topic_id=None, n_words=100):
        df_temp = topic_token_weights if topic_id is None else topic_token_weights[(topic_token_weights.topic_id==topic_id)]
        df = df_temp\
                .sort_values('weight', ascending=False)\
                .groupby('topic_id')\
                .apply(lambda x: ' '.join(x.token[:n_words].str.title()))
        return df

    @staticmethod
    def get_topic_title(topic_token_weights, topic_id, n_words=100):
        return LdaDataCompiler.get_topic_titles(topic_token_weights, topic_id, n_words=n_words).iloc[0]

    #get_topics_tokens_as_text = get_topic_titles
    #get_topic_tokens_as_text = get_topic_title

    @staticmethod
    def get_topic_tokens(topic_token_weights, topic_id=None, n_words=100):
        df_temp = topic_token_weights if topic_id is None else topic_token_weights[(topic_token_weights.topic_id == topic_id)]
        df = df_temp.sort_values('weight', ascending=False)[:n_words]
        return df
    
    @staticmethod
    def get_lda_topics(model, n_tokens=20):
        return pd.DataFrame({
            'Topic#{:02d}'.format(topic_id+1) : [ word[0] for word in model.show_topic(topic_id, topn=n_tokens) ]
                for topic_id in range(model.num_topics)
        })

# OBS OBS! https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
DEFAULT_VECTORIZE_PARAMS = dict(tf_type='linear', apply_idf=False, idf_type='smooth', norm='l2', min_df=1, max_df=0.95)

def compute_topic_model(corpus, tick=utility.noop, method='sklearn_lda', vec_args=None, term_args=None, tm_args=None, **args):
    
    tick()
    vec_args = utility.extend({}, DEFAULT_VECTORIZE_PARAMS, vec_args)
    
    terms_iter = lambda: (filter_terms(doc, term_args) for doc in corpus)
    tick()
    
    vectorizer = textacy.Vectorizer(**vec_args)
    doc_term_matrix = vectorizer.fit_transform(terms_iter())

    if method.startswith('sklearn'):
        tm_model = textacy.TopicModel(method.split('_')[1], **tm_args)
        tm_model.fit(doc_term_matrix)
        tick()
        doc_topic_matrix = tm_model.transform(doc_term_matrix)
        tick()
        tm_id2word = vectorizer.id_to_term
        tm_corpus = matutils.Sparse2Corpus(doc_term_matrix, documents_columns=False)
        compiled_data = None # FIXME
    else:
        doc_topic_matrix = None # ?
        tm_id2word = corpora.Dictionary(terms_iter())
        tm_corpus = [ tm_id2word.doc2bow(text) for text in terms_iter() ]
        #tm_id2word = vectorizer.id_to_term
        #tm_corpus = matutils.Sparse2Corpus(doc_term_matrix, documents_columns=False)
        tm_model = models.LdaModel(
            tm_corpus, 
            num_topics  =  tm_args.get('n_topics', 0),
            id2word     =  tm_id2word,
            iterations  =  tm_args.get('max_iter', 0),
            passes      =  20,
            alpha       = 'asymmetric'
        )
        documents = get_corpus_documents(corpus)
        compiled_data = LdaDataCompiler.compute_compiled_data(tm_model, tm_corpus, tm_id2word, documents)
    
    tm_data = types.SimpleNamespace(
        tm_model=tm_model,
        tm_id2term=tm_id2word,
        tm_corpus=tm_corpus,
        doc_term_matrix=doc_term_matrix,
        doc_topic_matrix=doc_topic_matrix,
        vectorizer=vectorizer,
        compiled_data=compiled_data
    )
    
    tick(0)
    
    return tm_data

def get_doc_topic_weights(doc_topic_matrix, threshold=0.05):
    topic_ids = range(0,doc_topic_matrix.shape[1])
    for document_id in range(0,doc_topic_matrix.shape[1]):
        topic_weights = doc_topic_matrix[document_id, :]
        for topic_id in topic_ids:
            if topic_weights[topic_id] >= threshold:
                yield (document_id, topic_id, topic_weights[topic_id])

def get_df_doc_topic_weights(doc_topic_matrix, threshold=0.05):
    it = get_doc_topic_weights(doc_topic_matrix, threshold)
    df = pd.DataFrame(list(it), columns=['document_id', 'topic_id', 'weight']).set_index('document_id')
    return df

def display_topic_model_gui(corpus, compute_callback):
    
    pos_options = [ x for x in DF_TAGSET.POS.unique() if x not in ['PUNCT', '', 'DET', 'X', 'SPACE', 'PART', 'CONJ', 'SYM', 'INTJ', 'PRON']]
    # groupby(['POS'])['DESCRIPTION'].apply(list).apply(lambda x: ', '.join(x)).to_dict()
    engine_options = { 'gensim': 'gensim' } #, 'sklearn_lda': 'sklearn_lda'}
    normalize_options = { 'None': False, 'Use lemma': 'lemma', 'Lowercase': 'lower'}
    ngrams_options = { '1': [1], '1, 2': [1, 2], '1,2,3': [1, 2, 3] }
    gui = types.SimpleNamespace(
        progress=widgets.IntProgress(value=0, min=0, max=5, step=1, description='', layout=widgets.Layout(width='90%')),
        n_topics=widgets.IntSlider(description='#topics', min=5, max=50, value=20, step=1),
        min_freq=widgets.IntSlider(description='Min word freq', min=0, max=10, value=2, step=1),
        max_iter=widgets.IntSlider(description='Max iterations', min=100, max=1000, value=20, step=10),
        ngrams=widgets.Dropdown(description='n-grams', options=ngrams_options, value=[1], layout=widgets.Layout(width='200px')),
        normalize=widgets.Dropdown(description='Normalize', options=normalize_options, value='lemma', layout=widgets.Layout(width='200px')),
        filter_stops=widgets.ToggleButton(value=True, description='Remove stopword',  tooltip='Filter out stopwords', icon='check'),
        filter_nums=widgets.ToggleButton(value=True, description='Remove nums',  tooltip='Filter out stopwords', icon='check'),
        named_entities=widgets.ToggleButton(value=False, description='Merge entities',  tooltip='Merge entities', icon='check'),
        drop_determiners=widgets.ToggleButton(value=True, description='Drop determiners',  tooltip='Drop determiners', icon='check'),
        apply_idf=widgets.ToggleButton(value=False, description='Apply IDF',  tooltip='Apply TF-IDF', icon='check'),
        include_pos=widgets.SelectMultiple(description='POS', options=pos_options, value=['NOUN', 'PROPN'], rows=7, layout=widgets.Layout(width='200px')),
        method=widgets.Dropdown(description='Engine', options=engine_options, value='gensim', layout=widgets.Layout(width='200px')),
        compute=widgets.Button(description='Compute'),
        boxes=None,
        output = widgets.Output(layout={'height': '500px'}),
        model=None
    )
    gui.boxes = widgets.VBox([
        gui.progress,
        widgets.HBox([
            widgets.VBox([
                gui.n_topics,
                gui.min_freq,
                gui.max_iter
            ]),
            widgets.VBox([
                gui.filter_stops,
                gui.filter_nums,
                gui.named_entities,
                gui.drop_determiners,
                gui.apply_idf
            ]),
            widgets.VBox([
                gui.normalize,
                gui.ngrams,
                gui.method
            ]),
            gui.include_pos,
            widgets.VBox([
                gui.compute
            ])
        ]),
        widgets.VBox([gui.output]), # ,layout=widgets.Layout(top='20px', height='500px',width='100%'))
    ])
    fx = lambda *args: compute_callback(corpus, gui, *args)
    gui.compute.on_click(fx)
    return gui
    

def compute_callback(corpus, gui, *args):
    
    def tick(x=None):
        gui.progress.value = gui.progress.value + 1 if x is None else x
    
    tick(1)
    gui.output.clear_output()
    with gui.output:
        vec_args = dict(apply_idf=gui.apply_idf.value)
        term_args = dict(
            args=dict(
                ngrams=gui.ngrams.value,
                named_entities=gui.named_entities.value,
                normalize=gui.normalize.value,
                as_strings=True
            ),
            kwargs=dict(
                filter_nums=gui.filter_nums.value,
                drop_determiners=gui.drop_determiners.value,
                min_freq=gui.min_freq.value,
                include_pos=gui.include_pos.value,
                filter_stops=gui.filter_stops.value,
                filter_punct=True
            )
        )
        tm_args = dict(
            n_topics=gui.n_topics.value,
            max_iter=gui.max_iter.value,
            learning_method='online', 
            n_jobs=1
        )
        method = gui.method.value
        gui.model = compute_topic_model(
            corpus=corpus,
            tick=tick,
            method=method,
            vec_args=vec_args,
            term_args=term_args,
            tm_args=tm_args
        )
    gui.output.clear_output()
    with gui.output:
        #display(gui.model.compiled_data.topic_token_overview)
        display(LdaDataCompiler.get_lda_topics(gui.model.tm_model, n_tokens=20))
        
TM_GUI_MODEL = display_topic_model_gui(CORPUS, compute_callback)
display(TM_GUI_MODEL.boxes)


VBox(children=(IntProgress(value=0, layout=Layout(width='90%'), max=5), HBox(children=(VBox(children=(IntSlide…

## <span style='color: green;'>MODEL</span> Display Named Entities<span style='color: green; float: right'>SKIP</span>

In [96]:
def display_document_entities_gui(corpus):
    
    def display_document_entities(document_id, corpus):
        displacy.render(corpus[document_id].spacy_doc, style='ent', jupyter=True)
    
    df_documents = get_corpus_documents(corpus)

    document_widget = widgets.Dropdown(description='Paper', options={v: k for k, v in df_documents['title'].to_dict().items()}, value=0, layout=widgets.Layout(width='80%'))

    itw = widgets.interactive(display_document_entities,document_id=document_widget, corpus=widgets.fixed(corpus))

    display(widgets.VBox([document_widget, widgets.VBox([itw.children[-1]],layout=widgets.Layout(margin_top='20px', height='500px',width='100%'))]))

    itw.update()
    
try:
    display_document_entities_gui(CORPUS)
except Except as ex:
    logger.error(ec)


VBox(children=(Dropdown(description='Paper', index=23, layout=Layout(width='80%'), options={'The Redistributio…

## <span style='color: green;'>VISUALIZE</span> Display Topic's Word Distribution as a Wordcloud<span style='color: red; float: right'>TRY IT</span>

In [8]:
# Display LDA topic's token wordcloud
opts = { 'max_font_size': 100, 'background_color': 'white', 'width': 900, 'height': 600 }
import wordcloud
import matplotlib.pyplot as plt
import common.widgets_utility as widgets_utility

def display_wordcloud_gui(callback, tm_data, text_id, output_options=None, word_count=(1, 100, 50)):
    model = tm_data.tm_model
    output_options = output_options or []
    wf = widgets_utility.wf
    wc = widgets_utility.WidgetUtility(
        n_topics=model.num_topics,
        text_id=text_id,
        text=wf.create_text_widget(text_id),
        topic_id=widgets.IntSlider(
            description='Topic ID', min=0, max=model.num_topics - 1, step=1, value=0, continuous_update=False),
        word_count=widgets.IntSlider(
            description='#Words', min=word_count[0], max=word_count[1], step=1, value=word_count[2], continuous_update=False),
        output_format=wf.create_select_widget('Format', output_options, default=output_options[0], layout=widgets.Layout(width="200px")),
        progress = widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))
    )

    wc.prev_topic_id = wc.create_prev_id_button('topic_id', model.num_topics)
    wc.next_topic_id = wc.create_next_id_button('topic_id', model.num_topics)

    iw = widgets.interactive(
        callback,
        tm_data=widgets.fixed(tm_data),
        topic_id=wc.topic_id,
        n_words=wc.word_count,
        output_format=wc.output_format,
        widget_container=widgets.fixed(wc)
    )

    display(widgets.VBox([
        wc.text,
        widgets.HBox([wc.prev_topic_id, wc.next_topic_id, wc.topic_id, wc.word_count, wc.output_format]),
        wc.progress,
        iw.children[-1]
    ]))

    iw.update()

def plot_wordcloud(df_data, token='token', weight='weight', figsize=(14, 14/1.618), **args):
    token_weights = dict({ tuple(x) for x in df_data[[token, weight]].values })
    image = wordcloud.WordCloud(**args,)
    image.fit_words(token_weights)
    plt.figure(figsize=figsize) #, dpi=100)
    plt.imshow(image, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    
def display_wordcloud(
    tm_data,
    topic_id=0,
    n_words=100,
    output_format='Wordcloud',
    widget_container=None
):
    container = tm_data.compiled_data
    widget_container.progress.value = 1
    df_temp = container.topic_token_weights.loc[(container.topic_token_weights.topic_id == topic_id)]
    tokens = LdaDataCompiler.get_topic_title(container.topic_token_weights, topic_id, n_words=n_words)
    widget_container.value = 2
    widget_container.text.value = 'ID {}: {}'.format(topic_id, tokens)
    if output_format == 'Wordcloud':
        plot_wordcloud(df_temp, 'token', 'weight', max_words=n_words, **opts)
    elif output_format == 'Table':
        widget_container.progress.value = 3
        df_temp = LdaDataCompiler.get_topic_tokens(container.topic_token_weights, topic_id=topic_id, n_words=n_words)
        widget_container.progress.value = 4
        display(HTML(df_temp.to_html()))
    else:
        display(pivot_ui(LdaDataCompiler.get_topic_tokens(topic_id, n_words)))
    widget_container.progress.value = 0

try:
    tm_data = get_current_model()
    display_wordcloud_gui(display_wordcloud, tm_data, 'tx02', ['Wordcloud', 'Table', 'Pivot'])
except TopicModelNotComputed as ex:
    logger.info(ex)
    

VBox(children=(HTML(value="<span class='tx02'></span>", placeholder=''), HBox(children=(Button(description='<<…

## <span style='color: green;'>VISUALIZE</span> Display Topic's Word Distribution as a Chart<span style='color: red; float: right'>TRY IT</span>
The following chart shows the word distribution for each selected topic. You can zoom in on the left chart. The distribution seems to follow [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) as (perhaps) expected.

In [9]:
# Display topic's word distribution
if False:
    from common.model_utility import ModelUtility
    from common.plot_utility import layout_algorithms, PlotNetworkUtility
    from common.network_utility import NetworkUtility, DISTANCE_METRICS, NetworkMetricHelper

    import math

    from itertools import product
    
    import bokeh.models as bm
    import bokeh.palettes
    from bokeh.io import output_file, push_notebook
    from bokeh.core.properties import value, expr
    from bokeh.transform import transform, jitter
    from bokeh.layouts import row, column, widgetbox
    from bokeh.models.widgets import DataTable, DateFormatter, TableColumn
    from bokeh.models import ColumnDataSource, CustomJS
    
def plot_topic_word_distribution(tokens, **args):

    source = bokeh.models.ColumnDataSource(tokens)

    p = bokeh.plotting.figure(toolbar_location="right", **args)

    cr = p.circle(x='xs', y='ys', source=source)

    label_style = dict(level='overlay', text_font_size='8pt', angle=np.pi/6.0)

    text_aligns = ['left', 'right']
    for i in [0, 1]:
        label_source = bokeh.models.ColumnDataSource(tokens.iloc[i::2])
        labels = bokeh.models.LabelSet(x='xs', y='ys', text_align=text_aligns[i], text='token', text_baseline='middle',
                          y_offset=5*(1 if i == 0 else -1),
                          x_offset=5*(1 if i == 0 else -1),
                          source=label_source, **label_style)
        p.add_layout(labels)

    p.xaxis[0].axis_label = 'Token #'
    p.yaxis[0].axis_label = 'Probability%'
    p.ygrid.grid_line_color = None
    p.xgrid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "6pt"
    p.axis.major_label_standoff = 0
    return p

def display_topic_tokens(tm_data, topic_id=0, n_words=100, output_format='Chart', widget_container=None):
    widget_container.forward()
    container = tm_data.compiled_data
    tokens = LdaDataCompiler.get_topic_tokens(container.topic_token_weights, topic_id=topic_id).\
        copy()\
        .drop('topic_id', axis=1)\
        .assign(weight=lambda x: 100.0 * x.weight)\
        .sort_values('weight', axis=0, ascending=False)\
        .reset_index()\
        .head(n_words)
    if output_format == 'Chart':
        widget_container.forward()
        tokens = tokens.assign(xs=tokens.index, ys=tokens.weight)
        p = plot_topic_word_distribution(tokens, plot_width=1000, plot_height=500, title='', tools='box_zoom,wheel_zoom,pan,reset')
        bokeh.plotting.show(p)
        widget_container.forward()
    elif output_format == 'Table':
        #display(tokens)
        display(HTML(tokens.to_html()))
    else:
        display(pivot_ui(tokens))
    widget_container.reset()
    
def display_topic_distribution_widgets(callback, tm_data, text_id, output_options=None, word_count=(1, 100, 50)):
    
    output_options = output_options or []
    model = tm_data.tm_model
    wf = widgets_utility.wf
    wc = widgets_utility.WidgetUtility(
        n_topics=model.num_topics,
        text_id=text_id,
        text=wf.create_text_widget(text_id),
        topic_id=widgets.IntSlider(
            description='Topic ID', min=0, max=model.num_topics - 1, step=1, value=0, continuous_update=False),
        word_count=widgets.IntSlider(
            description='#Words', min=word_count[0], max=word_count[1], step=1, value=word_count[2], continuous_update=False),
        output_format=wf.create_select_widget('Format', output_options, default=output_options[0], layout=widgets.Layout(width="200px")),
        progress = widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))
    )

    wc.prev_topic_id = wc.create_prev_id_button('topic_id', model.num_topics)
    wc.next_topic_id = wc.create_next_id_button('topic_id', model.num_topics)

    iw = widgets.interactive(
        callback,
        tm_data=widgets.fixed(tm_data),
        topic_id=wc.topic_id,
        n_words=wc.word_count,
        output_format=wc.output_format,
        widget_container=widgets.fixed(wc)
    )

    display(widgets.VBox([
        wc.text,
        widgets.HBox([wc.prev_topic_id, wc.next_topic_id, wc.topic_id, wc.word_count, wc.output_format]),
        wc.progress,
        iw.children[-1]
    ]))

    iw.update()
TM_DATA = TM_GUI_MODEL.model

display_topic_distribution_widgets(display_topic_tokens, TM_DATA, 'wc01', ['Chart', 'Table'])


VBox(children=(HTML(value="<span class='wc01'></span>", placeholder=''), HBox(children=(Button(description='<<…

## <span style='color: green;'>VISUALIZE</span> Display Topic's Trend Over Time or Documents<span style='color: red; float: right'>RUN</span>
- Displays topic's share over documents.

In [57]:
# Plot a topic's yearly weight over time in selected LDA topic model
#import numpy as np
#import math
#import bokeh.plotting
#from bokeh.models import ColumnDataSource, DataRange1d, Plot, LinearAxis, Grid
#from bokeh.models.glyphs import VBar
#from bokeh.io import curdoc, show

import math

def plot_topic_trend(df, pivot_column, value_column, x_label=None, y_label=None):
    tools = "pan,wheel_zoom,box_zoom,reset,previewsave"

    xs = df[pivot_column].astype(np.str)
    p = bokeh.plotting.figure(x_range=xs, plot_width=1000, plot_height=700, title='', tools=tools, toolbar_location="right")

    glyph = p.vbar(x=xs, top=df[value_column], width=0.5, fill_color="#b3de69")
    p.xaxis.major_label_orientation = math.pi/4
    p.xgrid.grid_line_color = None
    p.xaxis[0].axis_label = (x_label or '').title()
    p.yaxis[0].axis_label = (y_label or '').title()
    p.y_range.start = 0.0
    #p.y_range.end = 1.0
    p.x_range.range_padding = 0.01
    return p

def display_topic_trend(topic_id, widgets_container, output_format='Chart', tm_data=None, threshold=0.01):
    container = tm_data.compiled_data
    tokens = LdaDataCompiler.get_topic_title(container.topic_token_weights, topic_id, n_words=200)
    widgets_container.text.value = 'ID {}: {}'.format(topic_id, tokens)
    value_column = 'weight'
    category_column = 'author'
    df = container.document_topic_weights[(container.document_topic_weights.topic_id==topic_id)]
    df = df[(df.weight > threshold)].reset_index()
    df[category_column] = df.title.apply(slim_title)

    if output_format == 'Table':
        display(df)
    else:
        x_label = category_column.title()
        y_label = value_column.title()
        p = plot_topic_trend(df, category_column, value_column, x_label=x_label, y_label=y_label)
        bokeh.plotting.show(p)

def create_topic_trend_widgets(tm_data):
    
    model = tm_data.tm_model
    wf = widgets_utility.wf
    wc = widgets_utility.WidgetUtility(
        n_topics=model.num_topics,
        text_id='topic_share_plot',
        text=wf.create_text_widget('topic_share_plot'),
        threshold=widgets.FloatSlider(description='Threshold', min=0.0, max=0.25, step=0.01, value=0.10, continuous_update=False),
        topic_id=widgets.IntSlider(description='Topic ID', min=0, max=model.num_topics - 1, step=1, value=0, continuous_update=False),
        output_format=wf.create_select_widget('Format', ['Chart', 'Table'], default='Chart'),
        progress=widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="50%")),
    )

    wc.prev_topic_id = wc.create_prev_id_button('topic_id', model.num_topics)
    wc.next_topic_id = wc.create_next_id_button('topic_id', model.num_topics)

    iw = widgets.interactive(
        display_topic_trend,
        topic_id=wc.topic_id,
        widgets_container=widgets.fixed(wc),
        output_format=wc.output_format,
        tm_data=widgets.fixed(tm_data),
        threshold=wc.threshold
    )
    display(widgets.VBox([
        wc.text,
        widgets.HBox([wc.prev_topic_id, wc.next_topic_id, wc.output_format]),
        widgets.HBox([wc.topic_id, wc.threshold, wc.progress]),
        iw.children[-1]
    ]))
    
    iw.update()

tm_data = get_current_model()
create_topic_trend_widgets(tm_data)

VBox(children=(HTML(value="<span class='topic_share_plot'></span>", placeholder=''), HBox(children=(Button(des…

## <span style='color: green;'>VISUALIZE</span> Display Topic to Document Network<span style='color: red; float: right'>TRY IT</span>
The green nodes are documents, and blue nodes are topics. The edges (lines) indicates the strength of a topic in the connected document. The width of the edge is proportinal to the strength of the connection. Note that only edges with a strength above the certain threshold are displayed.

In [58]:
# Visualize year-to-topic network by means of topic-document-weights
from common.plot_utility import layout_algorithms, PlotNetworkUtility
from common.network_utility import NetworkUtility, DISTANCE_METRICS, NetworkMetricHelper

def plot_document_topic_network(network, layout, scale=1.0, titles=None):

    year_nodes, topic_nodes = NetworkUtility.get_bipartite_node_set(network, bipartite=0)  
    
    year_source = NetworkUtility.get_node_subset_source(network, layout, year_nodes)
    topic_source = NetworkUtility.get_node_subset_source(network, layout, topic_nodes)
    lines_source = NetworkUtility.get_edges_source(network, layout, scale=6.0, normalize=False)
    
    edges_alphas = NetworkMetricHelper.compute_alpha_vector(lines_source.data['weights'])
    
    lines_source.add(edges_alphas, 'alphas')
    
    p = bokeh.plotting.figure(plot_width=1000, plot_height=600, x_axis_type=None, y_axis_type=None, tools=TOOLS)
    
    r_lines = p.multi_line(
        'xs', 'ys', line_width='weights', alpha='alphas', color='black', source=lines_source
    )
    r_years = p.circle(
        'x','y', size=40, source=year_source, color='lightgreen', level='overlay', line_width=1,alpha=1.0
    )
    
    r_topics = p.circle('x','y', size=25, source=topic_source, color='skyblue', level='overlay', alpha=1.00)
    
    p.add_tools(bokeh.models.HoverTool(renderers=[r_topics], tooltips=None, callback=widgets_utility.wf.\
        glyph_hover_callback(topic_source, 'node_id', text_ids=titles.index, text=titles, element_id='nx_id1'))
    )

    text_opts = dict(x='x', y='y', text='name', level='overlay', x_offset=0, y_offset=0, text_font_size='8pt')
    
    p.add_layout(
        bokeh.models.LabelSet(
            source=year_source, text_color='black', text_align='center', text_baseline='middle', **text_opts
        )
    )
    p.add_layout(
        bokeh.models.LabelSet(
            source=topic_source, text_color='black', text_align='center', text_baseline='middle', **text_opts
        )
    )
    
    return p

def main_topic_network(tm_data):
    
    model = tm_data.tm_model
    text_id = 'nx_id1'
    layout_options = [ 'Circular', 'Kamada-Kawai', 'Fruchterman-Reingold']
    text_widget = widgets_utility.wf.create_text_widget(text_id)  # style="display: inline; height='400px'"),
    scale_widget = widgets.FloatSlider(description='Scale', min=0.0, max=1.0, step=0.01, value=0.1, continues_update=False)
    threshold_widget = widgets.FloatSlider(description='Threshold', min=0.0, max=1.0, step=0.01, value=0.50, continues_update=False)
    output_format_widget = widgets_utility.dropdown('Output', { 'Network': 'network', 'Table': 'table' }, 'network')
    layout_widget = widgets_utility.dropdown('Layout', layout_options, 'Fruchterman-Reingold')
    progress_widget = widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="40%"))
    
    def tick(x=None):
        progress_widget.value = progress_widget.value + 1 if x is None else x
        
    def display_topic_network(layout_algorithm, tm_data, threshold=0.10, scale=1.0, output_format='network'):
            
        tick(1)
        container = tm_data.compiled_data
        titles = LdaDataCompiler.get_topic_titles(container.topic_token_weights)

        df = container.document_topic_weights[container.document_topic_weights.weight > threshold].reset_index()
        
        df['slim_title'] = df.title.apply(slim_title)
        network = NetworkUtility.create_bipartite_network(df, 'slim_title', 'topic_id')
        
        tick()

        if output_format == 'network':
            
            args = PlotNetworkUtility.layout_args(layout_algorithm, network, scale)
            layout = (layout_algorithms[layout_algorithm])(network, **args)
            
            tick()
            
            p = plot_document_topic_network(network, layout, scale=scale, titles=titles)
            bokeh.plotting.show(p)

        elif output_format == 'table':
            display(df)
        else:
            display(pivot_ui(df))

        tick(0)

    iw = widgets.interactive(
        display_topic_network,
        layout_algorithm=layout_widget,
        tm_data=widgets.fixed(tm_data),
        threshold=threshold_widget,
        scale=scale_widget,
        output_format=output_format_widget
    )

    display(widgets.VBox([
        text_widget,
        widgets.HBox([layout_widget, threshold_widget]), 
        widgets.HBox([output_format_widget, scale_widget, progress_widget]),
        iw.children[-1]
    ]))
    iw.update()

tm_data = get_current_model()
main_topic_network(tm_data)


VBox(children=(HTML(value="<span class='nx_id1'></span>", placeholder=''), HBox(children=(Dropdown(description…

### Topic Trends - Heatmap
- The topic shares  displayed as a scattered heatmap plot using gradient color based on topic's weight in document.
- [Stanford’s Termite software](http://vis.stanford.edu/papers/termite) uses a similar visualization.

In [96]:
# plot_topic_relevance_by_year
import bokeh.transform

def setup_glyph_coloring(df):
    max_weight = df.weight.max()
    #colors = list(reversed(bokeh.palettes.Greens[9]))
    colors = ['#ffffff', '#f7fcf5', '#e5f5e0', '#c7e9c0', '#a1d99b', '#74c476', '#41ab5d', '#238b45', '#006d2c', '#00441b']
    mapper = bokeh.models.LinearColorMapper(palette=colors, low=0.0, high=1.0) # low=df.weight.min(), high=max_weight)
    color_transform = bokeh.transform.transform('weight', mapper)
    color_bar = bokeh.models.ColorBar(color_mapper=mapper, location=(0, 0),
                         ticker=bokeh.models.BasicTicker(desired_num_ticks=len(colors)),
                         formatter=bokeh.models.PrintfTickFormatter(format=" %5.2f"))
    return color_transform, color_bar

def plot_topic_relevance_by_year(df, xs, ys, flip_axis, glyph, titles, text_id):

    line_height = 7
    if flip_axis is True:
        xs, ys = ys, xs
        line_height = 10
    
    ''' Setup axis categories '''
    x_range = list(map(str, df[xs].unique()))
    y_range = list(map(str, df[ys].unique()))
    
    ''' Setup coloring and color bar '''
    color_transform, color_bar = setup_glyph_coloring(df)
    
    source = bokeh.models.ColumnDataSource(df)

    plot_height = max(len(y_range) * line_height, 500)
    
    p = bokeh.plotting.figure(title="Topic heatmap", tools=TOOLS, toolbar_location="right", x_range=x_range,
           y_range=y_range, x_axis_location="above", plot_width=1000, plot_height=plot_height)

    args = dict(x=xs, y=ys, source=source, alpha=1.0, hover_color='red')
    
    if glyph == 'Circle':
        cr = p.circle(color=color_transform, **args)
    else:
        cr = p.rect(width=1, height=1, line_color=None, fill_color=color_transform, **args)

    p.x_range.range_padding = 0
    p.ygrid.grid_line_color = None
    p.xgrid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "8pt"
    p.axis.major_label_standoff = 0
    p.xaxis.major_label_orientation = 1.0
    p.add_layout(color_bar, 'right')
    
    p.add_tools(bokeh.models.HoverTool(tooltips=None, callback=widgets_utility.WidgetUtility.glyph_hover_callback(
        source, 'topic_id', titles.index, titles, text_id), renderers=[cr]))
    
    return p
    
def display_doc_topic_heatmap(tm_data, key='max', flip_axis=False, glyph='Circle'):
    try:
        container = tm_data.compiled_data
        titles = LdaDataCompiler.get_topic_titles(container.topic_token_weights, n_words=100)
        df = container.document_topic_weights.copy().reset_index()
        df['document_id'] = df.document_id.astype(str)
        df['topic_id'] = df.topic_id.astype(str)
        df['author'] = df.title.apply(slim_title)
        p = plot_topic_relevance_by_year(df, xs='author', ys='topic_id', flip_axis=flip_axis, glyph=glyph, titles=titles, text_id='topic_relevance')
        bokeh.plotting.show(p)
    except Exception as ex:
        raise
        logger.error(ex)
            
def doc_topic_heatmap_gui(tm_data):
    
    def text_widget(element_id=None, default_value='', style='', line_height='20px'):
        value = "<span class='{}' style='line-height: {};{}'>{}</span>".format(element_id, line_height, style, default_value) if element_id is not None else ''
        return widgets.HTML(value=value, placeholder='', description='', layout=widgets.Layout(height='150px'))

    text_id = 'topic_relevance'
    #text_widget = widgets_utility.wf.create_text_widget(text_id)
    text_widget = text_widget(text_id)
    glyph = widgets.Dropdown(options=['Circle', 'Square'], value='Square', description='Glyph', layout=widgets.Layout(width="180px"))
    flip_axis = widgets.ToggleButton(value=True, description='Flip XY', tooltip='Flip X and Y axis', icon='', layout=widgets.Layout(width="80px"))

    iw = widgets.interactive(display_doc_topic_heatmap, tm_data=widgets.fixed(tm_data), glyph=glyph, flip_axis=flip_axis)

    display(widgets.VBox([widgets.HBox([flip_axis, glyph ]), text_widget, iw.children[-1]]))

    iw.update()

doc_topic_heatmap_gui(get_current_model())

VBox(children=(HBox(children=(ToggleButton(value=True, description='Flip XY', layout=Layout(width='80px'), too…

## Document Key Terms 
- [TextRank]	Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics.
- [SingleRank]	Hasan, K. S., & Ng, V. (2010, August). Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 365-373). Association for Computational Linguistics.


In [83]:
import textacy.keyterms

def display_document_key_terms_gui(corpus):
    
    df_documents = get_corpus_documents(corpus)
    methods = { 'SingleRank': textacy.keyterms.singlerank, 'TextRank': textacy.keyterms.textrank }
    document_options = {v: k for k, v in df_documents['title'].to_dict().items()}
    
    gui = types.SimpleNamespace(
        output=widgets.Output(layout={'border': '1px solid black'}),
        n_keyterms=widgets.IntSlider(description='#words', min=10, max=500, value=100, step=1, layout=widgets.Layout(width='240px')),
        document_id=widgets.Dropdown(description='Paper', options=document_options, value=0, layout=widgets.Layout(width='40%')),
        method=widgets.Dropdown(description='Algorithm', options=[ 'TextRank', 'SingleRank' ], value='TextRank', layout=widgets.Layout(width='180px')),
        normalize=widgets.Dropdown(description='Normalize', options=[ 'lemma', 'lower' ], value='lemma', layout=widgets.Layout(width='160px'))
    )
    
    def display_document_key_terms(corpus, method='TextRank', document_id=0, normalize='lemma', n_keyterms=10):
        keyterms = methods[method](corpus[document_id], normalize=normalize, n_keyterms=n_keyterms)
        terms = ' '.join([ x for x, y in keyterms ])
        gui.output.clear_output()
        with gui.output:
            display(terms)

    itw = widgets.interactive(
        display_document_key_terms,
        corpus=widgets.fixed(corpus),
        method=gui.method,
        document_id=gui.document_id,
        normalize=gui.normalize,
        n_keyterms=gui.n_keyterms,
    )

    display(widgets.VBox([
        widgets.HBox([gui.document_id, gui.method, gui.normalize, gui.n_keyterms]),
        gui.output
    ]))

    itw.update()

display_document_key_terms_gui(CORPUS)


VBox(children=(HBox(children=(Dropdown(description='Paper', index=3, layout=Layout(width='40%'), options={'Int…