### LDA Topic modelling

See *Blei, 2003: Latent dirichlet allocation* [PDF](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) for a description of LDA.

### Suggested readings on how to evaluate topic models
- Reading tea leaves: how humans interpret topic models, Chang et al. (2009)
- Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality, Lau et al.
- http://dirichlet.net/pdf/wallach09evaluation.pdf
- Many interesting blogs by Benjamin Schmidt, Ted Underwood etc.

### Tools for investigating TMs

> - [Termite](http://vis.stanford.edu/papers/termite) (Stanford)
> - [Hiérarchie](https://nlp.stanford.edu/events/illvi2014/papers/smith-illvi2014b.pdf)
> - [Word Embedding Visual Explorer](http://residue3.sci.utah.edu/?) [source](https://ronxin.github.io/wevi/), [paper](https://arxiv.org/abs/1411.2738)
> - LAMVI, Sentiview, Lexos and many more.
> - [TensorBoard](https://www.tensorflow.org/guide/summaries_and_tensorboard) (Google, PCA, VSM)

### How to create a topic model
- [MALLET](http://mallet.cs.umass.edu/) McCallum, Andrew Kachites.  "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002.
- [gensim](https://radimrehurek.com/gensim/index.html) Radim Rehurek and Petr Sojka, "Software Framework for Topic Modelling with Large Corpora", 2010
- [Stanford Topic Modeling Toolbox](https://nlp.stanford.edu/software/tmt/tmt-0.4/)

### <span style='color:blue'>MANDATORY STEP</span> Setup and Initialize the Notebook
Use the **play** button, or press **Shift-Enter** to execute a code cell (select it first). The code imports Python libraries and frameworks, and initializes the notebook.

In [1]:
# Folded Code
%load_ext autoreload
%autoreload 2

import common.utility
from common.model_utility import ModelUtility
from common.plot_utility import layout_algorithms, PlotNetworkUtility
import common.widgets_utility as wf
from common.network_utility import NetworkUtility, DISTANCE_METRICS, NetworkMetricHelper
#import common.vectorspace_utility

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

import os
import glob
import math
import types
import ipywidgets as widgets
import logging
import bokeh.models as bm
import bokeh.palettes
import pandas as pd
import numpy as np

from pivottablejs import pivot_ui
from IPython.display import display, HTML, clear_output, IFrame
from itertools import product
from bokeh.io import output_file, push_notebook
from bokeh.core.properties import value, expr
from bokeh.transform import transform, jitter
from bokeh.layouts import row, column, widgetbox
from bokeh.plotting import figure, show, output_notebook, output_file
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn
from bokeh.models import ColumnDataSource, CustomJS

logger = logging.getLogger('explore-topic-models')
TOOLS = "pan,wheel_zoom,box_zoom,reset,previewsave"
AGGREGATES = { 'mean': np.mean, 'sum': np.sum, 'max': np.max, 'std': np.std }

output_notebook()

pd.set_option('precision', 10)

### <span style='color:blue'>MANDATORY STEP</span> Select LDA Topic Model
- Select one of the previously computed and prepared topic models that you wan't to use in subsequent steps.
- Models are computed in batch in accordance to 
<a href="./images/workflow-prepare.svg">process flow</a> used in the *Digitala modeller* project.
- Note that subsequent code cells are NOT updated (executed) automatically when a new model is selected.
- Use the **play** button, or press **Shift-Enter** to execute the selected cell.

In [2]:
# Hidden code: Select current model state
class ModelState:
    
    def __init__(self, data_folder):
        
        self.data_folder = data_folder
        self.basenames = ModelUtility.get_model_names(data_folder)
        self.basename = self.basenames[0]
        self.on_set_model_callback = None
        
    def set_model(self, basename=None):

        basename = basename or self.basename
        
        self.basename = basename
        self.topic_keys = ModelUtility.get_topic_keys(self.data_folder, basename)
        state.max_alpha = self.topic_keys.alpha.max()
        self.topic_overview = ModelUtility\
            .get_result_model_sheet(self.data_folder, basename, 'topic_tokens')
        self.document_topic_weights = ModelUtility\
            .get_result_model_sheet(self.data_folder, basename, 'doc_topic_weights')\
            .drop('Unnamed: 0', axis=1, errors='ignore')
        self.topic_token_weights = ModelUtility\
            .get_result_model_sheet(self.data_folder, basename, 'topic_token_weights')\
            .drop('Unnamed: 0', axis=1, errors='ignore')\
            .dropna(subset=['token'])
        self._years = list(range(
            self.document_topic_weights.year.min(), self.document_topic_weights.year.max() + 1))
        self.min_year = min(self._years)
        self.max_year = max(self._years)
        self.years = [None] + self._years
        self.n_topics = self.topic_overview.topic_id.max() + 1
        # https://stackoverflow.com/questions/44561609/how-does-mallet-set-its-default-hyperparameters-for-lda-i-e-alpha-and-beta
        self.initial_alpha = 0.0  # 5.0 / self.n_topics if 'mallet' in state.basename else 1.0 / self.n_topics
        self.initial_beta = 0.0  # 0.01 if 'mallet' in basename else 1.0 / self.n_topics
        self._lda = None
        self._topic_titles = None
        self.corpus_documents = ModelUtility.get_corpus_documents(self.data_folder, self.basename).set_index('document_id')
        print("Current model: " + self.basename.upper())
        
        if self.on_set_model_callback is not None:
            self.on_set_model_callback(self)
            
        # _fix_topictokens()
        return self
    
    #def get_document_topic_weights(self, year=None, topic_id=None):
    #    df = self.document_topic_weights
    #    if year is None and topic_id is None:
    #        return df
    #    if topic_id is None:
    #        return df[(df.year == year)]
    #    if year is None:
    #        return df[(df.topic_id == topic_id)]
    #    return df[(df.year == year)&(df.topic_id == topic_id)]
    
    def get_unique_topic_ids(self):
        return self.document_topic_weights['topic_id'].unique()
    
    #def get_topic_weight_by_year_or_document(self, key='mean', pivot_column=None):
    #    
    #    if pivot_column is None:
    #        pivot_column = 'year' if year is None else 'document_id'    
    #        
    #    df = self.document_topic_weights(year) \
    #        .groupby([pivot_column,'topic_id']) \
    #        .agg(AGGREGATES[key])[['weight']].reset_index()
    #    return df, pivot_column
    
    #return self.get_document_topic_weight_by_pivot_column(pivot_column, key, filter={'column': 'year', 'values': [year]})
    
    def get_document_topic_weight_by_filter(self, filters=None):
        df = self.document_topic_weights.query('weight > 0')
        for filter in (filters or []):
            if 'query' in filter.keys():
                df = df.query(filter['query'])
            elif isinstance(filter['value'], str):
                df = df[(df[filter['column']]==filter['value'])]
            elif isinstance(filter['value'], list):
                df = df[(df[filter['column']].isin(filter['value']))]
        return df
    
    def get_document_topic_weight_by_pivot_column(self, pivot_column, key='mean', filters=None):
        df = self.get_document_topic_weight_by_filter(filters)
        df = df.groupby([pivot_column, 'topic_id'])\
               .agg(AGGREGATES[key])[['weight']].reset_index()
        return df[df.weight > 0]
    
    def get_topic_tokens_dict(self, topic_id, n_top=200):
        return self.get_topic_tokens(topic_id)\
            .sort_values(['weight'], ascending=False)\
            .head(n_top)[['token', 'weight']]\
            .set_index('token').to_dict()['weight']

    def compute_topic_terms_vector_space(self, n_words=100):
        '''
        Create an align topic-term vector space of top n_words from each topic
        '''
        unaligned_vector_dicts = ( self.get_topic_tokens_dict(topic_id, n_words) for topic_id in range(0, self.n_topics) )
        X, feature_names = ModelUtility.compute_and_align_vector_space(unaligned_vector_dicts)
        return X, feature_names

    def get_lda(self):
        raise Exception("Use of LDA model disabled in this Notebook")
        '''
        Get gensim model. Only used for pyLDAvis display
        '''
        if self._lda is None:
            filename = os.path.join(self.data_folder, self.basename, 'gensim_model_{}.gensim.gz'.format(self.basename))
            if os.path.isfile(filename):
                self._lda = LdaModel.load(filename)
                print('LDA model loaded...')
            else:
                print('LDA not found on disk...')
        return self._lda 
    
    def get_topic_titles(self, n_words=100, cache=True):
        if cache and self._topic_titles is not None:
            return self._topic_titles
        _topic_titles = ModelUtility.get_topic_titles(state.topic_token_weights, n_words=n_words)
        self._topic_titles = _topic_titles if cache else None
        return _topic_titles
    
    def get_topic_tokens(self, topic_id, max_n_words=500):
        tokens = state.topic_token_weights\
            .loc[lambda x: x.topic_id == topic_id]\
            .sort_values('weight',ascending=False)[:max_n_words]
        return tokens
    
    def get_topic_alphas(self):
        tokens = state.topic_token_weights\
            .loc[lambda x: x.topic_id == topic_id]\
            .sort_values('weight',ascending=False)[:max_n_words]
        alpas = ModelUtility.get_topic_alphas
        return tokens
    
    def get_topic_year_aggregate_weights(self, fn, threshold):
        df = self.document_topic_weights[(self.document_topic_weights.weight > 0.001)]
        df = df.groupby(['year', 'topic_id']).agg(fn)['weight'].reset_index()
        df = df[(df.weight>=threshold)]
        return df
    
    def get_topic_proportions(self):
        corpus_documents = self.get_corpus_documents()
        document_topic_weights = self.document_topic_weights
        topic_proportion = ModelUtility.compute_topic_proportions(document_topic_weights, corpus_documents)
        return topic_proportion
    
    def get_corpus_documents(self):
        #if self.corpus_documents is None:
        #    self.corpus_documents = ModelUtility.get_corpus_documents(self.data_folder, self.basename)
        return self.corpus_documents

    def on_set_model(self, callback):
        self.on_set_model_callback = callback
        return self
        
def on_set_model_handler(state):

    if 'report_name' in state.corpus_documents:
        return
    
    state.source_documents = pd.read_csv('data/SOU_1990_index.csv', sep='\t', header=None, names=['year', 'report_id', 'report_name'])
    state.corpus_documents['report_id'] = state.corpus_documents.document.str.split('_').apply(lambda x: x[1]).astype(np.int64)
    state.corpus_documents['report_name'] = pd.merge(state.corpus_documents, state.source_documents, how='inner', on=['year', 'report_id']).report_name
    state.corpus_documents['report_name'] = state.corpus_documents.apply(lambda x: '{}-{} {}'.format(x['year'], x['report_id'], x['report_name'])[:50], axis=1)
    state.document_topic_weights['report_name'] = pd.merge(state.document_topic_weights, state.corpus_documents, left_on='document_id', right_index=True).report_name

def select_model_main(state):
    
    basename_widget = widgets.Dropdown(
        options=state.basenames,
        value=state.basename,
        description='Topic model',
        disabled=False,
        layout=widgets.Layout(width='75%')
    )
    
    w = widgets.interactive(state.set_model, basename=basename_widget, state=widgets.fixed(state))
    display(widgets.VBox((basename_widget,) + (w.children[-1],)))
    w.update()

state = ModelState('./data').on_set_model(on_set_model_handler)

select_model_main(state)


VBox(children=(Dropdown(description='Topic model', layout=Layout(width='75%'), options=('20180910_SOU_1990_T50…


### The Alpha Hyperparameter

- See [Probabalistic Topic Models](http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf) for a description of LDA hyperparameters.
- The **alpha** hyperparameter affects the document-topic distribution.
- The LDA model is said to be *symmetric* if the same alpha value is used for all topics, and *assymetric* if it can vary per topic.
- If a assymetric model, then high alphas can indicate a "stopwords" topic (frequent words), and low alphas can indicate bogus topics.
- This chart is of no value for symmetric models. 
- See also: [stackexchange what-exactly-is-the-alpha-in-the-dirichlet-distribution](https://stats.stackexchange.com/questions/244917/what-exactly-is-the-alpha-in-the-dirichlet-distribution)


In [5]:
# Alpha / Lambda Plot

_topic_keys = ModelUtility.get_topic_keys(state.data_folder, state.basename)

def plot_alpha(df):

    source = ColumnDataSource(df)
    p = figure(x_range=df.topic.values,plot_width=900, plot_height=400, title='',
               tools=TOOLS, toolbar_location="above")
    p.xaxis[0].axis_label = 'Topic'
    p.yaxis[0].axis_label = 'Alpha'
    p.xaxis.major_label_orientation = 1.0
    p.y_range.start = 0.0
    x_axis_type = 'enum'
    p.xgrid.visible = False

    glyph = bm.glyphs.VBar(x='topic', top='alpha', bottom=0, width=0.5, fill_color='color')
    cr = p.add_glyph(source, glyph)

    titles = ModelUtility.get_topic_titles(state.topic_token_weights, n_words=100)
    p.add_tools(bm.HoverTool(tooltips=None, callback=wf.WidgetUtility.glyph_hover_callback(
        source, 'topic_id', titles.index, titles, 'alpha_plot'), renderers=[cr]))
        
    return p

def display_alpha(output_format, sort_by, window):
    global state
    palette = bokeh.palettes.PiYG[4]
    topic_keys = ModelUtility.get_topic_keys(state.data_folder, state.basename).reset_index()
    topic_keys = topic_keys[((topic_keys.alpha >= window[0]) & (topic_keys.alpha <= window[1]))]
    topic_keys['topic'] = topic_keys.topic_id.apply(lambda x: str(x))
    topic_keys['color'] = palette[1]  # topic_keys.alpha.apply(lambda x: palette[1] if x >= state.initial_alpha else palette[2])
    if sort_by.lower() == 'alpha':
        topic_keys = topic_keys.sort_values('alpha', axis=0)
    if output_format == 'Chart':
        p = plot_alpha(topic_keys)
        show(p)
    else:
        source = bm.ColumnDataSource(topic_keys)
        columns = [
            TableColumn(field="topic_id", title="ID"),
            TableColumn(field="alpha", title="Alpha"),
            TableColumn(field="tokens", title="Tokens"),
        ]
        data_table = DataTable(source=source, columns=columns, width=950, height=600)
        show(widgetbox(data_table))

def plot_alpha_main():
    
    za = wf.BaseWidgetUtility(
        text_id='alpha_plot',
        text=wf.create_text_widget('alpha_plot',default_value='Hover topics to display words!'),
        output_format=wf.create_select_widget('Format', ['Chart', 'Table'], default='Chart'),
        sort_by=wf.create_select_widget('Sort by', ['Topic', 'Alpha'], default='Alpha'),
        window=widgets.FloatRangeSlider(
            description='Window',
            min=0, max=state.max_alpha + 0.1,
            step=0.01,
            value=(0, state.max_alpha + 0.1),  # (state.initial_alpha, state.max_alpha + 0.1),
            continuous_update=False
        )
    )
    za.next_topic_id = za.create_next_id_button('topic_id', state.n_topics)

    wa = widgets.interactive(
        display_alpha,
        output_format=za.output_format,
        sort_by=za.sort_by,
        window=za.window
    )
    za.text.layout = widgets.Layout(width='95%') #  , height='120px')
    wa.children[-1].layout = widgets.Layout(width='98%')

    display(widgets.VBox([
        za.text,
        widgets.HBox([za.output_format, za.window, za.sort_by]),
        widgets.HBox([wa.children[-1]])
    ]))
    wa.update()
    
plot_alpha_main()


VBox(children=(HTML(value="<span class='alpha_plot' style='line-height: 20px;'>Hover topics to display words!<…

In [6]:
# Dir(alpha) test sample
def plot_dirichlet_alpha_sample(df):

    source = ColumnDataSource(df)
    p = figure(x_range=df.topic.values,plot_width=900, plot_height=400, title='',
               tools=TOOLS, toolbar_location="above")
    p.xaxis[0].axis_label = 'Topic'
    p.yaxis[0].axis_label = 'Value'
    p.xaxis.major_label_orientation = 1.0
    p.y_range.start = 0.0
    x_axis_type = 'enum'
    p.xgrid.visible = False

    glyph = bm.glyphs.VBar(x='topic', top='value', bottom=0, width=0.5, fill_color='color')
    cr = p.add_glyph(source, glyph)

    titles = ModelUtility.get_topic_titles(state.topic_token_weights, n_words=100)
    p.add_tools(bm.HoverTool(tooltips=None, callback=wf.WidgetUtility.glyph_hover_callback(
        source, 'topic_id', titles.index, titles, 'dirichlet_alpha_plot'), renderers=[cr]))
        
    return p

def display_dirichlet_alpha_draw():
    global state
    palette = bokeh.palettes.PiYG[4]
    topic_keys = ModelUtility.get_topic_keys(state.data_folder, state.basename).reset_index()
    topic_keys['topic'] = topic_keys.topic_id.apply(lambda x: str(x))
    topic_keys['color'] = palette[1] # topic_keys.alpha.apply(lambda x: palette[1] if x >= state.initial_alpha else palette[2])
    topic_keys['value'] = np.random.dirichlet(topic_keys.alpha)

    p = plot_dirichlet_alpha_sample(topic_keys)
    show(p)

def draw_dirichlet_alpha_main():
    
    zd = wf.BaseWidgetUtility(
        text_id='dirichlet_alpha_plot',
        text=wf.create_text_widget('dirichlet_alpha_plot',default_value='Hover topics to display words!'),
    )
    zd.refresh_button = widgets.Button(
        description='Draw',
        disabled=False,
        button_style='', # 'success', 'info', 'warning', 'danger' or ''
        tooltip='Click me',
        icon='check'
    )

    def on_refresh_button_clicked(b):
        wd.update()

    zd.refresh_button.on_click(on_refresh_button_clicked)

    wd = widgets.interactive(display_dirichlet_alpha_draw)
    zd.text.layout = widgets.Layout(width='95%')
    wd.children[-1].layout = widgets.Layout(width='98%')

    display(widgets.VBox([
        zd.text,
        widgets.HBox([zd.refresh_button]),
        widgets.HBox([wd.children[-1]])
    ]))
    wd.update()
    
draw_dirichlet_alpha_main()


VBox(children=(HTML(value="<span class='dirichlet_alpha_plot' style='line-height: 20px;'>Hover topics to displ…

### Topic-Word Distribution - Wordcloud and Table

In [7]:
# Display LDA topic's token wordcloud
opts = { 'max_font_size': 100, 'background_color': 'white', 'width': 900, 'height': 600 }

import wordcloud
import matplotlib.pyplot as plt

def display_topic_distribution_widgets(callback, state, text_id, output_options=None, word_count=(1, 100, 50)):
    
    output_options = output_options or []
    wc = wf.BaseWidgetUtility(
        n_topics=state.n_topics,
        text_id=text_id,
        text=wf.create_text_widget(text_id),
        topic_id=widgets.IntSlider(
            description='Topic ID', min=0, max=state.n_topics - 1, step=1, value=0, continuous_update=False),
        word_count=widgets.IntSlider(
            description='#Words', min=word_count[0], max=word_count[1], step=1, value=word_count[2], continuous_update=False),
        output_format=wf.create_select_widget('Format', output_options, default=output_options[0], layout=widgets.Layout(width="200px")),
        progress = widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))
    )

    wc.prev_topic_id = wc.create_prev_id_button('topic_id', state.n_topics)
    wc.next_topic_id = wc.create_next_id_button('topic_id', state.n_topics)

    iw = widgets.interactive(
        callback,
        topic_id=wc.topic_id,
        n_words=wc.word_count,
        output_format=wc.output_format,
        widget_container=widgets.fixed(wc)
    )

    display(widgets.VBox([
        wc.text,
        widgets.HBox([wc.prev_topic_id, wc.next_topic_id, wc.topic_id, wc.word_count, wc.output_format]),
        wc.progress,
        iw.children[-1]
    ]))

    iw.update()

def plot_wordcloud(df_data, token='token', weight='weight', figsize=(14, 14/1.618), **args):
    token_weights = dict({ tuple(x) for x in df_data[[token, weight]].values })
    image = wordcloud.WordCloud(**args,)
    image.fit_words(token_weights)
    plt.figure(figsize=figsize) #, dpi=100)
    plt.imshow(image, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    
def display_wordcloud(topic_id=0, n_words=100, output_format='Wordcloud', widget_container=None):
    widget_container.progress.value = 1
    df_temp = state.topic_token_weights.loc[(state.topic_token_weights.topic_id == topic_id)]
    tokens = state.get_topic_titles(n_words=n_words, cache=True).iloc[topic_id]
    widget_container.value = 2
    widget_container.text.value = 'ID {}: {}'.format(topic_id, tokens)
    if output_format == 'Wordcloud':
        plot_wordcloud(df_temp, 'token', 'weight', max_words=n_words, **opts)
    elif output_format == 'Table':
        widget_container.progress.value = 3
        df_temp = state.get_topic_tokens(topic_id, n_words)
        widget_container.progress.value = 4
        display(HTML(df_temp.to_html()))
    else:
        display(pivot_ui(state.get_topic_tokens(topic_id, n_words)))
    widget_container.progress.value = 0

display_topic_distribution_widgets(display_wordcloud, state, 'tx02', ['Wordcloud', 'Table', 'Pivot'])


VBox(children=(HTML(value="<span class='tx02' style='line-height: 20px;'></span>", placeholder=''), HBox(child…

### Topic-Word Distribution - Chart
The following chart shows the word distribution for each selected topic. You can zoom in on the left chart. The distribution seems to follow [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) as (perhaps) expected.

In [8]:
# Display topic's word distribution

def plot_topic_word_distribution(tokens, **args):

    source = ColumnDataSource(tokens)

    p = figure(toolbar_location="right", **args)

    cr = p.circle(x='xs', y='ys', source=source)

    label_style = dict(level='overlay', text_font_size='8pt', angle=np.pi/6.0)

    text_aligns = ['left', 'right']
    for i in [0, 1]:
        label_source = ColumnDataSource(tokens.iloc[i::2])
        labels = bm.LabelSet(x='xs', y='ys', text_align=text_aligns[i], text='token', text_baseline='middle',
                          y_offset=5*(1 if i == 0 else -1),
                          x_offset=5*(1 if i == 0 else -1),
                          source=label_source, **label_style)
        p.add_layout(labels)

    p.xaxis[0].axis_label = 'Token #'
    p.yaxis[0].axis_label = 'Probability%'
    p.ygrid.grid_line_color = None
    p.xgrid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "6pt"
    p.axis.major_label_standoff = 0
    return p

def plot_topic_tokens_charts(tokens, flag=True):

    if flag:
        left = plot_topic_word_distribution(tokens, plot_width=1000, plot_height=500, title='', tools='box_zoom,wheel_zoom,pan,reset')
        show(left)
        return

    left = plot_topic_word_distribution(tokens, plot_width=450, plot_height=500, title='', tools='box_zoom,wheel_zoom,pan,reset')
    right = plot_topic_word_distribution(tokens, plot_width=450, plot_height=500, title='', tools='pan')

    source = ColumnDataSource({'x':[], 'y':[], 'width':[], 'height':[]})
    left.x_range.callback = create_js_callback('x', 'width', source)
    left.y_range.callback = create_js_callback('y', 'height', source)

    rect = bm.Rect(x='x', y='y', width='width', height='height', fill_alpha=0.0, line_color='blue', line_alpha=0.4)
    right.add_glyph(source, rect)

    show(row(left, right))

def display_topic_tokens(topic_id=0, n_words=100, output_format='Chart', widget_container=None):
    widget_container.forward()
    tokens = state.get_topic_tokens(topic_id=topic_id).\
        copy()\
        .drop('topic_id', axis=1)\
        .assign(weight=lambda x: 100.0 * x.weight)\
        .sort_values('weight', axis=0, ascending=False)\
        .reset_index()\
        .head(n_words)
    if output_format == 'Chart':
        widget_container.forward()
        tokens = tokens.assign(xs=tokens.index, ys=tokens.weight)
        plot_topic_tokens_charts(tokens)
        widget_container.forward()
    elif output_format == 'Table':
        #display(tokens)
        display(HTML(tokens.to_html()))
    else:
        display(pivot_ui(tokens))
    widget_container.reset()
        
display_topic_distribution_widgets(display_topic_tokens, state, 'wc01', ['Chart', 'Table'])


VBox(children=(HTML(value="<span class='wc01' style='line-height: 20px;'></span>", placeholder=''), HBox(child…

### Topic's Trend Over Time or Documents
- Displays topic's share over documents or time.
- Note that source documents (i.e. SOU reports) are splitted into 1000 word chunks (LDA document) by the topic modelling process
- If "SOU Report" or "Year" is selected then the **max** or **mean** weight is selected from corresponding LDA documents

In [9]:
# Plot a topic's yearly weight over time in selected LDA topic model
import numpy as np
import math
import bokeh.plotting
from bokeh.models import ColumnDataSource, DataRange1d, Plot, LinearAxis, Grid
from bokeh.models.glyphs import VBar
from bokeh.io import curdoc, show

def plot_topic_trend(df, pivot_column, value_column, x_label=None, y_label=None):

    xs = df[pivot_column].astype(np.str)
    p = bokeh.plotting.figure(x_range=xs, plot_width=1000, plot_height=700, title='', tools=TOOLS, toolbar_location="right")

    glyph = p.vbar(x=xs, top=df[value_column], width=0.5, fill_color="#b3de69")
    p.xaxis.major_label_orientation = math.pi/4
    p.xgrid.grid_line_color = None
    p.xaxis[0].axis_label = (x_label or '').title()
    p.yaxis[0].axis_label = (y_label or '').title()
    p.y_range.start = 0.0
    #p.y_range.end = 1.0
    p.x_range.range_padding = 0.01
    return p

def display_topic_trend(topic_id, pivot_config, value_column, widgets_container, output_format='Chart', state=None, threshold=0.01):
    
    pivot_column = pivot_config['pivot_column']
    tokens = state.get_topic_titles(n_words=200, cache=True).iloc[topic_id]
    widgets_container.text.value = 'ID {}: {}'.format(topic_id, tokens)
    value_column = value_column if pivot_column is not None else 'weight'
    
    df = state.document_topic_weights[(state.document_topic_weights.topic_id==topic_id)]
    
    if pivot_column is not None:
        df = df.groupby([pivot_column]).agg([np.mean, np.max])['weight'].reset_index()
        df.columns = [pivot_column, 'mean', 'max' ]
        df = df[(df[value_column] > threshold)]
        
    if output_format == 'Table':
        display(df)
    else:
        x_label = pivot_column.title()
        y_label = value_column.title() + ('weight' if value_column != 'weight' else '')
        p = plot_topic_trend(df, pivot_column, value_column, x_label=x_label, y_label=y_label)
        show(p)

def create_topic_trend_widgets(state):
    pivot_options = {
        '': { 'pivot_column': None, 'filter': None },
        'SOU Report': { 'pivot_column': 'report_name', 'filter': None },
        'Year': { 'pivot_column': 'year', 'filter': None },
        'LDA Document': { 'pivot_column': 'document_id', 'filter': None }
    } 
    wc = wf.BaseWidgetUtility(
        n_topics=state.n_topics,
        text_id='topic_share_plot',
        text=wf.create_text_widget('topic_share_plot'),
        #year=wf.create_select_widget('Year', options=state.years, value=state.years[-1]),
        pivot_config=widgets.Dropdown(
            options=pivot_options,
            value=pivot_options['SOU Report'],
            description='Group by'
        ),
        threshold=widgets.FloatSlider(description='Threshold', min=0.0, max=0.25, step=0.01, value=0.10, continuous_update=False),
        topic_id=widgets.IntSlider(description='Topic ID', min=0, max=state.n_topics - 1, step=1, value=0, continuous_update=False),
        output_format=wf.create_select_widget('Format', ['Chart', 'Table'], default='Chart'),
        progress=widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="50%")),
        aggregate=widgets.Dropdown(options=['max', 'mean'], value='max', description='Aggregate')
    )

    wc.prev_topic_id = wc.create_prev_id_button('topic_id', state.n_topics)
    wc.next_topic_id = wc.create_next_id_button('topic_id', state.n_topics)

    iw = widgets.interactive(
        display_topic_trend,
        topic_id=wc.topic_id,
        pivot_config=wc.pivot_config,
        value_column=wc.aggregate,
        widgets_container=widgets.fixed(wc),
        output_format=wc.output_format,
        state=widgets.fixed(state),
        threshold=wc.threshold
    )
    display(widgets.VBox([
        wc.text,
        widgets.HBox([wc.prev_topic_id, wc.next_topic_id, wc.pivot_config, wc.aggregate, wc.output_format]),
        widgets.HBox([wc.topic_id, wc.threshold, wc.progress]),
        iw.children[-1]
    ]))
    
    iw.update()
    
create_topic_trend_widgets(state)

VBox(children=(HTML(value="<span class='topic_share_plot' style='line-height: 20px;'></span>", placeholder='')…

### Topic to Document Network
The green nodes are documents, and blue nodes are topics. The edges (lines) indicates the strength of a topic in the connected document. The width of the edge is proportinal to the strength of the connection. Note that only edges with a strength above the certain threshold are displayed.

In [10]:
# Visualize year-to-topic network by means of topic-document-weights
     
def plot_topic_year_network(network, layout, scale=1.0, titles=None):

    year_nodes, topic_nodes = NetworkUtility.get_bipartite_node_set(network, bipartite=0)  
    
    year_source = NetworkUtility.get_node_subset_source(network, layout, year_nodes)
    topic_source = NetworkUtility.get_node_subset_source(network, layout, topic_nodes)
    lines_source = NetworkUtility.get_edges_source(network, layout, scale=6.0, normalize=False)
    
    edges_alphas = NetworkMetricHelper.compute_alpha_vector(lines_source.data['weights'])
    
    lines_source.add(edges_alphas, 'alphas')
    
    p = figure(plot_width=1000, plot_height=600, x_axis_type=None, y_axis_type=None, tools=TOOLS)
    
    r_lines = p.multi_line(
        'xs', 'ys', line_width='weights', alpha='alphas', color='black', source=lines_source
    )
    r_years = p.circle(
        'x','y', size=40, source=year_source, color='lightgreen', level='overlay', line_width=1,alpha=1.0
    )
    
    r_topics = p.circle('x','y', size=25, source=topic_source, color='skyblue', level='overlay', alpha=1.00)
    
    p.add_tools(bm.HoverTool(renderers=[r_topics], tooltips=None, callback=wf.WidgetUtility.\
        glyph_hover_callback(topic_source, 'node_id', text_ids=titles.index, text=titles, element_id='nx_id1'))
    )

    text_opts = dict(
        x='x', y='y', text='name', level='overlay',
        x_offset=0, y_offset=0, text_font_size='8pt'
    )
    
    p.add_layout(
        bm.LabelSet(
            source=year_source, text_color='black', text_align='center', text_baseline='middle', **text_opts
        )
    )
    p.add_layout(
        bm.LabelSet(
            source=topic_source, text_color='black', text_align='center', text_baseline='middle', **text_opts
        )
    )
    
    return p

def main_topic_year_network(state):
    
    wc = wf.BaseWidgetUtility(
        n_topics=state.n_topics,
        text_id='nx_id1',
        text=wf.create_text_widget('nx_id1', style="display: inline; height='400px'"),
        year=widgets.IntSlider(description='Year', min=state.min_year, max=state.max_year, step=1, value=state.min_year, continues_update=False),
        pivot_column=widgets.Dropdown(
            options={
                'SOU report': 'report_name',
                'Year': 'year'
            },
            value='report_name',
            description='Pivot'
        ),
        scale=widgets.FloatSlider(description='Scale', min=0.0, max=1.0, step=0.01, value=0.1, continues_update=False),
        threshold=widgets.FloatSlider(description='Threshold', min=0.0, max=1.0, step=0.01, value=0.50, continues_update=False),
        output_format=widgets.Dropdown(
            options={'Network': 'network', 'Table': 'table'},
            value='network',
            description='Output'
        ),
        layout=widgets.Dropdown(
            options=list(layout_algorithms.keys()),
            value='Fruchterman-Reingold',
            description='Layout'
        ),
        progress=widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="40%"))
    ) 
    
    wc.previous = wc.create_prev_id_button('year', 10000)
    wc.next = wc.create_next_id_button('year', 10000)    
    
    def display_topic_year_network(
        layout_algorithm,
        threshold=0.50,
        scale=1.0,
        pivot_column='report_name',
        year=None,
        output_format='network'
    ):
        wc.progress.value = 1
        
        titles = state.get_topic_titles()
        filters = []
        if year is not None:
            filters = [ { 'column': 'year', 'value': year }]
        filters = filters + [ { 'query': 'weight >= {}'.format(threshold) } ]
        df = state.get_document_topic_weight_by_pivot_column(pivot_column, key='max', filters=filters)
        df = df[df.weight > threshold]
        
        wc.progress.value = 2

        network = NetworkUtility.create_bipartite_network(df, pivot_column, 'topic_id')
        
        wc.progress.value = 3

        if output_format == 'network':
            
            args = PlotNetworkUtility.layout_args(layout_algorithm, network, scale)
            layout = (layout_algorithms[layout_algorithm])(network, **args)
            
            wc.progress.value = 4
            
            p = plot_topic_year_network(network, layout, scale=scale, titles=titles)
            show(p)

        elif output_format == 'table':
            print(df.shape)
            display(df)
        else:
            display(pivot_ui(df))

        wc.progress.value = 0

    iw = widgets.interactive(
        display_topic_year_network,
        layout_algorithm=wc.layout,
        threshold=wc.threshold,
        scale=wc.scale,
        pivot_column=wc.pivot_column,
        year=wc.year,
        output_format=wc.output_format
    )

    display(widgets.VBox([
        wc.text,
        widgets.HBox([wc.layout, wc.year, wc.previous, wc.next]),
        widgets.HBox([wc.pivot_column, wc.scale]),
        widgets.HBox([wc.output_format, wc.threshold, wc.progress]),
        iw.children[-1]
    ]))
    iw.update()
    
main_topic_year_network(state)


VBox(children=(HTML(value="<span class='nx_id1' style='line-height: 20px;display: inline; height='400px''></sp…

### Topic Trends - Heatmap
- The topic shares  displayed as a scattered heatmap plot using gradient color based on topic's weight in document.
- [Stanford’s Termite software](http://vis.stanford.edu/papers/termite) uses a similar visualization.

In [11]:
# plot_topic_relevance_by_year

def setup_glyph_coloring(df):
    max_weight = df.weight.max()
    #colors = list(reversed(bokeh.palettes.Greens[9]))
    colors = ["#efefef", "#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878",
              "#933b41", "#550b1d"]
    mapper = bm.LinearColorMapper(palette=colors, low=df.weight.min(), high=max_weight)
    color_transform = transform('weight', mapper)
    color_bar = bm.ColorBar(color_mapper=mapper, location=(0, 0),
                         ticker=bm.BasicTicker(desired_num_ticks=len(colors)),
                         formatter=bm.PrintfTickFormatter(format=" %5.2f"))
    return color_transform, color_bar

def plot_topic_relevance_by_year(df, xs, ys, flip_axis, glyph, titles, text_id):

    line_height = 7
    if flip_axis is True:
        xs, ys = ys, xs
        line_height = 10
    
    ''' Setup axis categories '''
    x_range = list(map(str, df[xs].unique()))
    y_range = list(map(str, df[ys].unique()))
    
    ''' Setup coloring and color bar '''
    color_transform, color_bar = setup_glyph_coloring(df)
    
    source = ColumnDataSource(df)

    plot_height = max(len(y_range) * line_height, 500)
    
    p = figure(title="Topic heatmap", tools=TOOLS, toolbar_location="right", x_range=x_range,
           y_range=y_range, x_axis_location="above", plot_width=1000, plot_height=plot_height)

    args = dict(x=xs, y=ys, source=source, alpha=1.0, hover_color='red')
    
    if glyph == 'Circle':
        cr = p.circle(color=color_transform, **args)
    else:
        cr = p.rect(width=1, height=1, line_color=None, fill_color=color_transform, **args)

    p.x_range.range_padding = 0
    p.ygrid.grid_line_color = None
    p.xgrid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "5pt"
    p.axis.major_label_standoff = 0
    p.xaxis.major_label_orientation = 1.0
    p.add_layout(color_bar, 'right')
    
    p.add_tools(bm.HoverTool(tooltips=None, callback=wf.WidgetUtility.glyph_hover_callback(
        source, 'topic_id', titles.index, titles, text_id), renderers=[cr]))
    
    return p

def topic_heatmap_main(state):
    
    def display_topic_relevance_by_year(state, key='max', pivot_column=None, year=None, flip_axis=False, glyph='Circle', wdgs=None):
        
        try:
            wdgs.reset()
            wdgs.forward()
            
            titles = ModelUtility.get_topic_titles(state.topic_token_weights, n_words=100)
            wdgs.forward()

            year = (year or 0)
            
            pivot_column = 'year' if year > 0 else (pivot_column or 'report_name')
            filters = [{'column': 'year', 'values': [year]}] if year > 0 else []
            
            df = state.get_document_topic_weight_by_pivot_column(pivot_column, key, filters=filters)
            
            wdgs.forward()
            
            df[pivot_column] = df[pivot_column].astype(str)
            df['topic_id'] = df.topic_id.astype(str)
            
            wdgs.forward()
            
            p = plot_topic_relevance_by_year(df, xs=pivot_column, ys='topic_id', flip_axis=flip_axis, glyph=glyph, titles=titles, text_id='topic_relevance')
            
            show(p)
            wdgs.reset()
        except Exception as ex:
            raise
            logger.error(ex)
        finally:
            wdgs.reset()

    wc = wf.BaseWidgetUtility(
        text_id='topic_relevance',
        text=wf.create_text_widget('topic_relevance'),
        year=widgets.Dropdown(options=state.years, value=None, description='Year', layout=widgets.Layout(width="140px")),
        pivot_column=widgets.Dropdown(
            options={
                'SOU report': 'report_name',
                # 'LDA document': 'document_id',
                'Year': 'year'
            },
            value='report_name',
            description='Pivot',
            layout=widgets.Layout(width="200px")
        ),
        aggregate=widgets.Dropdown(options=['max', 'mean'], value='max', description='Aggregate', layout=widgets.Layout(width="180px")),
        progress=widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="35%")),
        glyph=widgets.Dropdown(options=['Circle', 'Square'], value='Square', description='Glyph', layout=widgets.Layout(width="180px")),
        flip_axis=widgets.ToggleButton(value=True, description='Flip XY', tooltip='Flip X and Y axis', icon='', layout=widgets.Layout(width="80px"))
    )

    iw = widgets.interactive(
        display_topic_relevance_by_year,
        state=widgets.fixed(state),
        key=wc.aggregate,
        pivot_column=wc.pivot_column,
        year=wc.year,
        glyph=wc.glyph,
        flip_axis=wc.flip_axis,
        wdgs=widgets.fixed(wc)
    )

    display(widgets.VBox([
        widgets.HBox([wc.pivot_column, wc.year, wc.aggregate, wc.flip_axis, wc.glyph, wc.progress ]),
        wc.text,
        iw.children[-1]
    ]))

    iw.update()
            
topic_heatmap_main(state)

VBox(children=(HBox(children=(Dropdown(description='Pivot', index=1, layout=Layout(width='200px'), options={'Y…

### Topic Co-Occurrence
- Computes weighted graph of topics co-occurring in the same document.
- Topics are defined as co-occurring if they exist in the same document both having weights within threshold range.
- Weight are number of co-occurrences (binary yes or no).
- Node size reflects topic proportions over the entire corpus (normalized document) length.

In [12]:
# Visualize topic co-occurrence

def display_topic_co_occurrence_network(layout, threshold, min_count, scale, output_format, context, state=None, text_id=''):

    try:
        titles = state.get_topic_titles()

        df = state.document_topic_weights
        df = df[(df.weight > threshold[0])&(df.weight < threshold[1])]
        
        df = pd.merge(df, df, how='inner', left_on=context, right_on=context)
        df = df.loc[(df.topic_id_x < df.topic_id_y)]

        if output_format in ['network', 'gephi']:
            if output_format == 'network':
                df = df.groupby([df.topic_id_x, df.topic_id_y]).size().reset_index()
                df.columns = ['source', 'target', 'weight']
                df = df[df.weight >= min_count]
                network = NetworkUtility.create_network(df, source_field='source', target_field='target', weight='weight')
                p = PlotNetworkUtility.plot_network(
                    network=network,
                    layout_algorithm=layout,
                    scale=scale,
                    threshold=0.0,
                    node_description=titles,
                    node_proportions=state.get_topic_proportions(),
                    weight_scale=10.0,
                    normalize_weights=True,
                    element_id=text_id,
                    figsize=(900,500)
                )
                show(p)
            elif output_format == 'gephi':
                display(df)
        elif output_format == 'table':
            pivots = ([context] if context == 'report_name' else []) + [df.topic_id_x, df.topic_id_y]
            df = df.groupby(pivots).size().reset_index().rename(columns={0: 'count'})
            df = df[df['count'] >= min_count]
            titles = pd.DataFrame(state.get_topic_titles(n_words=5, cache=False))
            df = pd.merge(df, titles, left_on='topic_id_x', right_index=True)
            df = pd.merge(df, titles, left_on='topic_id_y', right_index=True)
            df = df.rename(columns={'0_x': 'Topic#1', '0_y': 'Topic#2'})
            df.columns = [c.title() for c in df.columns]
            display(df)
        else:
            display(pivot_ui(df))
    except Exception as x:
        raise
        print("No data: please adjust filters")

def topic_co_occurance_main(state):

    text_id = 'cooc_id'
    wc = wf.BaseWidgetUtility(
        n_topics=state.n_topics,
        text_id=text_id,
        text=wf.create_text_widget(text_id),
        scale=widgets.FloatSlider(description='Scale', min=0.0, max=1.0, step=0.01, value=0.1, continues_update=False),
        min_count=widgets.IntSlider(description='Min count', min=1, max=500, step=1, value=10, continues_update=False),
        threshold=widgets.FloatRangeSlider(value=[0.25, 1.0], min=0, max=1.0, step=0.01, description='Threshold:', readout_format='.2f'),
        context=widgets.Dropdown(
            options={'SOU report': 'report_name', 'LDA document': 'document_id'},
            value='report_name', # ['year', 'report_id']
            description='Context',
            layout=widgets.Layout(width='200px')
        ),
        output_format=widgets.Dropdown(
            options={'Network': 'network', 'Gephi': 'gephi', 'Table': 'table'},
            value='network',
            description='Output',
            layout=widgets.Layout(width='200px')
        ),
        layout=widgets.Dropdown(
            options=list(layout_algorithms.keys()),
            value='Fruchterman-Reingold',
            description='Layout',
            layout=widgets.Layout(width='200px')
        ),
        progress=widgets.IntProgress(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="40%"))
    ) 

    iw = widgets.interactive(
        display_topic_co_occurrence_network,
        layout=wc.layout,
        threshold=wc.threshold,
        min_count=wc.min_count,
        scale=wc.scale,
        output_format=wc.output_format,
        context=wc.context,
        state=widgets.fixed(state),
        text_id=widgets.fixed(text_id)
    )

    display(widgets.VBox([
        wc.text,
        widgets.HBox([wc.threshold, wc.context, wc.layout,wc.layout, wc.output_format]),
        widgets.HBox([wc.min_count, wc.progress]),
        iw.children[-1]
    ]))
    iw.update()
    
topic_co_occurance_main(state)

VBox(children=(HTML(value="<span class='cooc_id' style='line-height: 20px;'></span>", placeholder=''), HBox(ch…

### Topic Similarity Network
- Displays topic similarity based on **euclidean or cosine distances** between the **topic-to-word vectors**.
- Please note that the computations can take some time to exceute, especially for larger LDA models.

> * Compute a multi dimensional topic vector space based on the top n words for each topic. Since the subset of words differs, and their positions differs between topics they need to be aligned in common space so that 1) each vector has the same dimension (i.e. number of unique top n tokens over all topics) and 2) each token has the same position within that space. (using sklearn DictVectorizer). The vector space will have as many dimensions as the number of unique top n words over all topics.
> * Reduce the topic vector space into a 2D space (using sklearn PCA)
> * Normalize the 2D space (sklearn Normalizer)

Note: Steps 1 to 3 above (the most time consuming) are executed whenever an option marked with an asterix is changed. 

In [13]:
# Visualization
from scipy.spatial import distance
from common.vectorspace_utility import VectorSpaceHelper

# if 'zy_data' not in globals():
correlation_network_state_data = types.SimpleNamespace(
    basename=None,
    network=None,
    X_n_space=None,
    X_n_space_feature_names=None,
    distance_matrix=None,
    metric=None,
    topic_proportions=None,
    n_words = 0
)

def plot_clustering_dendogram(clustering):
    plt.figure(figsize=(16,6))
    # https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html
    R = dendrogram(clustering)
    plt.show()
    plt.close()

def VectorSpaceHelper_compute_distance_matrix(X_n_space, metric='euclidean'):
    # https://se.mathworks.com/help/stats/pdist.html
    metric = metric.lower()
    if metric == 'kullback–leibler': metric = VectorSpaceHelper.kullback_leibler_divergence
    if metric == 'scipy.stats.entropy': metric = scipy.stats.entropy
    #print(metric)
    X = X_n_space.toarray() if hasattr(X_n_space, 'toarray') else X_n_space
    #X_n_space += 0.00001
    distances = distance.pdist(X, metric=metric)
    #print(distances)
    distance_matrix = distance.squareform(distances)
    #print(distance_matrix)    
    return distance_matrix

def main_correlation_network(state, zy_data):

    zy = wf.BaseWidgetUtility(
        n_topics=state.n_topics,
        text_id='nx_id3',
        text=wf.create_text_widget('nx_id3'),
        scale=wf.create_float_slider('Scale', min=0.0, max=1.0, step=0.01, value=0.1),
        year=wf.create_int_slider(
            description='Year', min=state.min_year, max=state.max_year, step=1, value=state.min_year
        ),
        n_words=wf.create_int_slider(description='#words*', min=10, max=500, step=1, value=20),
        metric=wf.create_select_widget(label='Metric*', values=list(DISTANCE_METRICS.keys()), default='Euclidean'),
        threshold=wf.create_float_slider('Threshold', min=0.0, max=1.0, step=0.01, value=0.01),
        output_format=widgets.Dropdown(
            options={'Network': 'network', 'Table': 'table'},
            value='network',
            description='Output',
            layout=widgets.Layout(width='200px')
        ),
        layout=wf.create_select_widget('Layout', list(layout_algorithms.keys()), default='Fruchterman-Reingold'),
        progress=wf.create_int_progress_widget(min=0, max=7, step=1, value=0, layout=widgets.Layout(width="90%"))
    ) 
    
    def display_correlation_network(
        layout_algorithm,
        threshold=0.10,
        scale=1.0,
        metric='Euclidean',
        n_words=200,
        output_format='Network'
    ):

        try:

            zy.progress.value = 1
            metric = DISTANCE_METRICS[metric]

            node_description = state.get_topic_titles()
            node_proportions = state.get_topic_proportions()

            zy.progress.value = 2
            if zy_data.network is None or state.basename != zy_data.basename or zy_data.metric != metric or zy_data.n_words != n_words:

                zy_data.basename = state.basename
                zy_data.n_words = n_words
                zy_data.X_n_space, zy_data.X_n_space_feature_names = state.compute_topic_terms_vector_space(n_words=n_words)

                #print(zy_data.X_n_space.shape)
                #print(zy_data.X_n_space_feature_names)
                zy.progress.value = 3
                zy_data.distance_matrix = VectorSpaceHelper_compute_distance_matrix(zy_data.X_n_space, metric=metric)
                zy_data.network = None

            edges_data = VectorSpaceHelper.lower_triangle_iterator(zy_data.distance_matrix, threshold)
            
            #df = pd.DataFrame(edges_data, columns=['x', 'y', 'weight']).groupby(['weight']).size()
            #display(df.head())
            #df.plot()
            zy.progress.value = 4
            if output_format == 'table':
                df = pd.DataFrame(edges_data, columns=['x', 'y', 'weight'])
                zy.progress.value = 5
                display(df)
            else:
                zy.progress.value = 5
                if zy_data.network is None:
                    zy_data.network = NetworkUtility.create_network_from_xyw_list(edges_data) # zy_data.distance_matrix)
                zy.progress.value = 6
                p = PlotNetworkUtility.plot_network(
                    network=zy_data.network,
                    layout_algorithm=layout_algorithm,
                    scale=scale,
                    threshold=threshold,
                    node_description=node_description,
                    node_proportions=node_proportions,
                    element_id='nx_id3',
                    figsize=(1000,600)
                )
                zy.progress.value = 6
                show(p)

            zy.progress.value = 7
            zy.progress.value = 0
        except Exception as ex:
            logger.error(ex)
            print('Empty set: please change filters')
            zy.progress.value = 0


    wy = widgets.interactive(
        display_correlation_network,
        layout_algorithm=zy.layout,
        threshold=zy.threshold,
        scale=zy.scale,
        metric=zy.metric,
        n_words=zy.n_words,
        output_format=zy.output_format
    )

    display(widgets.VBox(
        (zy.text, ) +
        (widgets.HBox((zy.threshold,) + (zy.metric,) + (zy.output_format,)),) +
        (widgets.HBox((zy.n_words,) + (zy.layout,) + (zy.scale,)),) +
        (zy.progress,) +
        (wy.children[-1],)))

    wy.update()

    
main_correlation_network(state, correlation_network_state_data)


VBox(children=(HTML(value="<span class='nx_id3' style='line-height: 20px;'></span>", placeholder=''), HBox(chi…

### Some (assorted) references

> Blei: https://scholar.google.com/citations?user=8OYE6iEAAAAJ

- Blei, 2003: Latent dirichlet allocation [PDF](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
- Blei, 2012: Probabilistic topic models [PDF](https://pdfs.semanticscholar.org/01f3/290d6f3dee5978a53d9d2362f44daebc4008.pdf)
- Blei, 2006: Dynamic topic models [PDF](http://repository.cmu.edu/cgi/viewcontent.cgi?article=2036&context=compsci)
- Introduction to Probabilistic Topic Models: [PDF](http://menome.com/wp/wp-content/uploads/2014/12/Blei2011.pdf)
- Mcauliffe, Blei, 2008: Supervised topic models [PDF](http://papers.nips.cc/paper/3328-supervised-topic-models.pdf)
- Grimmer, 2013: Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Politica
[PDF](http://www.jstor.org/stable/pdf/24572662.pdf?casa_token=PnEPVj2gkkwAAAAA:_Vg_oSs-p6gtYvjJ3eEDUQB7UsakHQtBOdFIdeJxRpuGH5-7tq09fkUGxQ0-Bek5X2uOSya35-MoEo-cPo-K5DM1W-z1R0UppL6OqP53y6SNS7alAl8)
- Chuang, 2013: Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment
[PDF](http://vis.stanford.edu/files/2013-TopicModelDiagnostics-ICML.pdf)
[Sup](http://vis.stanford.edu/files/2013-TopicalModelDiagnostics-SuppMaterial.pdf)
- Lecture, Blei, 2009: [Video](http://videolectures.net/mlss09uk_blei_tm/) 
- Prof. David Blei - Probabilistic Topic Models and User Behavior [YoutTube](https://www.youtube.com/watch?v=FkckgwMHP2s)
- PyData Berlin 2017 (Matti Lyra) [YouTube](https://www.youtube.com/watch?v=FkckgwMHP2s) [YouTube](https://www.youtube.com/watch?v=Dv86zdWjJKQ)
[NB](https://github.com/mattilyra/pydataberlin-2017/blob/master/notebook/EvaluatingUnsupervisedModels.ipynb)
- Probabilistic Topic Models: [PDF](https://pdfs.semanticscholar.org/01f3/290d6f3dee5978a53d9d2362f44daebc4008.pdf) [PDF](https://mimno.infosci.cornell.edu/info6150/readings/Blei2012.pdf)
- Visualizing Topic Models: [PDF](http://ajbc.io/projects/papers/ChaneyBlei2012.pdf)
- Topic models: [PDF](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.463.1205&rep=rep1&type=pdf#page=96)
- Sievert, LDAvis: A method for visualizing and interpreting topics [PDF](http://www.aclweb.org/anthology/W14-3110)
- Ted Underwood: [Blog](https://tedunderwood.com/category/methodology/topic-modeling/bayesian-topic-modeling/)
- Stanford Topic Modeling Toolbox: [Link](https://nlp.stanford.edu/software/tmt/tmt-0.4/)
- Blog, Naomi Saphra: Understanding Latent Dirichlet Allocation [Link](https://nsaphra.github.io/2012/07/09/LDA/)
- blog.bogatron.net: Visualizing Dirichlet Distributions with Matplotlib [Link](http://blog.bogatron.net/blog/2014/02/02/visualizing-dirichlet-distributions/)
- Wikipedia: [Topic Model](https://en.wikipedia.org/wiki/Topic_model) [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
[Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution)
- Visualization using dimensionality reduction (e.g. T-SNE, PCA) [Shusen Liu, 2016], (pitfalls)
http://qpleple.com/bib/#Chuang12
http://qpleple.com/word-relevance/
- Finding scientific topics: [PDF](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)

### Powered by
<img src="./images/powered_by.svg">