## Some sample LDA topic model visualizations

### Introduction

The notebook contains a number of **visualizations** that might aid the researcher to evaluate and analyse a topic model. Note that this is in all respect still **work-in-progress**. The evaluation of a topic model is a challange, and must be made by a human, but there are some methods and computed metrics that can be used to give hints to the researcher. The goal is to find and test some of those metrics. Another goal is of course to give the researcher **means to simply browse and explore** the model.

Några syften (inte specifikt för denna sida eller Jupyter):
- Stöd för att testa olika parametersättningar
- Finns visualiseringar hjälper till att evaluera modell
- Testa metriker som kan indikera konstigheter
- Stöd för att uforska modellen och topics
- Tillgängligöra metoder
- Transparans av data och flöde (del av Open Science)
- Stöd för forskare som saknar programmeringskunskaper
- Skjuta beslut och tolkning bort från teknikern till forskaren

There are a lot of other tools and applications, both online and downloadable packages, that can be used for visualizing various perspectives of topic models. Some of these are Termite (Stanford), TensorBoard (Google), Hiérarchie, Word Embedding Visual Explorer [Online](http://residue3.sci.utah.edu/?) [wevi: word embedding visual inspector [source](https://ronxin.github.io/wevi/), [paper](https://arxiv.org/abs/1411.2738), LAMVI, Sentiview, Lexos and many more.

Topic modelling can be used for common NLP tasks such as language detection, language translation, sentiment analysis, PoS-tagging, dependency analysis. Its primary usage is perhaps document ranking and classification, and document querying. Another interesting use case is recommendation systems (such as Spotify uses, and as described in [link](http://arno.uvt.nl/show.cgi?fid=136352) and [link](https://github.com/mattdennewitz/playlist-to-vec).

This is not a pipeline - more a way to share ideas and work between different projects that have similar technical needs. Humlab has a "textanalytic team" that meets once a month where researchers from various projects participate.

### Some References

- Blei, 2003: Latent dirichlet allocation [PDF](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
- Blei, 2012: Probabilistic topic models [PDF](https://pdfs.semanticscholar.org/01f3/290d6f3dee5978a53d9d2362f44daebc4008.pdf)
- Blei, 2006: Dynamic topic models [PDF](http://repository.cmu.edu/cgi/viewcontent.cgi?article=2036&context=compsci)
- Mcauliffe, Blei, 2008: Supervised topic models [PDF](http://papers.nips.cc/paper/3328-supervised-topic-models.pdf)
- Grimmer, 2013: Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Politica
[PDF](http://www.jstor.org/stable/pdf/24572662.pdf?casa_token=PnEPVj2gkkwAAAAA:_Vg_oSs-p6gtYvjJ3eEDUQB7UsakHQtBOdFIdeJxRpuGH5-7tq09fkUGxQ0-Bek5X2uOSya35-MoEo-cPo-K5DM1W-z1R0UppL6OqP53y6SNS7alAl8)
- Chuang, 2013: Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment
[PDF](http://vis.stanford.edu/files/2013-TopicModelDiagnostics-ICML.pdf)
[Sup](http://vis.stanford.edu/files/2013-TopicalModelDiagnostics-SuppMaterial.pdf)
- Lecture, Blei, 2009: [Video](http://videolectures.net/mlss09uk_blei_tm/) 
- Prof. David Blei - Probabilistic Topic Models and User Behavior [YoutTube](https://www.youtube.com/watch?v=FkckgwMHP2s)
- PyData Berlin 2017 (Matti Lyra) [YouTube](https://www.youtube.com/watch?v=FkckgwMHP2s) [YouTube](https://www.youtube.com/watch?v=Dv86zdWjJKQ)
[NB](https://github.com/mattilyra/pydataberlin-2017/blob/master/notebook/EvaluatingUnsupervisedModels.ipynb)
- Sievert, LDAvis: A method for visualizing and interpreting topics [PDF](http://www.aclweb.org/anthology/W14-3110)
- Ted Underwood: [Blog](https://tedunderwood.com/category/methodology/topic-modeling/bayesian-topic-modeling/)
- Stanford Topic Modeling Toolbox: [Link](https://nlp.stanford.edu/software/tmt/tmt-0.4/)
- Blog, Naomi Saphra: Understanding Latent Dirichlet Allocation [Link](https://nsaphra.github.io/2012/07/09/LDA/)
- blog.bogatron.net: Visualizing Dirichlet Distributions with Matplotlib [Link](http://blog.bogatron.net/blog/2014/02/02/visualizing-dirichlet-distributions/)
- Wikipedia: [Topic Model](https://en.wikipedia.org/wiki/Topic_model) [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
[Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution)
- Visualization using dimensionality reduction (e.g. T-SNE, PCA) [Shusen Liu, 2016], (pitfalls)
http://qpleple.com/bib/#Chuang12
http://qpleple.com/word-relevance/

### Jupyter Notebooks

#### About The Jupyter Project
On the projects web sites at [http://jupyter.org/] it is stated that
>“Project Jupyter exists to develop open-source software, open standards, and services for **interactive and reproducible computing**”.<br/><br/>
>“The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.” <br/><br/>
><img src="./tm-data/images/narrative_new.svg" style="width: 300px;padding: 0; margin: 0;"><br/><br/>
>”Computational Narratives as the Engine of Collaborative Data Science”

The project is sponsered by large companies such as Google and Microsoft, and funders such as Alfred P. Sloan foundation. See link [jupyter.org/about](http://jupyter.org/about) for all sponsors.
Jupyter is a tool for **data science**, defined in Wikipedia as 
> "...is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured..."</br>
Data science is a big “buzzword” - outside academia est. 1.5 million positions in US alone. Data science lies in the intersection between several advanced fields, and it is costly to acquire and keep these skills. Lots of necessary skills and knowledge are built-in to ready-to use (sort of) software libraries, though, but it is a  risky to use these kind of libraries entiraly as black-boxes.
<img src="./tm-data/images/data-science_new.svg" style="width: 40%; padding: 0; margin: 0;">

#### Jupyter Overview

Jupyter Notebook is a **web application** that can either be run on a local computer, or hosted in a multi-user environment. The main advantage of the latter is that the end user only need a web browser to open and execute a notebook. When used locally, the user must install Python and Jupyter, as well as all the necessary frameworks that are needed for the project (and also the correct version of each framework). Server hosted installations is therefore much preferably in most cases.
<img src="./tm-data/images/jupyter_stack.svg" style="width: 30%;padding: 0; margin: 0;">

<i>**Fig**. Jupyter server</i>

#### Why use Jupyter Notebooks?

- Easy to learn, use and deploy, lots of people know Python.
- Ready to use online platform - trivial to create simple interactivity.
- Much faster (and cheaper) development - immediate feedback - agile, collaborative.
- Can defer decisions from developer to researcher. 
- The ability to combine data, narrative and code into an interactive user interface.
- Fits users with different tech skills, some researcher wants to understand and be able to tune the logic.
- Supports Python, that has big ecosystem of (open source) software libraries.
- Very popular, millons of notebooks exist on GitHub.


#### Methods and Technologies

The figure below shows some of the methods, frameworks and tools used in this topic modelling workflow. The so called *technology stack* spans a large variety of domains and fields, where each single method or technology can have a rather steep learning curve. There are also many similar frameworks to choose from, many with overlapping or similar functionality. Each functionality in turn has a number of elements that need to be configured for proper use. This require both specific knowledge, and usage experience in order to safely apply the methods and tools on the problem at hand. What's missing in the figure, which is important to note, is the *problem domain specific knowledge* that is required to interpret and validate the result of the topic models.

<img src="./tm-data/images/concept_tools_new.png" style="width: 50%; padding: 0; margin: 0;">
<center><i>**Fig**. High level tech stack</i></center>

This method and tool chain unavoidably requires a multitude of both major and minor decisions. It can even be difficult to determine which decision is minor or major. To some extent, black-box use is unavoidable, but this emphasizes the need of a proper validation process. The Jupyter Notebook helps transferring some of these decisions from the tech specialist to the end user i.e. the researcher. It is of great advantage - and a signum of the Python ecosystem - that so many proven and battle-tested open source frameworks and tools exists.


#### Open Science as the New Normal

An important driving force behind the increased use of Jupyter Notebooks is the current "open science" movement. This is in part caused by the so called *reproducibility crisis*, and the *statistical crisis* (aka data dredging) in science. See
- [Presentation by Deevy Bishop](https://www.slideshare.net/deevybishop/what-is-the-reproducibility-crisis-in-science-and-what-can-we-do-about-it)
- [Replication Crisis](https://en.wikipedia.org/wiki/Replication_crisis)
- [An EU initiative for open science e-Learning](https://www.fosteropenscience.eu/)
- Simmons, J., L. Nelson, and U. Simonsohn. 2011. False-positive psychology
- [An article on the statistical crisis](https://www.americanscientist.org/article/the-statistical-crisis-in-science)

### Text Analysis Sample Flow

The figure below gives a view of a generic text analysis workflow, together with a few sample tasks for each step in the flow. Note that the tasks for a specific project depends on for instance the state and quality of  the text at hand, the specific research question, the kind of text etc. (Note that this is not the researcher's workflow.)

<center><img src="./tm-data/images/text-analysis_sample_tasks.svg" style="width: 50%;padding: 0; margin: 0;"></center>

<center><i>**Fig**. Sample text analysis tasks</i></center>

This notebook focuses mostly on parts of the "Evaluate & Interpret" step, but also visualisation that can be used in the "Narrate & Dissiminate" step. Assessing the quality of a topic model is a qualitative process that requires the "human-in-the-loop". The system can assist the researcher in a number of ways, with features such as:

* Easy way of browsing through topic-word distributions
* Easy way of browsing through document-topic distributions
* Intuitive ways of finding conceptual interpretations of topics

* Use of metrics to highlight suspect data

 * Display similarity of topics to known distributions (uniform distribution, mean corpus distribution etc)
 * Display similar or overlapping topics,  topic clusters (for some metric)
 * Display how ubiquitousness of topics
 * Display document clusters
 * Display topic-topic co-occurrence (same document)
 * Use reference documents that should have some expected topic?

These notebook contains sample implementations of some of these features. See also:
> - Reading tea leaves: how humans interpret topic models, Chang et al. (2009) <br></br>
> - Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality, Lau et al. <br></br>
> - http://dirichlet.net/pdf/wallach09evaluation.pdf

#### Brief Instructions on How to Use Notebooks
Please see [add link] for an introduction on what Jupyter notebooksare and how to use them. In short, a notebook is a document with embedded executable code presented in a simple and easy to use web interface. Most important things to note are:
- Click on the menu Help -> User Interface Tour for an overview of the Jupyter Notebook App user interface.
- The **code cells** contains the script code (Python in this case, but can be other languages are also suported) and are the sections marked by **In [x]** in the left margin. It is marked as **In []** if it hasn't been executed, and as **In [n]** when it has been executed(n is an integer). A cell marked as **In [\*]** is either executing, or waiting to be executed (i.e. other cells are executing).
- The **current cell** is highlighted with a blue (or green if in "edit" mode) border. You make a cell current by clicking on it,
- Code cells aren't executed automatically. Instead you execute the current cell by either pressing **shift+enter** or the **play** button in the toolbar. The output (or result) of a cell's execution is presented directly below the cell prefixed by **Out[n]**.
- The next cell will automatically be selected (made current) after a cell has been executed. Repeatadly pressing **shift+enter** or the play button hence executes the cells in sequence.
- You can run the entire notebook in a single step by clicking on the menu Cell -> Run All. Note that this can take some time to finish. You can see how cells are executed in sequence via the indicator in the margin (i.e. "In [\*]" changes to "In [n]" where n is an integer).
- The cells can be edited if they are double-clicked, in which case the cell border turns green. Use the ESC key to escape edit mode (or click on any other cell).

To restart the kernel (i.e. the computational engine assigned to your session), click on the menu Kernel -> Restart. 

In [1]:
#Code
from IPython.lib.display import YouTubeVideo
YouTubeVideo("h9S4kN4l5Is", width=400, height=300)

### Step 0: Setup and Initialize the Notebook (<span style='color:blue'>mandatory step</span>)
Import Python libraries and frameworks, and initialize the notebook.

In [2]:
# MANDATORY! Import dependencies and setup notebook
%run ./common/utility
%run ./common/model_utility
%run ./common/plot_utility
%run ./common/widgets_utility
%run ./common/network_utility
%run ./common/vectorspace_utility

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

import os
import glob
import math
import types
import ipywidgets as widgets
import logging

logger = logging.getLogger('explore-topic-models')

import IPython
from pivottablejs import pivot_ui
from IPython.display import display, HTML, clear_output, IFrame
from itertools import product
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

%config IPCompleter.greedy=True
# %autosave 120

import bokeh.models as bm
import bokeh.palettes

from bokeh.io import output_file, push_notebook
from bokeh.core.properties import value, expr
from bokeh.transform import transform, jitter
from bokeh.layouts import row, column, widgetbox
from bokeh.plotting import figure, show, output_notebook, output_file
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn

import pandas as pd

#pd.set_option('display.height', 1000)
#pd.set_option('display.max_rows', 500)
#pd.set_option('display.max_columns', 500)
#pd.set_option('display.width', 1000)

TOOLS = "pan,wheel_zoom,box_zoom,reset,previewsave"
AGGREGATES = { 'mean': np.mean, 'sum': np.sum, 'max': np.max, 'std': np.std }

output_notebook()

pd.set_option('precision', 10)

### Step 1: Select LDA Model (<span style='color:blue'>mandatory step</span>)
Select one of the previously computed and prepared topic models that you wan't to use in subsequent steps. New models are computed in batch in accordance to the following flow:
<img src="./tm-data/images/workflow-prepare.svg" style="width: 800px;">
The resulting model files, marked by the red box in the diagram, are made avaliable for selecton simply by uploading them into separate folders in ./data. The upload can be done with Jupyter Lab's upload feature. 
 
Note that it can take some time (20-30 seconds) to load a model for the first time if the large file sizes. Subsequent load is much faster since the system extracts data to CSV-files which gives faster loads. Also note that subsequent cells are NOT updated automatically when a new model is selected. Instead you must use the **play** button, or press **Shift-Enter** to execute the current cell.

In [11]:
# Hidden code: Select current model state
class ModelState:
    
    def __init__(self, data_folder):
        
        self.data_folder = data_folder
        self.basenames = ModelUtility.get_model_names(data_folder)
        self.basename = self.basenames[0]
        
    def set_model(self, basename=None):

        basename = basename or self.basename
        
        self.basename = basename
        self.topic_keys = ModelUtility.get_topic_keys(self.data_folder, basename)
        state.max_alpha = self.topic_keys.alpha.max()
        
        self.topic_overview = ModelUtility\
            .get_result_model_sheet(self.data_folder, basename, 'topic_tokens')
        
        self.document_topic_weights = ModelUtility\
            .get_result_model_sheet(self.data_folder, basename, 'doc_topic_weights')
        
        if 'Unnamed: 0' in self.document_topic_weights.columns:
            self.document_topic_weights = self.document_topic_weights.drop('Unnamed: 0', axis=1)
            
        self.topic_token_weights = ModelUtility\
            .get_result_model_sheet(self.data_folder, basename, 'topic_token_weights')
        
        if 'Unnamed: 0' in self.topic_token_weights.columns:
            self.topic_token_weights = self.topic_token_weights.drop('Unnamed: 0', axis=1)
            
        self._years = list(range(
            self.document_topic_weights.year.min(), self.document_topic_weights.year.max() + 1))
        self.min_year = min(self._years)
        self.max_year = max(self._years)
        self.years = [None] + self._years
        self.n_topics = self.document_topic_weights.topic_id.max() + 1
        # https://stackoverflow.com/questions/44561609/how-does-mallet-set-its-default-hyperparameters-for-lda-i-e-alpha-and-beta
        self.initial_alpha = 0.0  # 5.0 / self.n_topics if 'mallet' in state.basename else 1.0 / self.n_topics
        self.initial_beta = 0.0  # 0.01 if 'mallet' in basename else 1.0 / self.n_topics
        self._lda = None
        self.topic_tokens_as_text = None
        self.corpus_documents = None
        print("Current model: " + self.basename.upper())
        # _fix_topictokens()
        return self
    
    def get_document_topic_weights(self, year=None, topic_id=None):
        df = self.document_topic_weights
        if year is None and topic_id is None:
            return df
        if topic_id is None:
            return df[(df.year == year)]
        if year is None:
            return df[(df.topic_id == topic_id)]
        return df[(df.year == year)&(df.topic_id == topic_id)]
    
    def get_unique_topic_ids(self):
        return self.document_topic_weights['topic_id'].unique()
    
    def get_topic_weight_by_year_or_document(self, key='mean', year=None):
        pivot_column = 'year' if year is None else 'document_id'    
        df = self.get_document_topic_weights(year) \
            .groupby([pivot_column,'topic_id']) \
            .agg(AGGREGATES[key])[['weight']].reset_index()
        return df, pivot_column
    
    def get_topic_tokens_dict(self, topic_id, n_top=200):
        return self.get_topic_tokens(topic_id)\
            .sort_values(['weight'], ascending=False)\
            .head(n_top)[['token', 'weight']]\
            .set_index('token').to_dict()['weight']

    def compute_topic_terms_vector_space(self, n_words=100):
        '''
        Create an align topic-term vector space of top n_words from each topic
        '''
        unaligned_vector_dicts = ( self.get_topic_tokens_dict(topic_id, n_words) for topic_id in range(0, self.n_topics) )
        X, feature_names = ModelUtility.compute_and_align_vector_space(unaligned_vector_dicts)
        return X, feature_names

    def get_lda(self):
        '''
        Get gensim model. Only used for pyLDAvis display
        '''
        if self._lda is None:
            filename = os.path.join(self.data_folder, self.basename, 'gensim_model_{}.gensim.gz'.format(self.basename))
            if os.path.isfile(filename):
                self._lda = LdaModel.load(filename)
                print('LDA model loaded...')
            else:
                print('LDA not found on disk...')
        return self._lda 
    
    def get_topics_tokens_as_text(self, n_words=100, cache=True):
        if cache and self.topic_tokens_as_text is not None:
            return self.topic_tokens_as_text
        topic_tokens_as_text = ModelUtility.get_topics_tokens_as_text(state.topic_token_weights, n_words=n_words)
        if cache:
            self.topic_tokens_as_text = topic_tokens_as_text
        return topic_tokens_as_text
    
    def get_topic_tokens(self, topic_id, max_n_words=500):
        tokens = state.topic_token_weights\
            .loc[lambda x: x.topic_id == topic_id]\
            .sort_values('weight',ascending=False)[:max_n_words]
        return tokens
    
    def get_topic_alphas(self):
        tokens = state.topic_token_weights\
            .loc[lambda x: x.topic_id == topic_id]\
            .sort_values('weight',ascending=False)[:max_n_words]
        alpas = ModelUtility.get_topic_alphas
        return tokens
    
    def get_topic_year_aggregate_weights(self, fn, threshold):
        df = self.document_topic_weights
        #df = df[(df.weight>=threshold)]
        df = df.groupby(['year', 'topic_id']).agg(fn)['weight'].reset_index()
        df = df[(df.weight>=threshold)]
        return df
    
    def get_topic_proportions(self):
        corpus_documents = self.get_corpus_documents()
        document_topic_weights = self.get_document_topic_weights()
        topic_proportion = ModelUtility.compute_topic_proportions(document_topic_weights, corpus_documents)
        return topic_proportion
    
    def get_corpus_documents(self):
        if self.corpus_documents is None:
            self.corpus_documents = ModelUtility.get_corpus_documents(self.data_folder, self.basename)
        return self.corpus_documents

state = ModelState('./tm-data')

wdg_basename = widgets.Dropdown(
    options=state.basenames,
    value=state.basename,
    description='Topic model',
    disabled=False,
    layout=widgets.Layout(width='75%')
)
wdg_model = widgets.interactive(state.set_model, basename=wdg_basename)
display(widgets.VBox((wdg_basename,) + (wdg_model.children[-1],)))
wdg_model.update()


VBox(children=(Dropdown(description='Topic model', layout=Layout(width='75%'), options=('20180903_SOU_1990-199…


### Step: Interpret the Alpha Hyperparameter

See [http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf](Probabalistic Topic Models) for a description of LDA hyperparamers. The **alpha** hyperparameter affects the sparsity of the document-topic distribution. The LDA model is said to be *symmetric* if the same alpha value is used for all topics, and *assymetric* if it can vary per topic.

In the case of assymetric topic modelling, high alphas can indicate topics that contains stopwords (common words), and low values can indicate bogus topics. This chart is of no value for symmetric models. 

See also: [stackexchange Q&A](https://stats.stackexchange.com/questions/244917/what-exactly-is-the-alpha-in-the-dirichlet-distribution)


In [4]:
# Alpha / Lambda Plot
_topic_keys = ModelUtility.get_topic_keys(state.data_folder, state.basename)

def plot_alpha(df):

    source = ColumnDataSource(df)
    p = figure(x_range=df.topic.values,plot_width=900, plot_height=400, title='',
               tools=TOOLS, toolbar_location="above")
    p.xaxis[0].axis_label = 'Topic'
    p.yaxis[0].axis_label = 'Alpha'
    p.xaxis.major_label_orientation = 1.0
    p.y_range.start = 0.0
    x_axis_type = 'enum'
    p.xgrid.visible = False

    glyph = bm.glyphs.VBar(x='topic', top='alpha', bottom=0, width=0.5, fill_color='color')
    cr = p.add_glyph(source, glyph)

    titles = ModelUtility.get_topic_titles(state.topic_token_weights, n_words=100)
    p.add_tools(bm.HoverTool(tooltips=None, callback=WidgetUtility.glyph_hover_callback(
        source, 'topic_id', titles.index, titles, 'alpha_plot'), renderers=[cr]))
        
    return p

def display_alpha(output_format, sort_by, window):
    global state
    palette = bokeh.palettes.PiYG[4]
    topic_keys = ModelUtility.get_topic_keys(state.data_folder, state.basename).reset_index()
    topic_keys = topic_keys[((topic_keys.alpha >= window[0]) & (topic_keys.alpha <= window[1]))]
    topic_keys['topic'] = topic_keys.topic_id.apply(lambda x: str(x))
    topic_keys['color'] = palette[1]  # topic_keys.alpha.apply(lambda x: palette[1] if x >= state.initial_alpha else palette[2])
    if sort_by.lower() == 'alpha':
        topic_keys = topic_keys.sort_values('alpha', axis=0)
    if output_format == 'Chart':
        p = plot_alpha(topic_keys)
        show(p)
    else:
        source = bm.ColumnDataSource(topic_keys)
        columns = [
            TableColumn(field="topic_id", title="ID"),
            TableColumn(field="alpha", title="Alpha"),
            TableColumn(field="tokens", title="Tokens"),
        ]
        data_table = DataTable(source=source, columns=columns, width=950, height=600)
        show(widgetbox(data_table))
        
za = BaseWidgetUtility(
    text_id='alpha_plot',
    text=wf.create_text_widget('alpha_plot',default_value='Hover topics to display words!'),
    output_format=wf.create_select_widget('Format', ['Chart', 'Table'], default='Chart'),
    sort_by=wf.create_select_widget('Sort by', ['Topic', 'Alpha'], default='Alpha'),
    window=widgets.FloatRangeSlider(
        description='Window',
        min=0, max=state.max_alpha + 0.1,
        step=0.01,
        value=(0, state.max_alpha + 0.1),  # (state.initial_alpha, state.max_alpha + 0.1),
        continuous_update=False
    )
)
za.next_topic_id = za.create_next_id_button('topic_id', state.n_topics)

wa = widgets.interactive(
    display_alpha,
    output_format=za.output_format,
    sort_by=za.sort_by,
    window=za.window
)
za.text.layout = widgets.Layout(width='95%') #  , height='120px')
wa.children[-1].layout = widgets.Layout(width='98%')

display(widgets.VBox([
    za.text,
    widgets.HBox([za.output_format, za.window, za.sort_by]),
    widgets.HBox([wa.children[-1]])
]))
wa.update()


VBox(children=(HTML(value="<span class='alpha_plot' style='line-height: 20px;'>Hover topics to display words!<…

In [5]:
# Dir(alpha) test sample
def plot_dirichlet_alpha_sample(df):

    source = ColumnDataSource(df)
    p = figure(x_range=df.topic.values,plot_width=900, plot_height=400, title='',
               tools=TOOLS, toolbar_location="above")
    p.xaxis[0].axis_label = 'Topic'
    p.yaxis[0].axis_label = 'Value'
    p.xaxis.major_label_orientation = 1.0
    p.y_range.start = 0.0
    x_axis_type = 'enum'
    p.xgrid.visible = False

    glyph = bm.glyphs.VBar(x='topic', top='value', bottom=0, width=0.5, fill_color='color')
    cr = p.add_glyph(source, glyph)

    titles = ModelUtility.get_topic_titles(state.topic_token_weights, n_words=100)
    p.add_tools(bm.HoverTool(tooltips=None, callback=WidgetUtility.glyph_hover_callback(
        source, 'topic_id', titles.index, titles, 'dirichlet_alpha_plot'), renderers=[cr]))
        
    return p

def display_dirichlet_alpha_draw():
    global state
    palette = bokeh.palettes.PiYG[4]
    topic_keys = ModelUtility.get_topic_keys(state.data_folder, state.basename).reset_index()
    topic_keys['topic'] = topic_keys.topic_id.apply(lambda x: str(x))
    topic_keys['color'] = palette[1] # topic_keys.alpha.apply(lambda x: palette[1] if x >= state.initial_alpha else palette[2])
    topic_keys['value'] = np.random.dirichlet(topic_keys.alpha)

    p = plot_dirichlet_alpha_sample(topic_keys)
    show(p)
        
zd = BaseWidgetUtility(
    text_id='dirichlet_alpha_plot',
    text=wf.create_text_widget('dirichlet_alpha_plot',default_value='Hover topics to display words!'),
)
zd.refresh_button = widgets.Button(
    description='Draw',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
    icon='check'
)

def on_refresh_button_clicked(b):
    wd.update()
zd.refresh_button.on_click(on_refresh_button_clicked)

wd = widgets.interactive(display_dirichlet_alpha_draw)
zd.text.layout = widgets.Layout(width='95%')
wd.children[-1].layout = widgets.Layout(width='98%')

display(widgets.VBox([
    zd.text,
    widgets.HBox([zd.refresh_button]),
    widgets.HBox([wd.children[-1]])
]))
wd.update()



VBox(children=(HTML(value="<span class='dirichlet_alpha_plot' style='line-height: 20px;'>Hover topics to displ…

### Step: Analyse Documents' Topic-Weight Distribution
This graph displays **the distribution of document topic-weights** for the selected model. The X-axis percentage value between 0 and 100 and the Y-axis is the number of document topic-weights for each (integer) percentage. 
Not surprisingly is that the vast majority (97-98)% of the weights are zero, or close to zero.

In [6]:
# Topic Weight Distribution
topic_weights_distribution = state.get_document_topic_weights()
topic_weights_distribution['weight%'] = (topic_weights_distribution.weight * 100).astype('int')
topic_weights_distribution = topic_weights_distribution.groupby('weight%').size()
topic_weights_count = topic_weights_distribution.sum()

def display_topic_weights_distribution(p_range):
    global topic_weights_distribution
    selection = topic_weights_distribution[p_range[0]:p_range[1]+1]
    title = '{0:.2f}% of all document-topic weights are within selected interval'\
          .format(100 * (selection.sum() / topic_weights_count))
    selection.plot(figsize=(12,6), title=title, kind='line', xlim=(0,100), ylim=(0,None))

p_range = widgets.SelectionRangeSlider(
    options=range(0,100), index=(0,99), description='Interval', continues_update=False
)

w = widgets.interactive(display_topic_weights_distribution, p_range=p_range)
display(widgets.VBox(
    (p_range,) +
    (w.children[-1],)))
w.update()

VBox(children=(SelectionRangeSlider(description='Interval', index=(0, 99), options=(0, 1, 2, 3, 4, 5, 6, 7, 8,…

### Step: Evaluate Topic Wordclouds

In [8]:
# Display LDA topic's token wordcloud
opts = { 'max_font_size': 100, 'background_color': 'white', 'width': 900, 'height': 600 }

zwc = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='wc01',
    text=wf.create_text_widget('wc01'),
    topic_id=wf.topic_id_slider(state.n_topics),
    word_count=wf.word_count_slider(1, 500),
    output_format=wf.create_select_widget('Format', ['Wordcloud', 'List', 'Pivot'], default='Wordcloud'),
    progress = wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))
)
zwc.prev_topic_id = zwc.create_prev_id_button('topic_id', state.n_topics)
zwc.next_topic_id = zwc.create_next_id_button('topic_id', state.n_topics)

def display_wordcloud(topic_id=0, n_words=100, output_format='Wordcloud'):
    global state, zwc
    zwc.progress.value = 1
    df_temp = state.topic_token_weights
    df_temp = df_temp.loc[(df_temp.topic_id == topic_id)]
    tokens = state.get_topics_tokens_as_text(n_words=n_words, cache=True).iloc[topic_id]
    zwc.progress.value = 2
    zwc.text.value = 'ID {}: {}'.format(topic_id, tokens)
    if output_format == 'Wordcloud':
        WordcloudUtility.plot_wordcloud(df_temp, 'token', 'weight', max_words=n_words, **opts)
    elif output_format == 'List':
        zwc.progress.value = 3
        df_temp = state.get_topic_tokens(topic_id, n_words)
        zwc.progress.value = 4
        display(HTML(df_temp.to_html()))
    else:
        display(pivot_ui(state.get_topic_tokens(topic_id, n_words)))
    zwc.progress.value = 0

iw = widgets.interactive(
    display_wordcloud,
    topic_id=zwc.topic_id,
    n_words=zwc.word_count,
    output_format=zwc.output_format)

zwc.text.layout = widgets.Layout(width='95%')

display(widgets.VBox(
    (zwc.text,) +
    (widgets.HBox((zwc.prev_topic_id,) + (zwc.next_topic_id,) + 
                  (zwc.topic_id,) + (zwc.word_count,) + (zwc.output_format,)),) +
    (zwc.progress,) +
    (iw.children[-1],)))
iw.update()


VBox(children=(HTML(value="<span class='wc01' style='line-height: 20px;'></span>", layout=Layout(width='95%'),…

### Step: Evaluate and Interpret Topic's Word Distribution
The following chart shows the word distribution for each selected topic. You can zoom in on the left chart. The distribution seems to follow [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) as (perhaps) expected.

In [9]:
# Display topic's word distribution

def plot_tokens(tokens, **args):
    
    source = ColumnDataSource(tokens)
    
    p = figure(toolbar_location="right", **args)

    cr = p.circle(x='xs', y='ys', source=source)

    label_style = dict(level='overlay', text_font_size='8pt', angle=np.pi/6.0)
    
    text_aligns = ['left', 'right']
    for i in [0, 1]:
        label_source = ColumnDataSource(tokens.iloc[i::2])
        labels = bm.LabelSet(x='xs', y='ys', text_align=text_aligns[i], text='token', text_baseline='middle',
                          y_offset=5*(1 if i == 0 else -1),
                          x_offset=5*(1 if i == 0 else -1),
                          source=label_source, **label_style)
        p.add_layout(labels)
    
    p.xaxis[0].axis_label = 'Token #'
    p.yaxis[0].axis_label = 'Probability%'
    p.ygrid.grid_line_color = None
    p.xgrid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "6pt"
    p.axis.major_label_standoff = 0
    return p

    
def plot_topic_tokens_charts(tokens, flag=True):
    
    if flag:
        left = plot_tokens(tokens, plot_width=1000, plot_height=500, title='', tools='box_zoom,wheel_zoom,pan,reset')
        show(left)
        return
    
    left = plot_tokens(tokens, plot_width=450, plot_height=500, title='', tools='box_zoom,wheel_zoom,pan,reset')
    right = plot_tokens(tokens, plot_width=450, plot_height=500, title='', tools='pan')

    source = ColumnDataSource({'x':[], 'y':[], 'width':[], 'height':[]})
    left.x_range.callback = create_js_callback('x', 'width', source)
    left.y_range.callback = create_js_callback('y', 'height', source)

    rect = bm.Rect(x='x', y='y', width='width', height='height', fill_alpha=0.0, line_color='blue', line_alpha=0.4)
    right.add_glyph(source, rect)

    show(row(left, right))

def display_topic_tokens(topic_id=0, n_words=100, output_format='Wordcloud'):
    global state, g
    g.progress.value = 1
    tokens = state.get_topic_tokens(topic_id=topic_id).\
        copy()\
        .drop('topic_id', axis=1)\
        .assign(weight=lambda x: 100.0 * x.weight)\
        .sort_values('weight', axis=0, ascending=False)\
        .reset_index()\
        .head(n_words)
    if output_format == 'Wordcloud':
        g.progress.value = 3
        tokens = tokens.assign(xs=tokens.index, ys=tokens.weight)
        plot_topic_tokens_charts(tokens)
        g.progress.value = 4
    elif output_format == 'List':
        #display(tokens)
        display(HTML(tokens.to_html()))
    else:
        display(pivot_ui(tokens))
    g.progress.value = 0
    
g = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='wc01',
    text=wf.create_text_widget('wc01'),
    topic_id=wf.create_int_slider(description='Topic ID', min=0, max=state.n_topics - 1, step=1, value=0),
    word_count=wf.create_int_slider(description='Word count', min=1, max=500, step=1, value=100),
    output_format=wf.create_select_widget('Format', ['Wordcloud', 'List', 'Pivot'], default='Wordcloud'),
    progress = wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))

)
g.prev_topic_id = g.create_prev_id_button('topic_id', state.n_topics)
g.next_topic_id = g.create_next_id_button('topic_id', state.n_topics)

w = widgets.interactive(
    display_topic_tokens, topic_id=g.topic_id, n_words=g.word_count, output_format=g.output_format
)

display(widgets.VBox(
    (g.text,) +
    (widgets.HBox((g.prev_topic_id,) + (g.next_topic_id,) + 
        (g.topic_id,) + (g.word_count,) + (g.output_format,)),) +
    (g.progress, ) +
    (w.children[-1],)))

w.update()


VBox(children=(HTML(value="<span class='wc01' style='line-height: 20px;'></span>", placeholder=''), HBox(child…

### Step: Evaluate and Interpret Topic's Share Over Time
Display a specific topics share over time as well as listing topic terms in descending order (based on yearly mean weight over all documents). The *whisker* displays max and mean topic weight for given year.

In [13]:
# Plot a topic's yearly weight over time in selected LDA topic model

def plot_topic_over_time(df, pivot_column, value_column, topic_id=0, year=None, whisker=False):

    source = ColumnDataSource(df)
    p = figure(plot_width=1000, plot_height=400, title='', tools=TOOLS, toolbar_location="right")
    p.xaxis[0].axis_label = pivot_column.title()
    p.yaxis[0].axis_label = value_column.title() + ('weight' if value_column != 'weight' else '')
    p.y_range.start = 0.0
    p.y_range.end = 1.0

    day_width = 60*60*24*1000
    glyph = bm.glyphs.VBar(x=pivot_column, top=value_column, bottom=0, width=1, fill_color="#b3de69")
    p.add_glyph(source, glyph)
    if whisker and year is None:
        p.add_layout(
            bm.Whisker(source=source, base=pivot_column, upper="max", lower=value_column)
        )
    #if not year is None: print(df_temp[['index', 'document', 'topic_id', 'weight']])
    return p

def display_topic_over_time(topic_id, year, value_column):
    global state, zj

    tokens = state.get_topics_tokens_as_text(n_words=200, cache=True).iloc[topic_id]
    zj.text.value = 'ID {}: {}'.format(topic_id, tokens)

    pivot_column = 'year' if year is None else 'document_id'
    value_column = value_column if year is None else 'weight'
    
    df = state.document_topic_weights[(state.document_topic_weights.topic_id==topic_id)]
    
    if year is None:
        df = df.groupby([pivot_column, 'topic_id']).agg([np.mean, np.max, np.std])['weight'].reset_index()
        df.columns = ['year', 'topic_id', 'mean', 'max', 'std']
    else:
        df = df[(df.year==year)]
        
    p = plot_topic_over_time(df, pivot_column, value_column, topic_id, year,  False)
    show(p)
    
zj = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='topic_share_plot',
    text=wf.create_text_widget('topic_share_plot'),
    year=wf.create_select_widget('Year', options=state.years),
    topic_id=wf.create_int_slider(description='Topic ID', min=0, max=state.n_topics - 1, step=1, value=0),
    output_format=wf.create_select_widget('Format', ['Wordcloud', 'List', 'Pivot'], default='Wordcloud'),
    progress=wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%")),
    aggregate=wf.create_select_widget('Aggregate', list(AGGREGATES.keys()), 'max')

)
zj.prev_topic_id = zj.create_prev_id_button('topic_id', state.n_topics)
zj.next_topic_id = zj.create_next_id_button('topic_id', state.n_topics)

wj = widgets.interactive(
    display_topic_over_time,
    topic_id=zj.topic_id,
    year=zj.year,
    value_column=zj.aggregate
)

display(widgets.VBox(
    (zj.text,) + 
    (widgets.HBox((zj.prev_topic_id,) + (zj.next_topic_id,) + (zj.topic_id,) + (zj.year,) + (zj.aggregate,)),) + 
    (zj.progress,) + 
    (wj.children[-1],)))
wj.update()


VBox(children=(HTML(value="<span class='topic_share_plot' style='line-height: 20px;'></span>", placeholder='')…

### Step: Analyse Highest Ranked Topics over Time
Display topic shares in descending order as a stacked bar chart. Order is based on selected aggregate function.

In [14]:
# Plot topic shares (year aggregate or per document for selected year)

def prepare_stacked_topic_share_data(key, n_topics, year):
    global state
    pivot_column = 'year' if year is None else 'document_id'
    
    df_data = state.get_document_topic_weights(year)

    df = ModelUtility.get_document_topic_weights_pivot(df_data, AGGREGATES[key], pivot_column)
    df.set_index(pivot_column, inplace=True)
    #print(df)
    #n_topics = min(len(df.columns), n_topics)
    
    if False:
        topic_ids = set([])
        for row in df.iterrows():
            topic_ids = set(list(topic_ids) + list(row[1].nlargest(n_topics).index))
            
        #df[df.columns].sort_values(axis=0, ascending=False)
        print(topic_ids)
    else:
        topic_toplist = df[df.columns].sum().sort_values(axis=0, ascending=False)
        df_top = df[topic_toplist[:n_topics].index].copy()

    df = df_top.reset_index()
    df.columns = [ str(x) for x in  df.columns ]
    
    return df, pivot_column, n_topics

def generate_category_colors(n_items, palette=bokeh.palettes.Category20[20]):
    ''' Repeat palette to get n_items colors '''
    colors = (((n_items // len(palette)) + 1) * palette)[:n_items]
    return colors

def plot_stacked_bar_of_topic_over_time(df, pivot_column, key='mean', n_topics=3, year=None, n_words=100):
    
    categories = list(df.columns[1:])
    colors = generate_category_colors(n_topics)
    source = ColumnDataSource(df)
    
    p = figure(plot_width=900, plot_height=800, title=state.basename, tools=TOOLS, toolbar_location="right")
    
    p.xaxis[0].axis_label = key.title() + ' weight'
    p.yaxis[0].axis_label = pivot_column.title()
    
    #legend = [ value(x) for x in categories ]
    #p.hbar_stack(categories, y=pivot_column, source=source, color=colors, height=0.5, legend=legend)
        
    bottoms, tops = [], []
    for i, category in enumerate(categories):
        tops = tops + [category]
        cr = p.hbar(y=pivot_column,
                    left=expr(bm.expressions.Stack(fields=bottoms)),
                    right=expr(bm.expressions.Stack(fields=tops)),
                    color=colors[i],
                    height=0.5,
                    source=source,
                    legend='Topic ' + str(category))
        topic_id = int(category)
        tooltip = 'ID {}: {}'.format(topic_id, state.get_topics_tokens_as_text(n_words=200, cache=True).iloc[topic_id])
        p.add_tools(bm.HoverTool(tooltips=tooltip, renderers=[cr]))
        bottoms = bottoms + [category]
            
    return p

def display_stacked_bar_of_topic_over_time(key='max', n_topics=3, year=None, output_format='Wordcloud'):
    
    global state
    
    ''' Prepare the plot data '''
    
    df, pivot_column, n_topics = prepare_stacked_topic_share_data(key, n_topics, year)
    
    if output_format == 'Chart':
        p = plot_stacked_bar_of_topic_over_time(df, pivot_column, key, n_topics, year)
        show(p)
    elif output_format == 'List':
        #display(tokens)
        display(HTML(df.to_html()))
    else:
        display(pivot_ui(df))
        
zh = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='topic_share_plot',
    text=wf.create_text_widget('topic_share_plot'),
    year=wf.create_select_widget('Year', options=state.years),
    topics_count=wf.create_int_slider(description='Topic count', min=1, max=state.n_topics, step=1, value=3),
    output_format=wf.create_select_widget('Format', ['Chart', 'List', 'Pivot'], default='Chart'),
    progress=wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%")),
    aggregate=wf.create_select_widget('Aggregate', list(AGGREGATES.keys()), 'max')
)    
wh = widgets.interactive(
    display_stacked_bar_of_topic_over_time, n_topics=zh.topics_count,
    key=zh.aggregate, year=zh.year, output_format=zh.output_format
)

display(widgets.VBox(
    (zh.text,) + 
    (widgets.HBox((zh.aggregate,) + (zh.topics_count,) + (zh.year,) + (zh.output_format,)),) + 
    (zh.progress,) + 
    (wh.children[-1],)))

wh.update()


VBox(children=(HTML(value="<span class='topic_share_plot' style='line-height: 20px;'></span>", placeholder='')…

### Step: Analyse Topic Shares per Year or Document
The topic shares are displayed as a scatter plot using gradient color based on topic's weight.
A common visualization of topic models. See for instance Stanford’s Termite software (http://vis.stanford.edu/papers/termite)

In [15]:
# plot_topic_relevance_by_year

def setup_glyph_coloring(df):
    max_weight = df.weight.max()
    #colors = list(reversed(bokeh.palettes.Greens[9]))
    colors = ["#efefef", "#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878",
              "#933b41", "#550b1d"]
    mapper = bm.LinearColorMapper(palette=colors, low=df.weight.min(), high=max_weight)
    color_transform = transform('weight', mapper)
    color_bar = bm.ColorBar(color_mapper=mapper, location=(0, 0),
                         ticker=bm.BasicTicker(desired_num_ticks=len(colors)),
                         formatter=bm.PrintfTickFormatter(format=" %5.2f"))
    return color_transform, color_bar

def plot_topic_relevance_by_year(df, xs, ys, glyph, titles, text_id):

    ''' Setup axis categories '''
    x_range = list(map(str, df[xs].unique()))
    y_range = list(map(str, df[ys].unique()))
    
    ''' Setup coloring and color bar '''
    color_transform, color_bar = setup_glyph_coloring(df)
    
    source = ColumnDataSource(df)

    plot_height = max(len(y_range) * 6, 500)
    p = figure(title="Topic heatmap", tools=TOOLS, toolbar_location="right", x_range=x_range,
           y_range=y_range, x_axis_location="above", plot_width=900, plot_height=plot_height)

    args = dict(x=xs, y=ys, source=source, alpha=1.0, hover_color='red')
    
    if glyph == 'Circle':
        cr = p.circle(color=color_transform, **args)
    else:
        cr = p.rect(width=1, height=1, line_color=None, fill_color=color_transform, **args)

    p.x_range.range_padding = 0
    p.ygrid.grid_line_color = None
    p.xgrid.grid_line_color = None
    p.axis.axis_line_color = None
    p.axis.major_tick_line_color = None
    p.axis.major_label_text_font_size = "5pt"
    p.axis.major_label_standoff = 0
    p.xaxis.major_label_orientation = 1.0
    p.add_layout(color_bar, 'right')
    
    p.add_tools(bm.HoverTool(tooltips=None, callback=WidgetUtility.glyph_hover_callback(
        source, 'topic_id', titles.index, titles, text_id), renderers=[cr]))
    
    return p
    
def display_topic_relevance_by_year(key='max', year=None, glyph='Circle'):
    global state, zo
    zo.progress.value = 1
    titles = ModelUtility.get_topic_titles(state.topic_token_weights, n_words=100)
    zo.progress.value = 2
    df, pivot_column = state.get_topic_weight_by_year_or_document(key=key, year=year)
    zo.progress.value = 3
    df[pivot_column] = df[pivot_column].astype(str)
    df['topic_id'] = df.topic_id.astype(str)
    zo.progress.value = 4
    p = plot_topic_relevance_by_year(df, xs=pivot_column, ys='topic_id', glyph=glyph,
                                     titles=titles, text_id='topic_relevance')
    show(p)
    zo.progress.value = 0
    
#u = TopTopicWidgets(0, state.years, aggregates=list(AGGREGATES.keys()), text_id='topic_relevance')

zo = BaseWidgetUtility(
    text_id='topic_relevance',
    text=wf.create_text_widget('topic_relevance'),
    year=wf.create_select_widget('Year', options=state.years),
    # output_format=wf.create_select_widget('Format', ['List', 'Pivot'], default='List'),
    aggregate=wf.create_select_widget('Aggregate', list(AGGREGATES.keys()), 'max'),
    progress=wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))
) 

zo.glyph = widgets.Dropdown(options=['Circle', 'Square'], value='Square', description='Glyph', disabled=False)

wo = widgets.interactive(display_topic_relevance_by_year, key=zo.aggregate, year=zo.year, glyph=zo.glyph)

display(widgets.VBox(
    (widgets.HBox((zo.aggregate,) + (zo.glyph,) + (zo.year,)),) +
    (zo.progress,) +
    (zo.text,) +
    (wo.children[-1],)))
        
wo.update()

VBox(children=(HBox(children=(Dropdown(description='Aggregate', index=1, options=('mean', 'max', 'sum', 'std')…

###  Step: Analyse Topic-to-document Association Network
The green nodes are documents, and blue nodes are topics. The edges (lines) indicates the strength of a topic in the connected document. The width of the edge is proportinal to the strength of the connection. Note that only edges with a strength above the certain threshold are displayed.

In [16]:
# Visualize year-to-topic network by means of topic-document-weights
     
def plot_topic_year_network(network, layout, scale=1.0, titles=None):

    year_nodes, topic_nodes = NetworkUtility.get_bipartite_node_set(network, bipartite=0)  
    
    year_source = NetworkUtility.get_node_subset_source(network, layout, year_nodes)
    topic_source = NetworkUtility.get_node_subset_source(network, layout, topic_nodes)
    lines_source = NetworkUtility.get_edges_source(network, layout, scale=6.0, normalize=False)
    
    edges_alphas = NetworkMetricHelper.compute_alpha_vector(lines_source.data['weights'])
    
    lines_source.add(edges_alphas, 'alphas')
    
    p = figure(plot_width=1000, plot_height=600, x_axis_type=None, y_axis_type=None, tools=TOOLS)
    
    r_lines = p.multi_line(
        'xs', 'ys', line_width='weights', alpha='alphas', color='black', source=lines_source
    )
    r_years = p.circle(
        'x','y', size=40, source=year_source, color='lightgreen', level='overlay', line_width=1,alpha=1.0
    )
    
    r_topics = p.circle('x','y', size=25, source=topic_source, color='skyblue', level='overlay', alpha=1.00)
    
    p.add_tools(bm.HoverTool(renderers=[r_topics], tooltips=None, callback=WidgetUtility.\
        glyph_hover_callback(topic_source, 'node_id', text_ids=titles.index, text=titles, element_id='nx_id1'))
    )

    text_opts = dict(
        x='x', y='y', text='name', level='overlay',
        x_offset=0, y_offset=0, text_font_size='8pt'
    )
    
    p.add_layout(
        bm.LabelSet(
            source=year_source, text_color='black', text_align='center', text_baseline='middle', **text_opts
        )
    )
    p.add_layout(
        bm.LabelSet(
            source=topic_source, text_color='black', text_align='center', text_baseline='middle', **text_opts
        )
    )
    
    return p
    
def display_topic_year_network(
    layout_algorithm, threshold=0.10, scale=1.0, year=None, output_format='Network'
):
    global state, zn
    zn.progress.value = 1
    titles = state.get_topics_tokens_as_text()
    df = state.get_document_topic_weights(year=year, topic_id=None)
    df = df[(df.weight >= threshold)]
    zn.progress.value = 2
    
    network = NetworkUtility.create_bipartite_network(df, 'document', 'topic_id')
    zn.progress.value = 3
        
    if output_format == 'Network':
        args = PlotNetworkUtility.layout_args(layout_algorithm, network, scale)
        layout = (layout_algorithms[layout_algorithm])(network, **args)
        zn.progress.value = 4
        p = plot_topic_year_network(network, layout, scale=scale, titles=titles)
        show(p)

    elif output_format == 'List':
        display(HTML(df.to_html()))
    else:
        display(pivot_ui(df))
        
    zn.progress.value = 0

zn = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='nx_id1',
    text=wf.create_text_widget('nx_id1'),
    year=wf.create_int_slider(
        description='Year', min=state.min_year, max=state.max_year, step=1, value=state.min_year
    ),
    scale=wf.create_float_slider('Scale', min=0.0, max=1.0, step=0.01, value=0.1),
    threshold=wf.create_float_slider('Threshold', min=0.0, max=1.0, step=0.01, value=0.10),
    output_format=wf.create_select_widget('Format', ['Network', 'List', 'Pivot'], default='Network'),
    layout=wf.create_select_widget('Layout', list(layout_algorithms.keys()), default='Fruchterman-Reingold'),
    progress=wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))
) 
zn.previous = zn.create_prev_id_button('year', 10000)
zn.next = zn.create_next_id_button('year', 10000)

wn = widgets.interactive(
    display_topic_year_network, layout_algorithm=zn.layout,
    threshold=zn.threshold, scale=zn.scale,
    year=zn.year, output_format=zn.output_format
)

display(widgets.VBox(
    (zn.text, ) +
    (widgets.HBox((zn.layout, ) + (zn.year,) + (zn.previous,) + (zn.next,)),) +
    (widgets.HBox((zn.threshold,) + (zn.scale,) + (zn.output_format,)),) +
    (zn.progress, ) +
    (wn.children[-1],)))

wn.update()

VBox(children=(HTML(value="<span class='nx_id1' style='line-height: 20px;'></span>", placeholder=''), HBox(chi…

### Step: Analyse Topic Co-Occurence
Computes weighted graph of topics co-occurring in the same document. Topics are defined as co-occurring if they both exists  in the same document both having weights above threshold. Weight are number of co-occurrences (binary yes or no). Node size reflects topic proportions over the entire corpus (normalized document) length, and are computed in accordance to how node sizes are computed in LDAvis.

In [17]:
# Visualize topic co-occurrence
%run ./common/plot_utility
G = None
def display_topic_co_occurrence_network(layout, threshold, scale, output_format):

    global state, zn
    try:
        metric = 'Threshold'
        titles = state.get_topics_tokens_as_text()

        if metric == 'Threshold':
            df = state.get_document_topic_weights()
            df = df.loc[(df.weight >= threshold)]
            df = pd.merge(df, df, how='inner', left_on='document_id', right_on='document_id')
            df = df.loc[(df.topic_id_x < df.topic_id_y)]
            df = df.groupby([df.topic_id_x, df.topic_id_y]).size().reset_index()
            df.columns = ['source', 'target', 'weight']

        if output_format == 'Network':
            network = NetworkUtility.create_network(df, source_field='source', target_field='target', weight='weight')
            p = PlotNetworkUtility.plot_network(
                network=network,
                layout_algorithm=layout,
                scale=scale,
                threshold=0.0,
                node_description=state.get_topics_tokens_as_text(),
                node_proportions=state.get_topic_proportions(),
                weight_scale=10.0,
                normalize_weights=True,
                element_id='cooc_id',
                figsize=(900,500)
            )
            show(p)
        elif output_format == 'List':
            display(HTML(df.to_html()))
        else:
            display(pivot_ui(df))
    except Exception as x:
        print("No data: please adjust filters")
        
zn = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='cooc_id',
    text=wf.create_text_widget('cooc_id'),
    scale=wf.create_float_slider('Scale', min=0.0, max=1.0, step=0.01, value=0.1),
    threshold=wf.create_float_slider('Threshold', min=0.0, max=1.0, step=0.01, value=0.35),
    output_format=wf.create_select_widget('Format', ['Network', 'List', 'Pivot'], default='Network'),
    layout=wf.create_select_widget('Layout', list(layout_algorithms.keys()), default='Fruchterman-Reingold'),
    progress=wf.create_int_progress_widget(min=0, max=4, step=1, value=0, layout=widgets.Layout(width="95%"))
) 

wn = widgets.interactive(
    display_topic_co_occurrence_network,
    layout=zn.layout,
    threshold=zn.threshold,
    scale=zn.scale,
    output_format=zn.output_format
)

display(widgets.VBox(
    (zn.text, ) +
    (widgets.HBox((zn.layout, )),) +
    (widgets.HBox((zn.threshold,) + (zn.scale,) + (zn.output_format,)),) +
    (zn.progress, ) +
    (wn.children[-1],)))

wn.update()

VBox(children=(HTML(value="<span class='cooc_id' style='line-height: 20px;'></span>", placeholder=''), HBox(ch…

### Step: Analyse Topic Similarity Network
This plot displays topic similarity based on **euclidean or cosine distances** between the **topic-to-word vectors**. Please note that the computations can take some time to exceute, especially for larger LDA models.

1. Compute a multi dimensional topic vector space based on the top n words for each topic. Since the subset of words differs, and their positions differs between topics they need to be aligned in common space so that 1) each vector has the same dimension (i.e. number of unique top n tokens over all topics) and 2) each token has the same position within that space. (using sklearn DictVectorizer). The vector space will have as many dimensions as the number of unique top n words over all topics.
2. Reduce the topic vector space into a 2D space (using sklearn PCA)
3. Normalize the 2D space (sklearn Normalizer)

- TODO: Save network to file (either via pandas or networkx)
- TODO: Should partition/community be computed before or after network is filtered?

Note: Steps 1 to 3 above (the most time consuming) are executed whenever an option marked with an asterix is changed. 

In [18]:
# Visualization

# if 'zy_data' not in globals():
zy_data = types.SimpleNamespace(
    basename=None,
    network=None,
    X_n_space=None,
    X_n_space_feature_names=None,
    distance_matrix=None,
    metric=None,
    topic_proportions=None,
    n_words = 0
)

    
def plot_clustering_dendogram(clustering):
    plt.figure(figsize=(16,6))
    # https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html
    R = dendrogram(clustering)
    plt.show()
    plt.close()

def VectorSpaceHelper_compute_distance_matrix(X_n_space, metric='euclidean'):
    # https://se.mathworks.com/help/stats/pdist.html
    metric = metric.lower()
    if metric == 'kullback–leibler': metric = VectorSpaceHelper.kullback_leibler_divergence
    if metric == 'scipy.stats.entropy': metric = scipy.stats.entropy
    #print(metric)
    X = X_n_space.toarray() if hasattr(X_n_space, 'toarray') else X_n_space
    #X_n_space += 0.00001
    distances = distance.pdist(X, metric=metric)
    #print(distances)
    distance_matrix = distance.squareform(distances)
    #print(distance_matrix)    
    return distance_matrix
    
def display_correlation_network(
    layout_algorithm,
    threshold=0.10,
    scale=1.0,
    metric='Euclidean',
    n_words=200,
    output_format='Network'
):
    global state, zy_data, zy

    try:

        zy.progress.value = 1
        metric = DISTANCE_METRICS[metric]

        node_description = state.get_topics_tokens_as_text()
        node_proportions = state.get_topic_proportions()

        zy.progress.value = 2
        if zy_data.network is None or state.basename != zy_data.basename or zy_data.metric != metric or zy_data.n_words != n_words:

            zy_data.basename = state.basename
            zy_data.n_words = n_words
            zy_data.X_n_space, zy_data.X_n_space_feature_names = state.compute_topic_terms_vector_space(n_words=n_words)
            
            #print(zy_data.X_n_space.shape)
            #print(zy_data.X_n_space_feature_names)
            zy.progress.value = 3
            zy_data.distance_matrix = VectorSpaceHelper_compute_distance_matrix(zy_data.X_n_space, metric=metric)
            zy_data.network = None

        edges_data = VectorSpaceHelper.lower_triangle_iterator(zy_data.distance_matrix, threshold)

        zy.progress.value = 4
        if output_format == 'List':
            df = pd.DataFrame(edges_data, columns=['x', 'y', 'weight'])
            zy.progress.value = 5
            display(HTML(df.to_html()))
        else:
            zy.progress.value = 5
            if zy_data.network is None:
                zy_data.network = NetworkUtility.create_network_from_xyw_list(edges_data) # zy_data.distance_matrix)
            zy.progress.value = 6
            p = PlotNetworkUtility.plot_network(
                network=zy_data.network,
                layout_algorithm=layout_algorithm,
                scale=scale,
                threshold=threshold,
                node_description=node_description,
                node_proportions=node_proportions,
                element_id='nx_id3',
                figsize=(1000,600)
            )
            zy.progress.value = 6
            show(p)

        zy.progress.value = 7
        zy.progress.value = 0
    except Exception as ex:
        # logger.exception(ex)
        print('Error: {}'.format(ex))
        print('Empty set: please change filters')
        zy.progress.value = 0

zy = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='nx_id3',
    text=wf.create_text_widget('nx_id3'),
    scale=wf.create_float_slider('Scale', min=0.0, max=1.0, step=0.01, value=0.1),
    year=wf.create_int_slider(
        description='Year', min=state.min_year, max=state.max_year, step=1, value=state.min_year
    ),
    n_words=wf.create_int_slider(description='#words*', min=10, max=500, step=1, value=20),
    metric=wf.create_select_widget(label='Metric*', values=list(DISTANCE_METRICS.keys()), default='Euclidean'),
    threshold=wf.create_float_slider('Threshold', min=0.0, max=1.0, step=0.01, value=0.01),
    output_format=wf.create_select_widget('Format', ['Network', 'List'], default='Network'),
    layout=wf.create_select_widget('Layout', list(layout_algorithms.keys()), default='Fruchterman-Reingold'),
    progress=wf.create_int_progress_widget(min=0, max=7, step=1, value=0, layout=widgets.Layout(width="90%"))
) 
    
wy = widgets.interactive(
    display_correlation_network,
    layout_algorithm=zy.layout,
    threshold=zy.threshold,
    scale=zy.scale,
    metric=zy.metric,
    n_words=zy.n_words,
    output_format=zy.output_format
)

display(widgets.VBox(
    (zy.text, ) +
    (widgets.HBox((zy.threshold,) + (zy.metric,) + (zy.output_format,)),) +
    (widgets.HBox((zy.n_words,) + (zy.layout,) + (zy.scale,)),) +
    (zy.progress,) +
    (wy.children[-1],)))

wy.update()
                                   

VBox(children=(HTML(value="<span class='nx_id3' style='line-height: 20px;'></span>", placeholder=''), HBox(chi…

### Step: Analyse Topic Similarity using Dimensionality Reduction

In [19]:
#
import types
tr_data = types.SimpleNamespace(
    X_n_space=None,
    X_m_space=None,
    n_words=None,
    method=None,
    perplexity=None,
    corpus_documents=state.get_corpus_documents(),
    topic_proportions=state.get_topic_proportions(),
    tokens=state.get_topics_tokens_as_text(n_words=200)
)
# Plot 2d utility function
def plot_2d_vector_space(
    X_2_space,
    proportions=None,
    size=(20, 60),
    description=None,
    dom_id='id99',
    glyph_style=None,
    label_style=None,
    figsize=(800, 800)
):
    global tr_data
    xs, ys = zip(*X_2_space)
    n_dim = len(xs)
    item_ids = description.index if not description is None else range(0, n_dim)
    
    if proportions is not None:
        proportions = PlotNetworkUtility.project_series_to_range(proportions, size[0], size[1])
    
    source = ColumnDataSource(
        dict(xs=list(xs),
             ys=list(ys),
             size=proportions if not proportions is None else [size[0]] * n_dim,
             text=description if not description is None else item_ids,
             item_id=item_ids
        )
    )
    p = figure(plot_width=figsize[0], plot_height=figsize[1], title='', tools=TOOLS)
    
    glyph_style = extend(dict(color='green', alpha=0.2, hover_color='red') , glyph_style or {})
    cr = p.circle(x='xs', y='ys', size='size', source=source, **glyph_style)
    
    label_style = extend(dict(level='overlay', text_align='center', text_baseline='middle',
                              text_font_size='8pt') , label_style or {})
    labels = bm.LabelSet(x='xs', y='ys', text='item_id', source=source, **label_style)
    
    p.add_layout(labels)
    
    p.add_tools(bm.HoverTool(renderers=[cr], tooltips=None, callback=WidgetUtility.\
        glyph_hover_callback(source, 'item_id', text_ids=description.index, text=description, element_id=dom_id))
    )
    
    return p

In [20]:
#
def reduce_and_plot_vector_space(n_words, method='tsne', perplexity=30):
    global state, zc
    
    pp = zc.progress
    
    pp.value = 1
    
    if tr_data.X_n_space is None or tr_data.n_words != n_words:
        tr_data.X_n_space, _ = state.compute_topic_terms_vector_space(n_words)
        tr_data.X_m_space = None
        
    pp.value = 2
    
    if  tr_data.X_m_space is None or tr_data.method != method or tr_data.perplexity != perplexity:
        tr_data.X_m_space = VectorSpaceHelper.reduce_dimensions(
            tr_data.X_n_space, method=method, n_components=2, perplexity=perplexity
        )
        
    tr_data.n_words = n_words
    tr_data.method = method
    tr_data.perplexity = perplexity
    
    pp.value = 4
    
    p = plot_2d_vector_space(
        tr_data.X_m_space, proportions=tr_data.topic_proportions, size=(20,40),
        description=tr_data.tokens, dom_id='text99', figsize=(1000, 600)
    )
    pp.value = 5
    show(p)
    pp.value = 0
    
zc = BaseWidgetUtility(
    n_words=wf.create_int_slider(description='Word count', min=10, max=500, step=10, value=50),
    progress=wf.create_int_progress_widget(min=0, max=5, step=1, value=0, layout=widgets.Layout(width="95%")),
    perplexity=wf.create_int_slider(description='Perplexity', min=1, max=100, step=1, value=30),
    reducer=wf.create_select_widget(
        label='Reducer*', values=['pca','pca_norm','tsne'], default='tsne'
    ),
    text=wf.create_text_widget(element_id='text99')
)
wc = widgets.interactive(
    reduce_and_plot_vector_space, n_words=zc.n_words, method=zc.reducer, perplexity=zc.perplexity
)

display(widgets.VBox(
    (zc.text, ) +
    (widgets.HBox((zc.n_words,) + (zc.reducer, ) + (zc.perplexity,)),) + 
    (zc.progress, ) +
    (wc.children[-1],)))

wc.update()


VBox(children=(HTML(value="<span class='text99' style='line-height: 20px;'></span>", placeholder=''), HBox(chi…

###  Test: Analyse Document Similarity
The document similarity is computed using dimensionality reduction of document-topic distributions.

- Is there an established method of identifying the most (topically) interesting documents?
- Use a goodness of fit to test against uniform discrete density distribution?
  Wasserstein distance? Chi-square? KS-test


In [21]:
def plot_similarity_distribution():
    df = state.get_document_topic_weights()
    X_m_n_sparse = compute_document_topic_vector_space(df)
    matrix = VectorSpaceHelper.compute_distance_matrix(X_m_n_sparse, metric='cosine')
    x_dim, y_dim = matrix.shape
    items = ((i, j, matrix[i,j]) for i, j in product(range(0,x_dim), range(0,y_dim)) if i < j)
    ns, nm, ws = list(zip(*items))
    df = pd.DataFrame(dict(n=ns,m=nm,w=ws))
    df['similarity'] = (df.w*1000).astype('int')
    p = df.groupby('similarity').size().iloc[0:970].plot()
    
def compute_document_topic_vector_space(df):
    #https://stackoverflow.com/questions/22433884/python-gensim-how-to-calculate-document-similarity-using-the-lda-model
    #''' Filter out topics below given threshold '''
    #df = df[df.weight][['document_id', 'topic_id', 'weight']]

    ''' Create a dict (pair) for each topic-weight row '''
    df['weight_dict'] = df.apply(lambda x: { int(x.topic_id): x.weight}, axis=1)

    ''' Create a list of all dicts for each documents'''
    df = df.groupby('document_id')['weight_dict'].apply(list)

    ''' Merge the list of pair dicts into a single dict '''
    df = df.apply(lambda L: { k: v for d in L for k, v in d.items() } )

    ''' Fit the topic weighs into a sparse matrix (dimensions m_documents X n_topics)'''
    v = DictVectorizer()
    X_m_n_sparse = v.fit_transform(df)

    return X_m_n_sparse

In [22]:
# T-SNE 2D Visualization
if 'ds_data' not in globals():
    ds_data = types.SimpleNamespace(
        X_m_n_sparse=None,
        X_2_space=None,
        threshold=None,
        reducer=None,
        perplexity=None,
        G=None,
        description=state.get_corpus_documents().rename(columns={'document': 'text'})['text']
    )

def plot_document_similarity_by_topics_tsne(threshold=0.001, reducer='tsne', perplexity=30):
    global u, ds_data
    
    df = state.get_document_topic_weights()
    
    u.progress.value = 1
    if ds_data.X_m_n_sparse is None:
        ds_data.X_m_n_sparse = compute_document_topic_vector_space(df)
        ds_data.threshold = threshold
        ds_data.X_2_space = None
    
    u.progress.value = 2
    if ds_data.X_2_space is None or ds_data.reducer != reducer\
            or ds_data.perplexity != perplexity or ds_data.threshold != threshold:
        ds_data.X_2_space = VectorSpaceHelper.reduce_dimensions(
            ds_data.X_m_n_sparse, method=reducer,
            n_components=2, perplexity=perplexity)
        ds_data.reduce = reducer
        ds_data.perplexity = perplexity
        
    u.progress.value = 3

    description = state.get_corpus_documents().rename(columns={'document': 'text'})['text']
        
    u.progress.value = 4
    p = plot_2d_vector_space(ds_data.X_2_space, proportions=None,
            size=(20,60), description=ds_data.description, dom_id='nx_id4', glyph_style=dict(alpha=0.05))
    
    u.progress.value = 5
    show(p)
    u.progress.value = 0

u = BaseWidgetUtility()
u.threshold = u.create_float_slider('Threshold', min=0.0, max=0.10, step=0.01, value=0.01)
u.reducer = u.create_select_widget(label='Reducer*', values=['pca', 'pca_norm', 'tsne'], default='tsne')
u.progress = u.create_int_progress_widget(min=0, max=5, step=1)
u.perplexity = u.create_int_slider(description='Perplexity', min=1, max=100, step=1, value=30)
u.text = u.create_text_widget(element_id='nx_id4')

w = widgets.interactive(plot_document_similarity_by_topics_tsne,
                threshold=u.threshold,
                reducer=u.reducer,
                perplexity=u.perplexity)

display(widgets.VBox(
    (u.text, ) +
    (widgets.HBox((u.threshold,) + (u.reducer,) + (u.perplexity,) + (u.progress,)),) +
    (w.children[-1],)))

w.update()

VBox(children=(HTML(value="<span class='nx_id4' style='line-height: 20px;'></span>", placeholder=''), HBox(chi…

### The Same Data  Visualized as a Network


In [None]:
# Code
if True or 'zu_data' not in globals():
    zu_data = types.SimpleNamespace(
        X_m_n_sparse=None,
        top=None,
        metric=None,
        reducer=None,
        document_topic_weights=state.get_document_topic_weights(),
        corpus_documents=state.get_corpus_documents(),
        topic_proportions=state.get_topic_proportions(),
        G=None
    )

def plot_document_similarity_by_topics_network(
    layout_algorithm, top, metric, reducer
):
    global zu
    scale = 1.0
    threshold = 0.0
    zu.progress.value = 1
    df = zu_data.document_topic_weights
    
    zu.progress.value = 2
    if zu_data.X_m_n_sparse is None:
        zu_data.X_m_n_sparse = compute_document_topic_vector_space(df)
        zu_data.metric = None
        zu.progress.value = 3
    metric = DISTANCE_METRICS[metric]
    if zu_data.metric != metric or zu_data.top != top:
        zu_data.top = top
        zu_data.metric = metric
        matrix = VectorSpaceHelper.compute_distance_matrix(zu_data.X_m_n_sparse, metric=metric)
        #edges = NetworkUtility.matrix_weight_iterator(matrix, threshold)
        edges = NetworkUtility.df_stack_correlation_matrix(matrix, threshold=0.0, n_top=top)
        zu.progress.value = 4
        G = nx.Graph()
        G.add_weighted_edges_from(edges)
        zu_data.G = G
        print(nx.info(zu_data.G))

    node_ids, degrees = list(zip(*list(zu_data.G.degree(zu_data.G.nodes()))))
    node_proportions = pd.DataFrame(dict(node_id=node_ids, size=degrees)).set_index('node_id')
    node_proportions['size'] *= 1000
    zu.progress.value = 5
    p = PlotNetworkUtility.plot_network(
        network=zu_data.G,
        layout_algorithm=layout_algorithm,
        scale=scale,
        threshold=0.0,
        node_description=state.get_corpus_documents(),
        node_proportions=node_proportions,
        weight_scale=1.0,
        normalize_weights=True,
        element_id='nx_id_5'
    )
    zu.progress.value = 6
    show(p)
    zu.progress.value = 0
    
zu = BaseWidgetUtility(
    text = wf.create_text_widget(element_id='nx_id_5'),
    #scale = wf.create_float_slider('Scale', min=0.0, max=1.0, step=0.1, value=1.0),
    reducer = wf.create_select_widget(label='Reducer*', values=['none','pca','pca_norm','tsne'], default='none'),
    progress = wf.create_int_progress_widget(min=0, max=6, step=1, value=0, layout=widgets.Layout(width="95%")),
    threshold = wf.create_float_slider('Threshold', min=0.01, max=1.0, step=0.01, value=0.1),
    top = wf.create_int_slider('Top', min=100, max=1000, step=100, value=100),
    metric = wf.create_select_widget(label='Metric*', values=list(DISTANCE_METRICS.keys()), default='Cosine'),
    layout_algorithm = wf.layout_algorithm_widget(list(layout_algorithms.keys()), default='Fruchterman-Reingold')
)

wu = widgets.interactive(plot_document_similarity_by_topics_network,
                layout_algorithm=zu.layout_algorithm,
                #threshold=u.threshold,
                top=zu.top,
                metric=zu.metric,
                reducer=zu.reducer)

display(widgets.VBox(
    (zu.text, ) +
    #(zu.threshold,) +
    (widgets.HBox((zu.reducer,) + (zu.metric,)),) +
    (widgets.HBox((zu.layout_algorithm,) + (zu.top,)),) +
    (zu.progress, ) +
    (wu.children[-1],)))

wu.update()

### Display Document-Topic Weights
List aggregated topic weights.

In [None]:
# Folded code
import IPython.display # import display, HTML
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def plot_stacked_bar_of_topic_over_time(key='mean', year=None, output_format=None):
    global state
    pivot_column = 'year' if year is None else 'document_id'   
    df_data = state.get_document_topic_weights(year)
    df_temp = ModelUtility.get_document_topic_weights_pivot(df_data, AGGREGATES[key], pivot_column)
    df_temp.set_index(pivot_column, inplace=True)
    df_temp.columns = [ str(x) for x in df_temp.columns ]
    if output_format == 'List':
        # print(df_temp.columns)
        display(df_temp)
    else:
        display(pivot_ui(df_temp, rows=['year']+list(df_temp.columns)))

zk = BaseWidgetUtility(
    n_topics=state.n_topics,
    text_id='topic_share_plot',
    text=wf.create_text_widget('topic_share_plot'),
    year=wf.create_select_widget('Year', options=state.years),
    output_format=wf.create_select_widget('Format', ['List', 'Pivot'], default='List'),
    aggregate=wf.create_select_widget('Aggregate', list(AGGREGATES.keys()), 'max')
) 

wk = widgets.interactive(
    plot_stacked_bar_of_topic_over_time, key=zk.aggregate, year=zk.year, output_format=zk.output_format
)

display(widgets.VBox((zk.text,) +
       (widgets.HBox((zk.aggregate,) + (zk.year,) + (zk.output_format,)),) +
       (wk.children[-1],)))

wk.update()

## <span style='color: red;'>Everything below is work-in-progress</span>

### pyLDAvis

This visualization uses pyLDAvis (http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf)
    

In [7]:
# Code
# from IPython.display import IFrame, display
# IFrame('./data/{}/pyldavis.html'.format(state.basename), width=900, height=900)
%run ./common/model_utility
import pyLDAvis.gensim as gensimvis
import pyLDAvis

def display_pyLDAvis():
    global state
    
    lda = state.get_lda()

    if lda is None:
        print('Gensim LDA model is required for pyLDAvis but not made avaliable on disk')
        return
    
    dictionary = ModelUtility.load_dictionary(state.data_folder, state.basename)
    corpus = ModelUtility.load_corpus(state.data_folder, state.basename)

    pyLDAvis.enable_notebook()
    vis_data = gensimvis.prepare(lda, corpus, dictionary)
    pyLDAvis.display(vis_data)

display_pyLDAvis()

NameError: name 'state' is not defined


### Compute and Plot Document Similarity using TF-IDF and T-SNE

**FIXME** Fill in real TF-IDF values (from model) for tokens not in top-list (instead of zero)

**FIXME** Simple (to simlple) document similarity metric, use text2vec instead!


In [None]:
# Code
from gensim.models.tfidfmodel import TfidfModel
from gensim import corpora
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

class TfidfReducer:
    
    def __init__(self):
        self.corpus = corpora.MmCorpus(os.path.join(state.data_folder, state.basename, 'corpus.mm'))
        self.dictionary = corpora.Dictionary.load(os.path.join(state.data_folder, state.basename, 'corpus.dict.gz'))
        self.data_folder = data_folder
        self.basename = basename
        
    def tfidf_vectors(self, tfidf, corpus, n_tokens):
        for document in corpus:
            yield tfidf[document][:n_tokens]

    def tfidf_vectors_as_dicts(self, tfidf, corpus, n_tokens):
        ''' Create a dict(token_1: weight, ..., token_n: weight } for each document '''
        for tfidf_vector in self.tfidf_vectors(tfidf, corpus, n_tokens):
            yield { x[0]: x[1] for x in tfidf_vector }
        
    def fit_transform(self, tfidf, corpus, n_tokens, perplexity=30):

        ''' Align vectors... '''
        v = DictVectorizer()
        dict_vectors = self.tfidf_vectors_as_dicts(tfidf, corpus, n_tokens)
        X = v.fit_transform(dict_vectors)
        feature_names = v.get_feature_names()

        print('Shape: ', X.shape)
        reducer = TSNE(n_components=2, init='pca', random_state=2019, perplexity=perplexity)
        X_reduced = reducer.fit_transform(X.toarray())

        return X, feature_names, X_reduced

class TfidfDocumentWidgets():
    
    def __init__(self, years):
        self.text_id = 'document_text'
        self.text = widgets.HTML(value="<span class='{}'/>".format(self.text_id), placeholder='', description='')
        self.perplexity = widgets.IntSlider(
            min=1, max=200, step=1, value=30, description='Perplexity', continuous_update=False
        )
        self.word_count = widgets.IntSlider(
            min=50, max=250, step=1, value=200, description='Word count', continuous_update=False
        )
        #self.dropdown = widgets.Dropdown(options=[], value='None', description='Dropdown', disabled=False)
        self.year = widgets.Dropdown(
            options=state.years, value=state.years[0], description='Year', disabled=False
        )
        
    def setup_hover_callback_tool(self, cr):
        code = """
        var indices = cb_data.index['1d'].indices;
        if (indices.length > 0) {
            var index = indices[0];
            var topic_id = circle.data.topic_id[index];
            var title = circle.data.words[index];
            //var share = (100.0 * circle.data.topic_proportion[index]).toFixed(1).toString() + '%';
            $('.""" + self.text_id + """').html('DOC ' + topic_id.toString() + ': ' + title);
        }
        """
        callback = CustomJS(args={'document_glyph': cr.data_source}, code=code)
        p.add_tools(HoverTool(tooltips=None, callback=callback, renderers=[cr]))
        return HoverTool(tooltips=None, callback=callback, renderers=[cr])

def plot_tf_idf_document_vector_space(X_reduced, document_index):
    
    xs, ys = zip(*X_reduced)
    source = ColumnDataSource(
        dict(xs=list(xs),
             ys=list(ys),
             #size=5,
             #words=titles,
             #topic_id=titles.index
        )
    )
    p = figure(plot_width=800, plot_height=800, title='', tools=TOOLS)
    cr = p.circle(x='xs', y='ys', size=5, source=source, alpha=0.2, hover_color='red')
    show(p)
    
if 'corpus' not in globals():
    corpus = corpora.MmCorpus(os.path.join(state.data_folder, state.basename, 'corpus.mm'))
    dictionary = corpora.Dictionary.load(os.path.join(state.data_folder, state.basename, 'corpus.dict.gz'))
    id2document = ModelUtility.get_corpus_documents(data_folder, basename)
    tfidf_corpus = TfidfCorpus(state.data_folder, state.basename, tfidf, corpus, n_tokens=200)
    tfidf = TfidfModel(corpus)

if 'X_reduced' not in globals():
    ''' This takes some time to compute...'''
    document_tfidf_vectors = tfidf_vectors_as_dicts(tfidf, corpus)
    X, feature_names, X_reduced = compute_document_pca(document_tfidf_vectors)
    
def display_tf_idf_document_vector_space(perplexity, word_count, year):
    global X_reduced
    plot_tf_idf_document_vector_space(X_reduced, perplexity)
    
u = TfidfDocumentWidgets(state.years)
w = interactive(display_tf_idf_document_vector_space,
                perplexity=u.perplexity, word_count=u.word_count, year=u.year)

display(widgets.VBox(
    (u.text,) + (widgets.HBox((u.year,) + (u.perplexity,) + (u.word_count,)),)
    + (w.children[-1],)))
        
# w.update()

### TODO: Document similarity using BOW document vectorization:
https://de.dariah.eu/tatom/working_with_text.html


Goodness of Fit using **Kolmogorov-Smirnov** (alternatives are **chi square** and **maximum likelihood**) 

https://stats.stackexchange.com/questions/113464/understanding-scipy-kolmogorov-smirnov-test
*"For the KS test the p-value is itself distributed uniformly in [0,1] if the H0 is true (which it is if you test whether it your sample is from U(0,1)U(0,1) and the random number generation works okay). It therefore must "vary wildly" between 0 and 1, in fact its standard deviation is 1/12−−√1/12 which is roughly 0.3."*

https://en.m.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
*"The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that the samples are drawn from the same distribution (in the two-sample case). In each case, the distributions considered under the null hypothesis are continuous distributions but are otherwise unrestricted....The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. "* 

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html

scipy.stats.wasserstein_distance


### TODO Add use of HDP model (Hierarchical Dirichlet Process)

[Hierarchical Dirichlet process](https://en.wikipedia.org/wiki/Hierarchical_Dirichlet_process)

Teh, Y. W.; Jordan, M. I.; Beal, M. J.; Blei, D. M. (2006). "Hierarchical Dirichlet Processes" ([PDF](http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/TehJor2010a.pdf)). Journal of the American Statistical Association. 101: pp. 1566–1581.

hdp = models.hdpmodel.HdpModel(corpus, dictionary, T=50)
                                      
hdp.save('basename.model')

HDP is an extension of LDA. HDP is non-parametric method, it will fit as many topics as it can find.

### TODO Word Distinctiveness And Saliency

Similar (or same) as in pyLDAvis
- http://qpleple.com/word-distinctiveness-and-saliency/
- http://qpleple.com/bib/
- http://qpleple.com/bib/#Blei03


### Some References

Blei: https://scholar.google.com/citations?user=8OYE6iEAAAAJ

- Lecture: [PDF](http://videolectures.net/mlss09uk_blei_tm/)
- Latent Dirichlet Allocation: [PDF](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
- Introduction to Probabilistic Topic Models: [PDF](http://menome.com/wp/wp-content/uploads/2014/12/Blei2011.pdf)
- Probabilistic Topic Models: [PDF](https://pdfs.semanticscholar.org/01f3/290d6f3dee5978a53d9d2362f44daebc4008.pdf) [PDF](https://mimno.infosci.cornell.edu/info6150/readings/Blei2012.pdf)
- Visualizing Topic Models: [PDF](http://ajbc.io/projects/papers/ChaneyBlei2012.pdf)
- Topic models: [PDF](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.463.1205&rep=rep1&type=pdf#page=96)

Other
- Finding scientific topics: [PDF](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)

http://qpleple.com/bib/

To cite NetworkX please use the following publication:
*Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring network structure, dynamics, and function using NetworkX”, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008*

To cite gensim please use the following publication:
@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}

### Powered by
<img src="./tm-data/images/powered_by.svg">