# Visualizing Your Model Documents

Notebook demonstrates a couple of ways one may try to better understand and visualize a text document collection.
Visualizations are done with the help of TopicNet's [viewers](https://github.com/machine-intelligence-laboratory/TopicNet/tree/master/topicnet/viewers).

# Contents<a id="contents"></a>

* [Loading a TopicModel](#model-loading)
* [Loading a Dataset](#dataset-loading)
* [Top Documents Viewer](#topdoc-viewer)
* [Document Map](#document-map)
* [Finding Similar Texts](#similar-texts)
* [Document Search Visualization](#search-visualization)

In [1]:
import colorlover as cl
import plotly.graph_objs as go

from plotly.offline import (
    init_notebook_mode,
    iplot,
    plot,
)

from IPython.display import (
    display_html,
    display,
    IFrame,
)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from topicnet.cooking_machine import Dataset
from topicnet.cooking_machine.models import TopicModel, DummyTopicModel

from topicnet.viewers import DocumentClusterViewer
from topicnet.viewers import TopDocumentsViewer
from topicnet.viewers import TopSimilarDocumentsViewer

## Loading a TopicModel<a id="model-loading"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

In [5]:
topic_model = TopicModel.load(
    'wiki_data/experiments/wiki_experiment/##18h49m27s_23d10m2019y###/'
)

  master_config.topic_name.append(topic_name)


## Loading a Dataset<a id="dataset-loading"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

In [29]:
dataset = Dataset('wiki_data/wiki_data.csv', batch_vectorizer_path='wiki_data/document_visualisation_demo_batches/')

# Unfortunatelly some of the collection document names have disambiguation
new_data = dataset._data.loc[~dataset._data.index.duplicated(keep='first')]
dataset._data = new_data

Skipping line 380: field larger than field limit (131072)
Skipping line 1187: field larger than field limit (131072)
Skipping line 4068: field larger than field limit (131072)
Skipping line 5051: field larger than field limit (131072)


## Top Documents Viewer<a id="topdoc-viewer"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let's check TopDocumentsViewer for conventional model analysis

* TopicNet's viewers: [viewers](https://github.com/machine-intelligence-laboratory/TopicNet/tree/master/topicnet/viewers)
* TopDocumentsViewer source file: [top_documents_viewer.py](https://github.com/machine-intelligence-laboratory/TopicNet/blob/master/topicnet/viewers/top_documents_viewer.py)

In [30]:
tdv = TopDocumentsViewer(topic_model, dataset)

In [185]:
tdv.view_from_jupyter()

//////topic_0//////


//////topic_1//////


//////topic_2//////


//////topic_3//////


//////topic_4//////


//////topic_5//////


//////topic_6//////


//////topic_7//////


//////topic_8//////


//////topic_9//////


//////topic_10//////


//////topic_11//////


//////topic_12//////


//////topic_13//////


//////topic_14//////


//////topic_15//////


//////topic_16//////


//////topic_17//////


//////topic_18//////


//////topic_19//////


//////topic_20//////


//////topic_21//////


//////topic_22//////


//////topic_23//////


//////topic_24//////


//////topic_25//////


//////topic_26//////


//////topic_27//////


//////topic_28//////


//////topic_29//////


//////topic_30//////


//////topic_31//////


//////topic_32//////


//////topic_33//////


//////topic_34//////


//////topic_35//////


//////topic_36//////


//////topic_37//////


//////topic_38//////


//////topic_39//////


//////topic_40//////




## Document Map<a id="document-map"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Knowing that topics are ok lets make a map of all documents in the collection!

* DocumentClusterViewer source file: [document_cluster.py](https://github.com/machine-intelligence-laboratory/TopicNet/blob/master/topicnet/viewers/document_cluster.py)

In [34]:
# by default is saves the result to `DocumentCluster_view.html`
DocumentClusterViewer(topic_model).viev_from_jupyter(dataset, to_html=True)

In [4]:
IFrame(src='topic_clusters.html', width=900, height=700)

Around (-20, -40) point area you could see an interesting cluster coming from 2 different topics. I recommend explore it (plot is interactive)

## Finding Similar Texts<a id="similar-texts"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Previous visualisation let us choose similar text to the given one.

In [162]:
tsdv = TopSimilarDocumentsViewer(topic_model, dataset)


PyUnicode_AsEncodedObject() is deprecated; use PyUnicode_AsEncodedString() to encode from str to bytes or PyCodec_Encode() for generic encoding



In [220]:
search_doc = 'Metric_space'

tsdv.view_from_jupyter(document_id=search_doc)

## Document Search Visualization<a id="search-visualization"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Using reduced document representations from the previous function we can visualize the document search results.
Basically we are using embeddings from

In [217]:
model_data = topic_model.transform(batch_vectorizer=dataset.get_batch_vectorizer()).T
data_dict = dict()
data_dict['x'] = vectors[:, 0]
data_dict['y'] = vectors[:, 1]
data_dict['label'] = np.argmax(model_data.values, axis=1)
data_dict['text'] = model_data.index
base_scheme = cl.scales['12']['qual']['Paired']

In [234]:
selected_documents = sim_docs + [search_doc]
adresses = list(model_data.reset_index().query('index in @selected_documents').index)

In [246]:
selected_data = dict()

selected_data['x'] = vectors[adresses, 0]
selected_data['y'] = vectors[adresses, 1]
selected_data['text'] = selected_documents

In [250]:
html_div = iplot(
    [
        go.Scatter(
            x=data_dict['x'],
            y=data_dict['y'],
            mode='markers',
            marker=dict(
                colorscale=base_scheme,
                size=4,
                opacity=0.6,
                colorbar=dict(title='Topics')
            ),
            marker_color=data_dict['label'],
            text=data_dict['text'],
            name='Data clusters'
        ),
        go.Scatter(
            x=selected_data['x'],
            y=selected_data['y'],
            mode='markers',
            marker=dict(
                symbol=14,
                size=8,
                color='black'
            ),
            text=selected_data['text'],
            name='Search documents'
        ),
    ],
    show_link=False,
)