# Coherence Maximization Recipe Demo

The notebook describes how one can use the Coherence maximization recipe to build an experiment and train such a model which has high coherence value.

* Recipes in TopicNet: [recipes](https://github.com/machine-intelligence-laboratory/TopicNet/tree/master/topicnet/cooking_machine/recipes)
* Paper about coherence which is used in the recipe: [Intra-Text Coherence as a Measure of Topic Models' Interpretability](http://www.dialog-21.ru/media/4281/alekseevva.pdf)

# Contents<a id="contents"></a>

* [Loading the Recipe](#loading-recipe)
* [Data](#data)
* [Formatting the Recipe](#formatting)
* [Experiment](#experiment)
* [Best Model](#best-model)
    * [Exploring Best Model](#investigating)

In [None]:
import os
import numpy as np
import shutil

import matplotlib.pyplot as plt
from matplotlib import cm

%matplotlib inline

In [None]:
from IPython.display import display_html, HTML

In [None]:
from topicnet.cooking_machine import Dataset
from topicnet.cooking_machine.dataset import get_modality_vw
from topicnet.cooking_machine.pretty_output import make_notebook_pretty
from topicnet.cooking_machine.config_parser import build_experiment_environment_from_yaml_config
from topicnet.viewers.top_tokens_viewer import TopTokensViewer
from topicnet.viewers.top_documents_viewer import TopDocumentsViewer

In [None]:
make_notebook_pretty()

## Loading the Recipe<a id="loading-recipe"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let's look what is inside the recipes folder

In [None]:
RECIPES_FOLDER_PATH = os.path.join('..', 'cooking_machine', 'recipes')

In [None]:
os.listdir(RECIPES_FOLDER_PATH)

['topic_number_search.yml',
 'README.md',
 'intratext_coherence_maximization.yml',
 '__init__.py']

We need `intratext_coherence_maximization.yml`

In [None]:
RECIPE_FILE_NAME = 'intratext_coherence_maximization.yml'
RECIPY_FILE_PATH = os.path.join(
    '..',
    'cooking_machine',
    'recipes',
    RECIPE_FILE_NAME
)

Reading the recipe as a string which is to be formatted further.

In [None]:
with open(RECIPY_FILE_PATH, "r") as f:
    YAML_CONFIG = f.read()

## Data<a id="data"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let us have the data in the following folder:

In [None]:
DATA_FOLDER_PATH = os.path.join('..', 'data')

In [None]:
os.listdir(DATA_FOLDER_PATH)

['Post_Science']

In [None]:
DATASET_NAME = 'Post_Science'
DATASET_FOLDER_PATH = os.path.join(DATA_FOLDER_PATH, DATASET_NAME)

In [None]:
os.listdir(DATASET_FOLDER_PATH)

['PScience__internals', 'PScience_batches', 'PScience.csv', 'PScience.csv.zip']

File in .csv format — is the actual dataset that TopicNet needs.

In [None]:
DATASET_FILE_PATH = os.path.join(
    DATASET_FOLDER_PATH, 'PScience.csv'
)
DATASET_INTERNALS_FOLDER_PATH = os.path.join(
    DATASET_FOLDER_PATH, 'PScience__internals'
)

Let's see how big the dataset is

In [None]:
dataset = Dataset(
    DATASET_FILE_PATH,
    internals_folder_path=DATASET_INTERNALS_FOLDER_PATH
)

In [None]:
dataset.get_dictionary()

artm.Dictionary(name=a06fda3e-7156-4fb3-b3fd-d88c9864e946, num_entries=53095)

In [None]:
dataset._data.shape

(3404, 3)

Nearly 3000 documents — seems not many.

In [None]:
dataset._data.head()

Unnamed: 0_level_0,id,vw_text,raw_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.txt,1.txt,1.txt |@author fuchs preobrazhensky tabachniko...,@title Автограф # «Математический дивертисмент...
2.txt,2.txt,2.txt |@word книга:2 лекция:3 рассматриваться:...,@title Главы: Маскулинности в российском конте...
3.txt,3.txt,3.txt |@word развитие появляться пиджина:4 бел...,@title Пиджины и креольские языки | @snippet Л...
4.txt,4.txt,4.txt |@word стандартный задача:3 состоять:4 р...,@title FAQ: Физиология микроводорослей | @snip...
5.txt,5.txt,5.txt |@2gramm повседневный_практика государст...,@title Русская государственная идеология | @sn...


## Formatting the Recipe<a id="formatting"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let's use the flag below to reduce memory consumption in some places

In [None]:
LOW_MEMORY = True

Below is the main part: replacement of the placeholders in the recipe string with the real values.

In [None]:
specific_topics = [
    f'topic_{i}' for i in range(15)
]
background_topics = [
    f'bcg_{i}' for i in range(len(specific_topics), len(specific_topics) + 1)
]

num_documents_to_compute_coherence = 100  # not many — in order to speed up a bit 
total_num_documents = dataset._data.shape[0]
documents_fraction = num_documents_to_compute_coherence / total_num_documents

yaml_config = YAML_CONFIG.format(
    modality_names=['@word'],
    main_modality='@word',
    dataset_path=DATASET_FILE_PATH,
    keep_dataset_in_memory=True,  # better try keep this True for quicker computation
    keep_dataset=not LOW_MEMORY,
    documents_fraction=documents_fraction, 
    specific_topics=specific_topics,
    background_topics=background_topics,
    one_stage_num_iter=20,
    verbose=True,
)

In [None]:
print(yaml_config)

# The recipe mainly consists of basic cube stages,
# such as Decorrelation, Sparsing and Smoothing.
# In this way it is similar to ARTM baseline recipe.
# The core difference is that models selected based on their IntratextCoherenceScore
# (which is one of the scores included in TopicNet).
# PerplexityScore is also calculated to assure that models don't have high perplexity,
# but the main criteria is IntratextCoherenceScore.
#
# For more details about IntratextCoherence
# one may see the article http://www.dialog-21.ru/media/4281/alekseevva.pdf
#
# Recipe usage sample:
#   file_contents_as_string = file_contents_as_string.format(
#     modality_names=modality_names,
#     main_modality=main_modality,
#     dataset_path=dataset_file_path,
#     keep_dataset_in_memory=True,
#     keep_dataset=False,
#     documents_fraction=documents_fraction,
#     specific_topics=specific_topic_names,
#     background_topics=background_topic_names,
#     one_stage_num_iter=20,
#     verbose=True,
#   

## Experiment<a id="experiment"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Folder for future experiment

In [None]:
EXPERIMENTS_FOLDER_PATH = os.path.join('..', 'experiments')

In [None]:
os.makedirs(EXPERIMENTS_FOLDER_PATH, exist_ok=True)

In [None]:
os.listdir(EXPERIMENTS_FOLDER_PATH)

['Maximize_Coherence']

In [None]:
EXPERIMENT_ID = 'Maximize_Coherence'

Removing the folder with the chosen experiment ID if it exists

In [None]:
! rm -r ../experiments/$EXPERIMENT_ID

In [None]:
%%time

experiment, dataset = build_experiment_environment_from_yaml_config(
    yaml_config,
    experiment_id=EXPERIMENT_ID,
    save_path=EXPERIMENTS_FOLDER_PATH,
)

CPU times: user 3.53 s, sys: 404 ms, total: 3.94 s
Wall time: 3.1 s


In [None]:
dataset.get_dictionary()

artm.Dictionary(name=60bbf9fb-6a2b-4754-b7ee-0e5815a853f2, num_entries=53095)

There are too many words in the dictionary.
Let's filter it (otherwise Phi matrix will be very huge).

In [None]:
dictionary = dataset.get_dictionary()
dictionary.filter(min_df_rate=0.01, max_df_rate=0.9)
dataset._cached_dict = dictionary

In [None]:
dataset.get_dictionary()

artm.Dictionary(name=60bbf9fb-6a2b-4754-b7ee-0e5815a853f2, num_entries=5441)

Experiment's `low_memory` mode:

In [None]:
experiment._low_memory = LOW_MEMORY

In [None]:
experiment._low_memory

True

Go!

In [None]:
%%time

experiment.run(dataset)

100%|██████████| 103/103 [00:20<00:00,  4.99it/s]
100%|██████████| 103/103 [00:20<00:00,  4.94it/s]
100%|██████████| 103/103 [00:20<00:00,  4.98it/s]
100%|██████████| 103/103 [00:20<00:00,  5.07it/s]
100%|██████████| 103/103 [00:20<00:00,  5.06it/s]
100%|██████████| 103/103 [00:20<00:00,  5.09it/s]
100%|██████████| 103/103 [00:20<00:00,  5.13it/s]




100%|██████████| 103/103 [00:20<00:00,  4.96it/s]
100%|██████████| 103/103 [00:20<00:00,  5.01it/s]
100%|██████████| 103/103 [00:20<00:00,  5.15it/s]
100%|██████████| 103/103 [00:19<00:00,  5.18it/s]
100%|██████████| 103/103 [00:20<00:00,  5.08it/s]
100%|██████████| 103/103 [00:20<00:00,  5.10it/s]




PATH: ../experiments/Maximize_Coherence/<<<<<<<<<<<root>>>>>>>>>>>
100%|██████████| 103/103 [00:20<00:00,  4.97it/s]
100%|██████████| 103/103 [00:20<00:00,  4.96it/s]
100%|██████████| 103/103 [00:20<00:00,  5.04it/s]




100%|██████████| 103/103 [00:20<00:00,  4.96it/s]
100%|██████████| 103/103 [00:20<00:00,  5.01it/s]
100%|██████████| 103/103 [00:21<00:00,  4.90it/s]
100%|██████████| 103/103 [00:20<00:00,  4.91it/s]
100%|██████████| 103/103 [00:20<00:00,  5.08it/s]
100%|██████████| 103/103 [00:20<00:00,  5.02it/s]
100%|██████████| 103/103 [00:20<00:00,  5.02it/s]
CPU times: user 2min 45s, sys: 2min 36s, total: 5min 21s
Wall time: 18min 53s


{Model(id=##20h15m43s_06d04m2020y###, parent_id=##20h12m00s_06d04m2020y###, experiment_id=Maximize_Coherence)}

## Best Model<a id="best-model"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let's find the best model of all (not only from the last stage of the experiment).

In [None]:
score_name = 'IntratextCoherenceScore'

best_model = None
levels = range(1, len(experiment.cubes) + 1)

for level in levels:
    best_model_candidates = experiment.select(
        f'{score_name} -> max',
        level=level
    )

    if len(best_model_candidates) == 0:
        continue

    best_model_candidate = best_model_candidates[0]

    if (best_model is None or
            best_model.scores[score_name][-1] <
            best_model_candidate.scores[score_name][-1]):

        best_model = best_model_candidate

  f'Model \"{acceptable_model}\" has empty value list for score \"{metric}\"')


In [None]:
best_model

Model(id=##20h14m28s_06d04m2020y###, parent_id=##20h10m14s_06d04m2020y###, experiment_id=Maximize_Coherence)

All other models (excluding the Root one):

In [None]:
models_to_compare_with = [
    m for m in experiment.models.values()
    if len(m.scores[score_name]) > 0
]

In [None]:
len(models_to_compare_with)

23

In [None]:
len(experiment.models)  # plus Root

24

Best model's ID and its score value

In [None]:
print(f'         ID: {best_model.model_id}')
print(f'Score value: {best_model.scores[score_name][-1]:.5f}')

         ID: ##20h14m28s_06d04m2020y###
Score value: 0.00988


Other models' score value:

In [None]:
for level in levels:
    print('Level:', level)
    
    current_models = experiment.select(
        '',
        level=level
    )
    
    for m in current_models:
        if len(m.scores[score_name]) == 0:
            score_value = float('nan')
        else:
            score_value = m.scores[score_name][-1]

        print(f'{m.model_id}: {score_value:.5f}')
    
    print()

Level: 1
<<<<<<<<<<<root>>>>>>>>>>>: nan

Level: 2
##19h59m58s_06d04m2020y###: 0.00130
##20h00m37s_06d04m2020y###: 0.00125
##20h01m17s_06d04m2020y###: 0.00109
##20h01m57s_06d04m2020y###: 0.00116
##20h02m37s_06d04m2020y###: 0.00097
##20h03m18s_06d04m2020y###: 0.00108
##20h04m01s_06d04m2020y###: 0.00019

Level: 3
##20h04m52s_06d04m2020y###: 0.00191
##20h05m33s_06d04m2020y###: 0.00191
##20h06m15s_06d04m2020y###: 0.00478
##20h06m59s_06d04m2020y###: 0.00736
##20h07m43s_06d04m2020y###: 0.00734
##20h08m29s_06d04m2020y###: 0.00722

Level: 4
##20h09m28s_06d04m2020y###: 0.00741
##20h10m14s_06d04m2020y###: 0.00741
##20h11m00s_06d04m2020y###: 0.00702

Level: 5
##20h12m00s_06d04m2020y###: 0.00756
##20h12m48s_06d04m2020y###: 0.00677
##20h13m38s_06d04m2020y###: 0.00713
##20h14m28s_06d04m2020y###: 0.00988

Level: 6
##20h15m43s_06d04m2020y###: 0.00800
##20h16m36s_06d04m2020y###: 0.00800
##20h17m30s_06d04m2020y###: 0.00799



## Exploring Best Model<a id="investigating"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

More detailed description of the best model goes here.

In [None]:
best_model.describe_regularizers()

Unnamed: 0_level_0,Unnamed: 1_level_0,tau,gamma,class_ids
model_id,regularizer_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
##20h14m28s_06d04m2020y###,decorrelate_phi,0.1,0.0,[]
##20h14m28s_06d04m2020y###,smooth_phi_background,0.002459,,[]
##20h14m28s_06d04m2020y###,smooth_phi_bcg,0.0,,[@word]
##20h14m28s_06d04m2020y###,smooth_phi_specific,1.0,,[]
##20h14m28s_06d04m2020y###,smooth_theta_bcg,0.0,,
##20h14m28s_06d04m2020y###,sparse_phi_specific,-2.73198,,[]


In [None]:
best_model.describe_scores()

Unnamed: 0_level_0,Unnamed: 1_level_0,last_value
model_id,score_name,Unnamed: 2_level_1
##20h14m28s_06d04m2020y###,PerplexityScore@all,3902.82
##20h14m28s_06d04m2020y###,SparsityThetaScore,0.318082
##20h14m28s_06d04m2020y###,SparsityPhiScore@word,0.895063
##20h14m28s_06d04m2020y###,PerplexityScore@word,3902.82
##20h14m28s_06d04m2020y###,TopicKernel@word.average_coherence,0.0
##20h14m28s_06d04m2020y###,TopicKernel@word.average_contrast,0.678895
##20h14m28s_06d04m2020y###,TopicKernel@word.average_purity,0.752459
##20h14m28s_06d04m2020y###,TopicKernel@word.average_size,1390.25
##20h14m28s_06d04m2020y###,IntratextCoherenceScore,0.00988121


Let's use some viewers:

In [None]:
toptok_viewer = TopTokensViewer(best_model, num_top_tokens=10, method='phi')
topdoc_viewer = TopDocumentsViewer(best_model, dataset=dataset)

In [None]:
topic_html_strings = toptok_viewer.view_from_jupyter(output=False)

In [None]:
top_documents = topdoc_viewer.view()

In [None]:
HTML(
    topic_html_strings[0]
    + '&nbsp;' + topic_html_strings[1]
    + '&nbsp;' + topic_html_strings[3]
)

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_0
modality,token,Unnamed: 2_level_1
@word,фильм,0.10822
@word,герой,0.06408
@word,кино,0.04081
@word,жанр,0.03906
@word,испытуемый,0.03166
@word,зритель,0.03004
@word,сцена,0.03004
@word,персонаж,0.02614
@word,кинематограф,0.02183
@word,экран,0.01672

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_1
modality,token,Unnamed: 2_level_1
@word,культура,0.01171
@word,история,0.01142
@word,общество,0.01078
@word,пространство,0.01001
@word,отношение,0.00988
@word,понятие,0.00959
@word,представление,0.0089
@word,метр,0.00742
@word,смысл,0.00714
@word,социальный,0.00709

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_3
modality,token,Unnamed: 2_level_1
@word,язык,0.05561
@word,слово,0.0399
@word,книга,0.03932
@word,текст,0.02326
@word,говорить,0.02014
@word,русский,0.01716
@word,писать,0.0164
@word,автор,0.01546
@word,написать,0.01254
@word,литература,0.0104


In [None]:
HTML(
    topic_html_strings[4]
    + '&nbsp;' + topic_html_strings[5]
    + '&nbsp;' + topic_html_strings[6]
)

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_4
modality,token,Unnamed: 2_level_1
@word,право,0.04747
@word,бог,0.02628
@word,закон,0.02422
@word,король,0.01798
@word,церковь,0.01586
@word,религия,0.01261
@word,религиозный,0.01254
@word,император,0.01098
@word,суд,0.01093
@word,христианский,0.01069

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_5
modality,token,Unnamed: 2_level_1
@word,пациент,0.02562
@word,бактерия,0.02385
@word,болезнь,0.02385
@word,заболевание,0.01972
@word,врач,0.01944
@word,препарат,0.01649
@word,медицина,0.01638
@word,сон,0.01581
@word,микроорганизм,0.01563
@word,вирус,0.01527

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_6
modality,token,Unnamed: 2_level_1
@word,клетка,0.03075
@word,мозг,0.0215
@word,ген,0.01897
@word,организм,0.0164
@word,использовать,0.01592
@word,метод,0.01408
@word,система,0.01402
@word,память,0.01378
@word,процесс,0.01367
@word,молекула,0.01269


And the background topic (not meaningful):

In [None]:
HTML(topic_html_strings[-1])

Unnamed: 0_level_0,Unnamed: 1_level_0,bcg_15
modality,token,Unnamed: 2_level_1
@word,должный,0.0043
@word,большой,0.00424
@word,являться,0.00417
@word,существовать,0.00389
@word,говорить,0.00385
@word,работа,0.00382
@word,образ,0.0037
@word,стать,0.00336
@word,важный,0.00332
@word,сторона,0.00332


Topic and parts of its top documents:

In [None]:
topic_number = 10

display_html(topic_html_strings[topic_number], raw=True)

for doc_id in top_documents[topic_number]:
    doc_vw = dataset.get_vw_document(doc_id).values[0][0]
    doc_title = get_modality_vw(doc_vw, "@title")
    doc_snippet = get_modality_vw(doc_vw, "@snippet")
    display_html(f"<b>{doc_title}</b><br/>{doc_snippet}", raw=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_10
modality,token,Unnamed: 2_level_1
@word,частица,0.01402
@word,звезда,0.01226
@word,земля,0.01142
@word,вселенная,0.01096
@word,энергия,0.01052
@word,вещество,0.01013
@word,галактика,0.00878
@word,масса,0.00831
@word,планета,0.00756
@word,электрон,0.00718
