# Coherence Maximization Recipe Demo

The notebook describes how one can use the Coherence maximization recipe to build an experiment and train such a model which has high coherence value.

* Recipes in TopicNet: [recipes](https://github.com/machine-intelligence-laboratory/TopicNet/tree/master/topicnet/cooking_machine/recipes)
* Paper about coherence which is used in the recipe: [Intra-Text Coherence as a Measure of Topic Models' Interpretability](http://www.dialog-21.ru/media/4281/alekseevva.pdf)

# Contents<a id="contents"></a>

* [Loading Dataset](#data)
* [Formatting the Recipe](#formatting)
* [Experiment](#experiment)
* [Best Model](#best-model)
    * [Exploring Best Model](#investigating)

In [1]:
import os
import numpy as np
import shutil

import matplotlib.pyplot as plt
from matplotlib import cm

%matplotlib inline

In [2]:
from IPython.display import display_html

In [4]:
from topicnet.dataset_manager import load_dataset
from topicnet.cooking_machine import Dataset
from topicnet.cooking_machine.pretty_output import make_notebook_pretty
from topicnet.cooking_machine.recipes import IntratextCoherenceRecipe
from topicnet.viewers.top_tokens_viewer import TopTokensViewer
from topicnet.viewers.top_documents_viewer import TopDocumentsViewer

In [5]:
make_notebook_pretty()

## Loading Dataset<a id="data"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let's pick up a dataset from the [list of available for download](https://github.com/machine-intelligence-laboratory/TopicNet/blob/master/topicnet/dataset_manager/DemoDataset.ipynb)

In [6]:
DATASET_NAME = 'postnauka'

In [7]:
DATASET = load_dataset(DATASET_NAME)

In [8]:
DATASET._data.head()

Unnamed: 0_level_0,id,vw_text,raw_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.txt,1.txt,1.txt |@author fuchs preobrazhensky tabachniko...,@title Автограф # «Математический дивертисмент...
2.txt,2.txt,2.txt |@word книга:2 лекция:3 рассматриваться:...,@title Главы: Маскулинности в российском конте...
3.txt,3.txt,3.txt |@word развитие появляться пиджина:4 бел...,@title Пиджины и креольские языки | @snippet Л...
4.txt,4.txt,4.txt |@word стандартный задача:3 состоять:4 р...,@title FAQ: Физиология микроводорослей | @snip...
5.txt,5.txt,5.txt |@2gramm повседневный_практика государст...,@title Русская государственная идеология | @sn...


In [9]:
DATASET_PATH = DATASET._data_path

In [10]:
DATASET_PATH

'/home/alekseev/topicnet/dataset_manager/postnauka.csv'

Let's see how big the dataset is

In [11]:
DATASET.get_dictionary()

artm.Dictionary(name=f0d91fae-b8ef-453f-a918-9a1a5b5a7bc4, num_entries=53095)

In [12]:
DATASET._data.shape

(3404, 3)

Nearly 3000 documents — seems not many.

## Formatting the Recipe<a id="formatting"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let's use the flag below to reduce memory consumption in some places

In [13]:
LOW_MEMORY = True

Below is the main part: replacement of the placeholders in the recipe string with the real values.

In [14]:
training_pipeline = IntratextCoherenceRecipe()


num_documents_to_compute_coherence = 100  # not many — in order to speed up a bit 
total_num_documents = DATASET._data.shape[0]
documents_fraction = num_documents_to_compute_coherence / total_num_documents

recipe_config = training_pipeline.format_recipe(
    dataset_path=DATASET_PATH,
    main_modality='@word',
    modalities=['@word'],
    keep_dataset_in_memory=True,  # better try keep this True for quicker computation
    keep_dataset=not LOW_MEMORY,
    documents_fraction=documents_fraction, 
    num_specific_topics=20,
    num_background_topics=1,
    one_stage_num_iter=20,
    verbose=True,
)

In [15]:
print(recipe_config)

# The recipe mainly consists of basic cube stages,
# such as Decorrelation, Sparsing and Smoothing.
# In this way it is similar to ARTM baseline recipe.
# The core difference is that models selected based on their IntratextCoherenceScore
# (which is one of the scores included in TopicNet).
# PerplexityScore is also calculated to assure that models don't have high perplexity,
# but the main criteria is IntratextCoherenceScore.
#
# For more details about IntratextCoherence
# one may see the paper http://www.dialog-21.ru/media/4281/alekseevva.pdf
#
# Recipe usage sample:
#   file_contents_as_string = file_contents_as_string.format(
#     modality_names=modality_names,
#     main_modality=main_modality,
#     dataset_path=dataset_file_path,
#     keep_dataset_in_memory=True,
#     keep_dataset=False,
#     documents_fraction=documents_fraction,
#     specific_topics=specific_topic_names,
#     background_topics=background_topic_names,
#     one_stage_num_iter=20,
#     verbose=True,
#   )


## Experiment<a id="experiment"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Folder for future experiment

In [16]:
EXPERIMENTS_FOLDER_PATH = os.path.join('..', 'experiments')

In [17]:
os.makedirs(EXPERIMENTS_FOLDER_PATH, exist_ok=True)

In [18]:
os.listdir(EXPERIMENTS_FOLDER_PATH)

['Maximize_Coherence2', 'Maximize_Coherence']

In [19]:
EXPERIMENT_ID = 'Maximize_Coherence'

Removing the folder with the chosen experiment ID if it exists

In [20]:
COHERENCE_EXPERIMENT_FOLDER_PATH = os.path.join(EXPERIMENTS_FOLDER_PATH, EXPERIMENT_ID)

if os.path.isdir(COHERENCE_EXPERIMENT_FOLDER_PATH):
    shutil.rmtree(COHERENCE_EXPERIMENT_FOLDER_PATH)

In [21]:
%%time

experiment, dataset = training_pipeline.build_experiment_environment(
    experiment_id=EXPERIMENT_ID,
    save_path=EXPERIMENTS_FOLDER_PATH,
)

CPU times: user 2.85 s, sys: 294 ms, total: 3.14 s
Wall time: 3.01 s


In [22]:
dataset.get_dictionary()

artm.Dictionary(name=db8c7870-be41-411f-9dc1-c28d6d789fd6, num_entries=53095)

There are too many words in the dictionary.
Let's filter it (otherwise Phi matrix will be very huge).

In [23]:
dictionary = dataset.get_dictionary()
dictionary.filter(min_df_rate=0.01, max_df_rate=0.9)
dataset._cached_dict = dictionary

In [24]:
dataset.get_dictionary()

artm.Dictionary(name=db8c7870-be41-411f-9dc1-c28d6d789fd6, num_entries=5441)

Experiment's `low_memory` mode:

In [25]:
experiment._low_memory = LOW_MEMORY

In [26]:
experiment._low_memory

True

Go!

In [27]:
%%time

experiment.run(dataset)

100%|██████████| 100/100 [00:24<00:00,  4.04it/s]
100%|██████████| 100/100 [00:25<00:00,  3.92it/s]
100%|██████████| 100/100 [00:24<00:00,  4.02it/s]
100%|██████████| 100/100 [00:25<00:00,  3.97it/s]
100%|██████████| 100/100 [00:26<00:00,  3.73it/s]
100%|██████████| 100/100 [01:03<00:00,  1.58it/s]
100%|██████████| 100/100 [00:26<00:00,  3.82it/s]



Perplexity is too high for threshold 1.05



100%|██████████| 100/100 [00:26<00:00,  3.83it/s]
100%|██████████| 100/100 [00:25<00:00,  3.94it/s]
100%|██████████| 100/100 [00:25<00:00,  3.89it/s]
100%|██████████| 100/100 [00:25<00:00,  3.97it/s]
100%|██████████| 100/100 [00:26<00:00,  3.78it/s]
100%|██████████| 100/100 [00:27<00:00,  3.64it/s]



Max progression length exceeded



100%|██████████| 100/100 [00:26<00:00,  3.76it/s]
100%|██████████| 100/100 [00:26<00:00,  3.84it/s]
100%|██████████| 100/100 [00:24<00:00,  4.00it/s]



Already dummy



100%|██████████| 100/100 [00:25<00:00,  3.92it/s]
100%|██████████| 100/100 [00:24<00:00,  4.03it/s]
100%|██████████| 100/100 [00:23<00:00,  4.17it/s]
100%|██████████| 100/100 [00:23<00:00,  4.27it/s]
100%|██████████| 100/100 [00:25<00:00,  3.85it/s]
100%|██████████| 100/100 [00:25<00:00,  3.98it/s]
100%|██████████| 100/100 [00:25<00:00,  3.99it/s]
CPU times: user 2min 45s, sys: 2min 52s, total: 5min 37s
Wall time: 29min 3s


{Model(id=--10h05m21s_23d04m2020y---, parent_id=--10h02m48s_23d04m2020y---, experiment_id=Maximize_Coherence)}

## Best Model<a id="best-model"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Let's find the best model of all (not only from the last stage of the experiment).

In [28]:
SCORE_NAME = 'IntratextCoherenceScore'

BEST_MODEL = None
levels = range(1, len(experiment.cubes) + 1)

for level in levels:
    best_model_candidates = experiment.select(
        f'{SCORE_NAME} -> max',
        level=level
    )

    if len(best_model_candidates) == 0:
        continue

    best_model_candidate = best_model_candidates[0]

    if (BEST_MODEL is None or
            BEST_MODEL.scores[SCORE_NAME][-1] <
            best_model_candidate.scores[SCORE_NAME][-1]):

        BEST_MODEL = best_model_candidate


Model "Model(id=-----------root-----------, parent_id=None, experiment_id=Maximize_Coherence)" has empty value list for score "IntratextCoherenceScore"


Can't return the requested number of models:



In [29]:
BEST_MODEL

Model(id=--10h05m21s_23d04m2020y---, parent_id=--10h02m48s_23d04m2020y---, experiment_id=Maximize_Coherence)

All other models (excluding the Root one):

In [30]:
models_to_compare_with = [
    m for m in experiment.models.values()
    if len(m.scores[SCORE_NAME]) > 0
]

In [31]:
len(models_to_compare_with)

23

In [32]:
len(experiment.models)  # plus Root

24

Best model's ID and its score value

In [33]:
print(f'         ID: {BEST_MODEL.model_id}')
print(f'Score value: {BEST_MODEL.scores[SCORE_NAME][-1]:.5f}')

         ID: --10h05m21s_23d04m2020y---
Score value: 0.01554


Other models' score value:

In [34]:
for level in levels:
    print('Level:', level)
    
    current_models = experiment.select(
        '',
        level=level
    )
    
    for m in current_models:
        if len(m.scores[SCORE_NAME]) == 0:
            score_value = float('nan')
        else:
            score_value = m.scores[SCORE_NAME][-1]

        print(f'{m.model_id}: {score_value:.5f}')
    
    print()

Level: 1
-----------root-----------: nan

Level: 2
--09h40m13s_23d04m2020y---: 0.00135
--09h41m07s_23d04m2020y---: 0.00068
--09h42m03s_23d04m2020y---: 0.00068
--09h43m01s_23d04m2020y---: 0.00069
--09h44m02s_23d04m2020y---: 0.00081
--09h45m06s_23d04m2020y---: 0.00054
--09h48m50s_23d04m2020y---: 0.00013

Level: 3
--09h50m34s_23d04m2020y---: 0.00180
--09h51m33s_23d04m2020y---: 0.00180
--09h52m34s_23d04m2020y---: 0.00450
--09h53m37s_23d04m2020y---: 0.00851
--09h54m41s_23d04m2020y---: 0.00988
--09h55m49s_23d04m2020y---: 0.01175

Level: 4
--09h57m16s_23d04m2020y---: 0.00991
--09h58m18s_23d04m2020y---: 0.00991
--09h59m22s_23d04m2020y---: 0.01048

Level: 5
--10h00m36s_23d04m2020y---: 0.01036
--10h01m41s_23d04m2020y---: 0.01091
--10h02m48s_23d04m2020y---: 0.01231
--10h03m57s_23d04m2020y---: 0.00000

Level: 6
--10h05m21s_23d04m2020y---: 0.01554
--10h06m33s_23d04m2020y---: 0.01554
--10h07m47s_23d04m2020y---: 0.01554



## Exploring Best Model<a id="investigating"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

More detailed description of the best model goes here.

In [35]:
BEST_MODEL.describe_regularizers()

Unnamed: 0_level_0,Unnamed: 1_level_0,tau,gamma,class_ids
model_id,regularizer_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
--10h05m21s_23d04m2020y---,decorrelate_phi,0.05,0.0,[]
--10h05m21s_23d04m2020y---,smooth_phi_background,27.321693,,[]
--10h05m21s_23d04m2020y---,smooth_phi_bcg,0.0,,[@word]
--10h05m21s_23d04m2020y---,smooth_phi_specific,0.0,,[]
--10h05m21s_23d04m2020y---,smooth_theta_bcg,0.0,,
--10h05m21s_23d04m2020y---,sparse_phi_specific,-2.83701,,[]


In [36]:
BEST_MODEL.describe_scores()

Unnamed: 0_level_0,Unnamed: 1_level_0,last_value
model_id,score_name,Unnamed: 2_level_1
--10h05m21s_23d04m2020y---,PerplexityScore@all,3795.78
--10h05m21s_23d04m2020y---,SparsityThetaScore,0.32957
--10h05m21s_23d04m2020y---,SparsityPhiScore@word,0.912853
--10h05m21s_23d04m2020y---,PerplexityScore@word,3795.78
--10h05m21s_23d04m2020y---,TopicKernel@word.average_coherence,0.0
--10h05m21s_23d04m2020y---,TopicKernel@word.average_contrast,0.64136
--10h05m21s_23d04m2020y---,TopicKernel@word.average_purity,0.856669
--10h05m21s_23d04m2020y---,TopicKernel@word.average_size,1151.43
--10h05m21s_23d04m2020y---,IntratextCoherenceScore,0.0155387


Let's use some viewers:

In [37]:
toptok_viewer = TopTokensViewer(BEST_MODEL, num_top_tokens=10, method='phi')
topdoc_viewer = TopDocumentsViewer(BEST_MODEL, dataset=dataset)

In [41]:
top_tokens_view = toptok_viewer.view_from_jupyter(
    display_output=False,
    give_html=True,
)

In [42]:
display_html(' '.join(top_tokens_view), raw=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_0
modality,token,Unnamed: 2_level_1
@word,работать,0.04258
@word,наука,0.04249
@word,хороший,0.03673
@word,учёный,0.03078
@word,заниматься,0.02427
@word,хотеть,0.01814
@word,думать,0.018
@word,университет,0.01776
@word,образование,0.01711
@word,проект,0.0169

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_1
modality,token,Unnamed: 2_level_1
@word,говорить,0.01079
@word,работа,0.00843
@word,являться,0.00802
@word,образ,0.00737
@word,вопрос,0.00715
@word,сторона,0.00698
@word,должный,0.00695
@word,жизнь,0.00671
@word,важный,0.00594
@word,существовать,0.0059

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_2
modality,token,Unnamed: 2_level_1
@word,сантиметр,0.24049
@word,святилище,0.09795
@word,сосуд,0.06396
@word,стоящий,0.06351
@word,заполнить,0.05061
@word,культовый,0.04414
@word,фигурка,0.04406
@word,квадрат,0.04078
@word,статуэтка,0.03047
@word,ниша,0.02579

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_3
modality,token,Unnamed: 2_level_1
@word,книга,0.05898
@word,история,0.04808
@word,текст,0.02904
@word,писать,0.02746
@word,автор,0.02669
@word,метр,0.0253
@word,написать,0.01916
@word,литература,0.01475
@word,читать,0.01274
@word,искусство,0.01184

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_4
modality,token,Unnamed: 2_level_1
@word,россия,0.01928
@word,государство,0.01459
@word,общество,0.01368
@word,право,0.01323
@word,власть,0.0121
@word,политический,0.01034
@word,война,0.00939
@word,политика,0.0075
@word,закон,0.00679
@word,советский,0.00578

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_5
modality,token,Unnamed: 2_level_1
@word,caption,0.10341
@word,остров,0.06908
@word,align,0.05135
@word,width,0.05135
@word,attachment,0.05061
@word,африка,0.03809
@word,глубина,0.0355
@word,раса,0.03325
@word,гора,0.03276
@word,aligncenter,0.02891

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_6
modality,token,Unnamed: 2_level_1
@word,язык,0.08155
@word,слово,0.06631
@word,память,0.01843
@word,речь,0.01426
@word,вариант,0.01394
@word,информация,0.01372
@word,ошибка,0.0121
@word,понятно,0.01154
@word,рука,0.01006
@word,предложение,0.00977

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_7
modality,token,Unnamed: 2_level_1
@word,лёд,0.14354
@word,климат,0.1255
@word,морской,0.11453
@word,ледник,0.08815
@word,накапливаться,0.08202
@word,антарктида,0.04316
@word,дно,0.04083
@word,канадский,0.03754
@word,ледяной,0.03519
@word,арктика,0.03068

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_8
modality,token,Unnamed: 2_level_1
@word,женщина,0.04121
@word,животное,0.03246
@word,эволюция,0.02374
@word,мужчина,0.02163
@word,растение,0.01988
@word,самец,0.01282
@word,птица,0.01243
@word,самка,0.01145
@word,животный,0.01078
@word,признак,0.01062

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_9
modality,token,Unnamed: 2_level_1
@word,выбрать,0.0
@word,датировка,0.0
@word,колоть,0.0
@word,марков,0.0
@word,необходимый,0.0
@word,обезболивать,0.0
@word,перестроиться,0.0
@word,подруга,0.0
@word,пьяный,0.0
@word,сартр,0.0

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_10
modality,token,Unnamed: 2_level_1
@word,система,0.01083
@word,получить,0.00711
@word,большой,0.0069
@word,частица,0.00624
@word,звезда,0.00544
@word,метод,0.00539
@word,происходить,0.00524
@word,земля,0.00519
@word,эксперимент,0.00516
@word,вселенная,0.00486

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_11
modality,token,Unnamed: 2_level_1
@word,лекция,0.10174
@word,сеть,0.08911
@word,интернет,0.06831
@word,num,0.06079
@word,pcourse,0.06079
@word,процедура,0.04395
@word,пользователь,0.03337
@word,страница,0.03023
@word,узел,0.0225
@word,al,0.02202

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_12
modality,token,Unnamed: 2_level_1
@word,территория,0.01004
@word,бог,0.00999
@word,традиция,0.00912
@word,народ,0.00875
@word,имя,0.00805
@word,культура,0.00709
@word,миф,0.00698
@word,век,0.00625
@word,эпоха,0.00609
@word,церковь,0.00601

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_13
modality,token,Unnamed: 2_level_1
@word,стандарт,0.11286
@word,прогноз,0.098
@word,станция,0.0682
@word,сдвиг,0.06473
@word,геологический,0.05997
@word,землетрясение,0.05246
@word,протокол,0.05056
@word,слоить,0.04797
@word,обстановка,0.04237
@word,дополнение,0.04175

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_14
modality,token,Unnamed: 2_level_1
@word,часы,0.1079
@word,занятие,0.07667
@word,сумма,0.06943
@word,коллективный,0.06806
@word,комната,0.06741
@word,повседневность,0.0522
@word,коллектив,0.05049
@word,стол,0.05016
@word,бытовой,0.04872
@word,рубль,0.04675

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_15
modality,token,Unnamed: 2_level_1
@word,страна,0.02841
@word,технология,0.01329
@word,экономика,0.01227
@word,сша,0.01051
@word,деньга,0.00993
@word,использование,0.00976
@word,китай,0.0091
@word,экономический,0.00896
@word,компания,0.00877
@word,производство,0.00817

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_16
modality,token,Unnamed: 2_level_1
@word,клетка,0.0387
@word,мозг,0.02695
@word,ген,0.02383
@word,организм,0.02054
@word,днк,0.01347
@word,нейрон,0.01179
@word,белка,0.01174
@word,пациент,0.01139
@word,бактерия,0.01063
@word,болезнь,0.01041

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_17
modality,token,Unnamed: 2_level_1
@word,молекула,0.04376
@word,материал,0.03569
@word,вода,0.02534
@word,атмосфера,0.02382
@word,соединение,0.0236
@word,химия,0.02274
@word,микроорганизм,0.01912
@word,кислород,0.01487
@word,слой,0.01393
@word,микроб,0.01348

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_18
modality,token,Unnamed: 2_level_1
@word,дерево,0.1571
@word,литр,0.1541
@word,плод,0.09934
@word,оружие,0.0902
@word,орех,0.07938
@word,ред,0.0605
@word,спб,0.0585
@word,семя,0.04523
@word,сухой,0.03167
@word,карикатура,0.02437

Unnamed: 0_level_0,Unnamed: 1_level_0,topic_19
modality,token,Unnamed: 2_level_1
@word,город,0.10954
@word,пространство,0.09567
@word,центр,0.05476
@word,дом,0.04204
@word,москва,0.03526
@word,музей,0.02738
@word,городской,0.02305
@word,район,0.01859
@word,архитектура,0.01805
@word,здание,0.01769

Unnamed: 0_level_0,Unnamed: 1_level_0,bcg_topic_20
modality,token,Unnamed: 2_level_1
@word,задача,0.00245
@word,студент,0.00103
@word,компьютер,0.00098
@word,выбирать,0.00084
@word,уметь,0.00062
@word,зона,0.00057
@word,робот,0.00052
@word,допустить,0.00051
@word,алгоритм,0.00043
@word,рис,0.00043


Parts of topics' top documents:

In [45]:
topdoc_viewer.view_from_jupyter(
    current_num_top_doc=5,
    num_view_topics=5,
)

Some topics (eg. `topic_2`) has no documents: these topics have empty rows in $\Theta$ matrix.