# TopicBank: Bank Creation Experiment

Here we are going to collect interpretable topics (automatically, using topic coherence) from multiple model training.
These topics constitute *topic bank*.
And then the topic bank is going to be used for estimating topic models quality in the notebook [TopicBank-Experiment: Model Validation](TopicBank-Experiment-ModelValidation.ipynb).

The process is repeated for several datasets (some of them are already downloadable using [TopicNet](https://github.com/machine-intelligence-laboratory/TopicNet) library).

# Contents<a id="contents"></a>

* [Data](#data)
    * [Coocs](#coocs)
        * [Lower Memory Consumption (or a Bit of Shamanism. Part 1)](#optimizing-memory)
    * [Documents for Coherence Scores](#docs-for-cohs)
        * [Lower Time Consumption in Case of Big Datasets](#optimizing-time)
* [Experiment](#experiment)
    * [Scores](#scores)
    * [Bank Creation](#bank-creation)
* [Postprocessing](#postprocessing)

In [None]:
# General imports

import dill
import itertools
import json
import numpy as np
import os
import pandas as pd
import sys

from enum import Enum
from scipy.stats import gaussian_kde
from matplotlib import pyplot as plt
from tqdm import tqdm
from typing import (
    Dict,
    Iterable,
)

%matplotlib inline

In [None]:
# Making `topnum` module visible for Python

sys.path.insert(0, '..')

In [None]:
# Optimal number of topics

from topicnet.cooking_machine import Dataset

from topnum.data.vowpal_wabbit_text_collection import VowpalWabbitTextCollection
from topnum.scores import (
    IntratextCoherenceScore,
    SparsityPhiScore,
    SparsityThetaScore,
    SimpleTopTokensCoherenceScore,
    SophisticatedTopTokensCoherenceScore,
)
from topnum.scores._base_coherence_score import (
    SpecificityEstimationMethod,
    TextType,
    WordTopicRelatednessType,
)
from topnum.scores.intratext_coherence_score import ComputationMethod
from topnum.search_methods import TopicBankMethod
from topnum.search_methods.topic_bank.topic_bank import TopicBank
from topnum.search_methods.topic_bank.one_model_train_funcs import (
    default_train_func,

    # Functions below are not used (but could have been)

#     regularization_train_func,
#     specific_initial_phi_train_func,
#     background_topics_train_func,

)

## Data<a id="data"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Loading data from disk, creating batches, dictionary, gathering cooccurrence statistics...

In [None]:
DATA_FOLDER_PATH = os.path.join('.', 'data')

In [None]:
sorted(os.listdir(DATA_FOLDER_PATH))

['.ipynb_checkpoints',
 '20NG_natural_order.csv',
 '20NG_natural_order_250.csv',
 '20NG_natural_order_40.csv',
 '20NG_natural_order_45.csv',
 '20NG_natural_order_50.csv',
 '20NG_natural_order_55.csv',
 '20NG_natural_order_60.csv',
 '20NG_natural_order_65.csv',
 'AG_News.csv',
 'AG_News_100.csv',
 'AG_News_1000.csv',
 'AG_News_10000.csv',
 'AG_News_250.csv',
 'AG_News_500.csv',
 'Brown.csv',
 'Brown_10.csv',
 'Brown_8.csv',
 'Brown_80.csv',
 'Brown_9.csv',
 'Habrahabr.csv',
 'Habrahabr_100.csv',
 'Habrahabr_15.csv',
 'Habrahabr_50.csv',
 'PScience.csv',
 'PostNauka_natural_order.csv',
 'Post_Science',
 'Post_Science_12.csv',
 'Post_Science_15.csv',
 'Post_Science_18.csv',
 'Post_Science_20.csv',
 'Post_Science_65.csv',
 'Reuters.csv',
 'Reuters_20.csv',
 'Reuters_50.csv',
 'Reuters_60.csv',
 'Reuters_80.csv',
 'Watan2004.csv',
 'Watan2004_10.csv',
 'Watan2004_100.csv',
 'Watan2004_12.csv',
 'Watan2004_15.csv',
 'Watan2004_250.csv']

In [None]:
class DatasetName(Enum):
    POSTNAUKA = 'Post_Science'
    REUTERS = 'Reuters'
    BROWN = 'Brown'
    TWENTY_NEWSGROUPS = '20NG_natural_order'
    AG_NEWS = 'AG_News'
    WATAN = 'Watan2004'
    HABRAHABR = 'Habrahabr'

In [None]:
DATASET_NAME_TO_DATASET_FILE_PATH = {
    DatasetName.POSTNAUKA: os.path.join(
        DATA_FOLDER_PATH, 'PostNauka_natural_order.csv'
    ),
    DatasetName.REUTERS: os.path.join(
        DATA_FOLDER_PATH, 'Reuters.csv'
    ),
    DatasetName.BROWN: os.path.join(
        DATA_FOLDER_PATH, 'Brown.csv'
    ),
    DatasetName.TWENTY_NEWSGROUPS: os.path.join(
        DATA_FOLDER_PATH, '20NG_natural_order.csv'
    ),
    DatasetName.AG_NEWS: os.path.join(
        DATA_FOLDER_PATH, 'AG_News.csv'
    ),
    DatasetName.WATAN: os.path.join(
        DATA_FOLDER_PATH, 'Watan2004.csv'
    ),
    DatasetName.HABRAHABR: os.path.join(
        DATA_FOLDER_PATH, 'Habrahabr.csv'
    ),
}

In [None]:
DATASET_NAME = DatasetName.POSTNAUKA  # select a dataset here

DATASET_FILE_PATH = DATASET_NAME_TO_DATASET_FILE_PATH[DATASET_NAME]

Checking if all OK with data, what modalities does the collection have.

In [None]:
! head -n 2 $DATASET_FILE_PATH

id,raw_text,vw_text
29998.txt,материал отрицательный показатель преломление физик виктор веселаго распространение свет вещество фазовый групповой скорость метаматериалы различаться фазовый групповой скорость каков физика распространение свет вещество находить применение материал отрицательный показатель преломление рассказывать доктор физикоматематический наука виктор веселаго скорость распространяться энергия вещество обычно говорить излучение распространяться вещество со скорость n раз маленький n коэффициент преломление вещество коэффициент преломление n отношение скорость свет скорость распространение излучение вещество обычно уточняться распространяться распространение энергия распространение импульс происходить различный закон энергия распространяться со скорость называться групповой скорость много скорость свет эйнштейн сформулировать самый больший скорость излучение скорость свет кмс импульс распространяться фазовый скорость сколь угодно много скорость свет скорость входить со

In [None]:
def get_dataset_internals_folder_path(dataset_name: DatasetName) -> str:
    return os.path.join('.', dataset_name.value + '__internals')

In [None]:
DATASET_INTERNALS_FOLDER_PATH = get_dataset_internals_folder_path(DATASET_NAME)

In [None]:
DATASET_INTERNALS_FOLDER_PATH

'./Post_Science__internals'

In [None]:
%%time

# If using really big datasets (like Habrahabr),
# one may need to set this equal `False`
KEEP_DATASET_IN_MEMORY = True

DATASET = Dataset(
    DATASET_FILE_PATH,
    internals_folder_path=DATASET_INTERNALS_FOLDER_PATH,
    keep_in_memory=KEEP_DATASET_IN_MEMORY,
)

Looking what is inside dataset's folder

In [None]:
os.listdir(DATASET_INTERNALS_FOLDER_PATH)

['vw.txt',
 'vocab.txt',
 'dict.dict.txt',
 'new_ppmi_tf_',
 'ppmi_tf_',
 'dict.dict',
 'cooc_values.json',
 'batches']

Creating batches

In [None]:
DATASET.get_batch_vectorizer()

artm.BatchVectorizer(data_path="./Post_Science__internals__test/batches", num_batches=4)

In [None]:
os.listdir(DATASET_INTERNALS_FOLDER_PATH)

['vw.txt',
 'vocab.txt',
 'dict.dict.txt',
 'new_ppmi_tf_',
 'ppmi_tf_',
 'dict.dict',
 'cooc_values.json',
 'batches']

In [None]:
if KEEP_DATASET_IN_MEMORY:
    DOCUMENTS = list(DATASET._data.index)
else:
    DOCUMENTS = list(DATASET._data_index)

NUM_DOCUMENTS = len(DOCUMENTS)

print(f'Num documents: {NUM_DOCUMENTS}')

Num documents: 3446


Let's look at some text samples

In [None]:
DATASET._data.head()

Unnamed: 0_level_0,id,raw_text,vw_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
29998.txt,29998.txt,материал отрицательный показатель преломление ...,29998.txt |@word материал отрицательный показа...
7770.txt,7770.txt,культурный код экономика экономист александр а...,7770.txt |@word культурный код экономика эконо...
32230.txt,32230.txt,faq наука третий класс факт эксперимент резуль...,32230.txt |@word faq наука третий класс факт э...
27293.txt,27293.txt,обрушение волна поверхность жидкость математик...,27293.txt |@word обрушение волна поверхность ж...
481.txt,481.txt,существовать ли суперсимметрия мир элементарны...,481.txt |@word существовать ли суперсимметрия ...


### Coocs<a id="coocs"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Below we will make `vocab.txt` for computing word cooccurrences and compute coocs themselves.
Cooccurrences are needed for some coherence scores.

P.S.
The notebook [Making-Decorrelation-and-Topic-Selection-Friends.ipynb](https://github.com/machine-intelligence-laboratory/TopicNet/blob/master/topicnet/demos/Making-Decorrelation-and-Topic-Selection-Friends.ipynb) contains a bit more explanation and references concerning cooccurrences computation in ARTM library.

In [None]:
DATASET._dictionary_file_path

'./Post_Science__internals__test/dict.dict'

We will need dictionary saved in human readable text format

In [None]:
DICTIONARY_TXT_FILE_PATH = DATASET._dictionary_file_path + '.txt'

In [None]:
DICTIONARY_TXT_FILE_PATH

'./Post_Science__internals/dict.dict.txt'

In [None]:
dictionary = DATASET.get_dictionary()

In [None]:
dictionary

artm.Dictionary(name=3ac8eb64-8702-4d53-8cab-ed55e4f9a715, num_entries=82162)

Let's do dictionary filtering now so as to remove too frequent words (and compute coocs only for meaningful words)

In [None]:
dictionary.filter(min_df_rate=0.01, max_df_rate=0.5)

artm.Dictionary(name=3ac8eb64-8702-4d53-8cab-ed55e4f9a715, num_entries=5112)

Saving dictionary as text file and inside dataset entity

In [None]:
dictionary.save_text(DICTIONARY_TXT_FILE_PATH)

DATASET._cached_dict = dictionary

Now we read dictionary entries to make a `vocab.txt`

In [None]:
dictionary_lines = open(DICTIONARY_TXT_FILE_PATH, 'r').readlines()

In [None]:
len(dictionary_lines)

5114

In [None]:
dictionary_lines[:3]

['name: 3ac8eb64-8702-4d53-8cab-ed55e4f9a715 num_items: 3446\n',
 'token, class_id, token_value, token_tf, token_df\n',
 'социокультурный, @word, 3.8952195609454066e-05, 87.0, 39.0\n']

In [None]:
dictionary_lines[-3:]

['легкий, @word, 8.909754978958517e-05, 199.0, 149.0\n',
 'строительный, @word, 2.999766729772091e-05, 67.0, 49.0\n',
 'внизу, @word, 1.8804508727043867e-05, 42.0, 36.0\n']

In [None]:
vocab_text = ''

for line in dictionary_lines[2:]:
    token, modality, _, _, _ = line.strip().split(', ')
    vocab_text += f'{token} {modality}\n'

vocab_file_path = os.path.join(
    DATASET_INTERNALS_FOLDER_PATH,
    'vocab.txt',
)

with open(vocab_file_path, 'w') as f:
    f.write(vocab_text)


del vocab_text
del dictionary_lines

Now one must run BigARTM Command Line Utility to gather cooccerrences statistics.
To do so, there should be BigARTM executable file on the machine!
If `topicnet` was, for example, installed via `pip`, there may be no such ARTM executable.
However, one need exactly this one to run the command below.
There are some references abot BigARTM CLI here on [TopicNet's GitHub page](https://github.com/machine-intelligence-laboratory/TopicNet).

In [None]:
# Specify path and run this cell, or run the command in bash

! ~/<YOUR PATH>/bigartm/build/bin/bigartm \
    -c $DATASET_INTERNALS_FOLDER_PATH/vw.txt \
    -v vocab.txt \
    --cooc-window 10 \
    --cooc-min-tf 2 \
    --write-cooc-tf cooc_tf_ \
    --cooc-min-df 2 \
    --write-cooc-df cooc_df_ \
    --write-ppmi-tf ppmi_tf_ \
    --write-ppmi-df ppmi_df_

The function below is supposed to transform `ppmi_tf_` or `ppmi_df_` contents in a format used in the notebook

In [None]:
def transform_coocs_file(source_file_path, target_file_path, vocab_file_path):
    """
    Source file is assumed to be either `ppmi_tf_` or `ppmi_df_`

    """
    num_times_to_log = 10

    vocab = open(vocab_file_path, 'r').readlines()
    vocab = [l.strip().split()[0] for l in vocab]
    
    cooc_values = dict()
    word_word_value_triples = set()
    
    lines = open(source_file_path, 'r').readlines()
    
    for i, l in enumerate(lines):
        if i % (len(lines) // num_times_to_log) == 0:
            print(f'{i:6d} lines out of {len(lines)}')
        
        l = l.strip()
        words = l.split()

        if words[0].startswith('@'):  # modality
                                      # (not always included:
                                      #  @default_class seems not used in vocab.txt)
            words = words[1:]         # excluding modality
        
        anchor_word = words[0]
        other_word_values = words[1:]
        
        for word_and_value in other_word_values:
            other_word, value = word_and_value.split(':')
            value = float(value)
            
            cooc_values[(anchor_word, other_word)] = value
            
            # No need to do so: coherence scores do similar thing under the hood
            # cooc_values[(other_word, anchor_word)] = value  # if assume cooc values to be symmetric
            
            word_word_value_triples.add(
                tuple([
                    tuple(sorted([
                        vocab.index(anchor_word),
                        vocab.index(other_word)
                    ])),
                    value
                ])
            )
    
    new_text = ''
    
    for (w1, w2), v in word_word_value_triples:
        new_text += f'{w1} {w2} {v}\n'
    
    with open(target_file_path, 'w') as f:
        f.write(''.join(new_text))
    
    return cooc_values

In [None]:
COOC_VALUES = transform_coocs_file(
    os.path.join(DATASET_INTERNALS_FOLDER_PATH, 'ppmi_tf_'),
    os.path.join(DATASET_INTERNALS_FOLDER_PATH, 'new_ppmi_tf_'),
    os.path.join(DATASET_INTERNALS_FOLDER_PATH, 'vocab.txt'),
)

     0 lines out of 1673
   167 lines out of 1673
   334 lines out of 1673
   501 lines out of 1673
   668 lines out of 1673
   835 lines out of 1673
  1002 lines out of 1673
  1169 lines out of 1673
  1336 lines out of 1673
  1503 lines out of 1673
  1670 lines out of 1673


In [None]:
len(COOC_VALUES)

465882

In [None]:
it = iter(COOC_VALUES.items())

print(next(it), next(it))

del it

(('както', 'понимать'), 0.697806) (('както', 'сталкиваться'), 0.388971)


If there are too many coocs (more than $1\,000\,000$ let's say) — better do some additional filtering.
For example, keep only coocs with numeric value bigger than some threshold

In [None]:
min(COOC_VALUES.values())

1.67684e-07

In [None]:
max(COOC_VALUES.values())

6.46194

In [None]:
threshold_cooc = np.percentile(list(COOC_VALUES.values()), 10)

In [None]:
threshold_cooc

0.08462922

In [None]:
COOC_VALUES = {
    k: v for i, (k, v) in enumerate(COOC_VALUES.items()) if v >= threshold_cooc
}

del threshold_cooc

In [None]:
len(COOC_VALUES)

419293

Saving coocs on disk (in case notebook crashes, for example) and than loading them to check if all OK

In [None]:
COOC_VALUES_FILE_PATH = os.path.join(
    DATASET_INTERNALS_FOLDER_PATH, 'cooc_values.json'
)

with open(COOC_VALUES_FILE_PATH, 'w') as f:
    f.write(json.dumps(list(COOC_VALUES.items())))

In [None]:
len(COOC_VALUES)

419293

In [None]:
del COOC_VALUES

In [None]:
if not os.path.isfile(COOC_VALUES_FILE_PATH):
    COOC_VALUES = dict()
else:
    raw_cooc_values = json.loads(open(COOC_VALUES_FILE_PATH, 'r').read())
    
    print(
        raw_cooc_values[:20]
    )

    COOC_VALUES = {
        tuple(d[0]): d[1] for d in raw_cooc_values
    }

[[['както', 'понимать'], 0.697806], [['както', 'сталкиваться'], 0.388971], [['както', 'переводить'], 0.543726], [['както', 'любовь'], 0.905235], [['както', 'возможно'], 0.342641], [['както', 'экономист'], 0.119018], [['както', 'дорога'], 0.537558], [['както', 'академический'], 0.357577], [['както', 'проявлять'], 0.827184], [['както', 'проблема'], 0.464159], [['както', 'подавлять'], 0.581892], [['както', 'трудно'], 0.735254], [['както', 'действительно'], 0.263179], [['както', 'оставаться'], 0.162784], [['както', 'образовываться'], 0.712785], [['както', 'молодой'], 0.623452], [['както', 'пора'], 0.332543], [['както', 'очевидно'], 0.415827], [['както', 'насколько'], 0.429967], [['както', 'неделя'], 1.35179]]


In [None]:
len(COOC_VALUES)

419293

In [None]:
SAMPLE_COOC_KEY = list(COOC_VALUES.keys())[419126]

In [None]:
SAMPLE_COOC_KEY

('прекрасно', 'огромный')

#### Lower Memory Consumption (or a Bit of Shamanism. Part 1)<a id="optimizing-memory"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

It so happened that coocs are not stored in the best way possible in the coherence scores in the module.
Let's try to fix it right here in the notebook.

In [None]:
COOC_VALUES.__getitem__(SAMPLE_COOC_KEY)

0.126559

Collecting words in a vocabulary

In [None]:
VOCABULARY = set()

for word_a, word_b in tqdm(COOC_VALUES.keys(), file=sys.stdout):
    VOCABULARY.add(word_a)
    VOCABULARY.add(word_b)

100%|██████████| 419293/419293 [00:00<00:00, 2343319.19it/s]


In [None]:
len(VOCABULARY)

1676

In [None]:
WORD_TO_INDEX = {
    w: i for i, w in enumerate(VOCABULARY)
}

In [None]:
len(WORD_TO_INDEX)

1676

Here is a trick: we imitate coocs values, but using word indices instead of words.
So, we need to store only vocabulary + index pairs instead of word pairs.
If there are many coocs (more than $1\,000\,000$), this may help to optimize the situation.

In [None]:
class CoocValues:
    def __init__(self, word2index, coocs):
        self._word2index = word2index
        self._coocs = coocs
        
        self._index2word = {
            i: w
            for w, i in self._word2index.items()
        }
    
    def _map_key(self, word_pair_key):
        word_a, word_b = word_pair_key
        index_a = self._word2index[word_a]
        index_b = self._word2index[word_b]

        return (index_a, index_b)

    def __getitem__(self, word_pair_key):
        return self._coocs[self._map_key(word_pair_key)]
    
    def __len__(self):
        return len(self._coocs)
    
    def has_key(self, word_pair_key):
        word_a, word_b = word_pair_key

        return (
            word_a in self._word2index and
            word_b in self._word2index and
            self._map_key(word_pair_key) in self._coocs
        )
    
    def __contains__(self, k):
        return self.has_key(k)
    
    def __iter__(self):
        for (index_a, index_b) in self._coocs:
            yield (
                self._index2word[index_a],
                self._index2word[index_b]
            )

In [None]:
COOC_VALUES = CoocValues(WORD_TO_INDEX, _COOC_VALUES)

Checking if all OK

In [None]:
COOC_VALUES[SAMPLE_COOC_KEY]

0.126559

In [None]:
len(COOC_VALUES)

419293

In [None]:
for c in COOC_VALUES:
    print(c)

    break

('както', 'понимать')


In [None]:
del SAMPLE_COOC_KEY

### Dpcuments for Coherence Scores<a id="docs-for-cohs"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Now we need tp choose documents to be used by coherence scores.
Choosing just all documents is not a good choice, because things will be too slow.

Computing documents lengths (to select only not so small and not so big documents)

In [None]:
lengths = DATASET._data['vw_text'].apply(lambda text: len(text.split()))

In [None]:
%%time

if not KEEP_DATASET_IN_MEMORY:
    lengths = lengths.compute()
    lengths = lengths.values

CPU times: user 2.63 s, sys: 0 ns, total: 2.63 s
Wall time: 2.5 s


In [None]:
%%time

if not KEEP_DATASET_IN_MEMORY:
    DATASET.get_vw_document(DATASET._data_index[0])
else:
    DATASET.get_vw_document(DATASET._data.index[0])

In [None]:
median_length = np.median(lengths)
p25_length = np.percentile(lengths, 25)
p75_length = np.percentile(lengths, 75)

print(f'{p25_length:.2f} (25%) < {median_length:.2f} (median) < {p75_length:.2f} (75%)')

182.00 (25%) < 462.50 (median) < 871.75 (75%)


These documents are of ordinary lengths:

In [None]:
ORDINARY_DOCUMENTS = [
    d for i, d in enumerate(DOCUMENTS)
    if (
        lengths[i] <= p75_length
        and
        lengths[i] >= p25_length
    )
]

In [None]:
print(
    f'Num ordinary documents: {len(ORDINARY_DOCUMENTS)}'
    f' ({100 * len(ORDINARY_DOCUMENTS) / len(DOCUMENTS):.2f}%'
    f' of all {len(DOCUMENTS)})'
)

Num ordinary documents: 1728 (50.15% of all 3446)


Let's use only `TEXT_LENGTH_FOR_COHERENCE` words for coherence scores

In [None]:
TEXT_LENGTH_FOR_COHERENCE = 25000
NUM_TEST_DOCUMENTS = int(TEXT_LENGTH_FOR_COHERENCE / median_length)

In [None]:
NUM_TEST_DOCUMENTS

54

In [None]:
print(median_length * NUM_TEST_DOCUMENTS)

24975.0


In [None]:
del lengths
del median_length
del p25_length
del p75_length

As we are going to choose `NUM_TEST_DOCUMENTS` randomly, seems we need to do this several times and average the results.
In the experiments, three seeds were used (so three times selecting documents and running models training).

In [None]:
seed = 11221963 # 11221963 # 42 # 0
random = np.random.RandomState(seed)

TEST_DOCUMENTS = random.choice(
    ORDINARY_DOCUMENTS, size=NUM_TEST_DOCUMENTS, replace=False
)
TEST_DOCUMENTS = list(TEST_DOCUMENTS)

del random

In [None]:
TEST_DOCUMENTS[:10]

['59620.txt',
 '16645.txt',
 '47547.txt',
 '28543.txt',
 '34582.txt',
 '13548.txt',
 '32976.txt',
 '21798.txt',
 '48114.txt',
 '36253.txt']

#### Lower Time Consumption in Case of Big Datasets (or a Bit of Shamanism. Part 2)<a id="optimizing-time"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Currently, if dataset is initialized with `keep_in_memory=False`, getting document text takes a lot of time (which is not acceptable if we are computing coherence scores).
So, we need to do another trick: we retrieve text for documents *once* (for documents which we selected previously) and save the texts on disk.
And then we modify the dataset entity in such a way so it reads these saved texts.

This is how slow things are before we try to do anything

In [None]:
%%time

DATASET.get_vw_document(TEST_DOCUMENTS[0])

CPU times: user 2.14 s, sys: 0 ns, total: 2.14 s
Wall time: 2.01 s


Unnamed: 0_level_0,vw_text
id,Unnamed: 1_level_1
59620.txt,59620.txt |@word магический заговор постсоветс...


Saving original `get_vw_document()` function just in case

In [None]:
DATASET._original_get_vw_document = DATASET.get_vw_document

In [None]:
CACHED_VW_TEXTS_FOLDER_PATH = os.path.join(
    DATASET_INTERNALS_FOLDER_PATH,
    'cached_vw_texts',
)

In [None]:
os.makedirs(CACHED_VW_TEXTS_FOLDER_PATH, exist_ok=True)

In [None]:
def cached_vw_text_file_path(doc_id):
    return os.path.join(
        CACHED_VW_TEXTS_FOLDER_PATH, f'{doc_id}.csv'
    )

Here we save text on disk.
This may take time (especially if `keep_in_memory=False`)!
Up to $1$ hour or even more.

In [None]:
%%time

# Needed to raise RAM from 8 Gb to 16 Gb (and still 4 processes)
for doc_id in tqdm(TEST_DOCUMENTS, total=len(TEST_DOCUMENTS), file=sys.stdout):
    vw_doc = DATASET._original_get_vw_document(doc_id)
    vw_doc.to_csv(cached_vw_text_file_path(doc_id))

In [None]:
def quick_cached_get_vw_document(self, document_id):
    return pd.read_csv(
        cached_vw_text_file_path(document_id),
        index_col=0,
    )

And here we replace `get_vw_document()` with a new function

In [None]:
DATASET.get_vw_document = quick_cached_get_vw_document.__get__(DATASET, Dataset)

In [None]:
DATASET._original_get_vw_document

<bound method Dataset.get_vw_document of <topicnet.cooking_machine.dataset.Dataset object at 0x7f24184c5fd0>>

In [None]:
DATASET.get_vw_document

<bound method quick_cached_get_vw_document of <topicnet.cooking_machine.dataset.Dataset object at 0x7f24184c5fd0>>

Now much better!

In [None]:
%%time

DATASET.get_vw_document(TEST_DOCUMENTS[0])

CPU times: user 3.08 ms, sys: 3.36 ms, total: 6.44 ms
Wall time: 5.42 ms


Unnamed: 0_level_0,vw_text
id,Unnamed: 1_level_1
59620.txt,59620.txt |@word магический заговор постсоветс...


## Experiment<a id="experiment"></a>

Finally we are getting to the main part!)

### Scores (for Topics and Models)<a id="scores"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Here we define a lot of scores (which mainly differ in initial parameters).

In [None]:
WINDOW = 20
NUM_TOP_WORDS = 20
MAX_NUM_OUT_WORDS = 5

VERBOSE = False

Two core coherence scores

In [None]:
MAIN_TOPIC_SCORE = IntratextCoherenceScore(
    name='intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_pwt__sem_none',
    data=DATASET,
    documents=TEST_DOCUMENTS,
    text_type=TextType.VW_TEXT,
    computation_method=ComputationMethod.SEGMENT_WEIGHT,
    word_topic_relatedness=WordTopicRelatednessType.PWT,
    specificity_estimation=SpecificityEstimationMethod.NONE,
    max_num_out_of_topic_words=MAX_NUM_OUT_WORDS,
    window=WINDOW,
    verbose=VERBOSE,
)

OTHER_TOPIC_SCORES = [
    SophisticatedTopTokensCoherenceScore(
        name='top_tokens_coherence_score__tt_vw__wtrt_pwt__sem_none',
        data=DATASET,
        documents=TEST_DOCUMENTS,
        text_type=TextType.VW_TEXT,
        word_topic_relatedness=WordTopicRelatednessType.PWT,
        specificity_estimation=SpecificityEstimationMethod.NONE,
        num_top_words=NUM_TOP_WORDS,
        window=WINDOW,
        verbose=VERBOSE,
    )
]

Other coherence score variations

In [None]:
text_type_ids = {
    TextType.VW_TEXT: 'vw',
}
computation_method_ids = {
    ComputationMethod.SEGMENT_WEIGHT: 'seg_weight',
    ComputationMethod.SEGMENT_LENGTH: 'seg_length',
    ComputationMethod.SUM_OVER_WINDOW: 'sow',
}
word_topic_relatedness_type_ids = {
    WordTopicRelatednessType.PWT: 'pwt',
    WordTopicRelatednessType.PTW: 'ptw',
}
specificity_estimation_method_ids = {
    SpecificityEstimationMethod.NONE: 'none',
    SpecificityEstimationMethod.AVERAGE: 'av',
    SpecificityEstimationMethod.MAXIMUM: 'max',
}


param_combinations_intratext = list(
    itertools.product(
        text_type_ids,
        computation_method_ids,
        word_topic_relatedness_type_ids,
        specificity_estimation_method_ids,
    )
)
param_combinations_intratext.remove(
    (
        TextType.VW_TEXT,
        ComputationMethod.SEGMENT_WEIGHT,
        WordTopicRelatednessType.PWT,
        SpecificityEstimationMethod.NONE
    )
)

param_combinations_top_tokens = list(
    itertools.product(
        text_type_ids,
        word_topic_relatedness_type_ids,
        specificity_estimation_method_ids,
    )
)
param_combinations_top_tokens.remove(
    (
        TextType.VW_TEXT,
        WordTopicRelatednessType.PWT,
        SpecificityEstimationMethod.NONE
    )
)


for param_combination in param_combinations_intratext:
    (text_type,
     computation_method,
     word_topic_relatedness,
     specificity_estimation) = param_combination

    name = (
        f'intratext_coherence_score'
        f'__tt_{text_type_ids[text_type]}'
        f'__cm_{computation_method_ids[computation_method]}'
        f'__wtrt_{word_topic_relatedness_type_ids[word_topic_relatedness]}'
        f'__sem_{specificity_estimation_method_ids[specificity_estimation]}'
    )

    OTHER_TOPIC_SCORES.append(
        IntratextCoherenceScore(
            name=name,
            data=DATASET,
            documents=TEST_DOCUMENTS,
            text_type=text_type,
            computation_method=computation_method,
            word_topic_relatedness=word_topic_relatedness,
            specificity_estimation=specificity_estimation,
            max_num_out_of_topic_words=MAX_NUM_OUT_WORDS,
            window=WINDOW,
            verbose=VERBOSE,
        )
    )


for param_combination in param_combinations_top_tokens:
    (text_type,
     word_topic_relatedness,
     specificity_estimation) = param_combination

    name = (
        f'top_tokens_coherence_score'
        f'__tt_{text_type_ids[text_type]}'
        f'__wtrt_{word_topic_relatedness_type_ids[word_topic_relatedness]}'
        f'__sem_{specificity_estimation_method_ids[specificity_estimation]}'
    )

    OTHER_TOPIC_SCORES.append(
        SophisticatedTopTokensCoherenceScore(
            name=name,
            data=DATASET,
            documents=TEST_DOCUMENTS,
            text_type=text_type,
            word_topic_relatedness=word_topic_relatedness,
            specificity_estimation=specificity_estimation,
            #word_cooccurrences=COOC_VALUES2,  # TODO!
            num_top_words=NUM_TOP_WORDS,
            window=WINDOW,
            verbose=VERBOSE,
        )
    )

In [None]:
len(OTHER_TOPIC_SCORES)

23

Another implementation of top-tokens-based coherence

In [None]:
# Currently this score is particularly slow
# So we use just one combination of parameters
#
# param_combinations_other_top_tokens = list(
#     itertools.product([True, False], ['median', 'mean'], [None, 1e-7])
# )

param_combinations_other_top_tokens = list(
    itertools.product([True], ['median'], [1e-7])
)

if len(COOC_VALUES) > 0:  # with pre-computed coocs
    for param_combination in param_combinations_other_top_tokens:
        (kernel,
         average,
         active_topic_threshold) = param_combination

        name = (
            f'top_tokens_coherence_other_implementation_score'
            f'__ker_{kernel}'
            f'__av_{average}'
            f'__att_{active_topic_threshold}'
        )

        OTHER_TOPIC_SCORES.append(
            SimpleTopTokensCoherenceScore(
                name=name,
                data=DATASET,
                cooccurrence_values=COOC_VALUES,
                num_top_tokens=20,
                kernel=kernel,
                average=average,
                active_topic_threshold=active_topic_threshold,
            )
        )

In [None]:
len(OTHER_TOPIC_SCORES)

24

And a pair of default ARTM scores (these ones are fast)

In [None]:
OTHER_SCORES = [
    SparsityPhiScore(
        name='sparsity_phi_score'
    ),
    SparsityThetaScore(
        name='sparsity_theta_score'
    ),
]

### Bank Creation<a id="bank-creation"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Here we finally run the experiment!

In [None]:
# We use only one train function here
# Other variations are also possible
# It would be even better to make bank using several train functions
# However, it would also take way more time 

TRAIN_FUNCS = default_train_func  # default train func

In [None]:
DATASET_INTERNALS_FOLDER_PATH

'./Post_Science__internals__test'

In [None]:
SEARCH_RESULTS_FOLDER_PATH = os.path.join(
    DATASET_INTERNALS_FOLDER_PATH, 'result'
)

# File with some info about the process
SEARCH_RESULT_FILE_PATH = os.path.join(
    SEARCH_RESULTS_FOLDER_PATH, f'search_result__{seed}.json'
)

# Bank, with topics and their score values
BANK_FOLDER_PATH = os.path.join(
    SEARCH_RESULTS_FOLDER_PATH, f'bank__{seed}'
)

In [None]:
SEARCH_RESULTS_FOLDER_PATH

'./Post_Science__internals__test/result'

In [None]:
BANK_FOLDER_PATH

'./Post_Science__internals__test/result/bank__11221963'

In [None]:
os.makedirs(SEARCH_RESULTS_FOLDER_PATH, exist_ok=True)
os.makedirs(BANK_FOLDER_PATH, exist_ok=True)

In [None]:
seed

11221963

One cay vary some parameters below (for example `max_num_models` and `num_fit_iterations`).

In [None]:
optimizer = TopicBankMethod(
    data        = DATASET,
    min_df_rate = 0.0,  # dictionary filtering has already been done little earlier
    max_df_rate = 1.0,  #   so we don't want these parameters to have any effect

    main_topic_score   = MAIN_TOPIC_SCORE,
    other_topic_scores = OTHER_TOPIC_SCORES,
    other_scores       = OTHER_SCORES,
    documents          = TEST_DOCUMENTS,

    start_model_number   = 0,
    max_num_models       = 20,
    one_model_num_topics = 100,
    num_fit_iterations   = 100,  # 100 should be enough;
                                 # however, for big data better to reduce this one
                                 # (otherwise the process will be too slow)

    topic_score_threshold_percentile = 90,

    save_bank         = True,
    save_model_topics = True,
    save_file_path    = SEARCH_RESULT_FILE_PATH,
    bank_folder_path  = BANK_FOLDER_PATH,

    train_funcs = TRAIN_FUNCS,
    
    verbose = True,
)

# TODO: use Holdout Perplexity as Stop score

In [None]:
optimizer._result.keys()

dict_keys(['optimum', 'optimum_std', 'bank_scores', 'bank_topic_scores', 'model_scores', 'model_topic_scores', 'num_bank_topics', 'num_model_topics'])

Checking file paths

In [None]:
optimizer._save_file_path

'./Post_Science__internals__test/result/search_result__11221963.json'

In [None]:
optimizer._topic_bank._path

'./Post_Science__internals__test/result/bank__11221963'

Fulfilling the search (get ready for a really long process!):

In [None]:
%%time

optimizer.search_for_optimum(dataset)

  0%|          | 0/20 [00:00<?, ?it/s]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:11<28:35, 71.49s/it][A
  8%|▊         | 2/25 [03:58<38:20, 100.00s/it][A
 12%|█▏        | 3/25 [05:25<35:14, 96.10s/it] [A
 16%|█▌        | 4/25 [06:38<31:14, 89.28s/it][A
 20%|██        | 5/25 [08:02<29:13, 87.68s/it][A
 24%|██▍       | 6/25 [09:27<27:29, 86.82s/it][A
 28%|██▊       | 7/25 [10:52<25:55, 86.41s/it][A
 32%|███▏      | 8/25 [12:17<24:20, 85.92s/it][A
 36%|███▌      | 9/25 [13:43<22:53, 85.83s/it][A
 40%|████      | 10/25 [15:14<21:52, 87.51s/it][A
 44%|████▍     | 11/25 [16:42<20:28, 87.75s/it][A
 48%|████▊     | 12/25 [18:09<18:56, 87.41s/it][A
 52%|█████▏    | 13/25 [19:37<17:31, 87.59s/it][A
 56%|█████▌    | 14/25 [22:00<19:07, 104.30s/it][A
 60%|██████    | 15/25 [23:55<17:54, 107.48s/it][A
 64%|██████▍   | 16/25 [25:52<16:32, 110.27s/it][A
 68%|██████▊   | 17/25 [27:47<14:54, 111.79s/it][A
 72%|███████▏  | 18/25 [29:41<13:07, 112.47s/it][A
 76

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [45:02<00:00, 108.08s/it][A
  5%|▌         | 1/20 [54:54<17:23:11, 3294.29s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:34<37:59, 94.98s/it][A
  8%|▊         | 2/25 [04:43<47:07, 122.95s/it][A
 12%|█▏        | 3/25 [06:17<41:57, 114.42s/it][A
 16%|█▌        | 4/25 [07:49<37:41, 107.69s/it][A
 20%|██        | 5/25 [09:23<34:32, 103.62s/it][A
 24%|██▍       | 6/25 [11:01<32:17, 101.95s/it][A
 28%|██▊       | 7/25 [12:42<30:30, 101.69s/it][A
 32%|███▏      | 8/25 [14:21<28:30, 100.64s/it][A
 36%|███▌      | 9/25 [15:57<26:30, 99.39s/it] [A
 40%|████      | 10/25 [17:32<24:31, 98.08s/it][A
 44%|████▍     | 11/25 [19:09<22:46, 97.58s/it][A
 48%|████▊     | 12/25 [20:43<20:56, 96.64s/it][A
 52%|█████▏    | 13/25 [22:19<19:18, 96.53s/it][A
 56%|█████▌    | 14/25 [24:57<21:04, 114.98s/it][A
 60%|██████    | 15/25 [27:35<21:19, 127.92s/it][A
 64%|██████▍   | 16/25 [30:13<20:31, 136.83s/it][A
 68%|██████▊   | 17/25 [32:51<19:06

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [54:40<00:00, 131.22s/it][A
 10%|█         | 2/20 [2:09:11<18:12:57, 3643.20s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:00<24:12, 60.53s/it][A
  8%|▊         | 2/25 [03:27<33:11, 86.59s/it][A
 12%|█▏        | 3/25 [04:28<28:52, 78.77s/it][A
 16%|█▌        | 4/25 [05:29<25:42, 73.47s/it][A
 20%|██        | 5/25 [06:32<23:23, 70.19s/it][A
 24%|██▍       | 6/25 [07:40<22:05, 69.75s/it][A
 28%|██▊       | 7/25 [09:04<22:11, 73.99s/it][A
 32%|███▏      | 8/25 [10:29<21:50, 77.10s/it][A
 36%|███▌      | 9/25 [11:58<21:33, 80.82s/it][A
 40%|████      | 10/25 [13:31<21:06, 84.46s/it][A
 44%|████▍     | 11/25 [15:03<20:15, 86.83s/it][A
 48%|████▊     | 12/25 [16:33<18:58, 87.56s/it][A
 52%|█████▏    | 13/25 [17:56<17:17, 86.44s/it][A
 56%|█████▌    | 14/25 [20:29<19:27, 106.16s/it][A
 60%|██████    | 15/25 [22:51<19:29, 116.97s/it][A
 64%|██████▍   | 16/25 [25:14<18:44, 124.90s/it][A
 68%|██████▊   | 17/25 [27:31<17:07, 128.

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [47:50<00:00, 114.84s/it][A
 15%|█▌        | 3/20 [3:14:59<17:38:08, 3734.64s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:09<27:49, 69.57s/it][A
  8%|▊         | 2/25 [03:50<37:07, 96.86s/it][A
 12%|█▏        | 3/25 [05:01<32:39, 89.07s/it][A
 16%|█▌        | 4/25 [06:12<29:22, 83.92s/it][A
 20%|██        | 5/25 [07:25<26:52, 80.61s/it][A
 24%|██▍       | 6/25 [08:38<24:45, 78.16s/it][A
 28%|██▊       | 7/25 [09:51<23:00, 76.68s/it][A
 32%|███▏      | 8/25 [11:02<21:16, 75.10s/it][A
 36%|███▌      | 9/25 [12:14<19:43, 73.98s/it][A
 40%|████      | 10/25 [13:25<18:19, 73.31s/it][A
 44%|████▍     | 11/25 [14:38<17:01, 72.94s/it][A
 48%|████▊     | 12/25 [15:52<15:52, 73.26s/it][A
 52%|█████▏    | 13/25 [17:00<14:22, 71.84s/it][A
 56%|█████▌    | 14/25 [18:56<15:36, 85.15s/it][A
 60%|██████    | 15/25 [20:50<15:36, 93.63s/it][A
 64%|██████▍   | 16/25 [22:45<15:00, 100.10s/it][A
 68%|██████▊   | 17/25 [24:40<13:55, 104.47

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [42:05<00:00, 101.03s/it][A
 20%|██        | 4/20 [4:17:15<16:36:01, 3735.11s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:18<31:25, 78.55s/it][A
  8%|▊         | 2/25 [04:20<42:02, 109.66s/it][A
 12%|█▏        | 3/25 [05:46<37:34, 102.48s/it][A
 16%|█▌        | 4/25 [07:11<34:04, 97.36s/it] [A
 28%|██▊       | 7/25 [11:14<25:52, 86.27s/it][A
 32%|███▏      | 8/25 [12:31<23:38, 83.42s/it][A
 36%|███▌      | 9/25 [13:52<22:03, 82.71s/it][A
 40%|████      | 10/25 [15:17<20:50, 83.35s/it][A
 44%|████▍     | 11/25 [16:40<19:26, 83.35s/it][A
 48%|████▊     | 12/25 [18:00<17:50, 82.35s/it][A
 52%|█████▏    | 13/25 [19:15<16:02, 80.20s/it][A
 56%|█████▌    | 14/25 [21:36<18:03, 98.53s/it][A
 60%|██████    | 15/25 [23:56<18:28, 110.85s/it][A
 64%|██████▍   | 16/25 [26:13<17:49, 118.79s/it][A
 68%|██████▊   | 17/25 [28:29<16:29, 123.72s/it][A
 72%|███████▏  | 18/25 [30:39<14:39, 125.61s/it][A
 76%|███████▌  | 19/25 [32:48<12:40

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [47:42<00:00, 114.49s/it][A
 25%|██▌       | 5/20 [5:24:36<15:56:43, 3826.91s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:01<24:43, 61.79s/it][A
  8%|▊         | 2/25 [03:30<33:38, 87.75s/it][A
 12%|█▏        | 3/25 [04:31<29:15, 79.82s/it][A
 16%|█▌        | 4/25 [05:32<25:58, 74.21s/it][A
 20%|██        | 5/25 [06:33<23:24, 70.25s/it][A
 24%|██▍       | 6/25 [07:34<21:23, 67.55s/it][A
 28%|██▊       | 7/25 [08:36<19:44, 65.79s/it][A
 32%|███▏      | 8/25 [09:42<18:40, 65.91s/it][A
 36%|███▌      | 9/25 [11:03<18:43, 70.24s/it][A
 40%|████      | 10/25 [12:20<18:04, 72.32s/it][A
 44%|████▍     | 11/25 [13:46<17:52, 76.63s/it][A
 48%|████▊     | 12/25 [15:07<16:52, 77.86s/it][A
 52%|█████▏    | 13/25 [16:29<15:49, 79.15s/it][A
 56%|█████▌    | 14/25 [18:46<17:39, 96.34s/it][A
 60%|██████    | 15/25 [20:59<17:54, 107.48s/it][A
 64%|██████▍   | 16/25 [23:30<18:04, 120.46s/it][A
 68%|██████▊   | 17/25 [25:59<17:12, 129.0

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [46:31<00:00, 111.66s/it][A
 30%|███       | 6/20 [6:29:51<14:59:06, 3853.35s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:36<38:39, 96.66s/it][A
  8%|▊         | 2/25 [04:41<47:12, 123.17s/it][A
 12%|█▏        | 3/25 [06:18<42:15, 115.24s/it][A
 16%|█▌        | 4/25 [07:54<38:18, 109.43s/it][A
 20%|██        | 5/25 [09:27<34:51, 104.57s/it][A
 24%|██▍       | 6/25 [11:05<32:29, 102.61s/it][A
 28%|██▊       | 7/25 [12:35<29:38, 98.81s/it] [A
 32%|███▏      | 8/25 [14:03<27:05, 95.63s/it][A
 36%|███▌      | 9/25 [15:38<25:27, 95.49s/it][A
 40%|████      | 10/25 [17:14<23:54, 95.62s/it][A
 44%|████▍     | 11/25 [18:52<22:25, 96.10s/it][A
 48%|████▊     | 12/25 [20:27<20:46, 95.91s/it][A
 52%|█████▏    | 13/25 [21:33<17:22, 86.91s/it][A
 56%|█████▌    | 14/25 [23:29<17:32, 95.69s/it][A
 60%|██████    | 15/25 [25:24<16:53, 101.36s/it][A
 64%|██████▍   | 16/25 [27:19<15:49, 105.54s/it][A
 68%|██████▊   | 17/25 [29:12<14:21,

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [47:29<00:00, 113.96s/it][A
 35%|███▌      | 7/20 [7:42:40<14:28:21, 4007.83s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:42<41:04, 102.70s/it][A
  8%|▊         | 2/25 [04:57<49:55, 130.25s/it][A
 12%|█▏        | 3/25 [06:38<44:33, 121.54s/it][A
 16%|█▌        | 4/25 [08:21<40:36, 116.03s/it][A
 24%|██▍       | 6/25 [11:34<33:30, 105.84s/it][A
 28%|██▊       | 7/25 [13:09<30:47, 102.62s/it][A
 32%|███▏      | 8/25 [14:48<28:41, 101.26s/it][A
 36%|███▌      | 9/25 [16:28<26:54, 100.89s/it][A
 40%|████      | 10/25 [18:07<25:06, 100.42s/it][A
 44%|████▍     | 11/25 [19:36<22:36, 96.92s/it] [A
 48%|████▊     | 12/25 [21:08<20:41, 95.46s/it][A
 52%|█████▏    | 13/25 [22:39<18:51, 94.31s/it][A
 56%|█████▌    | 14/25 [25:11<20:25, 111.37s/it][A
 60%|██████    | 15/25 [27:46<20:44, 124.47s/it][A
 64%|██████▍   | 16/25 [30:22<20:05, 133.92s/it][A
 68%|██████▊   | 17/25 [33:00<18:50, 141.34s/it][A
 72%|███████▏  | 18/25 [35:37

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [52:10<00:00, 125.20s/it][A
 40%|████      | 8/20 [8:56:58<13:48:35, 4142.98s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:39<39:52, 99.68s/it][A
  8%|▊         | 2/25 [05:01<49:54, 130.19s/it][A
 12%|█▏        | 3/25 [06:42<44:37, 121.70s/it][A
 16%|█▌        | 4/25 [08:22<40:16, 115.08s/it][A
 20%|██        | 5/25 [10:00<36:40, 110.02s/it][A
 24%|██▍       | 6/25 [11:41<33:56, 107.20s/it][A
 28%|██▊       | 7/25 [13:19<31:19, 104.44s/it][A
 32%|███▏      | 8/25 [14:54<28:48, 101.68s/it][A
 36%|███▌      | 9/25 [16:47<27:58, 104.88s/it][A
 40%|████      | 10/25 [18:33<26:19, 105.29s/it][A
 44%|████▍     | 11/25 [20:26<25:09, 107.79s/it][A
 48%|████▊     | 12/25 [22:15<23:25, 108.14s/it][A
 52%|█████▏    | 13/25 [23:56<21:11, 105.96s/it][A
 56%|█████▌    | 14/25 [26:57<23:31, 128.29s/it][A
 60%|██████    | 15/25 [29:47<23:28, 140.81s/it][A
 64%|██████▍   | 16/25 [32:31<22:11, 147.99s/it][A
 68%|██████▊   | 17/25 [35:14

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [57:13<00:00, 137.32s/it][A
 45%|████▌     | 9/20 [10:14:37<13:07:56, 4297.90s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:02<25:05, 62.73s/it][A
  8%|▊         | 2/25 [03:32<34:03, 88.84s/it][A
 12%|█▏        | 3/25 [04:35<29:44, 81.13s/it][A
 16%|█▌        | 4/25 [05:38<26:27, 75.57s/it][A
 20%|██        | 5/25 [06:42<24:04, 72.22s/it][A
 24%|██▍       | 6/25 [07:45<21:56, 69.30s/it][A
 28%|██▊       | 7/25 [09:01<21:27, 71.51s/it][A
 32%|███▏      | 8/25 [10:38<22:24, 79.07s/it][A
 36%|███▌      | 9/25 [12:15<22:32, 84.51s/it][A
 40%|████      | 10/25 [14:01<22:44, 90.94s/it][A
 44%|████▍     | 11/25 [15:49<22:23, 95.96s/it][A
 48%|████▊     | 12/25 [17:30<21:07, 97.50s/it][A
 52%|█████▏    | 13/25 [19:09<19:36, 98.04s/it][A
 56%|█████▌    | 14/25 [22:07<22:21, 121.99s/it][A
 60%|██████    | 15/25 [25:07<23:12, 139.22s/it][A
 64%|██████▍   | 16/25 [28:00<22:26, 149.63s/it][A
 68%|██████▊   | 17/25 [30:41<20:23, 152

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [54:04<00:00, 129.79s/it][A
 50%|█████     | 10/20 [11:28:20<12:02:34, 4335.42s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:33<37:28, 93.70s/it][A
  8%|▊         | 2/25 [04:15<43:43, 114.05s/it][A
 12%|█▏        | 3/25 [05:16<36:00, 98.22s/it] [A
 16%|█▌        | 4/25 [06:18<30:33, 87.32s/it][A
 20%|██        | 5/25 [07:20<26:32, 79.64s/it][A
 24%|██▍       | 6/25 [08:22<23:32, 74.34s/it][A
 28%|██▊       | 7/25 [09:24<21:11, 70.65s/it][A
 32%|███▏      | 8/25 [10:25<19:13, 67.83s/it][A
 36%|███▌      | 9/25 [11:26<17:34, 65.89s/it][A
 40%|████      | 10/25 [12:29<16:15, 65.01s/it][A
 44%|████▍     | 11/25 [13:31<14:56, 64.01s/it][A
 48%|████▊     | 12/25 [14:33<13:46, 63.54s/it][A
 52%|█████▏    | 13/25 [15:35<12:35, 62.95s/it][A
 56%|█████▌    | 14/25 [17:31<14:26, 78.81s/it][A
 60%|██████    | 15/25 [19:25<14:54, 89.47s/it][A
 64%|██████▍   | 16/25 [21:23<14:41, 97.93s/it][A
 68%|██████▊   | 17/25 [23:27<14:05, 105

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [46:33<00:00, 111.75s/it][A
 55%|█████▌    | 11/20 [12:40:17<10:49:28, 4329.84s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:22<33:03, 82.64s/it][A
  8%|▊         | 2/25 [04:26<43:21, 113.10s/it][A
 12%|█▏        | 3/25 [05:43<37:25, 102.09s/it][A
 16%|█▌        | 4/25 [07:00<33:05, 94.54s/it] [A
 20%|██        | 5/25 [08:19<30:02, 90.13s/it][A
 24%|██▍       | 6/25 [09:38<27:24, 86.56s/it][A
 28%|██▊       | 7/25 [10:59<25:31, 85.06s/it][A
 32%|███▏      | 8/25 [12:21<23:48, 84.02s/it][A
 36%|███▌      | 9/25 [13:43<22:17, 83.59s/it][A
 40%|████      | 10/25 [15:04<20:39, 82.60s/it][A
 44%|████▍     | 11/25 [16:28<19:24, 83.18s/it][A
 48%|████▊     | 12/25 [17:50<17:56, 82.80s/it][A
 52%|█████▏    | 13/25 [19:16<16:43, 83.65s/it][A
 56%|█████▌    | 14/25 [21:51<19:16, 105.17s/it][A
 60%|██████    | 15/25 [24:27<20:02, 120.25s/it][A
 64%|██████▍   | 16/25 [26:22<17:49, 118.85s/it][A
 68%|██████▊   | 17/25 [28:19<15:46,

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [45:26<00:00, 109.05s/it][A
 60%|██████    | 12/20 [13:52:02<9:36:18, 4322.34s/it] 
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:55<46:21, 115.88s/it][A
  8%|▊         | 2/25 [05:19<54:29, 142.15s/it][A
 12%|█▏        | 3/25 [07:11<48:51, 133.24s/it][A
 16%|█▌        | 4/25 [08:57<43:43, 124.91s/it][A
 20%|██        | 5/25 [10:49<40:22, 121.12s/it][A
 24%|██▍       | 6/25 [12:37<37:05, 117.14s/it][A
 28%|██▊       | 7/25 [14:24<34:14, 114.14s/it][A
 32%|███▏      | 8/25 [16:06<31:17, 110.44s/it][A
 36%|███▌      | 9/25 [17:46<28:39, 107.49s/it][A
 40%|████      | 10/25 [19:26<26:18, 105.20s/it][A
 44%|████▍     | 11/25 [21:08<24:18, 104.19s/it][A
 48%|████▊     | 12/25 [22:50<22:24, 103.42s/it][A
 52%|█████▏    | 13/25 [24:30<20:30, 102.52s/it][A
 56%|█████▌    | 14/25 [27:16<22:17, 121.58s/it][A
 60%|██████    | 15/25 [29:56<22:11, 133.18s/it][A
 64%|██████▍   | 16/25 [32:32<20:59, 139.92s/it][A
 68%|██████▊   | 17/25 [35

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [56:08<00:00, 134.72s/it][A
 65%|██████▌   | 13/20 [15:10:42<8:38:10, 4441.53s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:01<24:28, 61.20s/it][A
  8%|▊         | 2/25 [03:30<33:38, 87.74s/it][A
 12%|█▏        | 3/25 [04:35<29:37, 80.82s/it][A
 16%|█▌        | 4/25 [05:37<26:18, 75.19s/it][A
 20%|██        | 5/25 [06:39<23:42, 71.12s/it][A
 24%|██▍       | 6/25 [07:40<21:35, 68.17s/it][A
 28%|██▊       | 7/25 [08:42<19:54, 66.36s/it][A
 32%|███▏      | 8/25 [09:44<18:25, 65.03s/it][A
 36%|███▌      | 9/25 [10:46<17:07, 64.24s/it][A
 40%|████      | 10/25 [11:49<15:55, 63.70s/it][A
 44%|████▍     | 11/25 [12:50<14:41, 62.95s/it][A
 48%|████▊     | 12/25 [13:56<13:49, 63.82s/it][A
 52%|█████▏    | 13/25 [14:58<12:40, 63.41s/it][A
 56%|█████▌    | 14/25 [16:54<14:29, 79.06s/it][A
 60%|██████    | 15/25 [18:51<15:04, 90.42s/it][A
 64%|██████▍   | 16/25 [20:51<14:53, 99.25s/it][A
 68%|██████▊   | 17/25 [22:47<13:54, 104.33

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [39:50<00:00, 95.63s/it] [A
 70%|███████   | 14/20 [16:06:33<6:51:27, 4114.53s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:01<24:43, 61.83s/it][A
  8%|▊         | 2/25 [03:29<33:37, 87.72s/it][A
 12%|█▏        | 3/25 [04:31<29:18, 79.93s/it][A
 16%|█▌        | 4/25 [05:34<26:10, 74.80s/it][A
 20%|██        | 5/25 [06:37<23:42, 71.11s/it][A
 24%|██▍       | 6/25 [07:39<21:41, 68.51s/it][A
 28%|██▊       | 7/25 [08:41<19:58, 66.58s/it][A
 32%|███▏      | 8/25 [09:43<18:25, 65.04s/it][A
 36%|███▌      | 9/25 [10:45<17:06, 64.17s/it][A
 40%|████      | 10/25 [11:47<15:54, 63.67s/it][A
 44%|████▍     | 11/25 [12:49<14:45, 63.24s/it][A
 48%|████▊     | 12/25 [13:51<13:36, 62.77s/it][A
 52%|█████▏    | 13/25 [14:53<12:29, 62.48s/it][A
 56%|█████▌    | 14/25 [16:48<14:19, 78.14s/it][A
 60%|██████    | 15/25 [18:42<14:50, 89.02s/it][A
 64%|██████▍   | 16/25 [20:40<14:38, 97.61s/it][A
 68%|██████▊   | 17/25 [22:36<13:45, 103.19

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [39:45<00:00, 95.42s/it] [A
 75%|███████▌  | 15/20 [17:02:16<5:23:34, 3882.95s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:01<24:47, 61.97s/it][A
  8%|▊         | 2/25 [03:31<33:52, 88.35s/it][A
 12%|█▏        | 3/25 [04:33<29:25, 80.27s/it][A
 16%|█▌        | 4/25 [05:35<26:10, 74.78s/it][A
 20%|██        | 5/25 [06:37<23:40, 71.01s/it][A
 24%|██▍       | 6/25 [07:39<21:39, 68.42s/it][A
 28%|██▊       | 7/25 [08:41<19:56, 66.47s/it][A
 32%|███▏      | 8/25 [09:43<18:24, 64.97s/it][A
 36%|███▌      | 9/25 [10:44<17:03, 63.97s/it][A
 40%|████      | 10/25 [11:47<15:53, 63.54s/it][A
 44%|████▍     | 11/25 [12:49<14:43, 63.10s/it][A
 48%|████▊     | 12/25 [13:51<13:34, 62.65s/it][A
 52%|█████▏    | 13/25 [14:52<12:27, 62.27s/it][A
 56%|█████▌    | 14/25 [16:48<14:22, 78.38s/it][A
 60%|██████    | 15/25 [18:42<14:49, 88.97s/it][A
 64%|██████▍   | 16/25 [20:38<14:33, 97.08s/it][A
 68%|██████▊   | 17/25 [22:34<13:42, 102.86

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [39:37<00:00, 95.12s/it] [A
 80%|████████  | 16/20 [17:58:30<4:08:41, 3730.39s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:01<24:41, 61.72s/it][A
  8%|▊         | 2/25 [03:30<33:40, 87.84s/it][A
 12%|█▏        | 3/25 [04:32<29:20, 80.02s/it][A
 16%|█▌        | 4/25 [05:33<26:02, 74.39s/it][A
 20%|██        | 5/25 [06:34<23:29, 70.48s/it][A
 24%|██▍       | 6/25 [07:36<21:27, 67.78s/it][A
 28%|██▊       | 7/25 [08:38<19:47, 65.96s/it][A
 32%|███▏      | 8/25 [09:39<18:19, 64.68s/it][A
 36%|███▌      | 9/25 [10:42<17:03, 63.95s/it][A
 40%|████      | 10/25 [11:46<15:59, 63.96s/it][A
 44%|████▍     | 11/25 [12:47<14:45, 63.23s/it][A
 48%|████▊     | 12/25 [13:49<13:38, 62.99s/it][A
 52%|█████▏    | 13/25 [14:52<12:33, 62.76s/it][A
 56%|█████▌    | 14/25 [16:46<14:21, 78.34s/it][A
 60%|██████    | 15/25 [18:42<14:55, 89.55s/it][A
 64%|██████▍   | 16/25 [20:36<14:31, 96.89s/it][A
 68%|██████▊   | 17/25 [22:31<13:37, 102.24

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [39:35<00:00, 95.04s/it] [A
 85%|████████▌ | 17/20 [18:54:08<3:00:37, 3612.44s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:03<25:30, 63.77s/it][A
  8%|▊         | 2/25 [03:34<34:25, 89.81s/it][A
 12%|█▏        | 3/25 [04:37<29:57, 81.69s/it][A
 16%|█▌        | 4/25 [05:38<26:28, 75.64s/it][A
 20%|██        | 5/25 [06:40<23:53, 71.66s/it][A
 24%|██▍       | 6/25 [07:43<21:51, 69.04s/it][A
 28%|██▊       | 7/25 [08:46<20:08, 67.13s/it][A
 32%|███▏      | 8/25 [09:48<18:36, 65.66s/it][A
 36%|███▌      | 9/25 [10:50<17:13, 64.59s/it][A
 40%|████      | 10/25 [11:53<15:59, 63.96s/it][A
 44%|████▍     | 11/25 [12:54<14:45, 63.24s/it][A
 48%|████▊     | 12/25 [13:57<13:38, 62.96s/it][A
 52%|█████▏    | 13/25 [15:00<12:35, 62.92s/it][A
 56%|█████▌    | 14/25 [16:57<14:31, 79.20s/it][A
 60%|██████    | 15/25 [18:53<15:01, 90.17s/it][A
 64%|██████▍   | 16/25 [20:49<14:42, 98.07s/it][A
 68%|██████▊   | 17/25 [22:45<13:48, 103.57

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [39:39<00:00, 95.20s/it] [A
 90%|█████████ | 18/20 [19:49:29<1:57:30, 3525.26s/it]
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:00<24:16, 60.67s/it][A
  8%|▊         | 2/25 [03:27<33:07, 86.41s/it][A
 12%|█▏        | 3/25 [04:28<28:54, 78.85s/it][A
 16%|█▌        | 4/25 [05:29<25:41, 73.43s/it][A
 20%|██        | 5/25 [06:30<23:13, 69.68s/it][A
 24%|██▍       | 6/25 [07:31<21:15, 67.15s/it][A
 28%|██▊       | 7/25 [08:32<19:34, 65.27s/it][A
 32%|███▏      | 8/25 [09:33<18:09, 64.07s/it][A
 36%|███▌      | 9/25 [10:37<17:07, 64.21s/it][A
 40%|████      | 10/25 [11:39<15:49, 63.31s/it][A
 44%|████▍     | 11/25 [12:40<14:38, 62.76s/it][A
 48%|████▊     | 12/25 [13:42<13:31, 62.40s/it][A
 52%|█████▏    | 13/25 [14:43<12:23, 61.97s/it][A
 56%|█████▌    | 14/25 [16:36<14:11, 77.40s/it][A
 60%|██████    | 15/25 [18:28<14:38, 87.81s/it][A
 64%|██████▍   | 16/25 [20:22<14:19, 95.45s/it][A
 68%|██████▊   | 17/25 [22:16<13:29, 101.21

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [39:00<00:00, 93.64s/it] [A
 95%|█████████▌| 19/20 [20:44:10<57:31, 3451.85s/it]  
  0%|          | 0/25 [00:00<?, ?it/s][A
  4%|▍         | 1/25 [01:08<27:27, 68.63s/it][A
  8%|▊         | 2/25 [03:36<35:25, 92.40s/it][A
 12%|█▏        | 3/25 [04:37<30:23, 82.89s/it][A
 16%|█▌        | 4/25 [05:38<26:41, 76.27s/it][A
 20%|██        | 5/25 [06:38<23:51, 71.58s/it][A
 24%|██▍       | 6/25 [07:39<21:39, 68.38s/it][A
 28%|██▊       | 7/25 [08:41<19:56, 66.50s/it][A
 32%|███▏      | 8/25 [09:43<18:25, 65.02s/it][A
 36%|███▌      | 9/25 [10:43<16:57, 63.60s/it][A
 40%|████      | 10/25 [11:47<15:54, 63.63s/it][A
 44%|████▍     | 11/25 [12:49<14:47, 63.36s/it][A
 48%|████▊     | 12/25 [13:51<13:37, 62.86s/it][A
 52%|█████▏    | 13/25 [14:52<12:28, 62.39s/it][A
 56%|█████▌    | 14/25 [16:49<14:24, 78.56s/it][A
 60%|██████    | 15/25 [18:45<15:00, 90.00s/it][A
 64%|██████▍   | 16/25 [20:41<14:38, 97.66s/it][A
 68%|██████▊   | 17/25 [22:33<13:36, 102.02

  'The parameter `documents` is not used by SimpleTopTokensCoherenceScore'



100%|██████████| 25/25 [39:32<00:00, 94.90s/it] [A
100%|██████████| 20/20 [21:39:25<00:00, 3898.28s/it]
CPU times: user 1d 10h 46s, sys: 3h 46min 9s, total: 1d 13h 46min 55s
Wall time: 21h 39min 25s


What topics we have in bank

In [None]:
optimizer._topic_bank.view_topics().head()

Unnamed: 0,Unnamed: 1,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
@word,перикл,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
@word,леагр,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
@word,ольвийский,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
@word,амфора,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
@word,вазопись,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
bank_topics = optimizer._topic_bank.view_topics()

In [None]:
bank_topics.shape

(2514, 15)

In [None]:
bank_topics.head()

Unnamed: 0,Unnamed: 1,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14
@default_class,както,0.0,0.0,5.497794e-11,0.0,1.985204e-08,2.223492e-14,9e-06,0.0,0.0,0.0,0.0,0.0,0.0,2.414336e-06,4.583967e-11
@default_class,гравитационный,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022702,0.0,0.0,0.0,0.0,0.0
@default_class,жена,0.0,0.00139,0.0,0.0,0.0,0.0001217113,0.0,3.405896e-15,0.0,0.0,0.0,0.0,0.0,1.091175e-13,8.552548e-05
@default_class,продолжительность,0.0,0.0,0.0,3.47069e-15,0.0,0.0,0.0,0.0004282006,0.0,0.0,3.47461e-08,0.0,0.0,0.0,0.0
@default_class,одновременно,0.0,0.00075,5.159791e-06,7.805681e-09,3.03166e-06,5.124066e-09,0.000146,8.667931e-13,0.000935,0.000564,1.033011e-07,0.0,0.0,2.076379e-05,1.407077e-09


In [None]:
bank_topics['topic_14'].sort_values(ascending=False)[:20]

@default_class  век             0.229964
                xx              0.053898
                xix             0.051478
                начало          0.043227
                первый          0.038318
                конец           0.035413
                второй          0.019743
                середина        0.018641
                время           0.017661
                половина        0.017191
                xviii           0.013792
                классический    0.013726
                xvii            0.012661
                возникать       0.011614
                эпоха           0.010723
                хх              0.010398
                новый           0.010231
                столетие        0.008743
                образ           0.008709
                период          0.008326
Name: topic_14, dtype: float64

And topic scores

In [None]:
optimizer._topic_bank.view_topic_scores()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14
kernel_size,229.0,408.0,102.0,157.0,235.0,302.0,345.0,347.0,283.0,391.0,338.0,59.0,121.0,407.0,214.0
intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_pwt__sem_none,0.129687,0.123335,0.468014,0.249535,0.211076,0.246103,0.084003,0.103003,0.149715,0.085349,0.102518,0.177264,0.116741,0.102491,0.099711
top_tokens_coherence_score__tt_vw__wtrt_pwt__sem_none,1.133654,0.66788,1.629793,0.835258,0.835292,0.806799,0.434416,0.921406,0.700194,0.409935,0.473712,0.614014,0.58399,0.500487,0.975982
intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_pwt__sem_av,0.127684,0.121866,0.462997,0.246569,0.207621,0.243621,0.082943,0.101184,0.147921,0.084183,0.101309,0.173868,0.113667,0.101379,0.0973
intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_pwt__sem_max,0.120821,0.114016,0.462851,0.245817,0.116343,0.24557,0.079723,0.083172,0.143387,0.07581,0.096531,0.103924,0.100431,0.100097,0.080213
intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_ptw__sem_none,0.46368,0.783485,0.763705,0.481053,0.594415,0.914586,0.645213,0.466021,0.734765,0.76588,0.903625,0.845633,0.492089,0.642143,0.650835
intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_ptw__sem_av,0.45275,0.771956,0.753009,0.470381,0.58359,0.903922,0.633082,0.455416,0.723801,0.753463,0.890064,0.83087,0.479616,0.631021,0.636409
intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_ptw__sem_max,0.385764,0.677018,0.71897,0.41817,0.345956,0.879365,0.549258,0.345888,0.661878,0.648797,0.772138,0.63636,0.417972,0.574683,0.525002
intratext_coherence_score__tt_vw__cm_seg_length__wtrt_pwt__sem_none,1.092971,1.152834,1.069599,1.067143,1.082481,1.066313,1.213043,1.060516,1.096398,1.241758,1.356061,1.476266,1.247272,1.112245,1.442545
intratext_coherence_score__tt_vw__cm_seg_length__wtrt_pwt__sem_av,1.092971,1.152834,1.069599,1.067143,1.082481,1.066313,1.213043,1.060516,1.096398,1.241758,1.356061,1.476266,1.247272,1.112245,1.442545


All models are also saved (topics as $\Phi$ matrices and topic score values)

In [None]:
! ls $optimizer._topic_bank._path

model_0__phi.bin	   model_4__phi.bin	      model_8__phi.bin
model_0__topic_scores.bin  model_4__topic_scores.bin  model_8__topic_scores.bin
model_1__phi.bin	   model_5__phi.bin	      model_9__phi.bin
model_1__topic_scores.bin  model_5__topic_scores.bin  model_9__topic_scores.bin
model_2__phi.bin	   model_6__phi.bin	      topics.bin
model_2__topic_scores.bin  model_6__topic_scores.bin  topic_scores.bin
model_3__phi.bin	   model_7__phi.bin
model_3__topic_scores.bin  model_7__topic_scores.bin


## Postprocessing<a id="postprocessing"></a>

<div style="text-align: right">Back to <a href=#contents>Contents</a></div>

Remember, we were going to do computations several times (with different seeds and documents used by coherence scores).
Here we combine these results to get just one topic bank.
Supposedly, topic banks for different seeds should *not* differ much: exactly the same models are used in all cases.
What is different is just the way we estimate topics quality (or, better, the documents we use for topic coherence computation).

P.S.
As I am writing the explanatory .md text, I realize that bank creation process could have been optimized: models should have been trained *only once* for each dataset.
And then, for different seeds, would go quality estimation without actual training.
In the version presented in the notebook there are model training for each seed.
However, coherence computation still remains the most time consuming part (and this one we should be done separately for each seed).

In [None]:
os.listdir(DATASET_INTERNALS_FOLDER_PATH)

['vw.txt',
 'vocab.txt',
 'dict.dict.txt',
 'new_ppmi_tf_',
 'ppmi_tf_',
 'dict.dict',
 'cooc_values.json',
 'batches',
 'result']

In [None]:
SEARCH_RESULTS_FOLDER_PATH

'./Post_Science__internals/result'

In [None]:
os.listdir(SEARCH_RESULTS_FOLDER_PATH)

['bank__11221963',
 'search_result__11221963.json',
 'bank__0',
 'search_result__0.json',
 'search_result__42.json',
 'bank__42']

Looking what is insides one bank's folder

In [None]:
one_dataset_bank_folder_path = os.path.join(
    SEARCH_RESULTS_FOLDER_PATH,
    'bank__11221963',
)

folder_contents = sorted(os.listdir(one_dataset_bank_folder_path))

print(folder_contents[:10])
print('...')
print(folder_contents[-10:])

['model_0__phi.bin', 'model_0__theta.bin', 'model_0__topic_scores.bin', 'model_10__phi.bin', 'model_10__theta.bin', 'model_10__topic_scores.bin', 'model_11__phi.bin', 'model_11__theta.bin', 'model_11__topic_scores.bin', 'model_12__phi.bin']
...
['model_7__theta.bin', 'model_7__topic_scores.bin', 'model_8__phi.bin', 'model_8__theta.bin', 'model_8__topic_scores.bin', 'model_9__phi.bin', 'model_9__theta.bin', 'model_9__topic_scores.bin', 'topic_scores.bin', 'topics.bin']


What info is in one result file

In [None]:
one_dataset_search_result_path = os.path.join(
    SEARCH_RESULTS_FOLDER_PATH,
    'search_result__11221963.json',
)

one_search_result = json.loads(open(one_dataset_search_result_path, 'r').read())

print(one_search_result.keys())

dict_keys(['optimum', 'optimum_std', 'bank_scores', 'bank_topic_scores', 'model_scores', 'model_topic_scores', 'num_bank_topics', 'num_model_topics'])


What result fields we are interested in (for combining)

In [None]:
class SearchResultKey(Enum):
    BANK_SCORES = 'bank_scores'
    BANK_TOPIC_SCORES = 'bank_topic_scores'
    MODEL_SCORES = 'model_scores'
    MODEL_TOPIC_SCORES = 'model_topic_scores'
    NUM_BANK_TOPICS = 'num_bank_topics'
    NUM_MODEL_TOPICS = 'num_model_topics'

How many items are there in each result field

In [None]:
for search_result_key in SearchResultKey:
    print(search_result_key.value)
    print(' ' * 4 + str(len(one_search_result[search_result_key.value])) + ' items')

bank_scores
    20 items
bank_topic_scores
    15 items
model_scores
    20 items
model_topic_scores
    20 items
num_bank_topics
    20 items
num_model_topics
    20 items


Example of items in result fields

In [None]:
for i, search_result_key in enumerate([
        SearchResultKey.BANK_SCORES,
        SearchResultKey.BANK_TOPIC_SCORES,
        SearchResultKey.MODEL_SCORES]):

    if i > 0:
        print()

    print(search_result_key.value)
    print(' ' * 4 + str(len(one_search_result[search_result_key.value])) + ' items')
    print(' ' * 4 + 'sample item: ', end='')
    
    one_item = next(iter(one_search_result[search_result_key.value]))
    
    print(
        '{'
        + ', '.join(
            [f'{k}: {v}' for k, v in list(one_item.items())[:3]]
        )
        + ', ...'
        + '}'
    )

bank_scores
    20 items
    sample item: {perplexity_score: 1150.075439453125, sparsity_phi_score: 0.5031424164772034, sparsity_theta_score: 0.005107370670884848, ...}

bank_topic_scores
    15 items
    sample item: {kernel_size: 229, intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_pwt__sem_none: 0.1296872250691411, top_tokens_coherence_score__tt_vw__wtrt_pwt__sem_none: 1.133653577000209, ...}

model_scores
    20 items
    sample item: {perplexity_score: 739.5682373046875, sparsity_phi_score: 0.48647573590278625, sparsity_theta_score: 0.009358677081763744, ...}


In [None]:
search_result_key = SearchResultKey.MODEL_TOPIC_SCORES

print(search_result_key.value)
print(' ' * 4 + str(len(one_search_result[search_result_key.value])) + ' items')

one_item = next(iter(one_search_result[search_result_key.value]))

print(' ' * 4 + f'Item length: {len(one_item)}')
print(' ' * 4 + 'Sample item: ', end='')

print(
    '['
    + '{'
    + ', '.join(
        [f'{k}: {v}' for k, v in list(next(iter(one_item)).items())[:3]]
    )
    + ', ...'
    + '}'
    + ', ...'
    + ']'
)

model_topic_scores
    20 items
    Item length: 100
    Sample item: [{kernel_size: 397, intratext_coherence_score__tt_vw__cm_seg_weight__wtrt_pwt__sem_none: 0.017755509703420103, top_tokens_coherence_score__tt_vw__wtrt_pwt__sem_none: 0.3707505407301823, ...}, ...]


In [None]:
for i, search_result_key in enumerate([
        SearchResultKey.NUM_BANK_TOPICS,
        SearchResultKey.NUM_MODEL_TOPICS]):

    if i > 0:
        print()

    print(search_result_key.value)
    
    result_value = one_search_result[search_result_key.value]
    
    print(' ' * 4 + f'Length: {len(result_value)}')
    print(
        ' ' * 4
        + '['
        + ', '.join(str(n) for n in result_value[:5])
        + ', ..., '
        + ', '.join(str(n) for n in result_value[-5:])
        + ']'
    )

num_bank_topics
    Length: 20
    [10, 10, 11, 11, 11, ..., 15, 15, 15, 15, 15]

num_model_topics
    Length: 20
    [100, 100, 100, 100, 100, ..., 100, 100, 100, 100, 100]


Loading banks corresponding to seeds (if there is a bank for such seed)

In [None]:
POSSIBLE_BANK_SEEDS = [11221963, 42, 0]

In [None]:
SEED_TO_BANK = dict()

for seed in POSSIBLE_BANK_SEEDS:
    one_dataset_bank_folder_path = os.path.join(
        SEARCH_RESULTS_FOLDER_PATH,
        f'bank__{seed}',
    )
    
    if not os.path.isdir(one_dataset_bank_folder_path):
        print(f'No bank for such seed: {seed}.'
              f' Folder "{one_dataset_bank_folder_path}" doesn\'t exist')
        
        continue
    
    SEED_TO_BANK[seed] = TopicBank(
        save=False,
        save_folder_path=one_dataset_bank_folder_path,
    )


BANK_SEEDS = list(SEED_TO_BANK.keys())

Seeds for which there is a bank saved on disk

In [None]:
BANK_SEEDS

[11221963, 42, 0]

Number of topics in bank

In [None]:
len(
    next(iter(SEED_TO_BANK.values())).topics
)

15

This is also number of topics, but some items in this list are `None` (when a topic is excluded from bank)

In [None]:
len(
    next(iter(SEED_TO_BANK.values()))._topics
)

17

Topic matrix

In [None]:
next(iter(SEED_TO_BANK.values())).view_topics().shape

(2514, 15)

In [None]:
SEED_TO_BANK[BANK_SEEDS[0]].view_topics()['topic_0'].sort_values(ascending=False)[:10]

@default_class  проблема        0.377909
                вопрос          0.051589
                решать          0.039914
                трудность       0.028569
                возникать       0.026642
                сталкиваться    0.022061
                другой          0.015737
                сложный         0.013366
                создавать       0.011581
                существовать    0.010672
Name: topic_0, dtype: float64

In [None]:
SEED_TO_BANK[BANK_SEEDS[1]].view_topics()['topic_0'].sort_values(ascending=False)[:10]

@default_class  слово          0.193740
                словарь        0.035589
                буква          0.027122
                значение       0.022216
                речь           0.020765
                глагол         0.016013
                русский        0.012727
                конструкция    0.010727
                часто          0.009467
                пример         0.008037
Name: topic_0, dtype: float64

Let's compare banks corresponding to different seeds to see if they differ much or not (supposedly not)

In [None]:
def compare_topics(
        bank_seed1: int,
        bank_seed2: int,
        max_num_twins_to_display: int = 20) -> None:

    topics_with_twin = set()
    
    print(
        f'Close topics:'
        f' <topic from "{bank_seed1}" bank>'
        f' <topic from "{bank_seed1}" bank>'
        f' <distance>'
    )
    
    num_displayed = 0
    
    for i1, t1 in enumerate(SEED_TO_BANK[bank_seed1].topics):
        for i2, t2 in enumerate(SEED_TO_BANK[bank_seed2].topics):
            d = TopicBankMethod._jaccard_distance(t1, t2)

            if d < 0.5:
                topics_with_twin.add(i1)
                num_displayed = num_displayed + 1
                
                print(f'{i1}\t{i2}\t{d}')
            
            if num_displayed >= max_num_twins_to_display:
                break

        if num_displayed >= max_num_twins_to_display:
            break

    print()

    max_seed_length = max([len(str(s)) for s in [bank_seed1, bank_seed2]])

    for seed in [bank_seed1, bank_seed2]:
        print(
            f'Num topics in bank "{seed:{max_seed_length}}":'
            f' {len(SEED_TO_BANK[seed].topics)}'
        )

    print(
        f'\nNum topics in bank "{bank_seed1}" with twins in bank "{bank_seed2}":'
        f' {len(topics_with_twin)}'
    )

In [None]:
if len(BANK_SEEDS) >= 2:
    compare_topics(BANK_SEEDS[0], BANK_SEEDS[1])

Close topics: <topic from "11221963" bank> <topic from "11221963" bank> <distance>
1	1	7.030248077910528e-06
2	2	5.364596949331002e-06
3	4	6.971623467633137e-06
4	5	9.489878975976751e-06
5	6	3.474788023982711e-06
7	7	5.1163178486079985e-06
8	9	6.845627906426621e-06
11	14	1.107408409306565e-06
12	15	1.2886343224716157e-06
13	10	0.46749039824435024

Num topics in bank "11221963": 15
Num topics in bank "      42": 16

Num topics in bank "11221963" with twins in bank "42": 10


In [None]:
if len(BANK_SEEDS) >= 3:
    compare_topics(BANK_SEEDS[1], BANK_SEEDS[2])

Close topics: <topic from "42" bank> <topic from "42" bank> <distance>
0	0	2.1602862702918557e-05
1	1	1.4857055578576528e-05
2	2	9.737322482217259e-06
4	3	1.0445701357442161e-05
5	4	1.668583528036116e-05
6	5	6.556589902784182e-06
7	7	9.581083945442437e-06
9	8	1.2234920047093922e-05
13	11	1.5169697592520848e-06
14	13	6.072632612319495e-07
15	14	1.1275848412761746e-06

Num topics in bank "42": 16
Num topics in bank " 0": 15

Num topics in bank "42" with twins in bank "0": 11


And here we combine banks to get just one bank!

In [None]:
COMBINED_BANK_TOPIC_INDICES = dict()

assert len(BANK_SEEDS) >= 1

COMBINED_BANK_TOPIC_INDICES[BANK_SEEDS[0]] = list(
    range(len(SEED_TO_BANK[BANK_SEEDS[0]].topics))
)


def get_final_bank_topics() -> Iterable[dict]:  # a token in dict may be as just one word,
                                                # or a tuple of words
    for seed, indices in COMBINED_BANK_TOPIC_INDICES.items():
        for i in indices:
            yield SEED_TO_BANK[seed].topics[i]


def get_final_bank_topic_scores() -> Iterable[Dict[str, float]]:
    for seed, indices in COMBINED_BANK_TOPIC_INDICES.items():
        for i in indices:
            yield SEED_TO_BANK[seed].topic_scores[i]


DISTANCE_THRESHOLD = 0.5  # if topics in banks are really similar, this will be enough

if len(BANK_SEEDS) == 1:
    pass
else:
    for seed in tqdm(BANK_SEEDS[1:], total=len(BANK_SEEDS) - 1, file=sys.stdout):
        COMBINED_BANK_TOPIC_INDICES[seed] = list()

        for i, t in enumerate(SEED_TO_BANK[seed].topics):
            if any([TopicBankMethod._jaccard_distance(t, bank_topic) < DISTANCE_THRESHOLD
                    for bank_topic in get_final_bank_topics()]):

                continue

            COMBINED_BANK_TOPIC_INDICES[seed].append(i)

In [None]:
print('{0:>10}\t{1:>10}'.format('seed', 'num_topics'))
print('-' * 30)

for seed, topic_indices in COMBINED_BANK_TOPIC_INDICES.items():
    print(f'{seed:10}\t{len(topic_indices):10}')

print('-' * 30)

print(
    ' ' * 10
    + '\t'
    + f'{sum(len(inds) for inds in COMBINED_BANK_TOPIC_INDICES.values()):10}'
)

      seed	num_topics
------------------------------
  11221963	        15
        42	         6
         0	         3
------------------------------
          	        24


P.S. Output of the previous cell for all datasets:

```
PostNauka:

seed	num_topics
------------------------------
  11221963	        15
        42	         6
         0	         3
------------------------------
          	        24

Reuters:

seed	num_topics
------------------------------
  11221963	        22
        42	        10
         0	         1
------------------------------
          	        33

Brown:

seed	num_topics
------------------------------
  11221963	        23
        42	        16
         0	         8
------------------------------
          	        47

20 NG:

seed	num_topics
------------------------------
  11221963	        27
        42	        13
         0	        15
------------------------------
          	        55

AG News:

seed	num_topics
------------------------------
  11221963	        37
        42	         7
         0	         6
------------------------------
          	        50

Watan:

     seed	num_topics
------------------------------
  11221963	        10
        42	         5
         0	         1
------------------------------
          	        16

Habrahabr:

seed	num_topics
------------------------------
  11221963	        11
        42	        11
------------------------------
          	        22
```

Saving the bank

In [None]:
COMBINED_SEARCH_RESULT_FOLDER_PATH = os.path.join(
    DATASET_INTERNALS_FOLDER_PATH,
    'result_combined',
)

COMBINED_BANK_FOLDER_PATH = os.path.join(
    COMBINED_SEARCH_RESULT_FOLDER_PATH,
    'bank',
)

In [None]:
COMBINED_SEARCH_RESULT_FOLDER_PATH

'./Post_Science__internals__test/result_combined'

In [None]:
COMBINED_BANK_FOLDER_PATH

'./Post_Science__internals__test/result_combined/bank'

In [None]:
os.makedirs(COMBINED_SEARCH_RESULT_FOLDER_PATH, exist_ok=True)
os.makedirs(COMBINED_BANK_FOLDER_PATH, exist_ok=True)

In [None]:
COMBINED_BANK = TopicBank(
    save=False,
    save_folder_path=COMBINED_BANK_FOLDER_PATH,
)

In [None]:
COMBINED_BANK._topics = list(get_final_bank_topics())
COMBINED_BANK._topic_scores = list(get_final_bank_topic_scores())

Checking if all OK

In [None]:
assert len(COMBINED_BANK._topics) == len(COMBINED_BANK._topic_scores)
assert len(COMBINED_BANK._topics) == sum(
    len(v) for v in COMBINED_BANK_TOPIC_INDICES.values()
)

In [None]:
os.listdir(COMBINED_BANK_FOLDER_PATH)

[]

In [None]:
COMBINED_BANK.save()

In [None]:
os.listdir(COMBINED_BANK_FOLDER_PATH)

['topic_scores.bin', 'topics.bin']

In [None]:
del COMBINED_BANK
del COMBINED_BANK_TOPIC_INDICES

Now topic banks are ready (all info is in `COMBINED_BANK_FOLDER_PATH` for each dataset).
Notebook [TopicBank-Experiment: Model Validation](TopicBank-Experiment-ModelValidation.ipynb) goes next: we are going to estimate topic models quality with the help of topic bank.