### A BERTopic-ról hasznos dokumentumok
https://maartengr.github.io/BERTopic/algorithm/algorithm.html#6-optional-fine-tune-topic-representation

https://towardsdatascience.com/topics-per-class-using-bertopic-252314f2640

https://people.inf.elte.hu/csa/html/szinek.htm

In [1]:
from bertopic import BERTopic
from hdbscan import HDBSCAN
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from sentence_transformers import SentenceTransformer

import pickle
import huspacy


## 1. Beolvassuk az előfeldolgozott korpuszokat. Később választhatunk, hogy a lemmatizált vagy a lemmatizált és stop szavaktól megszűrt korpuszon dolgozunk-e.

In [38]:
meta = pickle.load(open("../resources/meta.pkl", "rb"))
lemmatized = pickle.load(open("../resources/lemmatized.pkl", "rb"))
#pos = pickle.load(open("../resources/pos.pkl", "rb"))
#tokens =  pickle.load(open("../resources/tokenized.pkl", "rb"))
#doc_stop = pickle.load(open("../resources/no_stopword.pkl", "rb")) ### kevesebb stopszóval
doc_stop_2 = pickle.load(open("../resources/stopword_filtered.pkl", "rb")) ### több stop szóval, a no_stopword.pkl kiegészítve
all_docs = pickle.load(open("../resources/docs.pkl", "rb"))

## 2. Készítünk egy listát az előre megadott témák szavaiból

In [39]:
seed_topic_list = [["szabadidő", "szabadidőtök", "szabadidőd", "szabadidődet"],
                   ["nyelv", "nyelvtanulás", "nyelvvizsga"],
                   ["sport", "sportol", "sportolás"],
                   ["ismerkedés", 'megismerkedik', "megismer"],
                   ["olvas", "olvasás", "könyv"],
                   ['külföld', 'külföldi', 'utazik'],
                   ["magyarország"],
                   ["social", "media", "facebook", "facebookon", 'instagram', 'instagramm', 'instagrammom', 'instagrammon'], ["igen", "ja", "persze", "aha", "hum"]]
#, ["laugh", "nevet", "vicces"], ["placeholder"]

In [40]:
## 3. Topic modellezünk. Beadjuk seednek az előre megadott témákat, így azokat könnyebben azonosítja a modell. Lekérjük minden topik leggyakoribb 40 szavát. Eldöntjük, hogy a lemmatizált vagy a stop szavazott adaton dolgozunk-e. 3 féle modellt kipróbálunk.

## 3.1 Modell: BERTopic

In [41]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

key = KeyBERTInspired()
mm =  MaximalMarginalRelevance(diversity=0.3)
btm = BERTopic("hungarian", representation_model=[key, mm], seed_topic_list=seed_topic_list, min_topic_size = 50, calculate_probabilities=True)
topics, probs = btm.fit_transform(doc_stop_2)

OutOfMemoryError: CUDA out of memory. Tried to allocate 368.00 MiB. GPU 0 has a total capacity of 1.95 GiB of which 384.00 KiB is free. Process 4938 has 1011.00 MiB memory in use. Including non-PyTorch memory, this process has 981.00 MiB memory in use. Of the allocated memory 918.39 MiB is allocated by PyTorch, and 25.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
btm_jo_seed_ti= btm.get_topic_info()
btm_jo_seed_ti

## Reduce outliers. Kétféle módszert próbálunk ki: valószínűség és eloszlás alapján.

## Valószínűség alapján

In [None]:
new_topics = btm.reduce_outliers(doc_stop_2, topics, probabilities=probs, strategy="probabilities")

## Frissítjük a modellt az outlierek kizárása után létrehozott új topikokkal és topikeloszlásokkal

In [None]:
import pandas as pd
btm.update_topics(doc_stop_2, new_topics)
documents = pd.DataFrame({"Document": doc_stop_2, "Topic": new_topics})
btm._update_topic_size(documents)

In [None]:
btm_probs_ti = btm.get_topic_info()
btm_probs_ti

In [None]:
import pandas as pd
probs_df_2=pd.DataFrame(probs)
probs_df_2['main percentage'] = pd.DataFrame({'max': probs_df_2.max(axis=1)})
probs_df_2

## Eloszlás alapján (default).

In [None]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

key = KeyBERTInspired()
mm =  MaximalMarginalRelevance(diversity=0.3)
sima_btm = BERTopic("hungarian", representation_model=[key, mm], seed_topic_list=seed_topic_list, min_topic_size = 50, calculate_probabilities=True)
topics, probs = sima_btm.fit_transform(doc_stop_2)

In [None]:
new_topics_2 = sima_btm.reduce_outliers(doc_stop, topics)

In [None]:
import pandas as pd
sima_btm.update_topics(doc_stop, new_topics_2)
documents_2 = pd.DataFrame({"Document": doc_stop, "Topic": new_topics_2})
sima_btm._update_topic_size(documents_2)

In [None]:
btm_sima_ti_new = sima_btm.get_topic_info()
btm_sima_ti_new

## 3.2 Modell: Sentence Transformer

In [None]:
sentence_model = SentenceTransformer("NYTK/sentence-transformers-experimental-hubert-hungarian")
sentence_transformer_lemmatized = BERTopic(embedding_model=sentence_model, min_topic_size = 30, seed_topic_list=seed_topic_list)
topics, probs = sentence_transformer_lemmatized.fit_transform(lemmatized)

In [None]:
sentence_topic_info_lemmatized = sentence_transformer_lemmatized.get_topic_info()
sentence_topic_info_lemmatized

## 3.3 Modell: Huspacy

In [None]:
nlp = huspacy.load()
spacy_lemmatized = BERTopic(embedding_model=nlp, min_topic_size = 30, seed_topic_list=seed_topic_list)
topics, probs = spacy_lemmatized.fit_transform(lemmatized)

In [None]:
spacy_topic_info_lemmatized = spacy_lemmatized.get_topic_info()
spacy_topic_info_lemmatized

## 3.4 HDBscan

In [None]:
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=10)
hdbscan_lemmatized = BERTopic(hdbscan_model=hdbscan_model, seed_topic_list=seed_topic_list)
topics, probs = hdbscan_lemmatized.fit_transform(lemmatized)

In [None]:
hdbscan_lemmatized_topic_info = hdbscan_lemmatized.get_topic_info()
hdbscan_lemmatized_topic_info

## 4. Lementjük a modelleket és a topikokat

In [None]:
def save_model(model_name, model_path):
    model_name.save(model_path, serialization="pickle")

In [None]:
save_model(btm,"../models/bert_model_probabilities.pkl")

In [None]:
def save_topic_info(model_topic_info, topic_path):
    model_topic_info.to_csv(topic_path, sep=",", index=False, encoding="UTF-8")

In [None]:
save_topic_info(btm_probs_ti, "../results/bert_model_probabilities.csv")

## 5. Összevonjuk a hasonló topikokat

## Először betöltjük a használni kívánt modellt. Jelen esetben azt a BERTopic modellt töltjük vissza, amelynél a valószínűség alapján csökkentettök az outliereket.

In [42]:
loaded_model_path = "../models/bert_model_probabilities.pkl"
model = BERTopic.load(loaded_model_path)
print("A modellt betöltöttük")

OutOfMemoryError: CUDA out of memory. Tried to allocate 368.00 MiB. GPU 0 has a total capacity of 1.95 GiB of which 384.00 KiB is free. Process 4938 has 1011.00 MiB memory in use. Including non-PyTorch memory, this process has 981.00 MiB memory in use. Of the allocated memory 918.40 MiB is allocated by PyTorch, and 25.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
model.get_topic_info()

## 5.1 Lekérjük a similarity hetmapet, hogy vizuálisan lássuk, mely témák hasonlóak

In [None]:
model.visualize_heatmap()

## 5.2 Lekérjük a témák hierarchiáját, hogy lássuk a témák alá-fölé rendeltségének összefüggéseit

In [None]:
hierarchical_topics = model.hierarchical_topics(doc_stop_2)

# Visualize these representations
model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

## 5.3 Megnézzük, mely témák hasonlítanak számszerűsítve

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

distance_matrix = cosine_similarity(np.array(model.topic_embeddings_))
dist_df = pd.DataFrame(distance_matrix, columns=model.topic_labels_.values(),
                       index=model.topic_labels_.values())

tmp = []
for rec in dist_df.reset_index().to_dict('records'):
    t1 = rec['index']
    for t2 in rec:
        if t2 == 'index':
            continue
        tmp.append(
            {
                'topic1': t1,
                'topic2': t2,
                'distance': rec[t2]
            }
        )

pair_dist_df = pd.DataFrame(tmp)

pair_dist_df = pair_dist_df[(pair_dist_df.topic1.map(
      lambda x: not x.startswith('-1'))) &
            (pair_dist_df.topic2.map(lambda x: not x.startswith('-1')))]
pair_dist_df = pair_dist_df[pair_dist_df.topic1 < pair_dist_df.topic2]
pair_dist_df.sort_values('distance', ascending = False).head(20)

## 5.4 Megnézzük, mely témák hasonlítanak nagyobb, mint 85%-ban. Lementjük az adatot.

In [None]:
similar_topics = pair_dist_df[pair_dist_df["distance"] > 0.85]
similar_topics

In [None]:
similar_topics.to_csv("../results/bert_probabilities_similarity_85.csv")

## 5.5 Áttanulmányozzuk a lementett csv-t és eldöntjük, mely témákat érdemes összevonni a számszerűsített hasonlóság miatt. Megadjuk az összevonandó témákat, majd összevonjuk ezeket.

In [None]:
topics_to_merge = [[7,2,8,9,14,16,23,27,33,32,34,41,43,44,45], [1,6,36], [14,16,0,28,32,46,47,39,5,9 ], [20,22],[19,42]]
model.merge_topics(doc_stop_2, topics_to_merge)

In [None]:
model.get_topic_info()

In [None]:
model.update_topics(doc_stop_2)

In [None]:
topics = model.topics_
probabilities = model.probabilities_

In [None]:
probabilities

In [None]:
probs_df=pd.DataFrame(probabilities)
probs_df

In [None]:
probs_df.to_csv("../results/prob_matrix.csv", sep=",", encoding="UTF-8", index=False)

## 5.6 Elnevezzük a topikokat

In [None]:
topic_labels = model.generate_topic_labels(nr_words=5,
                                                 topic_prefix=False,
                                                 word_length=10,
                                                 separator=", ")
model.set_topic_labels(topic_labels)

In [None]:
model.set_topic_labels({0: "diskurzuselem", 1: "helyeslés, hümmögés", 2: "nevetés"})

In [None]:
topicinfo=model.get_topic_info()
topicinfo

## 5.7 Lementjük a frissített, összevont és átnevezett témákat tartalmazó topic modellt és a topikokat

In [None]:
save_topic_info(topicinfo, "../results/bert_probabilities_merged_85_new_names.csv")

In [None]:
save_model(model,"../models/bert_probabilities_merged.pkl")