<a href="https://colab.research.google.com/github/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/Topic_Modeling_with_BERTopic_Reclame_aqui_(17).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Topic Modeling with BERTopic - Reclame Aqui**

BERTopic is a topic modeling technique that leverages transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions 

Reference: (https://maartengr.github.io/BERTopic/index.html).

### **Enabling the GPU**

We will use the GPU provided by COLAB to accelarate our model training. To enable GPUs for the notebook:
1- Navigate to Edit -> Notebook Settings
2- Select GPU from the Hardware Accelerator drop-down

In [1]:
# verify if GPU is enable
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Feb  2 13:23:12 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0    27W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### **Github**

In [2]:
!ssh-keygen -t rsa -b 4096
# Add github.com to our known hosts
!ssh-keyscan -t rsa github.com >> ~/.ssh/known_hosts
# Restrict the key permissions, or else SSH will complain.
!chmod go-rwx /root/.ssh/id_rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:2T6Ca9XHjbIoADTGDuABWyXz2D16e9wHnkIlHOUdkjg root@4d2278c3e57b
The key's randomart image is:
+---[RSA 4096]----+
|=o+..   .oo..    |
|oo** . .Eo.o .   |
|.*..o o o.o .    |
|  o  . . =       |
|   .. . S.o. o   |
|    .. =.+oo+ .  |
|     .o.=.*+.    |
|      oo.o.o     |
|     ...         |
+----[SHA256]-----+
# github.com:22 SSH-2.0-babeld-9c9abdde


In [3]:
!cat /root/.ssh/id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDIADSzoiKT4Pxd05EQNpsxM8SwuHmhzfQVwOxtEryLL4JFTTSWstt+hdbrKSyJXBf3CKG5WwclAW4AkBPpvjNgq5EiJBCzvYsKVkCLU50qDR3EA0UbhSUI14Os1OBgcNh5uKH+cLkYQ7nvkOFocbER+uC6x+oSBrA3jnkiO84TbtqlciYaGb/1177xS7u2F7OqQ92TcsrrdPGXIWvpR/4yQTD5GLZXyw543IpKf7NjfCta3DWWcXAMYBFgVFtBeUcKzSNI+Ja2Im1wt2KIYxFgdfXCSkDPX65nmPMSxB0vI0dT/fl/BIOAiMX0Orfu8OOG6nw0Up4PLxHkbLhwW0uXpApGFM1SmPiia/Lg4iFnvEYIPoclOY98asMOt26TcwfsLES/GAAibHUQiRD6o6xToOZ6nPWR04vwadMCsUMd0H4H1HLw/BQdIJNBQ7g4XHKgQvwruq2z+mwBe5K9isYXHQFXIz2ANOkh8CJlJiu+u/1OuLtW14bri7C08Qvl52Dzitdv7mmAtIC1TEaCWRsLtQiYz5E5V7c23V/T1KXEzU0GxtvWesFjRFk64LRWTg0ddiUVXWSwpQe3EL0yde+DQ+hh1lHcJWZJkxF8hgbh1HdcJDrXXiOONpBtqDAQypw2KEGmG9A3U+iv5C0EKAsd54cfKTdBsEoDyeQS8m/wxw== root@4d2278c3e57b


In [4]:
!git config --global user.email ""
!git config --global user.name "punkmic"

In [5]:
!ssh -T git@github.com

Hi punkmic! You've successfully authenticated, but GitHub does not provide shell access.


In [6]:
!git clone git@github.com:punkmic/Topic-Modeling-Reclame-Aqui.git

Cloning into 'Topic-Modeling-Reclame-Aqui'...
remote: Enumerating objects: 18631, done.[K
remote: Counting objects: 100% (4450/4450), done.[K
remote: Compressing objects: 100% (2961/2961), done.[K
remote: Total 18631 (delta 1456), reused 4416 (delta 1440), pack-reused 14181[K
Receiving objects: 100% (18631/18631), 237.44 MiB | 8.11 MiB/s, done.
Resolving deltas: 100% (2604/2604), done.
Updating files: 100% (1084/1084), done.


### **Setup**

In [3]:
%%capture
import pandas as pd # for data manipulation
import os # for interacting with the operating system
import nltk # for natural language processing
import matplotlib.pyplot as plt # for visualization
import ast # for convert str to tuple
import csv
import json
import pickle 
from sklearn.decomposition import PCA # for dimension reduction
from sklearn.feature_extraction.text import CountVectorizer # for convert text documents to matrix of tokens count
from sklearn.cluster import KMeans # for clustering

try:
  from gensim import models
  from gensim.corpora import Dictionary
  from gensim.models.coherencemodel import CoherenceModel
  from bertopic import BERTopic # for topic modeling
  import optuna # for hyperparameter optimization
  from hdbscan import HDBSCAN # for clustering
  from umap import UMAP # for dimension reduction
  from bertopic.vectorizers import ClassTfidfTransformer 
except:
  !pip install gensim
  !pip install bertopic
  !pip install kaleido # for save BERTopic plots as image
  !pip install optuna
  !pip install hdbscan
  !pip install umap-learn
  from gensim import models
  from gensim.corpora import Dictionary
  from gensim.models.coherencemodel import CoherenceModel
  from umap import UMAP # for dimension reduction
  import optuna # for hyperparameter optimization
  from hdbscan import HDBSCAN # for clustering
  from bertopic import BERTopic # for topic modeling
  from bertopic.vectorizers import ClassTfidfTransformer 

# import custom module
%cd /content/Topic-Modeling-Reclame-Aqui/utils
from max_limit import max_limit
%cd ../

In [4]:
WORK_DIR = '/content/Topic-Modeling-Reclame-Aqui/bertopic'

In [119]:
df = pd.read_csv(os.path.join('/content/Topic-Modeling-Reclame-Aqui/datasets', 'processed_v1.csv'))

In [120]:
df.head(3)

Unnamed: 0,title,documents,documents_nouns,bigrams,noun_bigrams,trigrams,noun_trigrams,freq_words_removed,freq_words_removed_nouns
0,pedir cancelado justificativa semana compro,pesquisando bastante novo comprar resolver agu...,semana desconto promoção desconto compra custo...,"[('pesquisando', 'bastante'), ('bastante', 'no...","[('semana', 'desconto'), ('desconto', 'promoçã...","[('pesquisando', 'bastante', 'novo'), ('bastan...","[('semana', 'desconto', 'promoção'), ('descont...",pesquisando bastante novo resolver aguardar se...,semana desconto promoção desconto plataforma c...
1,pedir cancelar,sinceramente decepcionar entrar contato procon...,procon audiência conciliação solicitação produ...,"[('sinceramente', 'decepcionar'), ('decepciona...","[('procon', 'audiência'), ('audiência', 'conci...","[('sinceramente', 'decepcionar', 'entrar'), ('...","[('procon', 'audiência', 'conciliação'), ('aud...",sinceramente decepcionar procon hoje informar ...,procon audiência conciliação solicitação estoq...
2,cobrança indever,cancelei plano antes terminar período testir g...,cancelei período plataforma fatura gratuito ca...,"[('cancelei', 'plano'), ('plano', 'antes'), ('...","[('cancelei', 'período'), ('período', 'platafo...","[('cancelei', 'plano', 'antes'), ('plano', 'an...","[('cancelei', 'período', 'plataforma'), ('perí...",cancelei plano antes terminar período testir g...,cancelei período plataforma fatura gratuito ca...


In [147]:
COLNAME = 'freq_words_removed_nouns'

In [148]:
df.dropna(subset=[COLNAME], inplace=True)
documents = df[COLNAME].values
print(len(documents))

10292


## **Training a BERTopic Model**

The BERTopic algorithm has several advantages over other topic modeling algorithms. It is able to handle sparse data, it is scalable to large datasets, and it is able to learn topics that are not well-defined or are overlapping.

As our data language is portuguese we will going to set language to multilingual.

Create a new BERTopic model and train it. By default BERTopic use the paraphrase-multilingual-MiniLM-L12-v2 model for multi language documents. For others model check here [BERTopic sentence transformers](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#sentence-transformers)

In [123]:
# Download models with stopwords
nltk.download("stopwords")
custom_stop_words = ['amazon', 'americanas', 'casas bahia', 'magazine luiza', 'shein', 'kabum',
                       'samsung', 'mercado livre', 'banco brasil', 'apple', 'magazine', 'luiza', 'luizar',
                      'casas', 'bahia', 'casa', 'mercado', 'livre', 'loja', 'produto', 'compra', 'comprar', 
                     'comprar', 'entrar', 'amazomcombr', 'novembro', 'dia', 'amazoncombr digital', 'empresa']
stopwords = nltk.corpus.stopwords.words('portuguese') + custom_stop_words    


In [149]:
# Clustering algorithm
#kmeans = KMeans(n_clusters=20,max_iter=200, n_init=10, random_state=42)
hdbscan  = HDBSCAN(min_cluster_size=10)

# Dimension reduction algorithm
#pca = PCA(n_components=5, random_state=42)
umap = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)

# Define a custom model to remove stopwords
vectorizer = CountVectorizer(stop_words=stopwords, ngram_range=(1,2))

# Reduce frequent words influency
tfid = ClassTfidfTransformer(reduce_frequent_words=True)

# Create a new BERTopic model using multilingual option
topic_model = BERTopic(language="multilingual", verbose=True, 
                       #hdbscan_model=kmeans, 
                       umap_model=umap,
                       vectorizer_model=vectorizer,
                       ctfidf_model=tfid)

# Train model 
topics, probs = topic_model.fit_transform(documents)

print(f'Clustering algorithm parameters: {topic_model.hdbscan_model.get_params(False)}')
print(f'\nReduction algorithm parameters: {topic_model.umap_model.get_params(False)}')

Batches:   0%|          | 0/322 [00:00<?, ?it/s]

Clustering algorithm parameters: {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 20, 'n_init': 10, 'random_state': 42, 'tol': 0.0001, 'verbose': 0}

Reduction algorithm parameters: {'copy': True, 'iterated_power': 'auto', 'n_components': 5, 'random_state': 42, 'svd_solver': 'auto', 'tol': 0.0, 'whiten': False}


#### **Load best model from optimization**

In [159]:
try:
  dir = os.path.join(WORK_DIR, COLNAME, 'models')

  topic_model = BERTopic.load(os.path.join(dir, 'model_trial_2'))
  topics = topic_model.topics_

  print(f'Reduction algorithm parameters: {topic_model.umap_model.get_params(False)}')
  print(f'\nClustering algorithm parameters: {topic_model.hdbscan_model.get_params(False)}')

except:
  print('None model founded from optimization')

Reduction algorithm parameters: {'a': None, 'angular_rp_forest': True, 'b': None, 'dens_frac': 0.3, 'dens_lambda': 2.0, 'dens_var_shift': 0.1, 'densmap': False, 'disconnection_distance': None, 'force_approximation_algorithm': False, 'init': 'spectral', 'learning_rate': 1.0, 'local_connectivity': 1.0, 'low_memory': True, 'metric': 'cosine', 'metric_kwds': None, 'min_dist': 0.1, 'n_components': 2, 'n_epochs': None, 'n_jobs': -1, 'n_neighbors': 13, 'negative_sample_rate': 5, 'output_dens': False, 'output_metric': 'euclidean', 'output_metric_kwds': None, 'precomputed_knn': (None, None, None), 'random_state': 42, 'repulsion_strength': 1.0, 'set_op_mix_ratio': 1.0, 'spread': 1.0, 'target_metric': 'categorical', 'target_metric_kwds': None, 'target_n_neighbors': -1, 'target_weight': 0.5, 'tqdm_kwds': {'desc': 'Epochs completed', 'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'disable': True}, 'transform_mode': 'embedding', 'transform_queue_size': 4.0, 'trans

In [160]:
topic_range = range(0, max_limit(len(topics), 20), 1)
top_n_topics = len(topic_range)

BERTopic works in three main steps: 


1.   Documents are first converted to numeric data. It extracts different embeddings based on the context of the word. For this, a sentence transformation model is used.
2.  Documents with similar topics are then grouped together forming clusters with similar topics. For this purpose, BERTopic uses the clustering algorithm UMAP to lower the dimensionality of the embeddings. Then the documents are clustered with the density-based algorithm HDBSCAN.
3. BERTopic extracts topics from clusters using a class-based TF-IDF score. This score gives the importance of each word in a cluster. Topics are then created based on the most important words measured by their C-TF-IDF score.

For more information check this link [BERTopic](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6)



#### **BERTopic coherence score**

In [33]:
def get_bertopic_coherence(model, topics, docs):
  # Preprocess Documents
  df_docs = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
  documents_per_topic = df_docs.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
  cleaned_docs = model._preprocess_text(documents_per_topic.Document.values)

  # Extract vectorizer and analyzer from BERTopic
  vectorizer = model.vectorizer_model
  analyzer = vectorizer.build_analyzer()

  # Extract features for Topic Coherence evaluation
  words = vectorizer.get_feature_names()
  tokens = [analyzer(doc) for doc in cleaned_docs]
  dictionary = Dictionary(tokens)
  corpus = [dictionary.doc2bow(token) for token in tokens]
  topic_words = [[words for words, _ in model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

  # Evaluate
  coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
  return coherence_model.get_coherence()

In [161]:
print(f"Coherence score: {get_bertopic_coherence(topic_model, topics, documents)}")

Coherence score: 0.37503068336325207


### **Extracting Topics**

In [104]:
# Print the most frequent topics
freq = topic_model.get_topic_info()

# Show the top 5 most frequent topics
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,0,2087,0_comprei_mercado_vendedor_livre
1,1,1948,1_entrega_11_pedido_prazo
2,2,1676,2_amazon_prime_cartão_site amazon
3,3,1430,3_conta_app_consigo_celular
4,4,1421,4_cartão_conta_valor_pagamento


The table above shows the five most freqeuente topics and the words present on it extract by BERTopic. -1 refers to all outliers and should be ignored.

In [68]:
# show the most frequent topic
topic_model.get_topic(0)

[('entregar', 0.14122319721148702),
 ('pedir', 0.13797172374651687),
 ('contato', 0.13710345837730845),
 ('entrar', 0.13370079374742355),
 ('hoje', 0.13324434415634911),
 ('recebi', 0.13147051049491734),
 ('site', 0.13013216892789947),
 ('entrar contato', 0.12951982016099342),
 ('empresa', 0.12602715445848522),
 ('dia', 0.12599104693208002)]

**Note:** BERTopic is stocastich which means that the topics might differ across runs this is mostly due to the stocastisch nature of UMAP

**Save topic info table as CSV**

In [26]:
def save_freq_topics(model, label):
  # Print the most frequent topics
  freq = model.get_topic_info()

  # Show the top 5 most frequent topics
  freq = freq.head(10)

  dir = os.path.join(WORK_DIR, COLNAME, 'frequent_topics')
  
  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  freq.to_html(os.path.join(dir, f"freq_topics_trial_{label}.json"))
  freq.to_csv(os.path.join(dir, f"freq_topics_trial_{label}.csv"), index=False)

In [124]:
save_freq_topics(topic_model, COLNAME)

## **Visualization**

### **Intertopic Distance Map**

This graph shows the distance intertopic and help us understand the promixity of topics

In [152]:
fig = topic_model.visualize_topics()
fig

**Save intertopic distance map**

In [126]:
def save_topics(model, label, top_n_topics=-1):
  fig = model.visualize_topics(top_n_topics=max_limit(top_n_topics, 20))

  dir = os.path.join(WORK_DIR, COLNAME, 'topics')
  
  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  fig.write_image(os.path.join(dir, f"intertopic_distance_map_trial_{label}.png"), format="png")
  fig.write_html(os.path.join(dir, f"intertopic_distance_map_trial_{label}.html"))

In [127]:
save_topics(topic_model, COLNAME, top_n_topics)

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/intertopic_distance_map/idm_preprocessed_lemma.png?raw=true)

### **Visualize Topic Hierarchy**

The topics that were created can be hierarchically reduced. This visualization shows how the topics relate to one another.

In [129]:
fig = topic_model.visualize_hierarchy(top_n_topics=max_limit(top_n_topics, 40), width=800, height=800)
fig

**Save Hierarchical Clustering**

In [130]:
def save_hierarchy(model, label, top_n_topics=-1):
  fig = model.visualize_hierarchy(top_n_topics=max_limit(top_n_topics, 40), width=500, height=500)

  # Set the path to save 
  dir = os.path.join(WORK_DIR, COLNAME, 'hierarchical_clusterings')

  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  fig.write_image(os.path.join(dir, f"hierarchical_clustering_trial_{label}.png"), format="png")
  fig.write_html(os.path.join(dir, f"hierarchical_clustering_trial_{label}.html"))

In [131]:
save_hierarchy(topic_model, COLNAME, top_n_topics)

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/hierarchical_clustering/hc_preprocessed_lemma.png?raw=true)

### **Visualize Terms**

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation.

In [153]:
fig = topic_model.visualize_barchart(top_n_topics=max_limit(top_n_topics, 20), width=300, height=300)
fig

**Save Top Word Scores Bar Chart**

In [134]:
def save_top_words_scores(model, label, top_n_topics=-1):
  fig = model.visualize_barchart(top_n_topics=max_limit(top_n_topics, 12), width=250, height=250)

  # Set the path to save 
  dir = os.path.join(WORK_DIR, COLNAME, 'top_words_scores')

  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  fig.write_image(os.path.join(dir, f"top_words_scores_trial_{label}.png"), format="png")
  fig.write_html(os.path.join(dir, f"top_words_scores_trial_{label}.html"))

In [135]:
save_top_words_scores(topic_model, COLNAME, top_n_topics)

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/top_words_scores/tws_preprocessed_lemma.png?raw=true)

### **Visualize Topic Similarity**

This plot shows a similarity matrix by simply applying cosine similarities through those topic embeddings generate by BERTopic through both c-TF-IDF and embeddings. This matrix indicate how similar certain topics are to each other.

In [136]:
fig = topic_model.visualize_heatmap(n_clusters=max_limit(len(set(topics)) - 1, 10), width=1000, height=800)
fig

 **Save Similarity Matrix**

In [137]:
def save_similarity_matrix(model, label):
  fig = model.visualize_heatmap(n_clusters=max_limit(len(set(range(0, len(set(model.topics_))))) - 1, 14), width=1000, height=800)

  # Set the path to save 
  dir = os.path.join(WORK_DIR, COLNAME, 'similarity_matrixes')

  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  fig.write_image(os.path.join(dir, f"similarity_matrix_trial_{label}.png"), format="png")
  fig.write_html(os.path.join(dir, f"similarity_matrix_trial_{label}.html"))

In [138]:
save_similarity_matrix(topic_model, COLNAME)

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/similarity_matrix/sm_preprocessed_lemma.png?raw=true)

### **Visualize Term Score Decline**

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added.

In [140]:
fig = topic_model.visualize_term_rank(topics=range(0, top_n_topics, 1), width=800, height=500)
fig

**Save Term score decline per Topic**

In [145]:
def save_term_rank(model, label, topics=None):

  if topics == None:
    topics = range(0, len(set(model.topics_)), 1)
  fig = model.visualize_term_rank(topics=topics, width=800, height=500)

  # Set the path to save 
  dir = os.path.join(WORK_DIR, COLNAME, 'term_ranks')

  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  fig.write_image(os.path.join(dir, f"term_score_trial_{label}.png"), format="png")
  fig.write_html(os.path.join(dir, f"term_score_trial_{label}.html"))

In [146]:
save_term_rank(topic_model, COLNAME)

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/term_socore_decline_topic/tsdp_preprocessed_lemma.png?raw=true)

### **Visualize documents**

This plot shows documents and their topics in 2D

In [148]:
fig = topic_model.visualize_documents(documents, topics=range(0, top_n_topics, 1), width=800, height=700, hide_annotations=True)
fig

**Save documents and their topics**

In [151]:
def save_documents(model, docs, label, topics=None):
  if topics == None:
    topics = range(0, len(set(model.topics_)), 1)
  fig = model.visualize_documents(docs, topics=topics, width=800, height=700, hide_annotations=True)

  # Set the path to save 
  dir = os.path.join(WORK_DIR, COLNAME, 'documents_topics')

  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  fig.write_image(os.path.join(dir, f"document_trial_{label}.png"), format="png")
  fig.write_html(os.path.join(dir, f"document_trial_{label}.html"))

In [152]:
save_documents(topic_model, documents, COLNAME)

### **Term search**

In [153]:
# Find topics that contains blackfriday term
similar_topics, similarity = topic_model.find_topics("blackfriday", top_n=5)

# Show similar topics
similar_topics

[2, 1, 8, 5, 3]

In [158]:
# Show a specific topic
topic_model.get_topic(1)

[('compr', 0.019712303134621426),
 ('novembro', 0.01856071225980464),
 ('encomenda', 0.018050388626745885),
 ('loja', 0.016491146696242274),
 ('magalu', 0.014640014983652683),
 ('outubro', 0.013111189276384473),
 ('noite', 0.012841034013462892),
 ('número', 0.012805109355523222),
 ('atendimento', 0.012672380069674477),
 ('consigo', 0.012329134134671767)]

### **"Hiperparameter optimization"**

In [21]:
def save_hyperparameters(trial_params, label):
  # Set the path to save 
  dir = os.path.join(WORK_DIR, COLNAME, 'hyperparameters')

  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  with open(os.path.join(dir, f"hyperparameters_trial_{label}.json"), "w") as f:
    f.write(json.dumps(trial_params))

In [22]:
def save_model(model, label):
  # Set the path to save 
  dir = os.path.join(WORK_DIR, COLNAME, 'models')

  # Use makedirs() to create a new directory if it does not exists
  if not os.path.exists(dir):
    os.makedirs(dir)

  model.save(os.path.join(dir, label))

In [40]:
def save_coherence(model, topics,  n_gram_range, docs, label, clustering_model, reduction_model):
  
  # compute coherence score for BERTopic
  try:
    coherence_score = get_bertopic_coherence(model, topics, docs)
  except:
    coherence_score = 0.0

  # save scores
  dir = os.path.join(WORK_DIR, COLNAME, 'coherences')
  
  writeheader = False

  if not os.path.exists(dir):
    os.makedirs(dir)
    writeheader = True

  with open(os.path.join(dir, 'coherence_scores.csv'), 'a', newline='') as f:
    fieldnames = ['model', 'number_of_topics', 'n_gramas', 'clustering', 'reduction', 'coherence_score']
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    data = [{'model': label,
              'number_of_topics': len(set(model.topics_)),
             'n_gramas': n_gram_range,
             'clustering': clustering_model, 
             'reduction': reduction_model,
             'coherence_score': round(coherence_score, 4)}]
    if writeheader:
      writer.writeheader()
    writer.writerows(data)
  return coherence_score

In [94]:
def optimizer(trial):

  clustering_option = trial.suggest_categorical('clustering_algorithm__name', ['HDBSCAN', 'K-means'])
  dimensionality_option = trial.suggest_categorical('reduction_algorithm__name', ['UMAP', 'PCA'])

  # BERTopic hyperparameters
  #top_n_words = trial.suggest_int('bertopic__top_n_words', 10, 15)
  n_gram_range = ast.literal_eval(trial.suggest_categorical('bertopic__n_gram_range', ['(1,1)', '(1,2)', '(1,3)']))
  #min_topic_size = trial.suggest_int('bertopic__min_topic_size', 20, 100, step=20)
  #diversity = trial.suggest_float('bertopic__diversity', 0.0, 1.0)
  #outlier_threshold = trial.suggest_float('bertopic__outliers_threshold', 0.04, 0.09)
  nr_topics = trial.suggest_int('bertopic__nr_topics', 8, 14) 

  if clustering_option == 'HDBSCAN':
    # HDBSCAN hyperparameters
    min_cluster_size = trial.suggest_int('hdbscan__min_cluster_size', 10, 12) # the minimum number of points required for a cluster to be considered valid
    cluster_selection_epsilon = trial.suggest_float('hdbscan__cluster_selection_epsilon', 0.0, 1.0) # the distance threshold below which two points are considered neighbors.
    cluster_selection_method = trial.suggest_categorical('hdbcan__cluster_selection_method', ['leaf', 'eom']) 
    hdbscan_alpha = trial.suggest_float('hdbcan__alpha', 0.1, 1.0)
    min_samples = trial.suggest_int('hdbscan__min_samples', 5, 10)

    # create a new HDBSCAN model to cluster documents
    clustering_model = HDBSCAN(min_cluster_size=min_cluster_size,
                               cluster_selection_method=cluster_selection_method, 
                               cluster_selection_epsilon=cluster_selection_epsilon,
                               alpha=hdbscan_alpha,
                               min_samples= min_samples,
                               prediction_data=True)
  elif clustering_option == 'K-means':
    # K-means hyperparameters
    k_means_n_clusters = trial.suggest_int('k_means__n_cluster', 12, 20) 
    k_means_max_iter = trial.suggest_int('k_means__max_iter', 200, 200)
    k_means_n_init = trial.suggest_int('k_means__n_init', 10, 10)

    # create a new HDBSCAN model to cluster documents
    clustering_model = KMeans(n_clusters=k_means_n_clusters,max_iter=k_means_max_iter, n_init=k_means_n_init, random_state=42)


  if dimensionality_option == 'UMAP':
    
    # UMAP hyperparameters
    n_neighbors = trial.suggest_int('umap__n_neighbors', 10, 13) #  the number of nearest neighbors UMAP uses to construct the low-dimensional embedding
    n_components = trial.suggest_int('umap__n_components', 2, 3) # the number of dimensions in the reduced data space
    metric = trial.suggest_categorical('umap__metric', ['cosine', 'cosine']) # euclidean
    #min_dist = trial.suggest_float('umap__min_dist', 0.0, 1.0)
    #spread = trial.suggest_float('umap__spread', 0.0, 1.0)

    # create a new UMAP model to reduce dimension
    reduction_model = UMAP(n_neighbors=n_neighbors, n_components=n_components, 
                           #min_dist=min_dist,
                           #spread=spread,
                           metric=metric, random_state=42)
  elif dimensionality_option == 'PCA':
    
    # PCA hyperparameters
    pca_n_components = trial.suggest_int('pca__n_components', 5, 6) 
    
    # create a new PCA model to reduce dimension
    reduction_model = PCA(n_components=pca_n_components, random_state=42) # k-Means, that does not produce any outliers at all

  # CountVectorizer hyperparameters 
  #max_features = trial.suggest_int('countvectorizer__max_features', 4000, 6000)
  #max_features = trial.suggest_int('vectorizer__max_features', 3000, 6000)

  # reduce the impact of frequent words.
  #ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

  # create a new CountVectorizer to create a matrix of tokens count
  #vectorizer_model = CountVectorizer(stop_words=stopwords)

  # create a new BERTopic model using multilingual option
  model = BERTopic(language="multilingual", 
                   nr_topics=nr_topics,
                   calculate_probabilities=True, 
                   verbose=True,
                   #top_n_words=top_n_words,
                   n_gram_range=n_gram_range,
                   #min_topic_size=min_topic_size,
                   #diversity=diversity,
                   #vectorizer_model=vectorizer_model,
                   #ctfidf_model=ctfidf_model,
                   umap_model=reduction_model,
                   hdbscan_model=clustering_model)
    
  label = trial.number
  params = trial.params

  # train BERTopic model 
  topics, probs = model.fit_transform(documents)
  

  print(f' Number of Topics: {len(set(model.topics_))}')
  print('\n')
  print('\n')

  coherence_score = 0.0

  # define model id
  model_id = f"model_trial_{label}"
 
  try:
    topic_range = range(0, max_limit(len(topics), 20), 1)
    top_n_topics = len(topic_range)

    # save plots
    save_topics(model, label, top_n_topics)
    save_documents(model, documents, label, topic_range)
    save_hierarchy(model, label, top_n_topics)
    save_term_rank(model, label, topic_range)
    save_top_words_scores(model, label, top_n_topics)
    save_similarity_matrix(model, label)
  

  except ValueError or TypeError: # skip models that throws ValueError: zero-size array to reduction operation maximum which has no identity
    print('\n')
  
  # save hyperparameters
  save_hyperparameters(params, label)

  # save model coherence score
  coherence_score = save_coherence(model, 
                   topics,
                   n_gram_range,
                   documents,
                   model_id,
                   clustering_option,
                   dimensionality_option)
  
  # save model
  save_model(model, model_id)

  return round(coherence_score, 4)

In [95]:
%%time
import warnings
from scipy.sparse import SparseEfficiencyWarning
warnings.filterwarnings('ignore', category=SparseEfficiencyWarning)

# make sure to delete old file 
path = os.path.join(WORK_DIR, COLNAME, 'coherences', 'coherence_scores.csv')
if os.path.exists(path):
  os.remove(path)

# define the number of models to generate by optuna
NUMBER_OF_MODELS = 25

# create a new study
study = optuna.create_study(study_name=f'BERTopic_{COLNAME}', direction='maximize')

# run the optmize function 
study.optimize(optimizer, n_trials=NUMBER_OF_MODELS, show_progress_bar=True)

# print best value and parameters
print(f'Best value {study.best_value}')
print(f'Best params: {study.best_params}')


Progress bar is experimental (supported from v1.2.0). The interface can change in the future.



  0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 9






[32m[I 2023-02-02 15:59:25,042][0m Trial 0 finished with value: 0.6407 and parameters: {'clustering_algorithm__name': 'HDBSCAN', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 8, 'hdbscan__min_cluster_size': 10, 'hdbscan__cluster_selection_epsilon': 0.28474579602883865, 'hdbcan__cluster_selection_method': 'eom', 'hdbcan__alpha': 0.23862942516140362, 'hdbscan__min_samples': 8, 'umap__n_neighbors': 13, 'umap__n_components': 3, 'umap__metric': 'cosine'}. Best is trial 0 with value: 0.6407.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10






[32m[I 2023-02-02 16:00:15,648][0m Trial 1 finished with value: 0.3694 and parameters: {'clustering_algorithm__name': 'HDBSCAN', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,1)', 'bertopic__nr_topics': 9, 'hdbscan__min_cluster_size': 12, 'hdbscan__cluster_selection_epsilon': 0.058897342888796334, 'hdbcan__cluster_selection_method': 'eom', 'hdbcan__alpha': 0.8223845835871771, 'hdbscan__min_samples': 7, 'umap__n_neighbors': 13, 'umap__n_components': 2, 'umap__metric': 'cosine'}. Best is trial 0 with value: 0.6407.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 9




[32m[I 2023-02-02 16:00:54,433][0m Trial 2 finished with value: 0.3754 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,1)', 'bertopic__nr_topics': 9, 'k_means__n_cluster': 18, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'umap__n_neighbors': 13, 'umap__n_components': 2, 'umap__metric': 'cosine'}. Best is trial 0 with value: 0.6407.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10






[32m[I 2023-02-02 16:01:46,054][0m Trial 3 finished with value: 0.5096 and parameters: {'clustering_algorithm__name': 'HDBSCAN', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,2)', 'bertopic__nr_topics': 9, 'hdbscan__min_cluster_size': 12, 'hdbscan__cluster_selection_epsilon': 0.025385828640379193, 'hdbcan__cluster_selection_method': 'leaf', 'hdbcan__alpha': 0.6422395171250717, 'hdbscan__min_samples': 9, 'umap__n_neighbors': 13, 'umap__n_components': 2, 'umap__metric': 'cosine'}. Best is trial 0 with value: 0.6407.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 8




[32m[I 2023-02-02 16:02:42,474][0m Trial 4 finished with value: 0.4578 and parameters: {'clustering_algorithm__name': 'HDBSCAN', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,2)', 'bertopic__nr_topics': 12, 'hdbscan__min_cluster_size': 12, 'hdbscan__cluster_selection_epsilon': 0.8816092073068508, 'hdbcan__cluster_selection_method': 'leaf', 'hdbcan__alpha': 0.49368770683427254, 'hdbscan__min_samples': 9, 'umap__n_neighbors': 13, 'umap__n_components': 2, 'umap__metric': 'cosine'}. Best is trial 0 with value: 0.6407.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10






[32m[I 2023-02-02 16:03:33,800][0m Trial 5 finished with value: 0.5598 and parameters: {'clustering_algorithm__name': 'HDBSCAN', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,2)', 'bertopic__nr_topics': 9, 'hdbscan__min_cluster_size': 12, 'hdbscan__cluster_selection_epsilon': 0.7467985202132453, 'hdbcan__cluster_selection_method': 'leaf', 'hdbcan__alpha': 0.1213836808116061, 'hdbscan__min_samples': 6, 'umap__n_neighbors': 11, 'umap__n_components': 2, 'umap__metric': 'cosine'}. Best is trial 0 with value: 0.6407.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 12




[32m[I 2023-02-02 16:04:18,514][0m Trial 6 finished with value: 0.8287 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 12, 'k_means__n_cluster': 13, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'umap__n_neighbors': 13, 'umap__n_components': 2, 'umap__metric': 'cosine'}. Best is trial 6 with value: 0.8287.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 8




[32m[I 2023-02-02 16:04:59,910][0m Trial 7 finished with value: 0.7641 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,2)', 'bertopic__nr_topics': 8, 'k_means__n_cluster': 18, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'umap__n_neighbors': 12, 'umap__n_components': 2, 'umap__metric': 'cosine'}. Best is trial 6 with value: 0.8287.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 14




[32m[I 2023-02-02 16:05:43,092][0m Trial 8 finished with value: 0.7412 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'UMAP', 'bertopic__n_gram_range': '(1,2)', 'bertopic__nr_topics': 14, 'k_means__n_cluster': 17, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'umap__n_neighbors': 10, 'umap__n_components': 3, 'umap__metric': 'cosine'}. Best is trial 6 with value: 0.8287.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 13




[32m[I 2023-02-02 16:06:14,348][0m Trial 9 finished with value: 0.3749 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,1)', 'bertopic__nr_topics': 13, 'k_means__n_cluster': 17, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 6 with value: 0.8287.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 11




[32m[I 2023-02-02 16:06:55,032][0m Trial 10 finished with value: 0.8466 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 11, 'k_means__n_cluster': 12, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 5}. Best is trial 10 with value: 0.8466.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 11




[32m[I 2023-02-02 16:07:32,938][0m Trial 11 finished with value: 0.8466 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 11, 'k_means__n_cluster': 12, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 5}. Best is trial 10 with value: 0.8466.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 11




[32m[I 2023-02-02 16:08:08,431][0m Trial 12 finished with value: 0.8466 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 11, 'k_means__n_cluster': 12, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 5}. Best is trial 10 with value: 0.8466.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 11




[32m[I 2023-02-02 16:08:44,156][0m Trial 13 finished with value: 0.8324 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 11, 'k_means__n_cluster': 14, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 5}. Best is trial 10 with value: 0.8466.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:09:20,522][0m Trial 14 finished with value: 0.863 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 15, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 5}. Best is trial 14 with value: 0.863.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:09:56,285][0m Trial 15 finished with value: 0.863 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 15, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 5}. Best is trial 14 with value: 0.863.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:10:31,371][0m Trial 16 finished with value: 0.8646 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 15, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:11:08,176][0m Trial 17 finished with value: 0.8646 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 15, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:11:48,400][0m Trial 18 finished with value: 0.8635 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 20, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 12




[32m[I 2023-02-02 16:12:26,997][0m Trial 19 finished with value: 0.8514 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 12, 'k_means__n_cluster': 15, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:12:57,168][0m Trial 20 finished with value: 0.3692 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,1)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 16, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:13:32,946][0m Trial 21 finished with value: 0.8635 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 20, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:14:09,874][0m Trial 22 finished with value: 0.8392 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 14, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 9




[32m[I 2023-02-02 16:14:45,309][0m Trial 23 finished with value: 0.8581 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 9, 'k_means__n_cluster': 20, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m


Batches:   0%|          | 0/322 [00:00<?, ?it/s]

 Number of Topics: 10




[32m[I 2023-02-02 16:15:20,463][0m Trial 24 finished with value: 0.8457 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 16, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}. Best is trial 16 with value: 0.8646.[0m
Best value 0.8646
Best params: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__n_gram_range': '(1,3)', 'bertopic__nr_topics': 10, 'k_means__n_cluster': 15, 'k_means__max_iter': 200, 'k_means__n_init': 10, 'pca__n_components': 6}
CPU times: user 17min 58s, sys: 36.2 s, total: 18min 34s
Wall time: 16min 50s


#### **Update remote repository**

In [96]:
%cd /content/Topic-Modeling-Reclame-Aqui/

/content/Topic-Modeling-Reclame-Aqui


In [None]:
!git status

In [98]:
!git add *

In [None]:
!git status

In [None]:
!git commit -m "Adding bertopic results from frequent words removed (nouns)"

In [100]:
!git push origin master

Enumerating objects: 491, done.
Counting objects:   0% (1/486)Counting objects:   1% (5/486)Counting objects:   2% (10/486)Counting objects:   3% (15/486)Counting objects:   4% (20/486)Counting objects:   5% (25/486)Counting objects:   6% (30/486)Counting objects:   7% (35/486)Counting objects:   8% (39/486)Counting objects:   9% (44/486)Counting objects:  10% (49/486)Counting objects:  11% (54/486)Counting objects:  12% (59/486)Counting objects:  13% (64/486)Counting objects:  14% (69/486)Counting objects:  15% (73/486)Counting objects:  16% (78/486)Counting objects:  17% (83/486)Counting objects:  18% (88/486)Counting objects:  19% (93/486)Counting objects:  20% (98/486)Counting objects:  21% (103/486)Counting objects:  22% (107/486)Counting objects:  23% (112/486)Counting objects:  24% (117/486)Counting objects:  25% (122/486)Counting objects:  26% (127/486)Counting objects:  27% (132/486)Counting objects:  28% (137/486)Counting objects:  29% (141/486)C

In [None]:
# run this command to push a new version of this notebook in case you have saved the notebook in github and it is outdate 
!git stash
!git pull
!git stash pop

No local changes to save
Already up to date.
No stash entries found.


In [None]:
!rm -rf /root/.ssh/