<a href="https://colab.research.google.com/github/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/Topic_Modeling_with_BERTopic_Reclame_aqui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Topic Modeling with BERTopic - Reclame Aqui**

BERTopic is a topic modeling technique that leverages transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions 

Reference: (https://maartengr.github.io/BERTopic/index.html).

### **Enabling the GPU**

We will use the GPU provided by COLAB to accelarate our model training. To enable GPUs for the notebook:
1- Navigate to Edit -> Notebook Settings
2- Select GPU from the Hardware Accelerator drop-down

In [6]:
# verify if GPU is enable
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Jan  9 15:35:53 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P0    28W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### **Setup**

In [7]:
%%capture
!pip install pyspellchecker
!pip install bertopic
!pip install kaleido # for save BERTopic plots as image

In [8]:
import pandas as pd # for data manipulation
import os # for interacting with the operating system
import nltk # for natural language processing
import string # for string manipulation
import re # for for regular expressions
import matplotlib.pyplot as plt # for visualization
import spacy # for lemmatize portuguese text
from bertopic import BERTopic # for topic modeling
from spellchecker import SpellChecker # for spell check

In [9]:
%%capture
# Install spacy pt_core_news_sm for portuguese text
!python -m spacy download pt_core_news_sm

In [10]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [11]:
# Download dataset with stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
# Donwload datasets for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [13]:
# Donwload dependency need to stem portuguese text
nltk.download('rslp')

[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Unzipping stemmers/rslp.zip.


True

### **Load data from [Github](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git)**

In [None]:
#!git clone https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git

In [None]:
# Change directory
%cd /content/Topic-Modeling-Reclame-Aqui 

# Update files from remote repository
!git pull 

# Return to work directory
%cd ..

# Check current directory
!pwd

/content/Topic-Modeling-Reclame-Aqui
Already up to date.
/content
/content


In [14]:
def read_data(path_csv, drop_duplicates = True, lower=True):
 
  # use the read_csv method to read csv file
  df = pd.read_csv(path_csv)
  
  if drop_duplicates:
    # read and return the CSV file using the read_csv method
    print(f"Shape before remove duplicates: {df.shape}")

    # use the drop_duplicated method to drop duplicates rows
    df = df.drop_duplicates(subset="text")

    print(f"Shape after remove duplicates: {df.shape}")

    if lower:
      # apply the str.lower() method to each element in the dataframe
      df = df.applymap(str.lower)
    
     # rename columns
    df.columns = ["title", "documents"] 

    # use the replace() method to replace the string with an empty string
    df = df.replace(re.compile('\[editado pelo reclame aqui\]|editado pelo reclame aqui|Editado pelo Reclame Aqui'), '')
    df = df.replace(re.compile('\[casas bahia\]|Casa Bahia|Casas Bahia|casa bahia'), '')
    df = df.replace(re.compile('\[magazine luiza\]|Magazine luiza|Magazine Luiza| Magazine luizar|Magazine Luizar'), '')
    df = df.replace(re.compile('\[mercado livre\]|Mercado Livre|Mercado livre'), '')
    df = df.replace(re.compile('\[americana\]|Ameriacanas|ameriacanas'), '')

  return df

### **Preprocessing**

#### **Tokenization**

Tokenization aims to breaking text down into its component parts

In [15]:
WORD_TOKENIZER = nltk.tokenize.word_tokenize
def tokenize(text):
  tokens = [token.strip().lower() for token in WORD_TOKENIZER(text, language="portuguese")]
 
  # set a pattern to detect patterns such as x x, xxx x, xxx xxx
  pattern = r"\b\w+\s+\w+\b"
 
  # filter tokens by pattern
  filtered_words = [word for word in tokens if re.search(pattern, word)]

  # return token if not in filter list
  return [token for token in tokens if token not in filtered_words]

#### **Stem** 

Stem the tokens. This step aims to remove morphological affixes and normalize to standardized stem forms

In [16]:
STEMMER = nltk.stem.RSLPStemmer()
def stem(tokens):
  return [STEMMER.stem(token) for token in tokens]

#### **Lemmatize**

Lemmatize the tokens. Retains more natural forms than stemming, but assumes all tokens nons unless tokens are passed as (word, pos) tuples. Note: nltk lemmatize does not suport portugues language

In [17]:
LEMMATIZER = nltk.WordNetLemmatizer()

def lemmatize(tokens):
  lemmas = []
  for token in tokens:
      if isinstance(token, str):
          # treats token like a noun
          lemmas.append(LEMMATIZER.lemmatize(token)) 
      else: 
          # assume a tuple of (word, pos)
          lemmas.append(LEMMATIZER.lemmatize(*token))
  return lemmas

**Lemmatize option for portuguese text**

In [18]:
# load portuguese model
nlp = spacy.load('pt_core_news_sm')

def lemmatize_pt(tokens):

  # Create a spaCy Doc object and apply the lemmatization
  doc = nlp(' '.join(tokens))

  # Return lemmatize
  return [token.lemma_ for token in doc]

#### **Remove stopwords**

Stop words are things like articles and conjunctions that usually do not offer a lot of value in an analysis.

In [19]:
def remove_stopwords(tokens, stopwords=None, custom_stop_words = None):

  if custom_stop_words is None:
    custom_stop_words = ['amazon', 'americanas', 'casas bahia', 'magazine luiza', 'shein', 'kabum',
                       'samsung', 'mercado livre', 'banco brasil', 'apple', 'magazine', 'luiza', 'luizar',
                      'casas', 'bahia', 'casa', 'mercado', 'livre']

  # Use the default stop words if none is passed
  if stopwords is None:
    stopwords = nltk.corpus.stopwords.words('portuguese') + custom_stop_words
  
  # Filter the list of tokens to exclude the stop word tokens
  return [token for token in tokens if token not in stopwords]

In [20]:
custom_stop_words = ['amazon', 'americanas', 'casas bahia', 'magazine luiza', 'shein', 'kabum',
                       'samsung', 'mercado livre', 'banco brasil', 'apple', 'magazine', 'luiza', 'luizar',
                      'casas', 'bahia', 'casa', 'mercado', 'livre']
assert remove_stopwords(['compra', 'echar', 'em esse', 'amazon', 'pude'], custom_stop_words=custom_stop_words) == ['compra', 'echar', 'em esse', 'pude']

#### **Remove hyperlinks**

Removes http/s links from the tokens.

In [21]:
def remove_links(tokens):
  # Filter tokens that starts with "http://" or "https://"
  return [token for token in tokens 
          if not token.startswith("http://")
          and not token.startswith("https://")]

In [22]:
assert remove_links(['bom', 'http://online', 'https://offline']) == ['bom']

#### **Remove numbers**

In [23]:
def remove_numbers(tokens):
  # Use a regular expression to match words that contain numbers
  pattern = r"\b\w*\d\w*\b"
  tokens = [token for token in tokens if not re.sub(pattern, "", token) == ""]
  
  # Filter out number tokens using a list comprehension and the isnumeric method
  return [token for token in tokens if not token.isnumeric()]

In [24]:
assert remove_numbers(['ola', 'bicicleta', '1', '2002']) == ['ola', 'bicicleta']

#### **Remove date**

In [25]:
def remove_date(tokens):
  # Compile a regular expression to match dates in the format dd/mm or dd/mm/yyyy
  date_regex = re.compile(r'\d{2}/\d{2}(/\d{4})?')

  # Use the regex to find all the tokens that match the date pattern
  dates = [token for token in tokens if date_regex.fullmatch(token)]

  # Filter the list of tokens to exclude the date tokens
  filtered_tokens = [token for token in tokens if token not in dates]

  # Return the filtered tokens
  return filtered_tokens

In [26]:
assert remove_date(['texto', 'data', '20/10', 'seguro', '02/09/2014']) == ['texto', 'data', 'seguro']

#### **Remove punctuation**

In [27]:
def remove_punctuation(tokens,
                       strip_mentions=True,
                       strip_hashtags=True):

    tokens = [re.sub(r'[^\w\s]', '', token) for token in tokens]

    # Filter punctuation tokens
    tokens = [token.strip() for token in tokens if token not in string.punctuation]

    # Remove @ symbol from left side of tokens
    if strip_mentions:
        tokens = [t.lstrip(r"([!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~])\1+") for t in tokens]

    # Remove # symbol from left side of tokens
    if strip_hashtags:
        tokens = [t.lstrip(r"([!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~])\1+") for t in tokens]

    return tokens

In [28]:
assert remove_punctuation(['limpo', 'acento/  ///', 'simples???', 'onde', ',']) == ['limpo', 'acento', 'simples', 'onde']

#### **Remove short tokens**

In [29]:
def remove_short_tokens(tokens):
  # Filter the list of tokens to exclude tokens that are shorter than four letters
  filtered_tokens = [token for token in tokens if len(token) >= 4]

  # Return the filtered tokens
  return filtered_tokens

In [30]:
assert remove_short_tokens(['sair', 'um', 'correto', 'igual', 'oi', 'de', 'em']) == ['sair', 'correto', 'igual']

#### **Correction of spelling errors**

In [31]:
# Create a SpellChecker object
spell = SpellChecker(language='pt')

def check_spell_errors(text):
  result = []
  for token in text:
    # Correct the spelling errors in the text
    corrected_text = spell.correction(token)

    # If no correction is present user the original text
    if corrected_text == None:
      corrected_text =  token
  
    result.append(corrected_text)
  # Return the corrected text
  return result

#### **Remove extra white spaces**

In [32]:
def remove_whitespace(document):
    return  " ".join(document.split())

In [33]:
def preprocess(df, colname, custom_process = None, check_spell = False):
  df[colname]= df[colname].str.lower()
  df[colname]= df[colname].apply(remove_whitespace)

  if custom_process is None:
    df[colname] = df[colname].apply(tokenize)
    if check_spell:
      df[colname] = df[colname].apply(check_spell_errors)
    df[colname] = df[colname].apply(remove_links)
    df[colname] = df[colname].apply(remove_punctuation)
    df[colname] = df[colname].apply(remove_numbers)
    df[colname] = df[colname].apply(remove_date)
    df[colname] = df[colname].apply(remove_short_tokens)
    df[colname] = df[colname].apply(remove_stopwords)
    df[colname] = df[colname].apply(lemmatize_pt) 
    df[colname] = df[colname].apply(lambda x: ' '.join(x))
  else:
    if 'tokenize' in custom_process:
      df[colname] = df[colname].apply(tokenize)
    if 'check_spell' in custom_process:
      df[colname] = df[colname].apply(check_spell_errors) 
    if 'remove_links' in custom_process:
      df[colname] = df[colname].apply(remove_links)
    if 'remove_punctuation' in custom_process:
      df[colname] = df[colname].apply(remove_punctuation)
    if 'remove_numbers' in custom_process:
      df[colname] = df[colname].apply(remove_numbers)
    if 'remove_date' in custom_process:
      df[colname] = df[colname].apply(remove_date)
    if 'remove_short_tokens' in custom_process:
      df[colname] = df[colname].apply(remove_short_tokens)
    if 'remove_stopwords' in custom_process:
      df[colname] = df[colname].apply(remove_stopwords)
    if 'lemmatize' in custom_process:
      df[colname] = df[colname].apply(lemmatize_pt)
    if 'steam' in custom_process and 'lemmatize' not in custom_process:
      df[colname] = df[colname].apply(stem)
  return df

In [34]:
def get_data(features):
  dfs = {}
  # create models for documents and title features
  df = None
  for feature in features:
    if not os.path.exists(f'/content/Topic-Modeling-Reclame-Aqui/results/dataset/preprocessed/table_{feature}.csv'):

      path_csv = "/content/Topic-Modeling-Reclame-Aqui/docs.csv"
  
      if df is None:
        # read data 
        df = read_data(path_csv)
  
      # preprocess data
      df = preprocess(df, feature)

      # Set the path to save 
      path = '/content/Topic-Modeling-Reclame-Aqui/results/dataset/preprocessed/'

      # Use makedirs() to create a new directory if it does not exists
      if not os.path.exists(path):
        os.makedirs(path)

      # Save the DataFrame to a CSV file
      df.to_csv(path + f'table_{feature}.csv', index=False)
      print(f'Dataset saved into {path} directory.')
  for feature in features:
    df_processed = pd.read_csv(f'/content/Topic-Modeling-Reclame-Aqui/results/dataset/preprocessed/table_{feature}.csv')
    dfs[feature] = df_processed
  dfs['raw'] = df
  return dfs

In [36]:
features=['title', 'documents']
dfs = get_data(features)
print(dfs.keys())

Shape before remove duplicates: (12760, 2)
Shape after remove duplicates: (10510, 2)
Dataset saved into /content/Topic-Modeling-Reclame-Aqui/results/dataset/preprocessed/ directory.
Dataset saved into /content/Topic-Modeling-Reclame-Aqui/results/dataset/preprocessed/ directory.
dict_keys(['title', 'documents', 'raw'])


In [37]:
DATA_FRAME_FEATURE_NAME = 'raw'

docs = list(dfs[DATA_FRAME_FEATURE_NAME]['documents'])
print(docs[0:10])

['pesquisando bastante novo comprar resolver aguardar semana cliente porque ocorrer vários desconto promoção em esse semana ser assim recebi desconto cashback cupom varia plataforma assim decidir efetuar compra entender melhor custo beneficio ser assim setembro recebi oferta produto entendi preço além cachback efetuei comprar aguardei semana consumidor efetuar todo expectativa estavar ancioso estar dentro prazo entregar hoje resolver enviar mensagem perguntar pedir enviar após confirmação comprar pagamento passado dia nenhum retorno complei inclusive site confio tambem assinante surpresa após pedir informação pedir retorno email estar ser cancelar ser nenhum justificativa após dia realização aguardei chegar periodo compro ansiosamente aproveitar desconto produto familia aguardar além cancelar compra produto justificativa empresa tirar todo oportunidade desconto efetuar compra semano consumidor levar dia cancelar sinto extremamente empresa sempre acreditar aguardo satisfação reparação t

## **Training a BERTopic Model**

The BERTopic algorithm has several advantages over other topic modeling algorithms. It is able to handle sparse data, it is scalable to large datasets, and it is able to learn topics that are not well-defined or are overlapping.

As our data language is portuguese we will going to set language to multilingual.

Create a new BERTopic model and train it. By default BERTopic use the paraphrase-multilingual-MiniLM-L12-v2 model for multi language documents. For others model check here [BERTopic sentence transformers](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#sentence-transformers)

In [38]:
# Create a new BERTopic model using multilingual option
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

# Train model 
topics, probs = topic_model.fit_transform(docs)

Downloading:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/329 [00:00<?, ?it/s]

2023-01-09 15:46:44,778 - BERTopic - Transformed documents to Embeddings
2023-01-09 15:47:14,965 - BERTopic - Reduced dimensionality
2023-01-09 15:47:18,444 - BERTopic - Clustered reduced embeddings


BERTopic works in three main steps: 


1.   Documents are first converted to numeric data. It extracts different embeddings based on the context of the word. For this, a sentence transformation model is used.
2.  Documents with similar topics are then grouped together forming clusters with similar topics. For this purpose, BERTopic uses the clustering algorithm UMAP to lower the dimensionality of the embeddings. Then the documents are clustered with the density-based algorithm HDBSCAN.
3. BERTopic extracts topics from clusters using a class-based TF-IDF score. This score gives the importance of each word in a cluster. Topics are then created based on the most important words measured by their C-TF-IDF score.

For more information check this link [BERTopic](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6)



#### **BERTopic coherence score**

In [79]:
from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

def get_coherence(model, topics, docs):
  # Preprocess Documents
  documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
  documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
  cleaned_docs = model._preprocess_text(documents_per_topic.Document.values)

  # Extract vectorizer and analyzer from BERTopic
  vectorizer = model.vectorizer_model
  analyzer = vectorizer.build_analyzer()

  # Extract features for Topic Coherence evaluation
  words = vectorizer.get_feature_names()
  tokens = [analyzer(doc) for doc in cleaned_docs]
  dictionary = corpora.Dictionary(tokens)
  corpus = [dictionary.doc2bow(token) for token in tokens]
  topic_words = [[words for words, _ in model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

  # Evaluate
  coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
  coherence = coherence_model.get_coherence()
  return coherence

In [40]:
print(f"Coeherence score: {get_coherence(topic_model, topics, docs)}")

Coeherence score: 0.48699849859728617


### **Extracting Topics**

In [41]:
# Print the most frequent topics
freq = topic_model.get_topic_info()

# Show the top 5 most frequent topics
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4858,-1_entregar_produto_pedir_compra
1,0,627,0_cartão_crédito_cobrar_fatura
2,1,490,1_celular_aparelho_smartphone_comprar
3,2,390,2_reembolso_dinheiro_valor_recebi
4,3,341,3_cancelamento_cancelar_pedir_solicitei


The table above shows the five most freqeuente topics and the words present on it extract by BERTopic. -1 refers to all outliers and should be ignored.

In [42]:
# show the most frequent topic
topic_model.get_topic(0)

[('cartão', 0.05717987130959735),
 ('crédito', 0.02954930663781189),
 ('cobrar', 0.01917904703496946),
 ('fatura', 0.01901709618629686),
 ('cobrança', 0.017930776285888805),
 ('compra', 0.017443753008579917),
 ('valor', 0.017383968145205066),
 ('pagamento', 0.015075489843901939),
 ('contar', 0.014061441264874948),
 ('conta', 0.01336747614286483)]

**Note:** BERTopic is stocastich which means that the topics might differ across runs this is mostly due to the stocastisch nature of UMAP

#### **Save topic info table as CSV**

In [None]:
# Set the path to save 
path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/topic_info_tables/'

# Use makedirs() to create a new directory if it does not exists
if not os.path.exists(path):
  os.makedirs(path)

# Save table as csv
freq.head(10).to_csv(path + 'topic_info_preprocessed_lemma.csv', index=False)

## **Visualization**

### **Intertopic Distance Map**

This graph shows the distance intertopic and help us understand the promixity of topics

In [None]:
fig = topic_model.visualize_topics(width=800, height=800)
fig

#### **Save intertopic distance map**

In [None]:
# Set the path to save 
path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/intertopic_distance_map/'

# Use makedirs() to create a new directory if it does not exists
if not os.path.exists(path):
  os.makedirs(path)


fig.write_image(path + "idm_preprocessed_lemma.png", format="png")
fig.write_html(path + "idm_preprocessed_lemma.html")

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/intertopic_distance_map/idm_preprocessed_lemma.png?raw=true)

### **Visualize Topic Hierarchy**

The topics that were created can be hierarchically reduced. This visualization shows how the topics relate to one another.

In [None]:
fig = topic_model.visualize_hierarchy(top_n_topics=30, width=800, height=800)
fig

#### **Save Hierarchical Clustering**

In [None]:
# Set the path to save 
path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/hierarchical_clustering/'

# Use makedirs() to create a new directory if it does not exists
if not os.path.exists(path):
  os.makedirs(path)


fig.write_image(path + "hc_preprocessed_lemma.png", format="png")

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/hierarchical_clustering/hc_preprocessed_lemma.png?raw=true)

### **Visualize Terms**

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation.

In [None]:
fig = topic_model.visualize_barchart(top_n_topics=12, width=300, height=300)
fig

#### **Save Top Word Scores Bar Chart**

In [None]:
# Set the path to save 
path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/top_words_scores/'

# Use makedirs() to create a new directory if it does not exists
if not os.path.exists(path):
  os.makedirs(path)


fig.write_image(path + "tws_preprocessed_lemma.png", format="png")

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/top_words_scores/tws_preprocessed_lemma.png?raw=true)

### **Visualize Topic Similarity**

This plot shows a similarity matrix by simply applying cosine similarities through those topic embeddings generate by BERTopic through both c-TF-IDF and embeddings. This matrix indicate how similar certain topics are to each other.

In [None]:
fig = topic_model.visualize_heatmap(n_clusters=20, width=800, height=800)
fig

#### **Save Similarity Matrix**

In [None]:
# Set the path to save 
path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/similarity_matrix/'

# Use makedirs() to create a new directory if it does not exists
if not os.path.exists(path):
  os.makedirs(path)


fig.write_image(path + "sm_preprocessed_lemma.png", format="png")

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/similarity_matrix/sm_preprocessed_lemma.png?raw=true)

### **Visualize Term Score Decline**

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added.

In [None]:
fig = topic_model.visualize_term_rank()
fig

#### **Save Term score decline per Topic**

In [None]:
# Set the path to save 
path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/term_score_decline_topic/'

# Use makedirs() to create a new directory if it does not exists
if not os.path.exists(path):
  os.makedirs(path)


fig.write_image(path + "tsdp_preprocessed_lemma.png", format="png")

[Github link](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/results/term_socore_decline_topic/tsdp_preprocessed_lemma.png?raw=true)

### **Term search**

In [None]:
# Find topics that contains blackfriday term
similar_topics, similarity = topic_model.find_topics("blackfriday", top_n=5)

# Show similar topics
similar_topics

[171, 78, 141, 150, 140]

In [None]:
# Show a specific topic
topic_model.get_topic(64)

[('pontos', 0.023236990563106297),
 ('chamados', 0.021102505777063563),
 ('cnpj', 0.01938067590059961),
 ('configuração', 0.018322062042288475),
 ('chamado', 0.01712199764722379),
 ('certificado', 0.015844135313015942),
 ('plataforma', 0.01439598185441888),
 ('série', 0.01401854982042202),
 ('2022', 0.011437141741414058),
 ('mercado', 0.011420219941546402)]

### **"Hiperparameter optimization"**

In [136]:
%%capture
!pip install optuna
!pip install hdbscan
!pip install umap-learn
!pip install tabulate

In [87]:
import json
import optuna # for hyperparameter optimization
from hdbscan import HDBSCAN # for clustering
from sklearn.cluster import KMeans # for clustering
from umap import UMAP # for dimension reduction
from sklearn.decomposition import PCA # for dimension reduction
from sklearn.feature_extraction.text import CountVectorizer # for convert text documents to matrix of tokens count
from bertopic.vectorizers import ClassTfidfTransformer 
import ast # for convert str to tuple
import csv
from tabulate import tabulate

In [170]:
def save_documents(model, docs, label):
  try:
    fig = topic_model.visualize_documents(docs, hide_document_hover=True, hide_annotations=True, width=800, height=800)

    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/documents_n_topics/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    fig.write_image(path + f"document_n_topics_trial_{label}.png", format="png")
    fig.write_html(path + f"document_n_topics_trial_{label}.html")
  except Exception as error:
    print(error)

In [171]:
def save_topics(model, label):
  try:
    fig = model.visualize_topics(width=800, height=800)

    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/intertopic_distance_map/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    fig.write_image(path + f"intertopic_distance_map_trial_{label}.png", format="png")
    fig.write_html(path + f"intertopic_distance_map_trial_{label}.html")
  except Exception as error:
    print(error)

In [172]:
def save_hierarchy(model, label):
  try:
    fig = model.visualize_hierarchy(width=800, height=800)

    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/hierarchical_clustering/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    fig.write_image(path + f"hierarchical_clustering_trial_{label}.png", format="png")
    fig.write_html(path + f"hierarchical_clustering_trial_{label}.html")
  except Exception as error:
    print(error)

In [173]:
def save_top_words_scores(model, label):
  try:
    fig = model.visualize_barchart(top_n_topics=12, width=300, height=300)

    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/top_words_scores/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    fig.write_image(path + f"top_words_scores_trial_{label}.png", format="png")
    fig.write_html(path + f"top_words_scores_trial_{label}.html")
  except Exception as error:
    print(error) 

In [174]:
def save_similarity_matrix(model, label):
  try:
    fig = model.visualize_heatmap()

    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/similarity_matrix/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    fig.write_image(path + f"similarity_matrix_trial_{label}.png", format="png")
    fig.write_html(path + f"similarity_matrix_trial_{label}.html")
  except Exception as error:
    print(error)

In [175]:
def save_term_rank(model, label):
  try:
    fig = model.visualize_term_rank(width=800, height=800)

    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/term_score_decline_topic/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    fig.write_image(path + f"term_score_trial_{label}.png", format="png")
    fig.write_html(path + f"term_score_trial_{label}.html")
  except Exception as error:
    print(error) 

In [176]:
def save_hyperparameters(trial_params, label):
  try:
    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/hyperparameters/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    with open(path + f"hyperparameters_trial_{label}.json", "w") as f:
      f.write(json.dumps(trial_params))
  except Exception as error:
    print(error)


In [177]:
def save_freq_topics(model, label):
  try:
    # Print the most frequent topics
    freq = topic_model.get_topic_info()

    # Show the top 5 most frequent topics
    freq = freq.head(5)

    # Set the path to save 
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/frequent_topics/' 

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    freq.to_html(path + f"freq_topics_trial_{label}.json")
    freq.to_csv(path + f"freq_topics_trial_{label}.csv", index=False)
  except Exception as error:
    print(error)

In [178]:
def save_model(model, label):
  try:
    # save model
    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/models/'

    # Use makedirs() to create a new directory if it does not exists
    if not os.path.exists(path):
      os.makedirs(path)

    model.save(path + label)
  except Exception as error:
    print(error)

In [179]:
def save_coherence(model, topics, docs, label, clustering_model, reduction_model):
  try:
    # compute coherence score
    coherence_score = get_coherence(model, topics, docs)

    print(f"Coeherence score: {coherence_score}")

    path = f'/content/Topic-Modeling-Reclame-Aqui/results/{DATA_FRAME_FEATURE_NAME}/coherence/'

    if not os.path.exists(path):
      os.makedirs(path)

    with open(path + 'coherence_scores.csv', 'a', newline='') as f:
      # create a CSV writer object
      fieldnames = ['model', 'clustering', 'reduction', 'coherence_score']
      writer = csv.DictWriter(f, fieldnames=fieldnames)
      data = [{'model': label, 'clustering': clustering_model, 
              'reduction': reduction_model, 'coherence_score': round(coherence_score, 4)}]
      writer.writerows(data)
    return coherence_score
  except Exception as error:
    print(error)

In [180]:
# define the number of models to generate by optuna
NUMBER_OF_MODELS = 20

In [181]:
def optimizer(trial):

  scores = []

  clustering_option = trial.suggest_categorical('clustering_algorithm__name', ['HDBSCAN', 'K-means'])
  dimensionality_option = trial.suggest_categorical('reduction_algorithm__name', ['UMAP', 'PCA'])

  # BERTopic hyperparameters
  top_n_words = trial.suggest_int('bertopic__top_n_words', 5, 14)
  n_gram_range = ast.literal_eval(trial.suggest_categorical('bertopic__n_gram_range', ['(1,1)', '(1,2)', '(1,3)']))
  min_topic_size = trial.suggest_int('bertopic__min_topic_size', 15, 20)
  diversity = trial.suggest_float('bertopic__diversity', 0.1, 1.0)
  outlier_threshold = trial.suggest_float('bertopic__outliers_threshold', 0.04, 0.09)
  #nr_topics = trial.suggest_categorical('bertopic__nr_topics', ['None', '10'])

  #if nr_topics == 'None':
  #  nr_topics = None
  #else:
  #  nr_topics = int(nr_topics)

  nr_topics = 10


  if clustering_option == 'HDBSCAN':
    # HDBSCAN hyperparameters
    min_cluster_size = trial.suggest_int('hdbscan__min_cluster_size', 10, 14)
    cluster_selection_epsilon = trial.suggest_float('hdbscan__cluster_selection_epsilon', 0.1, 1.0)

    # create a new HDBSCAN model to cluster documents
    clustering_model = HDBSCAN(min_cluster_size=min_cluster_size, prediction_data=True)
  elif clustering_option == 'K-means':
    # K-means hyperparameters
    k_means_n_clusters = trial.suggest_int('k-means__n_cluster',6, 10)

    # create a new HDBSCAN model to cluster documents
    clustering_model = KMeans(n_clusters=k_means_n_clusters)


  if dimensionality_option == 'UMAP':
    
    # UMAP hyperparameters
    n_neighbors = trial.suggest_int('umap__n_neighbors', 13, 17)
    n_components = trial.suggest_int('umap__n_components', 2, 8)
    metric = trial.suggest_categorical('umap__metric', ['cosine', 'euclidean'])
    min_dist = trial.suggest_float('umap__min_dist', 0.1, 1.0)
    spread = trial.suggest_float('umap__spread', 0.1, 1.0)

    # create a new UMAP model to reduce dimension
    reduction_model = UMAP(n_neighbors=n_neighbors, metric=metric, random_state=42)
  elif dimensionality_option == 'PCA':
    
    # PCA hyperparameters
    pca_n_components = trial.suggest_int('pca__n_components', 4, 6)
    
    # create a new PCA model to reduce dimension
    reduction_model = PCA(n_components=pca_n_components, random_state=12) # k-Means, that does not produce any outliers at all


  # CountVectorizer hyperparameters 
  max_features = trial.suggest_int('vectorizer__max_features', 3000, 6000)

  # reduce the impact of frequent words.
  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

  # list of portuguese stop words + custom words
  stop_words = nltk.corpus.stopwords.words('portuguese') + custom_stop_words

  # create a new CountVectorizer to create a matrix of tokens count
  vectorizer_model = CountVectorizer(stop_words=stop_words, ngram_range=n_gram_range, max_features=max_features)


  # create a new BERTopic model using multilingual option
  model = BERTopic(language="multilingual", 
                   nr_topics=nr_topics,
                   calculate_probabilities=True, 
                   verbose=False,
                   top_n_words=top_n_words,
                   n_gram_range=n_gram_range,
                   min_topic_size=min_topic_size,
                   diversity=diversity,
                   umap_model=reduction_model,
                   hdbscan_model=clustering_model,
                   vectorizer_model=vectorizer_model,
                   ctfidf_model=ctfidf_model)
  
  label = trial.number
  params = trial.params
  
  # define model id
  model_id = f"model_trial_{label}"

  # train BERTopic model 
  topics, probs = model.fit_transform(docs)
  
  # save model
  save_model(model, model_id)

  # save hyperparameters
  save_hyperparameters(params, label)
  
  # save plots
  save_topics(model, label)
  save_documents(model, docs, label)
  save_hierarchy(model, label)
  save_term_rank(model, label)
  save_top_words_scores(model, label)

  # save model coherence score
  score =  save_coherence(model, topics, docs, model_id, clustering_option, dimensionality_option)
  scores.append(score)

  average = 0.0
  if trial.number == NUMBER_OF_MODELS:
    average = round(sum(scores) / len(scores), 2)

  return average

In [None]:
%%time

# create a new study
study = optuna.create_study(study_name='Bertopic')

# run the optmize function 
study.optimize(optimizer, n_trials=NUMBER_OF_MODELS)

[32m[I 2023-01-09 19:36:03,209][0m A new study created in memory with name: Bertopic[0m
2023-01-09 19:36:19,462 - BERTopic - Transformed documents to Embeddings
2023-01-09 19:36:19,678 - BERTopic - Reduced dimensionality
2023-01-09 19:36:20,678 - BERTopic - Clustered reduced embeddings
2023-01-09 19:36:26,034 - BERTopic - Reduced number of topics from 9 to 9


Coeherence score: 0.5561392775448547


[32m[I 2023-01-09 19:37:17,492][0m Trial 0 finished with value: 0.0 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'PCA', 'bertopic__top_n_words': 8, 'bertopic__n_gram_range': '(1,3)', 'bertopic__min_topic_size': 16, 'bertopic__diversity': 0.4562744175324325, 'bertopic__outliers_threshold': 0.05100481037642743, 'k-means__n_cluster': 9, 'pca__n_components': 5, 'vectorizer__max_features': 4075}. Best is trial 0 with value: 0.0.[0m
2023-01-09 19:37:33,595 - BERTopic - Transformed documents to Embeddings
2023-01-09 19:37:33,817 - BERTopic - Reduced dimensionality
2023-01-09 19:37:34,871 - BERTopic - Clustered reduced embeddings
2023-01-09 19:37:37,349 - BERTopic - Reduced number of topics from 3 to 3


zero-size array to reduction operation maximum which has no identity


[32m[I 2023-01-09 19:38:23,174][0m Trial 1 finished with value: 0.0 and parameters: {'clustering_algorithm__name': 'HDBSCAN', 'reduction_algorithm__name': 'PCA', 'bertopic__top_n_words': 6, 'bertopic__n_gram_range': '(1,2)', 'bertopic__min_topic_size': 16, 'bertopic__diversity': 0.4901275493488505, 'bertopic__outliers_threshold': 0.06590339246979073, 'hdbscan__min_cluster_size': 12, 'hdbscan__cluster_selection_epsilon': 0.6033806928610854, 'pca__n_components': 5, 'vectorizer__max_features': 3788}. Best is trial 0 with value: 0.0.[0m


Coeherence score: 0.5758711391499514


2023-01-09 19:38:40,875 - BERTopic - Transformed documents to Embeddings
2023-01-09 19:38:56,121 - BERTopic - Reduced dimensionality
2023-01-09 19:38:56,397 - BERTopic - Clustered reduced embeddings
2023-01-09 19:39:01,718 - BERTopic - Reduced number of topics from 8 to 8


Coeherence score: 0.673734250086521


[32m[I 2023-01-09 19:39:54,872][0m Trial 2 finished with value: 0.0 and parameters: {'clustering_algorithm__name': 'K-means', 'reduction_algorithm__name': 'UMAP', 'bertopic__top_n_words': 9, 'bertopic__n_gram_range': '(1,3)', 'bertopic__min_topic_size': 15, 'bertopic__diversity': 0.5959581070394935, 'bertopic__outliers_threshold': 0.06527252914139782, 'k-means__n_cluster': 8, 'umap__n_neighbors': 16, 'umap__n_components': 8, 'umap__metric': 'cosine', 'umap__min_dist': 0.4173560809365089, 'umap__spread': 0.8383598273045656, 'vectorizer__max_features': 5531}. Best is trial 0 with value: 0.0.[0m
2023-01-09 19:40:11,134 - BERTopic - Transformed documents to Embeddings
2023-01-09 19:40:11,347 - BERTopic - Reduced dimensionality
2023-01-09 19:40:12,274 - BERTopic - Clustered reduced embeddings
2023-01-09 19:40:17,684 - BERTopic - Reduced number of topics from 9 to 9


#### **Merge of similar topics**

In [None]:
# TODO: select the best model and merge similar topic and save final model

#topic_model = BERTopic.load("best_model")

#### **Github**

In [1]:
! ssh-keygen -t rsa -b 4096
# Add github.com to our known hosts
! ssh-keyscan -t rsa github.com >> ~/.ssh/known_hosts
# Restrict the key permissions, or else SSH will complain.
! chmod go-rwx /root/.ssh/id_rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:RWE2SbTsv5hVrvePaLQX7rPtHG6jSTas3pJQ4PnBM0A root@3db2597eab85
The key's randomart image is:
+---[RSA 4096]----+
|        .EOo     |
|         B.o     |
|        . O      |
|         = *     |
|        S + + .  |
|         . ooo.  |
|          ..+B.o |
|           *B+O*.|
|          ++=B**O|
+----[SHA256]-----+
# github.com:22 SSH-2.0-babeld-4ce3b487


In [2]:
! cat /root/.ssh/id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDfAonhC6LaU7YdcLaIkIwh7MYOwQe3qIP6d8bmg5JPcs0qjmQ2VHGXFyCcB1lWy9nXAVHs3VMaGYFObriBWnJCqdfFMyzGZbSbpeH+TcyD8XXN8vYDqrtkJuBzlQYOmg+8+iSAt3F1wh2F4eEPbz6j9HA7M8BiYE0VHvCSfmaV2rYZ3NTkxjbOoqsuOAPDI+zSPja0vjKl974ybPw0jM7o7JzOBiCAg02GxjQrn/lssiMXhwiYHR7UyqhZinOKb7XqPAifJUj/aO3BhlyUubuTG41WT+dcMS0FqQ8p06aoifUNvYkGhsqTuyAWd7pfz0S1QR9k416dQ3UYNph+ffKhGZPMgkWrlbXHv0eLe0dTVO2Ay2ayMyFdFETtfSJd68bS9oHkP89GWWTw+41dBQtMeLevUeys0s0lHeKDjrrKSwoo+gzAxFPLaDhkTJNFnoKGwgHFN/uananIFWQN5Rppn/dq6Whd1JqlrvXBzU3l6rE5NPe9pP0m3Hzc2evH7jnVQTRmeIKunbngJZZY7eQ7cqXeti68qJSOnJ0gGjnTJ3iSA1AaefXTmvZ5iOc0eNKZV6bMFzaRcG+W3fPk7xet+kUukjhSc3Y4fBtM1I65a6CbIDuG6jDwg2UsFPjKCImVNqFJPkQoY8M+gJrDkq5MFRV1dZOgVlYCNyI0ivi/DQ== root@3db2597eab85


In [3]:
!git config --global user.email "mattheus_ribeiro@outlook.com"
!git config --global user.name "punkmic"

In [4]:
!ssh -T git@github.com

Hi punkmic! You've successfully authenticated, but GitHub does not provide shell access.


In [5]:
!git clone git@github.com:punkmic/Topic-Modeling-Reclame-Aqui.git

Cloning into 'Topic-Modeling-Reclame-Aqui'...
remote: Enumerating objects: 16281, done.[K
remote: Counting objects: 100% (2100/2100), done.[K
remote: Compressing objects: 100% (1832/1832), done.[K
remote: Total 16281 (delta 304), reused 1950 (delta 234), pack-reused 14181[K
Receiving objects: 100% (16281/16281), 127.01 MiB | 7.06 MiB/s, done.
Resolving deltas: 100% (1452/1452), done.
Checking out files: 100% (14808/14808), done.


In [None]:
%cd /content/Topic-Modeling-Reclame-Aqui/

In [None]:
!git add ./results/documents

In [None]:
!git status

In [None]:
!git commit -m "Adding preprocessed BERTopic result"

In [None]:
!git push origin master

In [None]:
# run this command to push a new version of this notebook in case you have saved the notebook in github and it is outdate 
!git stash
!git pull
!git stash pop