<a href="https://colab.research.google.com/github/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/Topic_Modeling_with_BERTopic_Reclame_aqui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Topic Modeling with BERTopic - Reclame Aqui**

BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions (https://maartengr.github.io/BERTopic/index.html).

## **Import dependecies**

In [None]:
import pandas as pd # for data manipulation
import os # for interacting with the operating system
import nltk # for natural language processing
import string # for string manipulation
import re # for for regular expressions
import matplotlib.pyplot as plt # for visualization
try:
  from bertopic import BERTopic # for topic modeling
  from spellchecker import SpellChecker # for spell check
  from sentence_transformers import SentenceTransformer # for embeddings
  from umap import UMAP # for dimension reduction
  from hdbscan import HDBSCAN # for clustering
except:
  !pip install pyspellchecker
  !pip install sentence-transformers
  !pip install umap-learn
  !pip install hdbscan
  !pip install bertopic
  os.kill(os.getpid(), 9)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
nltk.download("punkt")

NameError: ignored

In [3]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [5]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

## **Load data from [Github](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git)**

In [6]:
!git clone https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git

Cloning into 'Topic-Modeling-Reclame-Aqui'...
remote: Enumerating objects: 16133, done.[K
remote: Counting objects: 100% (1952/1952), done.[K
remote: Compressing objects: 100% (1753/1753), done.[K
remote: Total 16133 (delta 234), reused 1808 (delta 173), pack-reused 14181[K
Receiving objects: 100% (16133/16133), 118.15 MiB | 29.05 MiB/s, done.
Resolving deltas: 100% (1382/1382), done.
Checking out files: 100% (14795/14795), done.


## **Run this cell to update files from remote repository**

In [7]:
# Change directory
%cd /content/Topic-Modeling-Reclame-Aqui 

# Update files from remote repository
!git pull 

# Return to work directory
%cd ..

# Check current directory
!pwd

/content/Topic-Modeling-Reclame-Aqui
Already up to date.
/content
/content


## **Read csv file using read_csv() method as Dataframe**

In [8]:
# Set the path to data
path_csv = "/content/Topic-Modeling-Reclame-Aqui/corpus.csv"

# Read the CSV file using the read_csv method
df = pd.read_csv(path_csv)

# Print the first 5 rows of the DataFrame
df.head(5)

Unnamed: 0,title,text
0,Pedido Cancelado sem justificativa após uma se...,Eu estava pesquisando bastante uma nova TV par...
1,Pedido cancelado,Eu sinceramente estou decepcionada com o Amazo...
2,Cobrança indevida,Cancelei meu plano antes de terminar o período...
3,PEDIDO REINCIDENTE,"OLHA FIZ COMPRA VEIO ERRADA, E VEIO ERRADO NOV..."
4,Assinatura para vender na amazon Brasil,Eu me inscrevi na Amazon para realizar vendas ...


This dataset contains just two columns called title and text 

In [9]:
print(df.shape)

(12760, 2)


There are 12760 rows in this dataset. To make sure that we have unique rows let's remove duplicate rows.

In [10]:
print(f"Shape before remove duplicates: {df.shape}")

# User the drop_duplicated method to drop duplicates rows
df = df.drop_duplicates(subset="text")

print(f"Shape after remove duplicates: {df.shape}")

Shape before remove duplicates: (12760, 2)
Shape after remove duplicates: (10510, 2)


### **Now let's lower each element in the Dataframe, remove unwanted text and join the two columns**

In [11]:
# apply the str.lower() method to each element in the dataframe
df = df.applymap(str.lower)

# Use the replace() method to replace the string with an empty string
df = df.replace(re.compile('\[editado pelo reclame aqui\]|editado pelo reclame aqui|Editado pelo Reclame Aqui'), '')

# join columns
df["documents"] = df["title"] + " " + df["text"]

# Drop the old index column
df.reset_index(inplace = True, drop = True)

df.head()

Unnamed: 0,title,text,documents
0,pedido cancelado sem justificativa após uma se...,eu estava pesquisando bastante uma nova tv par...,pedido cancelado sem justificativa após uma se...
1,pedido cancelado,eu sinceramente estou decepcionada com o amazo...,pedido cancelado eu sinceramente estou decepc...
2,cobrança indevida,cancelei meu plano antes de terminar o período...,cobrança indevida cancelei meu plano antes de ...
3,pedido reincidente,"olha fiz compra veio errada, e veio errado nov...",pedido reincidente olha fiz compra veio errada...
4,assinatura para vender na amazon brasil,eu me inscrevi na amazon para realizar vendas ...,assinatura para vender na amazon brasil eu me ...


## **Preprocessing**

### **Tokenization**

Tokenization aims to breaking text down into its component parts

In [12]:
WORD_TOKENIZER = nltk.tokenize.word_tokenize
def tokenize(text, lowercase=True):
  if lowercase:
    text = text.lower()
  return WORD_TOKENIZER(text, language="portuguese")

### **Stem** 

Stem the tokens. This step aims to remove morphological affixes and normalize to standardized stem forms

In [13]:
STEMMER = nltk.PorterStemmer()
def stem(tokens):
  return [STEMMER.stem(token) for token in tokens]

### **Lemmatize**

Lemmatize the tokens. Retains more natural forms than stemming, but assumes all tokens nons unless tokens are passed as (word, pos) tuples.

In [14]:
LEMMATIZER = nltk.WordNetLemmatizer()
def lemmatize(tokens):
  lemmas = []
  for token in tokens:
    if isinstance(token, str):
      # For str token
      lemmas.append(LEMMATIZER.lemmatize(token)) 
    else:
      # For tuple of (str, pos)
      lemmas.append(LEMMATIZER.lemmatize(*tokens)) 
  return lemmas

### **Remove stopwords**

Stop words are things like articles and conjunctions that usually do not offer a lot of value in an analysis.

In [15]:
def remove_stopwords(tokens, stopwords=None):

  # Use the default stop words if none is passed
  if stopwords is None:
    stopwords = nltk.corpus.stopwords.words("portuguese")
  
  # Filter the list of tokens to exclude the stop word tokens
  return [token for token in tokens if token not in stopwords]

### **Remove hyperlinks**

Removes http/s links from the tokens.

In [16]:
def remove_links(tokens):
  # Filter tokens that starts with "http://" or "https://"
  return [token for token in tokens 
          if not token.startswith("http://")
          and not token.startswith("https://")]

### **Remove numbers**

In [17]:
def remove_numbers(tokens):
  # Filter number tokens
  return [token for token in tokens if not token.isdigit()]

### **Remove date**

In [18]:
def remove_date(tokens):
  # Compile a regular expression to match dates in the format dd/mm or dd/mm/yyyy
  date_regex = re.compile(r'\d{2}/\d{2}(/\d{4})?')

  # Use the regex to find all the tokens that match the date pattern
  dates = [token for token in tokens if date_regex.fullmatch(token)]

  # Filter the list of tokens to exclude the date tokens
  filtered_tokens = [token for token in tokens if token not in dates]

  # Return the filtered tokens
  return filtered_tokens

### **Remove punctuation**

In [19]:
def remove_punctuation(tokens,
                       strip_mentions=False,
                       strip_hashtags=False,
                       strict=False):

    # Use the sub method to remove all punctuation characters
    tokens = [re.sub(r"[,!?.]", "", t) for t in tokens ] 

    # Remove punctuation
    #tokens = [t for t in tokens if t not in string.punctuation]

    # Remove @ symbol from left side of tokens
    if strip_mentions:
        tokens = [t.lstrip('@') for t in tokens]

    # Remove # symbol from left side of tokens
    if strip_hashtags:
        tokens = [t.lstrip('#') for t in tokens]

    return tokens

### **Remove short tokens**

In [20]:
def remove_short_tokens(tokens):
  # Filter the list of tokens to exclude tokens that are shorter than four letters
  filtered_tokens = [token for token in tokens if len(token) >= 4]

  # Return the filtered tokens
  return filtered_tokens

### **Correction of spelling errors and abbreviations**

In [21]:
def check_spell_errors(text):
  # Create a SpellChecker object
  spell = SpellChecker(language='pt')

  # Correct the spelling errors in the text
  corrected_text = spell.correction(text)

  # If no correction is present user the original text
  if corrected_text == None:
     corrected_text =  text
  
  # Return the corrected text
  return corrected_text

In [22]:
def preprocessing(documents):
  corpus = []

  # process each document and append to corpus list
  for i, text in enumerate(documents):
    if i % 1000 == 0:
      print(f"Processed {i} documents")
    text = check_spell_errors(text)
    tokens = tokenize(text)
    tokens = remove_links(tokens)
    tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
    tokens = remove_numbers(tokens)
    tokens = remove_date(tokens)
    tokens = remove_short_tokens(tokens)
    tokens = remove_stopwords(tokens)
    tokens = lemmatize(tokens) 
    corpus.append(' '.join(tokens))
  return corpus

In [23]:
# Print the first document before and after pre-processing it
corpus = preprocessing(df.documents) 
print(df.documents[0])
print()
corpus[0]

Processed 0 documents


NameError: ignored

## **Training a BERTopic Model**

The BERTopic algorithm has several advantages over other topic modeling algorithms. It is able to handle sparse data, it is scalable to large datasets, and it is able to learn topics that are not well-defined or are overlapping.

As our data language is portuguese we will going to set language to multilingual.

### **Enabling the GPU**

We will use the GPU provided by COLAB to accelarate our model training. To enable GPUs for the notebook:
1- Navigate to Edit -> Notebook Settings
2- Select GPU from the Hardware Accelerator drop-down

In [24]:
# verify if GPU is enable
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat Dec 10 23:48:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Create a new BERTopic model and train it. By default BERTopic use the paraphrase-multilingual-MiniLM-L12-v2 model for multi language documents. For others model check here [BERTopic sentence transformers](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#sentence-transformers)

In [None]:
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
embeddings = model.encode(corpus, show_progress_bar=True)

In [45]:
# Create a new BERTopic model using multilingual option
topic_model = BERTopic(embedding_model = "paraphrase-multilingual-MiniLM-L12-v2", language="multilingual", calculate_probabilities=True, verbose=True)

# Train model 
topics, probs = topic_model.fit_transform(corpus)

Batches:   0%|          | 0/329 [00:00<?, ?it/s]

2022-12-11 00:22:00,882 - BERTopic - Transformed documents to Embeddings
2022-12-11 00:22:13,116 - BERTopic - Reduced dimensionality
2022-12-11 00:22:15,754 - BERTopic - Clustered reduced embeddings


BERTopic works in three main steps: 


1.   Documents are first converted to numeric data. It extracts different embeddings based on the context of the word. For this, a sentence transformation model is used.
2.  Documents with similar topics are then grouped together forming clusters with similar topics. For this purpose, BERTopic uses the clustering algorithm UMAP to lower the dimensionality of the embeddings. Then the documents are clustered with the density-based algorithm HDBSCAN.
3. BERTopic extract topics from clusters using a class-based TF-IDF score. This score gives the importance of each word in cluster. Topic are then created based on most important words measured by its C-TF-IDF score.

For more information check this link [BERTopic](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6)



## **Extracting Topics**

In [42]:
# print the most frequent topics
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,3976,-1_produto_compra_pedido_recebi
1,0,600,0_conta_suspensa_senha_acessar
2,1,535,1_amazon_transportadora_entregue_entrega
3,2,460,2_casas_bahia_transportadora_entrega
4,3,410,3_entrega_atraso_atrasada_prazo


The table above shows the five most freqeuente topics and the words present on it extract by BERTopic. -1 refers to all outliers and should be ignored.

In [43]:
# show the most frequent topic
topic_model.get_topic(1)

[('amazon', 0.03766850116243189),
 ('transportadora', 0.015238758364451608),
 ('entregue', 0.012792559309914473),
 ('entrega', 0.011923976904959275),
 ('produto', 0.011880702842787161),
 ('pedido', 0.011260814728494996),
 ('empresa', 0.010297151568065207),
 ('reembolso', 0.009938759655023615),
 ('site', 0.009658567920815474),
 ('contato', 0.00884245284652742)]

**Note:** BERTopic is stocastich which means that the topics might differ across runs this is mostly due to the stocastisch nature of UMAP

## **Visualization**

### **Intertopic Distance Map**

This graph shows the distance intertopic and help us understand the promixity of topics

In [44]:
topic_model.visualize_topics()

### **Visualize Topic Hierarchy**

The topics that were created can be hierarchically reduced. This visualization shows how the topics relate to one another.

In [29]:
topic_model.visualize_hierarchy(top_n_topics=50)

### **Visualize Terms**

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation.

In [30]:
topic_model.visualize_barchart(top_n_topics=8)

### **Visualize Topic Similarity**

This plot shows a similarity matrix by simply applying cosine similarities through those topic embeddings generate by BERTopic through both c-TF-IDF and embeddings. This matrix indicate how similar certain topics are to each other.

In [31]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

### **Visualize Term Score Decline**

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added.

In [32]:
topic_model.visualize_term_rank()

## **Term search**

In [33]:
# Find topics that contains blackfriday term
similar_topics, similarity = topic_model.find_topics("blackfriday", top_n=5)

# Show similar topics
similar_topics

[54, 55, 46, 52, 47]

In [34]:
# Show a specific topic
topic_model.get_topic(54)

[('black', 0.06985306625981177),
 ('friday', 0.06430106595817871),
 ('promoção', 0.03844630637165133),
 ('cancelada', 0.02956269817819524),
 ('valor', 0.02940957084471321),
 ('promocional', 0.028640942476590735),
 ('preço', 0.02752816202495268),
 ('compra', 0.026177994729193667),
 ('divergencia', 0.02376887794088174),
 ('comprar', 0.02322466014033331)]

### **Topic Reduction**

In [35]:
#new_topic_model = topic_model.reduce_topics(df.documents, nr_topics=40)
#new_topics = new_topic_model.topics_
#new_probs = new_topic_model.probabilities_

In [36]:
#new_topic_model.visualize_topics()