<a href="https://colab.research.google.com/github/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/Topic_Modeling_with_BERTopic_Reclame_aqui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Topic Modeling with BERTopic - Reclame Aqui**

BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

## **Import dependecies**

In [None]:
import pandas as pd # for read csv file and manipulate data
import csv
import os
import nltk
import string
import re
try:
  from bertopic import BERTopic # for topic modeling
  from enelvo.normaliser import Normaliser # for spelling errors and abbreviations
except:
  !pip install enelvo --no-dependecies numpy
  !pip install bertopic
  os.kill(os.getpid(), 9)

In [None]:
nltk.download("punkt")

In [None]:
nltk.download("stopwords")

In [None]:
nltk.download('wordnet')

In [None]:
nltk.download('omw-1.4')

## **Prepare data**

The following steps will be done:

1- clone the repository from Github

2- load our csv file as DataFrame object

3- remove duplicate rows

4- join title and body columns

In [None]:
!git clone https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git

In [None]:
# run this cell to update files from remote repository
%cd /content/Topic-Modeling-Reclame-Aqui 
!git pull 
%cd ..
!pwd

In [None]:
# load csv 
path_csv = "/content/Topic-Modeling-Reclame-Aqui/corpus.csv"
df = pd.read_csv(path_csv)
df.head()

In [None]:
# remove duplicates
print(f"Shape before remove duplicates: {df.shape}")
df = df.drop_duplicates(subset="text")
print(f"Shape after remove duplicates: {df.shape}")

In [None]:
df["documents"] = df["title"] + " " + df["text"]
df.head()

In [None]:
df.documents.dropna(inplace=True)
df.reset_index(inplace = True, drop = True)
df.shape

## **Preprocessing**

### **Tokenization**

Tokenization aims to breaking text down into its component parts

In [None]:
WORD_TOKENIZER = nltk.tokenize.word_tokenize

def tokenize(text, lowercase=True):
  if lowercase:
    text = text.lower()
  return WORD_TOKENIZER(text, language="portuguese")

### **Stem** 

Stem the tokens. This step aims to remove morphological affixes and normalize to standardized stem forms

In [None]:
STEMMER = nltk.PorterStemmer()

def stem(tokens):
  return [STEMMER.stem(token) for token in tokens]

### **Lemmatize**

Lemmatize the tokens. Retains more natural forms than stemming, but assumes all tokens nons unless tokens are passed as (word, pos) tuples.

In [None]:
LEMMATIZER = nltk.WordNetLemmatizer()
def lemmatize(tokens):
  lemmas = []
  for token in tokens:
    if isinstance(token, str):
      lemmas.append(LEMMATIZER.lemmatize(token)) # for str token
    else:
      lemmas.append(LEMMATIZER.lemmatize(*tokens)) # for tuple
  return lemmas

### **Remove stopwords**

Stop words are things like articles and conjunctions that usually do not offer a lot of value in an analysis.

In [None]:
def remove_stopwords(tokens, stopwords=None):
  if stopwords is None:
    stopwords = nltk.corpus.stopwords.words("portuguese")
  return [token for token in tokens if token not in stopwords]

### **Remove hyperlinks**

Removes http/s links from the tokens.

In [None]:
def remove_links(tokens):
  return [token for token in tokens 
          if not token.startswith("http://")
          and not token.startswith("https://")]

### **Remove numbers**

In [None]:
def remove_numbers(tokens):
  return [token for token in tokens if not token.isdigit()]

Remove words with numbers

Remove words that contains numbers such as post code

In [None]:
def remove_words_numbers(tokens):
  return [token for token in tokens if not re.sub(r'\w*\d\w*', '', token)]


### **Remove punctuation**

In [None]:
def remove_punctuation(tokens,
                       strip_mentions=False,
                       strip_hashtags=False,
                       strict=False):

    tokens = [re.sub(r'(\W)(?=\1)', '', t) for t in tokens ] # remove double punctuation, I. e, ..
    tokens = [t for t in tokens if t not in string.punctuation]
    if strip_mentions:
        tokens = [t.lstrip('@') for t in tokens]
    if strip_hashtags:
        tokens = [t.lstrip('#') for t in tokens]
    if strict:
        cleaned = []
        for t in tokens:
            cleaned.append(
                t.translate(str.maketrans('', '', string.punctuation)).strip())
        tokens = [t for t in cleaned if t]
    return tokens

### **Remove tokens with length less than 4 characters**

In [None]:
def remove_short_tokens(tokens):
  return [token for token in tokens if not len(token) <= 4]

### **Correction of spelling errors and abbreviations**

In [None]:
%%capture
norm = Normaliser(tokenizer='readable')
def normalize(text):
  return norm.normalise(text)

In [None]:
corpus = []
for i, text in enumerate(df.documents):
  if i % 1000 == 0:
    print(f"Processed {i} documents")
  tokens = normalize(text)
  tokens = tokenize(text)
  tokens = remove_links(tokens)
  tokens = remove_numbers(tokens)
  tokens = remove_words_numbers(tokens)
  tokens = remove_short_tokens(tokens)
  tokens = remove_stopwords(tokens)
  tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
  tokens = lemmatize(tokens) 
  corpus.append(' '.join(tokens))

In [None]:
print(df.documents[20])
print()
corpus[20]

## **Training a BERTopic Model**

As our data language is portuguese we will going to set language to multilingual.

### **Enabling the GPU**

We will use the GPU provided by COLAB to accelarate our model training. To enable GPUs for the notebook:
1- Navigate to Edit -> Notebook Settings
2- Select GPU from the Hardware Accelerator drop-down

In [None]:
# verify if GPU is selected
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(corpus)

## **Extracting Topics**

In [None]:
# print the most frequent topics

freq = topic_model.get_topic_info()
freq.head(5)

-1 refers to all outliers and should be ignored.

In [None]:
# show the most frequent topic
topic_model.get_topic(1)

**Note:** BERTopic is stocastich which means that the topics might differ across runs this is mostly due to the stocastisch nature of UMAP

## **Visualization**

### **Intertopic Distance Map**

This graph shows the distance intertopic and help us understand the promixity of topics

In [None]:
topic_model.visualize_topics()

### **Visualize Topic Hierarchy**

The topics that were created can be hierarchically reduced. This visualization shows how the topics relate to one another.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

### **Visualize Terms**

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

### **Visualize Topic SImilarity**

This plot shows a similarity matrix by simply applying cosine similarities through those topic embeddings generate by BERTopic through both c-TF-IDF and embeddings. This matrix indicate how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

Visualize Term Score Decline

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added.

In [None]:
topic_model.visualize_term_rank()

### **Term search**

In [None]:
similar_topics, similarity = topic_model.find_topics("blackfriday", top_n=3)
similar_topics

In [None]:
topic_model.get_topic(58)

### **Topic Reduction**

In [None]:
#new_topic_model = topic_model.reduce_topics(df.documents, nr_topics=40)
#new_topics = new_topic_model.topics_
#new_probs = new_topic_model.probabilities_

In [None]:
#new_topic_model.visualize_topics()