# BERTopic

BERTopic is an algorithm that generate topics from a collection of documents. It leverages BERT (Bidirectional Encoder Representations from Transformers), a powerful language representation model developed by Google, along with other machine learning techniques, to identify coherent and meaningful topics within large volumes of text. Unlike traditional topic modeling algorithms (like LDA) that rely on word frequency counts and distributions, BERTopic uses contextual embeddings. This means it understands the context in which words are used, allowing it to identify topics with greater accuracy and relevance.

In this notebook, the model is trained on customer reviews from flights. I used the model out of the box, and would like to continue experimentation with hyperparameters.

Training the model was time consuming, so we pickled the model to make it easy to use. 

In [None]:
import pandas as pd
import numpy as np
import re
from bertopic import BERTopic  
from bertopic import BERTopic   
import pickle

In [None]:
# Load the data
df = pd.read_csv('/Users/paulhershaw/brainstation_course/test_git/test/data/airline_reviews_cleaned.csv')

In [None]:
# Perform basic text cleaning

def preprocess_text(text):
    # Convert text to lower case
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove non-alphanumeric characters
    text = re.sub(r'\W+', ' ', text)
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    return text

In [None]:
# Apply the function to the 'customer_review' column
df['customer_review'] = df['customer_review'].apply(preprocess_text)

In [None]:
# Create a list of the cleaned reviews
reviews_list = df['customer_review'].tolist()

In [None]:
# Instantiate the BERTopic model
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", 
                        calculate_probabilities=True, 
                        verbose=True)

In [None]:
# Fit the model to the reviews
topics, probs = topic_model.fit_transform(reviews_list)

In [None]:
# Save the model for later use. 
with open('topic_model.pkl', 'wb') as file:
    pickle.dump(topic_model, file)

# Save the topics and probabilities
with open('topics_probs.pkl', 'wb') as file:
    pickle.dump((topics, probs), file)