# **Tutorial** - Topic Modeling with BERTopic
(last updated 01-09-2022)

In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for.


## BERTopic
BERTopic is a topic modeling technique that leverages ü§ó transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit‚ÜíNotebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [10]:
!pip uninstall numba bertopic umap-learn hdbscan -y
!pip install --upgrade bertopic numba umap-learn hdbscan

Found existing installation: numba 0.63.1
Uninstalling numba-0.63.1:
  Successfully uninstalled numba-0.63.1
Found existing installation: bertopic 0.17.4
Uninstalling bertopic-0.17.4:
  Successfully uninstalled bertopic-0.17.4
Found existing installation: umap-learn 0.5.11
Uninstalling umap-learn-0.5.11:
  Successfully uninstalled umap-learn-0.5.11
Found existing installation: hdbscan 0.8.41
Uninstalling hdbscan-0.8.41:
  Successfully uninstalled hdbscan-0.8.41
Collecting bertopic
  Using cached bertopic-0.17.4-py3-none-any.whl.metadata (24 kB)
Collecting numba
  Using cached numba-0.63.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.9 kB)
Collecting umap-learn
  Using cached umap_learn-0.5.11-py3-none-any.whl.metadata (26 kB)
Collecting hdbscan
  Using cached hdbscan-0.8.41-cp312-cp312-linux_x86_64.whl
Using cached bertopic-0.17.4-py3-none-any.whl (154 kB)
Using cached numba-0.63.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.8 MB)
Using cache

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime ‚Üí Restart Runtime

# Data
We use English and Chinese travel reviews datasets scrapping from various platforms including Mafengwo, Xiaohongshu, Qiongyou, TripAdvisor and Reddit which contains roughly 18000 texts.

In [1]:
import pandas as pd

# File paths
MFW_output = '/content/chinese/MFW_output.xlsx'
MFW_youji = '/content/chinese/MFW_youji.xlsx'
qr = '/content/chinese/qr.xlsx'
textqr2_final = '/content/chinese/textqr2_final.xlsx'
tripadvisor_forum_data = '/content/english/tripadvisor_forum_data.xlsx'
xiaohongshu_comments_combined = '/content/chinese/xiaohongshu_comments_combined.csv'
reddit_comments_combined = '/content/english/reddit_comments_combined.csv'
reddit_posts_combined = '/content/english/reddit_posts_combined.csv'
qunar_reviews = '/content/chinese/qunar_reviews.xlsx'
tripadvisor_reviews_data = '/content/english/tripadvisor_reviews_data.csv'
xiecheng = '/content/chinese/xc.xlsx'

# Function to read and extract column data from Excel files
def get_column_data(file_path, column, start_row=1):
    df = pd.read_excel(file_path)
    # Convert column letter to index if needed (e.g., 'A' -> 0, 'H' -> 7)
    if isinstance(column, str):
        column = ord(column.upper()) - ord('A')
    return df.iloc[start_row-1:, column].dropna().astype(str).tolist()

# Function to read and extract column data from csv files
def get_csv_column_data(file_path, column, start_row=1, encoding='utf-8'):
  try:
    df = pd.read_csv(file_path, encoding=encoding)
  except UnicodeDecodeError:
      # Try common alternative encodings if UTF-8 fails
      try:
          df = pd.read_csv(file_path, encoding='latin1')  # Also known as ISO-8859-1
      except:
          df = pd.read_csv(file_path, encoding='gbk')  # Common for Chinese files
    # Convert column letter to index if needed (e.g., 'A' -> 0, 'H' -> 7)
  if isinstance(column, str):
      column = ord(column.upper()) - ord('A')
  return df.iloc[start_row-1:, column].dropna().astype(str).tolist()

# Extract data from each file
mfw_output = get_column_data(MFW_output, 'A', start_row=1)  # A1-Aend
mfw_youji = get_column_data(MFW_youji, 'B', start_row=2)
qr_data = get_column_data(qr, 'A', start_row=2)           # A2-Aend
textqr_data = get_column_data(textqr2_final, 'A', start_row=2)  # A2-Aend
tripadvisor_forum = get_column_data(tripadvisor_forum_data, 'H', start_row=2)  # H2-Hend
xiaohongshu_comments_combined = get_csv_column_data(xiaohongshu_comments_combined, 'F', start_row=2)
reddit_comments_combined = get_csv_column_data(reddit_comments_combined, 'D', start_row=2)
reddit_posts_combined = get_csv_column_data(reddit_posts_combined, 'O', start_row=2)
qunar_reviews_data = get_column_data(qunar_reviews, 'G', start_row=2)
tripadvisor_reviews = get_csv_column_data(tripadvisor_reviews_data, 'E', start_row=2)
xiecheng_data = get_column_data(xiecheng, 'E', start_row = 2)
xiecheng_timestamp = get_column_data(xiecheng, 'F', start_row = 2)

# Concatenate all the data
docs = mfw_output + mfw_youji + qr_data + textqr_data + tripadvisor_forum + xiaohongshu_comments_combined + reddit_comments_combined + reddit_posts_combined + qunar_reviews_data + tripadvisor_reviews + xiecheng_data



# concatenate all chinese data
docs_chinese = mfw_output + mfw_youji + qr_data + textqr_data + xiaohongshu_comments_combined + qunar_reviews_data + xiecheng_data

# concatenate all english data
docs_english = reddit_comments_combined + reddit_posts_combined + tripadvisor_forum + tripadvisor_reviews

# ÊèèËø∞Êæ≥Èó®ÂΩ¢Ë±°ÂíåÂáùËßÜÁÑ¶ÁÇπÁöÑ‰∏≠ÊñáÊï∞ÊçÆ
docs_chinese_focus = mfw_youji + xiaohongshu_comments_combined + qunar_reviews_data + xiecheng_data

# ÊèèËø∞ËÆøÊæ≥Ê∏∏ÂÆ¢Ê†∏ÂøÉÈúÄÊ±ÇÁöÑ‰∏≠ÊñáÊï∞ÊçÆ
docs_chinese_demand = mfw_output + qr_data + textqr_data + xiaohongshu_comments_combined

# ÊèèËø∞Êæ≥Èó®ÂΩ¢Ë±°ÂíåÂáùËßÜÁÑ¶ÁÇπÁöÑËã±ÊñáÊï∞ÊçÆ
docs_english_focus = reddit_posts_combined + tripadvisor_reviews

# ÊèèËø∞ËÆøÊæ≥Ê∏∏ÂÆ¢Ê†∏ÂøÉÈúÄÊ±ÇÁöÑËã±ÊñáÊï∞ÊçÆ
docs_english_demand = reddit_comments_combined + tripadvisor_forum



In [2]:
print(f'total number of documents: {len(docs)}')
print(f'number of english documents: {len(docs_english)}')
print(f'number of chinese documents: {len(docs_chinese)}')
print(f'number of chinese_focus documents: {len(docs_chinese_focus)}')
print(f'number of chinese_demand documents: {len(docs_chinese_demand)}')
print(f'number of english_focus documents: {len(docs_english_focus)}')
print(f'number of english_demand documents: {len(docs_english_demand)}')

total number of documents: 58735
number of english documents: 30512
number of chinese documents: 28223
number of chinese_focus documents: 23710
number of chinese_demand documents: 7540
number of english_focus documents: 24045
number of english_demand documents: 6467


In [3]:
!pip install emoji
!pip install opencc
!pip install pycountry



In [4]:
# Êï∞ÊçÆÊ∏ÖÊ¥ó

import re
import emoji
import opencc

def remove_emojis(text):
    return emoji.replace_emoji(text, replace='')

converter = opencc.OpenCC('t2s.json')

# Clean the text
def clean_text(text):
    # remove URLs first (both <http...> and www formats)
    text = re.sub(r'<https?:\/\/[^\s>]+>|www\.[^\s>]+', '', text)
    # remove symbols
    text = re.sub(r'[\[\]„Äå„Äç‚òÖ‚ñ†„Äê„Äë*#@&$%\^\*\-\+=\|ÔΩû\\/~\u3000]', '', text)
    # Then remove emojis
    text = remove_emojis(text)

    # Add removal of specific words
    #words_to_remove = ["ÊîªÁï•È¶ñÈ°µ", "ÈùûÂ∏∏Êä±Ê≠âÔºåÊÇ®ËÆøÈóÆÁöÑÈ°µÈù¢‰∏çÂ≠òÂú®„ÄÇ", "ÂéªÂì™ÂÑøÊóÖË°å", "È¶ñÈ°µ", "ÊîªÁï•", "ÈÄÇËÄÅÂåñÂèäÊó†ÈöúÁ¢ç", "ËØ∑ÁôªÂΩïÊàñÂÖçË¥πÊ≥®ÂÜå", "Ê∂àÊÅØ", "Êü•ÁúãËÆ¢ÂçïËÅîÁ≥ªÂÆ¢Êúç", "NEW‰∏ÄÊó•Ê∏∏", "Âõ¢Ë¥≠", "ÈÇÆËΩÆ", "Â∫¶ÂÅá", "Èó®Á•®", "ÁÅ´ËΩ¶Á•®", "ÂΩìÂú∞‰∫∫", "Ê±ΩËΩ¶Á•®", "Êõ¥Â§ö", "ÊîªÁï•Â∫ì", "ÁõÆÁöÑÂú∞", "ÂàõÂª∫Ë°åÁ®ã", "ÂõûÂèëË°®Ê∏∏ËÆ∞", "Âàõ‰ΩúËÄÖÂπ≥Âè∞", "ÊêúÁõÆÁöÑÂú∞/ÊîªÁï•", "ÊóÖË°å„ÄãÁõÆÁöÑÂú∞„Äã‰∏≠ÂõΩ>Êæ≥Èó®Êæ≥Èó®ÊôØÁÇπ„Äã", "ËØÑËÆ∫ËØ¶ÊÉÖ", "Êõ¥Â§öÂë®ËæπÊôØÁÇπ", "ÊÑèËßÅÂèçÈ¶à", "Á∫†Èîô", "ÂèçÈ¶à", "ÁõÆÁöÑÂú∞ÂàÜÁ±ªÂØºËà™", "Êæ≥Èó®ÁÉ≠Èó®ÊôØÁÇπ", "Â§ß‰∏âÂ∑¥ÁâåÂùä", "Êæ≥Èó®Â°î", "Â≤óÈ°∂ÂâçÂú∞", "ÈæôÁéØËë°Èüµ‰ΩèÂÆÖÂºè", "‰∏ªÊïôÂ±±Â∞èÂ†Ç", "Êæ≥Èó®Â∏ÇÊîøÁΩ≤", "‰∫öÂ©Ü‰∫ïÂâçÂú∞", "Â§ßÁÇÆÂè∞", "Â¶àÈòÅÂ∫ô", "Êæ≥Èó®ÂçöÁâ©È¶Ü", "Áé´Áë∞Âú£ÊØçÂ†Ç", "Âú£Ëã•Áëü‰øÆÈô¢ËóèÁèçÈ¶Ü", "Êæ≥Èó®Ê∏î‰∫∫Á†ÅÂ§¥", "ÈáëËé≤Ëä±ÂπøÂú∫", "Âç¢ÂÆ∂Â§ßÂ±ã", "Âú£ÊØçËØûËæ∞‰∏ªÊïôÂ∫ßÂ†ÇÂú£ËÄÅÊ•û‰ΩêÂ†Ç", "Ê∏ØÂä°Â±Ä", "ÈÉëÂÆ∂Â§ßÂ±ã", "Êæ≥Èó®Â§ßËµõËΩ¶ÂçöÁâ©È¶Ü", "‰∏≠ÂõΩÁÉ≠Èó®ÂüéÂ∏Ç", "Âåó‰∫¨ÊóÖÊ∏∏", "Âé¶Èó®ÊóÖÊ∏∏", "‰∏äÊµ∑ÊóÖÊ∏∏", "Êù≠Â∑ûÊóÖÊ∏∏", "Âçó‰∫¨ÊóÖÊ∏∏", "ÊàêÈÉΩÊóÖÊ∏∏", "‰∏ΩÊ±üÊóÖÊ∏∏", "È¶ôÊ∏ØÊóÖÊ∏∏", "Êæ≥Èó®ÊóÖÊ∏∏", "Âè∞ÂåóÊóÖÊ∏∏", "‰∏â‰∫öÊóÖÊ∏∏", "Ë•øÂÆâÊóÖÊ∏∏", "Ê°ÇÊûóÊóÖÊ∏∏", "ÈáçÂ∫ÜÊóÖÊ∏∏", "ÈùíÂ≤õÊóÖÊ∏∏", "ËãèÂ∑ûÊóÖÊ∏∏", "Â§ßËøûÊóÖÊ∏∏", "ÂπøÂ∑ûÊóÖÊ∏∏", "Ê∑±Âú≥ÊóÖÊ∏∏", "Âº†ÂÆ∂ÁïåÊóÖÊ∏∏", "12025ÁÉ≠Èó®ÁõÆÁöÑÂú∞Êé®Ëçê", "Â§©Ê¥•ÊóÖÊ∏∏", "Ê≥∞ÂõΩÊóÖÊ∏∏", "Ëã±ÂõΩÊóÖÊ∏∏", "Ê≥ïÂõΩÊóÖÊ∏∏", "ÊÑèÂ§ßÂà©ÊóÖÊ∏∏", "ÁæéÂõΩÊóÖÊ∏∏", "Âä†ÊãøÂ§ßÊóÖÊ∏∏", "Êæ≥Â§ßÂà©‰∫öÊóÖÊ∏∏", "ÂüÉÂèäÊóÖÊ∏∏", "Èü©ÂõΩÊóÖÊ∏∏", "È©¨Â∞î‰ª£Â§´ÊóÖÊ∏∏", "Âæ∑ÂõΩÊóÖÊ∏∏", "ÂúüËÄ≥ÂÖ∂ÊóÖÊ∏∏", "Êñ∞Ë•øÂÖ∞ÊóÖÊ∏∏", "ËÇØÂ∞º‰∫öÊóÖÊ∏∏", "ÂçóÈùûÊóÖÊ∏∏", "Âç∞Â∫¶Â∞ºË•ø‰∫öÊóÖÊ∏∏", "Ëè≤ÂæãÂÆæÊóÖÊ∏∏", "Êó•Êú¨ÊóÖÊ∏∏", "ÁëûÂ£´ÊóÖÊ∏∏", "Qunar.com", "‰∏öÂä°Âêà‰Ωú", "Âä†ÂÖ•Êàë‰ª¨", "‰∏•ÈáçËøùËßÑ", "Â§±‰ø°‰∏ìÈ°πÊï¥Ê≤ª", "‰∏æÊä•", "ÂÆâÂÖ®‰∏≠ÂøÉ", "ÊòüÈ™ÜÈ©ºÂÖ¨Áõä", "AboutUs", "Trip.com", "Group", "Copyright@2021Qunar.com", "‰∫¨ÂÖ¨ÁΩëÂÆâÂ§á", "11010802030542", "‰∫¨ICPÂ§á", "05021087Âè∑", "‰∫¨ICPËØÅ", "060856Âè∑", "Ëê•‰∏öÊâßÁÖß‰ø°ÊÅØ", "‰∫íËÅîÁΩëËçØÂìÅ‰ø°ÊÅØÊúçÂä°ËµÑÊ†ºËØÅ", "Ôºà‰∫¨Ôºâ-ÈùûÁªèËê•ÊÄß-2016-0110CÂéªÂì™ÂÑø", "ÊäïËØâ", "Âí®ËØ¢ÁÉ≠Á∫øÁîµËØù", "95117", "ÊäïËØâÈÇÆÁÆ±", "tousu@qunar.com", "ÂÖ®ÂõΩÊóÖÊ∏∏ÊäïËØâÁÉ≠Á∫ø", "12345", "Êú™ÊàêÂπ¥‰∫∫", "ËøùÊ≥ïÂíå‰∏çËâØ‰ø°ÊÅØ", "ÁÆóÊ≥ïÊé®Ëçê", "‰∏æÊä•ÁîµËØù", "010-59606977", "‰∏æÊä•ÈÇÆÁÆ±", "‰ø°Áî®‰∏≠ÂõΩ", "Osecure", "ÁΩë‰∏äÊúâÂÆ≥‰ø°ÊÅØ", "‰∏æÊä•‰∏ìÂå∫", "ÊóÖË°å„ÄãÁõÆÁöÑÂú∞„Äã‰∏≠ÂõΩ"]  # Add more words as needed
    words_for_removal = ["ÈòÖËØªÂÖ®ÈÉ®"]
    for word in words_for_removal:
        text = text.replace(word, '')

    # ÁπÅ‰ΩìËΩ¨ÁÆÄ‰Ωì
    text = converter.convert(text)

    return text

# Apply to all docs
docs = [clean_text(doc) for doc in docs]
docs_chinese = [clean_text(doc) for doc in docs_chinese]
docs_chinese_focus = [clean_text(doc) for doc in docs_chinese_focus]
docs_chinese_demand = [clean_text(doc) for doc in docs_chinese_demand]
docs_english = [clean_text(doc) for doc in docs_english]
docs_english_focus = [clean_text(doc) for doc in docs_english_focus]
docs_english_demand = [clean_text(doc) for doc in docs_english_demand]

In [None]:
# randomly inspect documents after cleansing
import pandas as pd

docs_inspect = []
for i in range(0, len(docs), 100):
  docs_inspect.append(docs[i])

pd.DataFrame(docs_inspect, columns=['Documents']).to_excel('documents_for_inspect.xlsx', index=False)

In [5]:
### Training data statistics

# total word count
def count_words(doc):
    """
    Count the total words in a document containing English or Chinese text.
    For English: splits on whitespace and punctuation
    For Chinese: counts each character as a word (common approach)
    Handles mixed English-Chinese documents correctly.
    """
    # Chinese characters (Unicode blocks)
    chinese_pattern = re.compile(r'[\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002b73f\U0002b740-\U0002b81f\U0002b820-\U0002ceaf]+')

    # Count Chinese characters (each counts as 1 word)
    chinese_words = sum(len(match) for match in chinese_pattern.findall(doc))

    # Count English words (split by non-word characters)
    english_words = len(re.findall(r'[a-zA-Z]+', doc))

    return chinese_words + english_words


total_word_count = 0

for doc in docs:
  word_count = count_words(doc)
  total_word_count += word_count

print(f'total word count of all documents is: ', total_word_count)

total word count of all documents is:  4598410


In [None]:
#count region number
import pycountry
from collections import defaultdict

tripadvisor_forum_region = get_column_data(tripadvisor_forum_data, 'E', start_row=2)
tripadvisor_reviews_region = get_csv_column_data(tripadvisor_reviews_data, 'C', start_row=2)
xiecheng

print(tripadvisor_forum_region, tripadvisor_reviews_region)


# ÂõΩÂÆ∂/Âú∞Âå∫Ê†áÂáÜÂåñÊò†Â∞ÑË°®
COUNTRY_MAPPING = {
    # Â§ÑÁêÜÁâπÊÆäÁº©ÂÜôÂíåÂà´Âêç
    "AB": "Canada",          # Âä†ÊãøÂ§ßÈòøÂ∞î‰ºØÂ°îÁúÅ
    "AK": "United States",    # ÁæéÂõΩÈòøÊãâÊñØÂä†Â∑û
    "AZ": "United States",    # ÁæéÂõΩ‰∫öÂà©Ê°ëÈÇ£Â∑û
    "BC": "Canada",           # Âä†ÊãøÂ§ß‰∏çÂàóÈ¢†Âì•‰º¶ÊØî‰∫öÁúÅ
    "UK": "United Kingdom",
    "Turkiye": "Turkey",
    # Â§ÑÁêÜÂüéÂ∏ÇÂΩíÂπ∂ÔºàÂèØÈÄâÔºâ
    "Bacolod": "Philippines",
    "Batangas": "Philippines",
    # Â§ÑÁêÜÁ©∫ÂÄºÂíåÊó†ÊïàÂÄº
    "": "Unknown",
    "Asia": "Unknown"
}

def standardize_country(raw_name):
    # Ê∏ÖÊ¥óÊï∞ÊçÆÔºöÂéªÈô§Á©∫Ê†ºÂíåÁâπÊÆäÁ¨¶Âè∑
    cleaned = raw_name.strip().split('...')[0].split(',')[-1].strip()

    # ‰ºòÂÖàÊ£ÄÊü•È¢ÑËÆæÊò†Â∞ÑË°®
    if cleaned in COUNTRY_MAPPING:
        return COUNTRY_MAPPING[cleaned]

    # ‰ΩøÁî®pycountryÂåπÈÖçÊ†áÂáÜÂõΩÂÆ∂Âêç
    try:
        return pycountry.countries.search_fuzzy(cleaned)[0].name
    except LookupError:
        return "Unknown"


# Your input list
locations = tripadvisor_forum_region + tripadvisor_reviews_region

# Ê†áÂáÜÂåñÂπ∂ËÆ°Êï∞
country_counts = defaultdict(int)
for loc in locations:
    country = standardize_country(loc)
    country_counts[country] += 1

# ËøáÊª§Êó†ÊïàÂÄºÂπ∂ÊéíÂ∫è
valid_counts = {k: v for k, v in sorted(country_counts.items())
                if k != "Unknown" and v > 0}

print(f"ÂÆûÈôÖÂõΩÂÆ∂/Âú∞Âå∫Êï∞Èáè: {len(valid_counts)}")
for country, count in valid_counts.items():
    print(f"{country}: {count}")

['Bangkok, Thailand', 'New Delhi, India', 'Manila, Philippines', 'World', 'Richmond, New...', 'Canada', 'Indonesia', 'Istanbul, Turkiye', 'Mumbai, India', 'Hong Kong, China', 'Murray Bridge...', 'Marietta, Georgia', 'Melbourne, Australia', 'Gavle, Sweden', 'Gavle, Sweden', 'Gavle, Sweden', 'Sydney, Australia', 'Perth, Australia', 'Hong Kong, China', 'Yangon (Rangoon...', 'Singapore', 'Singapore', 'Bangkok, Thailand', 'Bangkok, Thailand', 'Singapore, Singapore', 'Istanbul, Turkiye', 'Singapore', 'london', 'Sydney, Australia', 'Philippines', 'Canada', 'Ontario, Canada', 'london', 'Indonesia', 'Hong Kong, China', 'Bideford', 'Indonesia', 'Hong Kong, China', 'Kathmandu, Nepal', 'Southampton, United...', 'Bangkok, Thailand', 'Palm Beach...', 'Palm Beach...', 'Mexico City, Mexico', 'Brisbane, Australia', 'Marietta, Georgia', 'Singapore, Singapore', 'Warrington, United...', 'Hong Kong, China', 'Ambala, India', 'Singapore, Singapore', 'Bialystok, Poland', 'Lincoln, California', 'Singapore, Sin

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-fc1d2e919c71>", line 49, in <cell line: 0>
    country = standardize_country(loc)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "<ipython-input-15-fc1d2e919c71>", line 38, in standardize_country
    return pycountry.countries.search_fuzzy(cleaned)[0].name
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pycountry/__init__.py", line 75, in search_fuzzy
    match_subdivions = pycountry.Subdivisions.match(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pycountry/__init__.py", line None, in match
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/IPyt

In [None]:
# temporal range

from datetime import datetime
from dateutil.parser import parse

def find_date_range(date_strings):
    dates = []
    for date_str in date_strings:
        try:
            # Parse the date string automatically (handles most common formats)
            date = parse(date_str)
            dates.append(date)
        except ValueError:
            print(f"Warning: Could not parse date: {date_str}")
            continue

    if not dates:
        return None, None

    return min(dates), max(dates)

# Example usage
qunar_reviews_time = get_column_data(qunar_reviews, 'C', start_row=2)
tripadvisor_reviews_time = get_csv_column_data(tripadvisor_reviews_data, 'F', start_row=2)
tripadvisor_forum_time = get_column_data(tripadvisor_forum_data, 'C', start_row=2)
date_list = qunar_reviews_time + tripadvisor_reviews_time + tripadvisor_forum_time

print(date_list)

earliest, latest = find_date_range(date_list)

print(f"Earliest date: {earliest.strftime('%Y-%m-%d')}")
print(f"Most recent date: {latest.strftime('%Y-%m-%d')}")

['2015-11-08', '2018-01-10', '2019-10-09', '2018-08-19', '2018-05-19', '2018-01-12', '2017-12-20', '2015-11-10', '2015-06-11', '2019-12-30', '2019-10-17', '2019-09-24', '2018-11-08', '2018-08-20', '2018-02-22', '2018-02-16', '2017-02-24', '2016-09-30ÂèëË°®‰∫éÊ∏∏ËÆ∞Âè∞ÊπæËá™Áî±Ë°åÔºå2016‰∏ç‰∏ÄÊ†∑ÁöÑÊò•ËäÇ', '2015-11-30', '2015-09-13', '2015-08-30', '2015-08-19', '2015-04-10', '2024-02-08', '2019-09-24', '2019-09-24', '2019-09-23', '2019-09-23', '2019-04-14', '2019-01-29', '2019-01-29', '2018-11-26', '2018-11-25', '2018-11-25', '2018-11-08', '2018-05-06', '2018-03-24', '2018-02-23', '2018-02-22', '2018-02-14', '2018-01-24', '2018-01-24', '2018-01-18', '2018-01-18', '2017-12-13ÂèëË°®‰∫éÊ∏∏ËÆ∞È¶ôÊ∏Ø&Êæ≥Èó®ÔºåÈÅáËßÅ‰∏ç‰∏ÄÊ†∑ÁöÑ‰Ω†', '2017-11-27ÂèëË°®‰∫éÊ∏∏ËÆ∞ÊÉä‰∏ñÂ∑®Âà∂„ÄåÊ∞¥ËàûÈó¥„Äç‚Äî‚ÄîÊæ≥Èó®ÁïÖÊ∏∏‰ΩìÈ™åË°å„ÄêÂÆåÁªì„Äë', '2017-10-18', '2017-10-16', '2017-07-17ÂèëË°®‰∫éÊ∏∏ËÆ∞ËµèÂë≥ËÆ∞ ÂΩìÁîüÊ¥ª‰∏ÄÂ°åÁ≥äÊ∂ÇÁöÑÊó∂ÂÄôÔºåÊàëË∑ëÂéª‰∫ÜÊæ≥Èó®ÂêÉÂèâÁÉßËõã', '2017-05-19ÂèëË°®‰∫éÊ∏∏ËÆ∞È¢ÜÁùÄ5‰∏™Ë

In [6]:
# inspect documents
print(docs[14790])

ÊãøÂ•ΩË°åÊùéÂêéÔºåÈóÆÂéªÈì∂Ê≤≥ÁöÑÁè≠ËΩ¶Âú®Âì™Èáå‰πòÔºåÁªìÊûúÁ≠îÊ°àÊòØÔºöÊàë‰ª¨Âú®ÂçóÈó®ÔºåÁè≠ËΩ¶Âú®ÂåóÈó®Â∞±ËøôÊ†∑Êàë‰ª¨Êù•Êù•ÂõûÂõû„ÄÅÂèçÂèçÂ§çÂ§çÁöÑÂØπÁ©øËµåÂú∫„ÄÇÂ•Ω‰∏çÂÆπÊòìÂà∞‰∫ÜÈì∂Ê≤≥Â∑•‰Ωú‰∫∫ÂëòÂëäËØâÊàë‰ª¨ÂéªÊñ∞È©¨Ë∑ØÁöÑËΩ¶ÂàöÂàöÂºÄËµ∞‰∫ÜÔºåËÄå‰∏îÊòØÊúÄÂêé‰∏ÄÁè≠ÔºåÊôïÂïäËøô‰∏ãËÇø‰πàÂäû
Ë¶Å‰π∞ÁöÑ‰∏úË•øËµ∞Êñ≠ËÖøÈÉΩÊ≤°ÂÖúÂà∞ÔºåË¶ÅÊê≠ÁöÑÁè≠ËΩ¶ÊúâÊ≤°‰πòÂà∞Êú´Áè≠ÔºåËØ∏‰∫ã‰∏çÈ°∫Âïä
ÂêéÊù•ÈóÆ‰∫ÜÂ∑•‰Ωú‰∫∫ÂëòËøòÊúâ‰ªÄ‰πàÁè≠ËΩ¶ÂèØ‰ª•ËøáÂéªÁöÑÔºåÂõ†‰∏∫Èì∂Ê≤≥„ÄÅÂ®ÅÂ∞ºÊñØ„ÄÅÈáëÊ≤ô„ÄÅÊñ∞Êø†ÊòØÂú®Ëøô‰∏™Â≤õÁöÑÔºåËÄå‰∏ªË¶ÅÊôØÁÇπÂíåÈÖíÂ∫óÊòØÂú®Âè¶‰∏Ä‰∏™Â≤õ‰∏äÁöÑ...



# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.




## Training

We start by instantiating BERTopic. We set language to `english` when documents are in the English language. When the documents are in chinese or other languages, we use `language="multilingual"` instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.


In [27]:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech
import openai
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder


# pre-calculate embeddings
# embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# embeddings = embedding_model.encode(docs_chinese, show_progress_bar=True)

# # finetune dimensionality reduction model
# umap_model = UMAP(n_neighbors=15, n_components=5, metric='cosine', low_memory=False, random_state=42)

# # finetune clustering model
# hdbscan_model = HDBSCAN(min_cluster_size=25, metric='euclidean', prediction_data=True)

prompt_cn_focus = """
Âü∫‰∫éËøô‰∫õÂÖ≥ÈîÆËØçÔºö[KEYWORDS]
ÂíåÂü∫‰∫éËøô‰∫õ‰ª£Ë°®Êñá‰ª∂Ôºö[DOCUMENTS]

ËØ∑Áî®‰∏≠ÊñáÂàõÂª∫‰∏çË∂ÖËøá7‰∏™Â≠óÁöÑ‰∏ªÈ¢òÊ†áÁ≠æÔºåË¶ÅÊ±ÇËÅöÁÑ¶Êæ≥Èó®ÂΩ¢Ë±°ÂíåËÆøÊæ≥ÊóÖÂÆ¢ÁöÑÂáùËßÜÁÑ¶ÁÇπ„ÄÇ
"""

prompt_en_focus = """
Based on these keywords: [KEYWORDS]
And these documents: [DOCUMENTS]

Create a technical label (max 5 words and in English) for this topic. This techinical label should describe either Macau's visitors' attention focus or Macau's image and identity
"""

prompt_cn_demand = """
Âü∫‰∫éËøô‰∫õÂÖ≥ÈîÆËØçÔºö[KEYWORDS]
ÂíåÂü∫‰∫éËøô‰∫õ‰ª£Ë°®Êñá‰ª∂Ôºö[DOCUMENTS]

ËØ∑Áî®‰∏≠ÊñáÂàõÂª∫‰∏çË∂ÖËøá7‰∏™Â≠óÁöÑ‰∏ªÈ¢òÊ†áÁ≠æÔºåË¶ÅÊ±ÇËÅöÁÑ¶ËÆøÊæ≥ÊóÖÂÆ¢ÁöÑÊ†∏ÂøÉÈúÄÊ±Ç„ÄÇ
"""

prompt_en_demand = """
Based on these keywords: [KEYWORDS]
And these documents: [DOCUMENTS]

Create a technical label (max 5 words and in English) for this topic. This techinical label should describe Macau's visitors' core demand.
"""

client = openai.OpenAI(api_key="")
representation_model0 = None
representation_model1 = KeyBERTInspired()
representation_model2 = MaximalMarginalRelevance(diversity=0)
representation_model3 = OpenAI(
    client,
    model="gpt-4.1",
    prompt=prompt_cn_focus, #ÊåâÁÖßÊÉÖÂÜµ‰øÆÊîπ
    nr_docs=4,
    doc_length=100,
    diversity=0.2,
    tokenizer= "whitespace",
    delay_in_seconds=2
)

seed_topic_list = [["ÊñáÂåñÁéØÂ¢É"],["‰∏ÄËà¨Âü∫Á°ÄËÆæÊñΩ"],["‰∏ÄËà¨Ê∞õÂõ¥"],["Á§æ‰ºöÁéØÂ¢É"],["Ëá™ÁÑ∂ÁéØÂ¢É"],["ÊóÖÊ∏∏Âü∫Á°ÄËÆæÊñΩ"],["È´òÊ°£Ê∞õÂõ¥"]]
#seed_topic_list = [["Cultural Environment"],["General Infrastructure"],["General Atmosphere"],["Social Environment"],["Natural Environment"],["Tourism Infrastructure"],["Upscale Atmosphere"]]

topic_model = BERTopic(language="chinese (simplified)", seed_topic_list=seed_topic_list, top_n_words=10, min_topic_size=10, representation_model=representation_model2, calculate_probabilities=True, verbose=True) #ÊåâÊñáÊú¨ËØ≠Ë®Ä‰øÆÊîπÊ®°ÂûãËØ≠Ë®ÄÂíåmin_topic_size
#topic_model = BERTopic(language="english", seed_topic_list=seed_topic_list, top_n_words=10, min_topic_size=10, representation_model=representation_model2, calculate_probabilities=True, verbose=True) #ÊåâÊñáÊú¨ËØ≠Ë®Ä‰øÆÊîπÊ®°ÂûãËØ≠Ë®ÄÂíåmin_topic_size

topics, probs = topic_model.fit_transform(docs_chinese_focus) #‰øÆÊîπÊñáÊú¨



2026-01-23 07:16:28,108 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/741 [00:00<?, ?it/s]

2026-01-23 07:16:54,361 - BERTopic - Embedding - Completed ‚úì
2026-01-23 07:16:54,362 - BERTopic - Guided - Find embeddings highly related to seeded topics.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2026-01-23 07:16:54,546 - BERTopic - Guided - Completed ‚úì
2026-01-23 07:16:54,548 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-23 07:17:18,930 - BERTopic - Dimensionality - Completed ‚úì
2026-01-23 07:17:18,931 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-23 07:18:50,723 - BERTopic - Cluster - Completed ‚úì
2026-01-23 07:18:50,734 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-23 07:18:59,419 - BERTopic - Representation - Completed ‚úì


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [28]:
import jieba

professional_terms = ["Âú£ÂÆâÂ§öÂ∞º", "ËÅñÂÆâÂ§öÂ∞º", "‰ΩïË∂ÖÁêº", "Ê¢ÅÂÆâÁê™", "Ëä±ÁéãÂ†Ç", "Ê∏ØÁè†Êæ≥Â§ßÊ°•", "Ë∑ØÊòìÂçÅ‰∏É", "ÊùÉÂøóÈæô", "Â®ÅÂ∞ºÊñØ‰∫∫", "‰º¶Êï¶‰∫∫", "Â∑¥Èªé‰∫∫", "ÁéãÊ•öÈí¶", "ÁâπÂà´Ë°åÊîø", "Ëä±ÁéãÂ†ÇÂå∫", "È£éÈ°∫Â†ÇÂå∫", "Ëä±Âú∞ÁéõÂ†ÇÂå∫", "ÊúõÂæ∑Â†ÇÂå∫", "Âú£ÊñπÊµéÂêÑÂ†ÇÂå∫", "Ê∏ØÊæ≥ÈÄöË°åËØÅ", "ÁñØÂ†ÇÊñúÂ∑∑"]
for term in professional_terms:
    jieba.add_word(term, freq=10000, tag='nz')  # È´òÈ¢ë+‰∏ìÊúâÂêçËØçÊ†áÁ≠æ

def scel_to_txt(scel_file, txt_file):
    with open(scel_file, 'rb') as f_scel, open(txt_file, 'w', encoding='utf-8') as f_txt:
        f_scel.read(0x154)  # Skip header
        while True:
            word_bytes = f_scel.read(2)
            if not word_bytes:
                break
            word = word_bytes.decode('utf-16le', errors='ignore')
            f_scel.read(2)  # Skip frequency/other data
            f_txt.write(word + '\n')
    print(f"Converted {scel_file} to {txt_file}")


scel_to_txt("/content/sougou_travel_words.scel", "output.txt") # sougou_travel_words.scel is available in the github repo 'jqiu19/Topic_Modelling'
jieba.load_userdict("output.txt")

# finetune vectorizer to improve topic representation
stop_words_cn = [
    "$", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "?", "_", "‚Äú", "‚Äù", "„ÄÅ", "„ÄÇ", "„Ää", "„Äã", "ÈáåÈù¢", "Áªà‰∫é", "‰∏ÄÂ∫ß", "‰∏Ä", "‰∏Ä‰∫õ", "‰∏Ä‰Ωï",
    "‰∏ÄÂàá", "‰∏ÄÂàô", "‰∏ÄÊñπÈù¢", "‰∏ÄÊó¶", "‰∏ÄÊù•", "‰∏ÄÊ†∑", "‰∏ÄËà¨", "‰∏ÄËΩ¨Áúº", "‰∏á‰∏Ä", "‰∏ä", "‰∏ä‰∏ã", "‰∏ã", "‰∏ç", "‰∏ç‰ªÖ", "‰∏ç‰ΩÜ",
    "‰∏çÂÖâ", "‰∏çÂçï", "‰∏çÂè™", "‰∏çÂ§ñ‰πé", "‰∏çÂ¶Ç", "‰∏çÂ¶®", "‰∏çÂ∞Ω", "‰∏çÂ∞ΩÁÑ∂", "‰∏çÂæó", "‰∏çÊÄï", "‰∏çÊÉü", "‰∏çÊàê", "‰∏çÊãò", "‰∏çÊñô",
    "‰∏çÊòØ", "‰∏çÊØî", "‰∏çÁÑ∂", "‰∏çÁâπ", "‰∏çÁã¨", "‰∏çÁÆ°", "‰∏çËá≥‰∫é", "‰∏çËã•", "‰∏çËÆ∫", "‰∏çËøá", "‰∏çÈóÆ", "‰∏é", "‰∏éÂÖ∂", "‰∏éÂÖ∂ËØ¥",
    "‰∏éÂê¶", "‰∏éÊ≠§ÂêåÊó∂", "‰∏î", "‰∏î‰∏çËØ¥", "‰∏îËØ¥", "‰∏§ËÄÖ", "‰∏™", "‰∏™Âà´", "‰∏¥", "‰∏∫", "‰∏∫‰∫Ü", "‰∏∫‰ªÄ‰πà", "‰∏∫‰Ωï", "‰∏∫Ê≠¢",
    "‰∏∫Ê≠§", "‰∏∫ÁùÄ", "‰πÉ", "‰πÉËá≥", "‰πÉËá≥‰∫é", "‰πà", "‰πã", "‰πã‰∏Ä", "‰πãÊâÄ‰ª•", "‰πãÁ±ª", "‰πå‰πé", "‰πé", "‰πò", "‰πü", "‰πüÂ•Ω",
    "‰πüÁΩ¢", "‰∫Ü", "‰∫åÊù•", "‰∫é", "‰∫éÊòØ", "‰∫éÊòØ‰πé", "‰∫ë‰∫ë", "‰∫ëÂ∞î", "‰∫õ", "‰∫¶", "‰∫∫", "‰∫∫‰ª¨", "‰∫∫ÂÆ∂", "‰ªÄ‰πà", "‰ªÄ‰πàÊ†∑",
    "‰ªä", "‰ªã‰∫é", "‰ªç", "‰ªçÊóß", "‰ªé", "‰ªéÊ≠§", "‰ªéËÄå", "‰ªñ", "‰ªñ‰∫∫", "‰ªñ‰ª¨", "‰ª•", "‰ª•‰∏ä", "‰ª•‰∏∫", "‰ª•‰æø", "‰ª•ÂÖç", "‰ª•Âèä",
    "‰ª•ÊïÖ", "‰ª•Êúü", "‰ª•Êù•", "‰ª•Ëá≥", "Âª∫‰∫é", "‰ª•Ëá≥‰∫é", "‰ª•Ëá¥", "‰ª¨", "‰ªª", "‰ªª‰Ωï", "‰ªªÂá≠", "‰ººÁöÑ", "‰ΩÜ", "‰ΩÜÂá°", "‰ΩÜÊòØ", "‰Ωï",
    "‰Ωï‰ª•", "‰ΩïÂÜµ", "‰ΩïÂ§Ñ", "‰ΩïÊó∂", "‰ΩôÂ§ñ", "‰Ωú‰∏∫", "‰Ω†", "‰Ω†‰ª¨", "‰Ωø", "‰ΩøÂæó", "‰æãÂ¶Ç", "‰æù", "‰æùÊçÆ", "‰æùÁÖß", "‰æø‰∫é",
    "‰ø∫", "‰ø∫‰ª¨", "ÂÄò", "ÂÄò‰Ωø", "ÂÄòÊàñ", "ÂÄòÁÑ∂", "ÂÄòËã•", "ÂÄü", "ÂÅá‰Ωø", "ÂÅáÂ¶Ç", "ÂÅáËã•", "ÂÇ•ÁÑ∂", "ÂÉè", "ÂÑø", "ÂÖà‰∏çÂÖà",
    "ÂÖâÊòØ", "ÂÖ®‰Ωì", "ÂÖ®ÈÉ®", "ÂÖÆ", "ÂÖ≥‰∫é", "ÂÖ∂", "ÂÖ∂‰∏Ä", "ÂÖ∂‰∏≠", "ÂÖ∂‰∫å", "ÂÖ∂‰ªñ", "ÂÖ∂‰Ωô", "ÂÖ∂ÂÆÉ", "ÂÖ∂Ê¨°", "ÂÖ∑‰ΩìÂú∞ËØ¥",
    "ÂÖ∑‰ΩìËØ¥Êù•", "ÂÖº‰πã", "ÂÜÖ", "ÂÜç", "ÂÜçÂÖ∂Ê¨°", "ÂÜçÂàô", "ÂÜçÊúâ", "ÂÜçËÄÖ", "ÂÜçËÄÖËØ¥", "ÂÜçËØ¥", "ÂÜí", "ÂÜ≤", "ÂÜµ‰∏î", "Âá†", "Âá†Êó∂",
    "Âá°", "Âá°ÊòØ", "Âá≠", "Âá≠ÂÄü", "Âá∫‰∫é", "Âá∫Êù•", "ÂàÜÂà´", "Âàô", "ÂàôÁîö", "Âà´", "Âà´‰∫∫", "Âà´Â§Ñ", "Âà´ÊòØ", "Âà´ÁöÑ", "Âà´ÁÆ°", "Âà´ËØ¥",
    "Âà∞", "ÂâçÂêé", "ÂâçÊ≠§", "ÂâçËÄÖ", "Âä†‰πã", "Âä†‰ª•", "Âç≥", "Âç≥‰ª§", "Âç≥‰Ωø", "Âç≥‰æø", "Âç≥Â¶Ç", "Âç≥Êàñ", "Âç≥Ëã•", "Âç¥", "Âéª", "Âèà",
    "ÂèàÂèä", "Âèä", "ÂèäÂÖ∂", "ÂèäËá≥", "Âèç‰πã", "ÂèçËÄå", "ÂèçËøáÊù•", "ÂèçËøáÊù•ËØ¥", "ÂèóÂà∞", "Âè¶", "Âè¶‰∏ÄÊñπÈù¢", "Âè¶Â§ñ", "Âè¶ÊÇâ", "Âè™",
    "Âè™ÂΩì", "Âè™ÊÄï", "Âè™ÊòØ", "Âè™Êúâ", "Âè™Ê∂à", "Âè™Ë¶Å", "Âè™Èôê", "Âè´", "ÂèÆÂíö", "ÂèØ", "ÂèØ‰ª•", "ÂèØÊòØ", "ÂèØËßÅ", "ÂêÑ", "ÂêÑ‰∏™",
    "ÂêÑ‰Ωç", "ÂêÑÁßç", "ÂêÑËá™", "Âêå", "ÂêåÊó∂", "Âêé", "ÂêéËÄÖ", "Âêë", "Âêë‰Ωø", "ÂêëÁùÄ", "Âêì", "Âêó", "Âê¶Âàô", "Âêß", "ÂêßÂìí", "Âê±", "ÂëÄ",
    "ÂëÉ", "Âëï", "Âëó", "Âëú", "ÂëúÂëº", "Âë¢", "Âëµ", "ÂëµÂëµ", "Âë∏", "ÂëºÂìß", "Âíã", "Âíå", "Âíö", "Âí¶", "Âíß", "Âí±", "Âí±‰ª¨", "Âí≥", "Âìá",
    "Âìà", "ÂìàÂìà", "Âìâ", "Âìé", "ÂìéÂëÄ", "ÂìéÂìü", "Âìó", "Âìü", "Âì¶", "Âì©", "Âì™", "Âì™‰∏™", "Âì™‰∫õ", "Âì™ÂÑø", "Âì™Â§©", "Âì™Âπ¥", "Âì™ÊÄï",
    "Âì™Ê†∑", "Âì™Ëæπ", "Âì™Èáå", "Âìº", "ÂìºÂî∑", "Âîâ", "ÂîØÊúâ", "Âïä", "Âïê", "Âï•", "Âï¶", "Âï™Ëææ", "Âï∑ÂΩì", "ÂñÇ", "Âñè", "ÂñîÂî∑", "ÂñΩ",
    "Âó°", "Âó°Âó°", "Âó¨", "ÂóØ", "Âó≥", "Âòé", "ÂòéÁôª", "Âòò", "Âòõ", "Âòª", "Âòø", "ÂòøÂòø", "Âõ†", "Âõ†‰∏∫", "Âõ†‰∫Ü", "Âõ†Ê≠§", "Âõ†ÁùÄ",
    "Âõ†ËÄå", "Âõ∫ÁÑ∂", "Âú®", "Âú®‰∏ã", "Âú®‰∫é", "Âú∞", "Âü∫‰∫é", "Â§ÑÂú®", "Â§ö", "Â§ö‰πà", "Â§öÂ∞ë", "Â§ß", "Â§ßÂÆ∂", "Â•π", "Â•π‰ª¨", "Â•Ω", "Â¶Ç",
    "Â¶Ç‰∏ä", "Â¶Ç‰∏äÊâÄËø∞", "Â¶Ç‰∏ã", "Â¶Ç‰Ωï", "Â¶ÇÂÖ∂", "Â¶ÇÂêå", "Â¶ÇÊòØ", "Â¶ÇÊûú", "Â¶ÇÊ≠§", "Â¶ÇËã•", "ÂßãËÄå", "Â≠∞Êñô", "Â≠∞Áü•", "ÂÆÅ",
    "ÂÆÅÂèØ", "ÂÆÅÊÑø", "ÂÆÅËÇØ", "ÂÆÉ", "ÂÆÉ‰ª¨", "ÂØπ", "ÂØπ‰∫é", "ÂØπÂæÖ", "ÂØπÊñπ", "ÂØπÊØî", "Â∞Ü", "Â∞è", "Â∞î", "Â∞îÂêé", "Â∞îÂ∞î", "Â∞ö‰∏î",
    "Â∞±", "Â∞±ÊòØ", "Â∞±ÊòØ‰∫Ü", "Â∞±ÊòØËØ¥", "Â∞±ÁÆó", "Â∞±Ë¶Å", "Â∞Ω", "Â∞ΩÁÆ°", "Â∞ΩÁÆ°Â¶ÇÊ≠§", "Â≤Ç‰ΩÜ", "Â∑±", "Â∑≤", "Â∑≤Áü£", "Â∑¥", "Â∑¥Â∑¥",
    "Âπ∂", "Âπ∂‰∏î", "Âπ∂Èùû", "Â∫∂‰πé", "Â∫∂Âá†", "ÂºÄÂ§ñ", "ÂºÄÂßã", "ÂΩí", "ÂΩíÈΩê", "ÂΩì", "ÂΩìÂú∞", "ÂΩìÁÑ∂", "ÂΩìÁùÄ", "ÂΩº", "ÂΩºÊó∂", "ÂΩºÊ≠§",
    "ÂæÄ", "ÂæÖ", "Âæà", "Âæó", "Âæó‰∫Ü", "ÊÄé", "ÊÄé‰πà", "ÊÄé‰πàÂäû", "ÊÄé‰πàÊ†∑", "ÊÄéÂ•à", "ÊÄéÊ†∑", "ÊÄª‰πã", "ÊÄªÁöÑÊù•Áúã", "ÊÄªÁöÑÊù•ËØ¥",
    "ÊÄªÁöÑËØ¥Êù•", "ÊÄªËÄåË®Ä‰πã", "ÊÅ∞ÊÅ∞Áõ∏Âèç", "ÊÇ®", "ÊÉüÂÖ∂", "ÊÖ¢ËØ¥", "Êàë", "Êàë‰ª¨", "Êàñ", "ÊàñÂàô", "ÊàñÊòØ", "ÊàñÊõ∞", "ÊàñËÄÖ", "Êà™Ëá≥",
    "ÊâÄ", "ÊâÄ‰ª•", "ÊâÄÂú®", "ÊâÄÂπ∏", "ÊâÄÊúâ", "Êâç", "ÊâçËÉΩ", "Êâì", "Êâì‰ªé", "Êää", "ÊäëÊàñ", "Êãø", "Êåâ", "ÊåâÁÖß", "Êç¢Âè•ËØùËØ¥", "Êç¢Ë®Ä‰πã",
    "ÊçÆ", "ÊçÆÊ≠§", "Êé•ÁùÄ", "ÊïÖ", "ÊïÖÊ≠§", "ÊïÖËÄå", "ÊóÅ‰∫∫", "Êó†", "Êó†ÂÆÅ", "Êó†ËÆ∫", "Êó¢", "Êó¢ÂæÄ", "Êó¢ÊòØ", "Êó¢ÁÑ∂", "Êó∂ÂÄô", "ÊòØ",
    "ÊòØ‰ª•", "ÊòØÁöÑ", "Êõæ", "Êõø", "Êõø‰ª£", "ÊúÄ", "Êúâ", "Êúâ‰∫õ", "ÊúâÂÖ≥", "ÊúâÂèä", "ÊúâÊó∂", "ÊúâÁöÑ", "Êúõ", "Êúù", "ÊúùÁùÄ", "Êú¨", "Êú¨‰∫∫",
    "Êú¨Âú∞", "Êú¨ÁùÄ", "Êú¨Ë∫´", "Êù•", "Êù•ÁùÄ", "Êù•Ëá™", "Êù•ËØ¥", "ÊûÅ‰∫Ü", "ÊûúÁÑ∂", "ÊûúÁúü", "Êüê", "Êüê‰∏™", "Êüê‰∫õ", "ÊüêÊüê", "Ê†πÊçÆ", "Ê¨§",
    "Ê≠£ÂÄº", "Ê≠£Â¶Ç", "Ê≠£Â∑ß", "Ê≠£ÊòØ", "Ê≠§", "Ê≠§Âú∞", "Ê≠§Â§Ñ", "Ê≠§Â§ñ", "Ê≠§Êó∂", "Ê≠§Ê¨°", "Ê≠§Èó¥", "ÊØãÂÆÅ", "ÊØè", "ÊØèÂΩì", "ÊØî", "ÊØîÂèä",
    "ÊØîÂ¶Ç", "ÊØîÊñπ", "Ê≤°Â•à‰Ωï", "Ê≤ø", "Ê≤øÁùÄ", "Êº´ËØ¥", "ÁÑâ", "ÁÑ∂Âàô", "ÁÑ∂Âêé", "ÁÑ∂ËÄå", "ÁÖß", "ÁÖßÁùÄ", "Áäπ‰∏î", "ÁäπËá™", "Áîö‰∏î",
    "Áîö‰πà", "ÁîöÊàñ", "ÁîöËÄå", "ÁîöËá≥", "ÁîöËá≥‰∫é", "Áî®", "Áî®Êù•", "Áî±", "Áî±‰∫é", "Áî±ÊòØ", "Áî±Ê≠§", "Áî±Ê≠§ÂèØËßÅ", "ÁöÑ", "ÁöÑÁ°Æ", "ÁöÑËØù",
    "Áõ¥Âà∞", "Áõ∏ÂØπËÄåË®Ä", "ÁúÅÂæó", "Áúã", "Áú®Áúº", "ÁùÄ", "ÁùÄÂë¢", "Áü£", "Áü£‰πé", "Áü£Âìâ", "Á¶ª", "Á´üËÄå", "Á¨¨", "Á≠â", "Á≠âÂà∞", "Á≠âÁ≠â",
    "ÁÆÄË®Ä‰πã", "ÁÆ°", "Á±ªÂ¶Ç", "Á¥ßÊé•ÁùÄ", "Á∫µ", "Á∫µ‰ª§", "Á∫µ‰Ωø", "Á∫µÁÑ∂", "Áªè", "ÁªèËøá", "ÁªìÊûú", "Áªô", "Áªß‰πã", "ÁªßÂêé", "ÁªßËÄå",
    "Áªº‰∏äÊâÄËø∞", "ÁΩ¢‰∫Ü", "ËÄÖ", "ËÄå", "ËÄå‰∏î", "ËÄåÂÜµ", "ËÄåÂêé", "ËÄåÂ§ñ", "ËÄåÂ∑≤", "ËÄåÊòØ", "ËÄåË®Ä", "ËÉΩ", "ËÉΩÂê¶", "ËÖæ", "Ëá™",
    "Ëá™‰∏™ÂÑø", "Ëá™‰ªé", "Ëá™ÂêÑÂÑø", "Ëá™Âêé", "Ëá™ÂÆ∂", "Ëá™Â∑±", "Ëá™Êâì", "Ëá™Ë∫´", "Ëá≥", "Ëá≥‰∫é", "Ëá≥‰ªä", "Ëá≥Ëã•", "Ëá¥", "Ëà¨ÁöÑ", "Ëã•",
    "Ëã•Â§´", "Ëã•ÊòØ", "Ëã•Êûú", "Ëã•Èùû", "Ëé´‰∏çÁÑ∂", "Ëé´Â¶Ç", "Ëé´Ëã•", "ËôΩ", "ËôΩÂàô", "ËôΩÁÑ∂", "ËôΩËØ¥", "Ë¢´", "Ë¶Å", "Ë¶Å‰∏ç", "Ë¶Å‰∏çÊòØ",
    "Ë¶Å‰∏çÁÑ∂", "Ë¶Å‰πà", "Ë¶ÅÊòØ", "Ë≠¨Âñª", "Ë≠¨Â¶Ç", "ËÆ©", "ËÆ∏Â§ö", "ËÆ∫", "ËÆæ‰Ωø", "ËÆæÊàñ", "ËÆæËã•", "ËØöÂ¶Ç", "ËØöÁÑ∂", "ËØ•", "ËØ¥Êù•",
    "ËØ∏", "ËØ∏‰Ωç", "ËØ∏Â¶Ç", "Ë∞Å", "Ë∞Å‰∫∫", "Ë∞ÅÊñô", "Ë∞ÅÁü•", "Ë¥ºÊ≠ª", "Ëµñ‰ª•", "Ëµ∂", "Ëµ∑", "Ëµ∑ËßÅ", "Ë∂Å", "Ë∂ÅÁùÄ", "Ë∂äÊòØ", "Ë∑ù", "Ë∑ü",
    "ËæÉ", "ËæÉ‰πã", "Ëæπ", "Ëøá", "Ëøò", "ËøòÊòØ", "ËøòÊúâ", "ËøòË¶Å", "Ëøô", "Ëøô‰∏ÄÊù•", "Ëøô‰∏™", "Ëøô‰πà", "Ëøô‰πà‰∫õ", "Ëøô‰πàÊ†∑", "Ëøô‰πàÁÇπÂÑø",
    "Ëøô‰∫õ", "Ëøô‰ºöÂÑø", "ËøôÂÑø", "ËøôÂ∞±ÊòØËØ¥", "ËøôÊó∂", "ËøôÊ†∑", "ËøôÊ¨°", "ËøôËà¨", "‰∏Ä‰∏™", "ËøôËæπ", "ËøôÈáå", "ËøõËÄå", "Ëøû", "ËøûÂêå", "ÈÄêÊ≠•",
    "ÈÄöËøá", "ÈÅµÂæ™", "ÈÅµÁÖß", "ÈÇ£", "ÈÇ£‰∏™", "ÈÇ£‰πà", "ÈÇ£‰πà‰∫õ", "ÈÇ£‰πàÊ†∑", "ÈÇ£‰∫õ", "ÈÇ£‰ºöÂÑø", "ÈÇ£ÂÑø", "ÈÇ£Êó∂", "ÈÇ£Ê†∑", "ÈÇ£Ëà¨",
    "ÈÇ£Ëæπ", "ÈÇ£Èáå", "ÈÉΩ", "ÈÑô‰∫∫", "Èâ¥‰∫é", "ÈíàÂØπ", "Èòø", "Èô§", "Èô§‰∫Ü", "Èô§Â§ñ", "Èô§ÂºÄ", "Èô§Ê≠§‰πãÂ§ñ", "Èô§Èùû", "Èöè", "ÈöèÂêé",
    "ÈöèÊó∂", "ÈöèÁùÄ", "ÈöæÈÅìËØ¥", "Èùû‰ΩÜ", "ÈùûÂæí", "ÈùûÁâπ", "ÈùûÁã¨", "Èù†", "È°∫", "È°∫ÁùÄ", "È¶ñÂÖà", "...", "‰∏ÄËµ∑", "ÁúüÁöÑ", "‰Ωç‰∫é", "ÂÆ≥Áæû", "Êæ≥Èó®", "Êæ≥ÈñÄ", "ÔºÅ", "Ôºå", "Ôºö", "Ôºõ", "Ôºü", "Ê≤°Êúâ", "‰∫∫‰∫∫", "ËØùÈ¢ò", "r"
] # ‰∏≠ÊñáÂÅúÁî®ËØçÔºåÂç≥‰∏≠ÊñáÊÆµÂÜÖÂá∫Áé∞Ê¨°Êï∞Â§ö‰ΩÜÂØπÊñáÊú¨ÊÑèÊÄùÊó†ÂÆûÈôÖË¥°ÁåÆÁöÑËØç

def tokenize_zh(text):
    words = jieba.lcut(text)

    # Filter out unwanted tokens
    words = [word.strip() for word in words
             if word.strip()
             and len(word.strip()) > 1  # Remove single characters
             and not word.isspace()
             and not word.isdigit()
             and not re.search(r'[a-zA-Z]', word)] # Remove english characters
    return words

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stop_words_cn, min_df=2, max_df=0.9, tokenizer=tokenize_zh, token_pattern=None) # min_df=6 ÊÑèÊÄùÊòØËØ•ËØçÂøÖÈ°ªÂá∫Áé∞Âú®Ëá≥Â∞ë6‰∏™ÊñáÊ°£‰∏≠ÊâçËÉΩË¢´ËÄÉËôë‰∏∫‰∏ªÈ¢òËØç, max_df=0.9ÂøΩÁï•Âú®90%ÁöÑÊñáÊ°£‰∏≠ÈÉΩÂá∫Áé∞‰∫ÜÁöÑËØç
#ÊåâÊñáÊú¨ËØ≠Ë®Ä‰øÆÊîπstop_wordsÔºåÂíåtokenizer

#vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=None, min_df=2, max_df=0.9, tokenizer=tokenize, token_pattern=None) # min_df=6 ÊÑèÊÄùÊòØËØ•ËØçÂøÖÈ°ªÂá∫Áé∞Âú®Ëá≥Â∞ë6‰∏™ÊñáÊ°£‰∏≠ÊâçËÉΩË¢´ËÄÉËôë‰∏∫‰∏ªÈ¢òËØç, max_df=0.9ÂøΩÁï•Âú®90%ÁöÑÊñáÊ°£‰∏≠ÈÉΩÂá∫Áé∞‰∫ÜÁöÑËØç
#ÊåâÊñáÊú¨ËØ≠Ë®Ä‰øÆÊîπstop_wordsÔºåÂíåtokenizer
topic_model.update_topics(docs_chinese_focus, vectorizer_model=vectorizer_model, top_n_words=10) #‰øÆÊîπÊñáÊú¨
#topic_model.update_topics(docs_english_focus, vectorizer_model=vectorizer_model, top_n_words=10) #‰øÆÊîπÊñáÊú¨

Converted /content/sougou_travel_words.scel to output.txt


In [29]:
from pathlib import Path
import pandas as pd
from collections import defaultdict

freq = topic_model.get_topic_info();

def postprocessing(df):
    # Convert all representations to component lists
    all_rows = [words.copy() for words in df['Representation']]
    global_components = set()

    for i in range(len(all_rows)):
        current_row = all_rows[i]
        filtered_words = []
        words_kept = 0

        for word in current_row:
            components = word.split()

            # Check against all previously seen components (current row + global)
            duplicate = any(comp in global_components for comp in components)

            if not duplicate:
                filtered_words.append(word)
                global_components.update(components)
                words_kept += 1

                # Stop current row if we reach 10 words
                if words_kept >= 15:
                    break

        # Update the row (may have less than 10 words at this point)
        all_rows[i] = filtered_words

    # Final truncation to ensure exactly 10 words per row
    for i in range(len(all_rows)):
        all_rows[i] = all_rows[i][:15]

    # Update dataframe
    df['Representation'] = all_rows
    return df


#freq = postprocessing(freq)
freq.head(100).to_excel("/content/topic_frequencies_list.xlsx", index=False)


In [30]:
# Step 2: Reduce to exactly 15 topics
topic_model_reduced = topic_model.reduce_topics(docs_chinese_focus, nr_topics=16)
topics_reduced, probs_reduced = topic_model_reduced.transform(docs_chinese_focus)

2026-01-23 07:22:31,510 - BERTopic - Topic reduction - Reducing number of topics
2026-01-23 07:22:31,555 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-23 07:22:42,719 - BERTopic - Representation - Completed ‚úì
2026-01-23 07:22:42,726 - BERTopic - Topic reduction - Reduced number of topics from 260 to 16


Batches:   0%|          | 0/741 [00:00<?, ?it/s]

2026-01-23 07:23:08,232 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2026-01-23 07:23:15,861 - BERTopic - Dimensionality - Completed ‚úì
2026-01-23 07:23:15,863 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2026-01-23 07:23:17,304 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2026-01-23 07:24:52,957 - BERTopic - Probabilities - Completed ‚úì
2026-01-23 07:24:52,958 - BERTopic - Cluster - Completed ‚úì


In [31]:
freq = topic_model.get_topic_info();

def postprocessing(df):
    # Convert all representations to component lists
    all_rows = [words.copy() for words in df['Representation']]
    global_components = set()

    for i in range(len(all_rows)):
        current_row = all_rows[i]
        filtered_words = []
        words_kept = 0

        for word in current_row:
            components = word.split()

            # Check against all previously seen components (current row + global)
            duplicate = any(comp in global_components for comp in components)

            if not duplicate:
                filtered_words.append(word)
                global_components.update(components)
                words_kept += 1

            # Stop current row if we reach 10 words
            if words_kept >= 15:
                break

        # Update the row (may have less than 10 words at this point)
        all_rows[i] = filtered_words

    # Final truncation to ensure exactly 10 words per row
    for i in range(len(all_rows)):
        all_rows[i] = all_rows[i][:15]

    # Update dataframe
    df['Representation'] = all_rows
    return df


#freq = postprocessing(freq)
freq.head(18).to_excel("/content/topic_frequencies_list_reduced.xlsx", index=False)


[('ÈÖíÂ∫ó', np.float64(0.015794855617877846)), ('Âª∫Á≠ë', np.float64(0.014745948348970268)), ('ÊïôÂ†Ç', np.float64(0.014580848859784851)), ('ÂéÜÂè≤', np.float64(0.011622230190892311)), ('ÂçöÁâ©È¶Ü', np.float64(0.009705067700597128)), ('Á†ÅÂ§¥', np.float64(0.00899664720879912)), ('ÊôØËâ≤', np.float64(0.008926034683663018)), ('ÊãçÁÖß', np.float64(0.008902108591815238)), ('ÊóÖÊ∏∏', np.float64(0.008396801633723364)), ('Â§ß‰∏âÂ∑¥ÁâåÂùä', np.float64(0.00834930939246645))]


In [46]:
topic_15_words = topic_model.get_topic(15)
print(topic_15_words)

False


In [None]:
# generate topic label using gpt-4.1
topic_model.update_topics(docs_chinese_focus, representation_model=representation_model3)#‰øÆÊîπÊñáÊú¨

freq = topic_model.get_topic_info();
freq.head(100)
freq.head(100).to_excel("/content/topic_frequencies_label.xlsx", index=False)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 67/67 [03:37<00:00,  3.24s/it]


In [None]:
# save model
from datetime import datetime

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("/content/", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('macau', np.float64(0.05101931384878174)),
 ('ferry', np.float64(0.02032461515610029)),
 ('hotel', np.float64(0.019651222903642587)),
 ('day', np.float64(0.016622899721410157)),
 ('kong', np.float64(0.015330582843296867)),
 ('hk', np.float64(0.015315618241245865)),
 ('hong', np.float64(0.015265578158339437)),
 ('hong kong', np.float64(0.015132633289210357)),
 ('time', np.float64(0.014688443841489216)),
 ('travel', np.float64(0.014398745912400347))]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

In [None]:
### Attributes

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [None]:
topic_model.topics_[:10]

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created.

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can
be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the distributions, we simply call:

In [None]:
topic_model.visualize_distribution(probs[200], min_probability=0.015)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created.

This allows for fine-tuning the model to your specifications and wishes.

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update
the topic representation with new parameters for `c-TF-IDF`:


In [None]:
topic_model.update_topics(docs, n_gram_range=(1, 2))

In [None]:
topic_model.get_topic(0)   # We select topic that we viewed before

[('space', 0.011119596146117955),
 ('nasa', 0.0047697533973351915),
 ('shuttle', 0.0044533985251824495),
 ('orbit', 0.004129278694477752),
 ('spacecraft', 0.004011023125258004),
 ('satellite', 0.003783732360211832),
 ('moon', 0.003639954930862572),
 ('lunar', 0.0034753177228921146),
 ('the moon', 0.002821040122532999),
 ('mars', 0.0028033947303940923)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
topic_model.reduce_topics(docs, nr_topics=60)

2021-10-17 06:05:02,666 - BERTopic - Reduced number of topics from 220 to 61


In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

[0, 32, -1, -1, -1, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, 14, -1, -1, -1, 4, 6, -1, -1, 4, 0, -1, -1, -1, -1, 20, -1, 48, 5, 0, 25, 11, 24, -1, 4, -1, -1, 23, 51, -1, 0, -1, -1, 7, 1, 5, -1, -1, 48, 1, -1, 4, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, 19, 35, -1, -1, -1, 0, 37, -1, 0, 54, 6, 58, 51, 30, -1, -1, 19, -1, -1, 0, 2, 45, 32, -1, -1, 21, -1, -1, -1, -1, -1, -1, 3, 2, 0, -1, -1, -1, -1, 11, -1, -1, 15, 14, -1, -1, -1, 0, 5, 5, -1, -1, -1, 9, -1, -1, 2, -1, -1, -1, -1, -1, 0, 55, 2, 21, 1, -1, -1, -1, 11, 31, -1, -1, -1, 11, 10, 0, 7, 4, 4, 55, -1, -1, -1, -1, 3, 8, 16, -1, 2, -1, -1, -1, -1, -1, -1, -1, 25, 24, -1, 28, -1, -1, -1, 0, -1, 4, 0, 6, 0, -1, -1, -1, 2, -1, 31, -1, -1, 5, -1, 2, 13, -1, -1, 14, -1, -1, 10, -1, -1, -1, -1, -1, -1, 18, -1, 53, 13, 4, 44, -1, 5, 4, -1, -1, -1, -1, 2, 0, 34, -1, 6, 1, -1, 1, 4, -1, 0, -1, -1, -1, 40, 0, -1, -1, 0, 0, -1, -1, -1, 25, 39, -1, 7, 2, 6, -1, 29, -1, -1, 40, -1, 23, -1, -1, -1, -1, -1, -1, -1, -1, -1, 32, 21, 

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. Here, we are going to be searching for topics that closely relate the
search term "vehicle". Then, we extract the most similar topic and check the results:

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[71, 45, 77, 9, 56]

In [None]:
topic_model.get_topic(71)

[('car', 0.03740731827314482),
 ('the car', 0.027790363401304377),
 ('dealer', 0.013837911908704722),
 ('the dealer', 0.009515109324321468),
 ('owner', 0.008430722097917726),
 ('previous owner', 0.008157988442865012),
 ('cars', 0.005827046491488879),
 ('the odometer', 0.00514870077683653),
 ('bought car', 0.004667512506960727),
 ('car with', 0.004498685875558186)]

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

In [None]:
# Save model
topic_model.save("my_model")

In [None]:
# Load model
my_model = BERTopic.load("my_model")

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
