## Merging Topics with Bertopic

We're import the necessary libraries and loading a dataset from the Hugging Face Datasets library. 
- The UMAP class is imported from the umap library. UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique used for visualizing high-dimensional data in a lower-dimensional space.
- The BERTopic class is imported from the bertopic library. BERTopic is a topic modeling technique that leverages pre-trained language models like BERT to extract topics from text data.
- The load_dataset function is imported from the datasets library, which is part of the Hugging Face Datasets library. This function allows you to load various datasets from the Hugging Face Hub.
- The load_dataset function is used to load the "CShorten/ML-ArXiv-Papers" dataset from the Hugging Face Hub. The ["train"] part specifies that we want to load the training split of the dataset.

The CShorten/ML-ArXiv-Papers dataset is a collection of machine learning papers from the arXiv repository. It contains various fields such as the paper title, abstract, and other metadata. 

In [1]:
from umap import UMAP
from bertopic import BERTopic
from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

In [2]:
import unicodedata  # For character normalization
from nltk.corpus import stopwords 
from string import punctuation 
from nltk.stem import WordNetLemmatizer


def normalize_text(text):
  # Normalize text to Unicode format (handles special characters)
  normalized_text = unicodedata.normalize('NFKD', text)
  return normalized_text


# Function for lemmatization
def lemmatization(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    return lemmatized_text


def clean_text(text, remove_stopwords=False):
    normalized_text = normalize_text(text)
    
    # Remove punctuations
    letters_only = "".join([char for char in normalized_text if char.isalnum() or char.isspace()])

    # Optionally remove stop words
    if remove_stopwords:
        stop_words = stopwords.words('english')
        words = [word for word in letters_only.lower().split() if word not in stop_words]
        cleaned_text = " ".join(words)
    else:
        cleaned_text = letters_only.lower()  # Convert to lowercase

    # Lemmatize text
    lemmatized_text = lemmatization(cleaned_text)

    return lemmatized_text


In [19]:
# Apply the clean_text function to each element in the 'abstract' column
cleaned_abstracts = []
for text in dataset['abstract']:
    cleaned_text = clean_text(text, remove_stopwords=True)
    cleaned_abstracts.append(cleaned_text)


In [20]:
cleaned_abstracts[:5]

['problem statistical learning construct predictor random variable function related random variable x basis iid training sample joint distribution xy allowable predictor drawn specified class goal approach asymptotically performance expected loss best predictor class consider setting one perfect observation xpart sample ypart communicated finite bit rate encoding yvalues allowed depend xvalues suitable regularity condition admissible predictor underlying family probability distribution loss function give informationtheoretic characterization achievable predictor performance term conditional distortionrate function idea illustrated example nonparametric regression gaussian noise',
 'sensor network practice communication among sensor subject to1 error failure random time 3 cost and2 constraint since sensor network operate scarce resource power data rate communication signaltonoise ratio snr usually main factor determining probability error communication failure link probability proxy snr

In [21]:

# Extract abstracts to train on and corresponding titles
abstracts_1 = cleaned_abstracts[:8000]
abstracts_2 = cleaned_abstracts[8000:16000]
abstracts_3 = cleaned_abstracts[16000:24000]

In [22]:
# Create topic models
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model_1 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_1)
topic_model_2 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_2)
topic_model_3 = BERTopic(umap_model=umap_model, min_topic_size=20).fit(abstracts_3)

In [23]:
# Combine all models into one
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2, topic_model_3])

In [24]:
len(topic_model_1.get_topic_info())

73

In [25]:
len(merged_model.get_topic_info())

76

In [26]:
merged_model.get_topic_info().tail(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
71,70,21,70_neuron_synaptic_spike_neuronal,"[neuron, synaptic, spike, neuronal, rule, spik...",
72,71,99,71_community_detection_block_sbm,"[community, detection, block, sbm, stochastic,...",
73,72,102,24_music_musical_audio_song,"[music, musical, audio, song, chord, transcrip...",
74,73,123,63_driver_driving_road_steering,"[driver, driving, road, steering, lane, vehicl...",
75,74,24,50_caption_image_captioning_text,"[caption, image, captioning, text, visual, gen...",


In [27]:
merged_model.reduce_topics(abstracts_1 + abstracts_2 + abstracts_3, nr_topics=20, images=None)

<bertopic._bertopic.BERTopic at 0x498a59f10>

In [28]:
merged_model.visualize_topics()
