# Attempts with BERTopic Modeling

(1) no preprocessing (for all)

**RESULT:** low count and only one topic (-1; unique topics)

(2) assigning outliers to attempt getting more than 1 topic

**RESULT:** error occurred; couldn't assign outliers to any topics

(3) trying a clustering algorithm (kmeans) to prevent outliers

**RESULT:** still a low count and few topics; not distinguishable enough between topics to infer any sufficient conclusion

(4) tried same clustering algorithm (kmeans) but with stopword exclusion

**RESULT:** for some reason topics remained relatively the same as when stopwords were included

### Notes from Prof. (1/23)
- DON'T clean/preprocess your data, just directly input into BERTopic to see if that yields better results
    - get data of all albums?
- use spacy to find pos patterns in their lyrics: do they have a signature lyrical pattern that remains consistent throughout their discography? does their less successful album have a different pos pattern than their viral ones?
- compare to top (billboard?) hits of the correlating genre to their albums: what makes their artistry different from other top artists of the time? (use tfidf)

In [1]:
import os
from bertopic import BERTopic

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Define path to folder
folder_path = "/Users/maika/My_Notebooks/DSMA/Final Project/scraped lyrics"

# Initialize an empty list to store documents
documents = []

# Iterate through each file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):  # Assuming lyrics are in .txt files
        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r") as f:
            text = f.read()
            documents.append(text)

In [3]:
# Create a BERTopic model
topic_model = BERTopic()

# Fit the model to documents
topics,_ = topic_model.fit_transform(documents)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


In [55]:
# making sure number of files and number of topics are the same
print(f"Number of text files: {len([filename for filename in os.listdir(folder_path) if filename.endswith('.txt')])}")
print(f"Number of topics assigned: {len(topics)}")

Number of text files: 16
Number of topics assigned: 16


In [56]:
# Make sure the number of topics matches the number of documents
if len(topics) != len(os.listdir(folder_path)):
    print("Warning: The number of topics does not match the number of documents!")

# Iterate over the files and topics
for i, filename in enumerate(os.listdir(folder_path)):
    if filename.endswith(".txt"):
        # Check if we have a valid topic index
        if i < len(topics):
            print(f"{filename}: Topic {topics[i]}")
        else:
            print(f"{filename}: No topic assigned")

bee_gees_1st.txt: Topic -1
main_course.txt: Topic -1
trafalgar.txt: Topic -1
horizontal.txt: Topic -1
cucumber_castle.txt: Topic -1
this_is_where_i_came_in.txt: Topic -1
idea.txt: Topic -1
rare_precious_and_beautiful_vol2.txt: Topic -1
spirits_having_flown.txt: Topic -1
turn_around_look_at_us.txt: Topic -1
rare_precious_and_beautiful.txt: Topic -1
14_barry_gibb_songs.txt: Topic -1
saturday_night_fever.txt: Topic -1
two_years_on.txt: Topic -1
inception_nostalgia.txt: Topic -1
odessa.txt: No topic assigned


In [58]:
# Print the topics and their most frequent words
for i, topic in enumerate(topic_model.get_topic_freq().Topic.to_list()[1:]):  # Skip the -1 topic
    print(f"Topic {i}:")
    print(topic_model.get_topic(topic))

# Get the topic assigned to each document
for i, filename in enumerate(os.listdir(folder_path)):
    if filename.endswith(".txt"):
        print(f"{filename}: Topic {topics[i]}")

bee_gees_1st.txt: Topic -1
main_course.txt: Topic -1
trafalgar.txt: Topic -1
horizontal.txt: Topic -1
cucumber_castle.txt: Topic -1
this_is_where_i_came_in.txt: Topic -1
idea.txt: Topic -1
rare_precious_and_beautiful_vol2.txt: Topic -1
spirits_having_flown.txt: Topic -1
turn_around_look_at_us.txt: Topic -1
rare_precious_and_beautiful.txt: Topic -1
14_barry_gibb_songs.txt: Topic -1
saturday_night_fever.txt: Topic -1
two_years_on.txt: Topic -1
inception_nostalgia.txt: Topic -1


In [59]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,16,-1_you_the_and_to,"[you, the, and, to, me, my, in, of, love, be]","[In the morning, when the moon is at its rest ..."


In [4]:
#trying assinging outliers to topics to get more than one topic
from bertopic import BERTopic

# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(documents)

# Reduce outliers
new_topics = topic_model.reduce_outliers(documents, topics)
#KEEP OUTPUT CELL

ValueError: Found array with 0 sample(s) (shape=(0, 2025)) while a minimum of 1 is required by check_pairwise_arrays.

In [5]:
# trying kmeans to prevent outliers; clustering algorithm

# Define path to folder
folder_path = "/Users/maika/My_Notebooks/DSMA/Final Project/scraped lyrics"

# Initialize an empty list to store documents
documents = []

# Iterate through each file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):  # lyrics are in .txt files
        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r") as f:
            text = f.read()
            documents.append(text)

from bertopic import BERTopic
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=5)
topic_model = BERTopic(hdbscan_model=cluster_model)

In [6]:
topics,_ = topic_model.fit_transform(documents)
topic_model.get_topic_info()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,4,0_you_the_and_me,"[you, the, and, me, my, to, of, love, your, in]",[Here we are In a room full of strangers Stand...
1,1,4,1_the_you_and_to,"[the, you, and, to, my, me, in, know, love, that]",[Where are you Night is day and day is night W...
2,2,3,2_the_to_you_and,"[the, to, you, and, my, of, in, me, is, it]","[Now, I found that the world is round And of c..."
3,3,2,3_you_and_me_the,"[you, and, me, the, my, to, tell, why, im, never]","[One year, two years, time goes by People laug..."
4,4,2,4_you_to_be_me,"[you, to, be, me, dont, the, love, or, and, your]","[Ah, ah, ah, ah.. was a lover, a leader of me..."


In [7]:
topic_model.visualize_barchart()

In [8]:
#trying kmeans with stopword exclusion
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(vectorizer_model=vectorizer_model)

# Define path to folder
folder_path = "/Users/maika/My_Notebooks/DSMA/Final Project/scraped lyrics"

# Initialize an empty list to store documents
documents = []

# Iterate through each file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):  # lyrics are in .txt files
        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r") as f:
            text = f.read()
            documents.append(text)

cluster_model = KMeans(n_clusters=5)
topic_model = BERTopic(hdbscan_model=cluster_model)

In [9]:
topics,_ = topic_model.fit_transform(documents)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,5,0_you_and_the_me,"[you, and, the, me, to, my, in, love, that, of]",[Here I lie in a lost and lonely part of town ...
1,1,3,1_the_to_you_and,"[the, to, you, and, my, of, in, me, is, it]","[Now, I found that the world is round And of c..."
2,2,3,2_you_the_me_and,"[you, the, me, and, my, your, to, on, of, its]","[Well, you can tell by the way I use my walk I..."
3,3,2,3_you_to_be_me,"[you, to, be, me, dont, the, love, or, and, your]","[Ah, ah, ah, ah.. was a lover, a leader of me..."
4,4,2,4_you_and_the_to,"[you, and, the, to, my, me, for, in, know, be]","[One year, two years, time goes by People laug..."


In [11]:
topic_model.visualize_barchart()