<a href="https://colab.research.google.com/github/gracek904/twitter-io/blob/main/BERTopic_Implementation_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code installs the required Python libraries for performing topic modeling using the BERTopic framework.

In [None]:
!pip install bertopic
!pip install sentence-transformers
!pip install umap-learn
!pip install hdbscan

Collecting bertopic
  Downloading bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic)
  Downloa

Importing required libraries.

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
import re
import string
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

Downloading necessary NLTK data

In [None]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

This code segment is part of the preprocessing pipeline for cleaning and tokenizing tweets. Here's what each part does:

**1. remove_hyperlinks_marks_styles(tweet)**
This function takes a tweet as input and removes unnecessary elements such as retweet markers, hyperlinks, and hashtags. It uses regular expressions (re.sub) for pattern matching and substitution.

**2. TweetTokenizer Initialization**

This initializes an instance of NLTK's TweetTokenizer, which is specifically designed for tokenizing tweets. The parameters control how the tokenizer behaves:
- preserve_case=False: Converts all text to lowercase for consistency (e.g., "Hello" becomes "hello")

- strip_handles=True: Removes Twitter handles (e.g., "@user").

- reduce_len=True: Reduces elongated words to their base form by collapsing repeated characters (e.g., "soooo" becomes "soo").

**3. tokenize_tweet(tweet) Function**

This function takes a tweet as input and uses the TweetTokenizer instance to split it into individual tokens (words or symbols).



In [None]:
# Define preprocessing functions
def remove_hyperlinks_marks_styles(tweet):
    new_tweet = re.sub(r'^RT[\s]+', '', tweet)
    new_tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', new_tweet)
    new_tweet = re.sub(r'#', '', new_tweet)
    return new_tweet

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

def tokenize_tweet(tweet):
    return tokenizer.tokenize(tweet)

This code segment is part of a preprocessing pipeline for cleaning, tokenizing, and stemming tweets to prepare them for analysis (e.g., topic modeling). Here's a detailed explanation of each function and its purpose:

**1. stopwords_english, additional_stopwords, and punctuation**:

Covers a list of common English stopwords and addtional stopwords specific to tweets (rt, https, http, amp, etc).

**2.remove_stopwords_punctuations(tweet_tokens):**

Cleans the tokenized words.

**3. get_stem(tweets_clean):**

Applies stemming to each word in the cleaned tokens using the Porter Stemmer. Stemming reduces words to their root form.

**4. process_tweet(tweet):**

Processes a single tweet through multiple steps: - Remove hyperlinks, retweet markers, and hashtags
- Tokenize the tweet
- Clean the tokens
- Reconstruct the cleaned tokens into a sentence

**5. contains_follow(tweet):**

Checks if the tweet contains any variation of the word "follow"

In [None]:
stopwords_english = stopwords.words('english')
additional_stopwords = ['rt', 'http', 'https', 'amp', '//']
punctuations = string.punctuation

def remove_stopwords_punctuations(tweet_tokens):
    tweets_clean = []
    for word in tweet_tokens:
        if word not in additional_stopwords and "http" not in word and word.isalpha() and word not in stopwords_english and word not in punctuations:
            tweets_clean.append(word)
    return tweets_clean

stemmer = PorterStemmer()

def get_stem(tweets_clean):
    return [stemmer.stem(word) for word in tweets_clean]

def process_tweet(tweet):
    processed_tweet = remove_hyperlinks_marks_styles(tweet)
    tweet_tokens = tokenize_tweet(processed_tweet)
    tweets_clean = remove_stopwords_punctuations(tweet_tokens)
    final_tweet = TreebankWordDetokenizer().detokenize(tweets_clean)
    return final_tweet

def contains_follow(tweet):
    return 1 if any(word in tweet for word in ["follow", "following", "followe", "followed"]) else 0

This code segment is responsible for loading, filtering, preprocessing, and preparing the tweets for analysis. Here are the steps:

**1. Load the Data**:

Reads data from CSV file

**2. Filter the Data:**

Filters the DataFrame to include only relevant tweets (english tweets, excludes retweets)

**3. Extract Tweet Text:**

Converts the tweet_text column of the filtered DataFrame into a list of strings

**4. Preprocess Tweets**:

Applies the process_tweet() function to each tweet in the list to clean and preprocess it.

**5. Filter Tweets Based on "Follow" Words**:

Filters out tweets that contian "follow" behavior

In [None]:
# Load and preprocess the data
df = pd.read_csv('/content/iran.csv')
df = df[(df.tweet_language == 'en') & (df.is_retweet == 0)]

#getting random sample of tweets - comment out if not needed
#sampled_df = df.sample(frac=0.10, random_state=42)

#only the sample tweets
sampled_tweets = sampled_df['tweet_text']

tweets = df.tweet_text.tolist()

processed_tweets = [process_tweet(tweet) for tweet in tweets]
final_tweets = [tweet for tweet in processed_tweets if not contains_follow(tweet)]


Columns (15,19) have mixed types. Specify dtype option on import or set low_memory=False.



In [None]:
print(len(final_tweets))

This code segment applies the BERTopic model to the preprocessed tweets (final_tweets) to perform topic modeling. By default, BERTopic uses transformer-based embeddings (via senttences-transformers) to represent textual data in high-dimensional space. It applies dimensionality reduction (via UMAP) and clustering (via HDBSCAN) to group similar tweets into topics. Though this implementation uses the default settings, you can customize paramters like the embedding modek, vectorizer modek, nr_topics, and more to tailor the topic modeling process.

The `topics, probs = topic_model.fit_transform(final_tweets)` line fits the BERTopic model to the data set `final tweets` and assigns each tweet to a topic. The input is a list of preprocessed tweets and the output is a list of integers where each integer represents the topic assigned to a corresponding tweet. `probs` is a list of probabilities indicating how strongly each tweet belongs to its assigned topic. Higher probabilities indicate higher confidence in the assignment.

How it works internally:

**1. Text Embedding:**

Each tweet in final_tweets is converted into a numerical vector using a transformer-based embedding model (e.g., SBERT or other sentence-transformers models).

These embeddings capture semantic meaning, allowing similar tweets to have similar representations.

**2. Dimensionality Reduction:**

UMAP reduces the high-dimensional embeddings into a lower-dimensional space for efficient clustering.

**3. Clustering:**

HDBSCAN groups similar embeddings into clusters, where each cluster represents a topic.

Tweets that don't fit well into any cluster are labeled as noise (Topic -1).

**4. Topic Representation:**

For each cluster (topic), BERTopic identifies the most representative words based on their importance within that cluster.

In [None]:
"""
# Create a custom vectorizer
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", max_features=10000)

# Initialize and fit BERTopic model
topic_model = BERTopic(vectorizer_model=vectorizer_model, min_topic_size=20, nr_topics="auto")
topics, probs = topic_model.fit_transform(final_tweets)
"""

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(final_tweets)


In [None]:
topic_info = topic_model.get_topic_info()
print(topic_info)
topic_model.get_topics()

     Topic  Count                                          Name  \
0       -1  22489         -1_palestine_palestinian_world_regime   
1        0    992              0_jews_jewish_politicians_israel   
2        1    869                        1_oil_exports_zero_cut   
3        2    641             2_iraqi_iraq_iraqprotests_iraqwar   
4        3    549              3_iran_iranian_regime_revolution   
..     ...    ...                                           ...   
945    944     10               944_pic_painful_tigrai_painfull   
946    945     10  945_slogan_grope_principles_entrepreneurship   
947    946     10             946_withdrawal_putin_syria_accept   
948    947     10             947_tramp_grudging_spoil_trickery   
949    948     10       948_exemptions_tomorrow_republics_eight   

                                        Representation  \
0    [palestine, palestinian, world, regime, people...   
1    [jews, jewish, politicians, israel, lobbies, m...   
2    [oil, exports, z

{-1: [('palestine', np.float64(0.001341399393453874)),
  ('palestinian', np.float64(0.001270566684520538)),
  ('world', np.float64(0.0012621212905940858)),
  ('regime', np.float64(0.0012470617920239233)),
  ('people', np.float64(0.0012367074772645897)),
  ('palestinians', np.float64(0.0012241427512310334)),
  ('syria', np.float64(0.0012213508973008453)),
  ('countries', np.float64(0.0011837633128169932)),
  ('united', np.float64(0.001179560302118478)),
  ('iranian', np.float64(0.0011744454624880375))],
 0: [('jews', np.float64(0.013666539042010662)),
  ('jewish', np.float64(0.010976091340879183)),
  ('politicians', np.float64(0.01064956672622692)),
  ('israel', np.float64(0.010260216167149585)),
  ('lobbies', np.float64(0.007130208128956857)),
  ('media', np.float64(0.006668665586248889)),
  ('israeli', np.float64(0.006133985971377528)),
  ('outlets', np.float64(0.006000343327573937)),
  ('fake', np.float64(0.005603714831530253)),
  ('handful', np.float64(0.0052580254375791035))],
 1: 

This code uses a zero-shot classification model (facebook/bart-large-mnli) from the Hugging Face Transformers library to identify race-related topics generated by BERTopic. For each topic (excluding outliers), the model evaluates the topic's representative keywords (topic name) and classifies it as either "race-related" or "not race-related" based on semantic meaning. Only topics classified as "race-related" with a confidence score greater than 0.90 are retained. This approach enables automated, LLM-based filtering of relevant topics without relying on predefined keyword lists.

In [None]:
from transformers import pipeline

#loading zero-shot classification model
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

#getting all topics and their representative words (from BERTopic)
topic_info = topic_model.get_topic_info()
topic_words = {row['Topic']: row['Name'] for _, row in topic_info.iterrows() if row['Topic'] != -1}  # exclude -1 (outliers)

#classifying each topic using zero-shot classification
llm_filtered_topics = []
for topic_id, description in topic_words.items():
    result = classifier(description, candidate_labels=["race-related", "not race-related"])
    label = result['labels'][0]
    confidence = result["scores"][0]
    if label == "race-related" and confidence > 0.90:
        llm_filtered_topics.append((topic_id, description))





Device set to use cpu


This section takes the race-related topics previously identified using zero-shot classification and prepares them for visualization and interpreation. Here's a break-down:
1. **Filter and Sort by Prevalence**: The code filters the DataFrame to keep only race-related topic IDs, sorts them in descending order based on how frequently each topic appears in the dataset, and keeps only the top 10 most prevalent race-related topics.
2. **Extract Topic IDs and Names**: This part of the code converts the top 10 topic IDs into a list for later use in visualization. It also creates a dictionary mapping each topic ID to its human-readable name or representative keywords (Name), for clarity in output and reference.
3. **Print Identified Race-Related Topics**
4. **Assign Custom Labels for Visualization**

In [None]:
# Sort topics by prevalence (count)
filtered_topic_df = topic_info[topic_info['Topic'].isin([t[0] for t in llm_filtered_topics])]
top_llm_race_topics = filtered_topic_df.nlargest(10, 'Count')

# Get final topic IDs and names
top_10_llm_topic_ids = top_llm_race_topics['Topic'].tolist()
topic_names = dict(zip(top_llm_race_topics['Topic'], top_llm_race_topics['Name']))

print("\nTop 10 Race-Related Topics (LLM-Filtered):")
for topic_id in top_10_llm_topic_ids:
    print(f"Topic {topic_id}: {topic_names[topic_id]}")

#Creating custom labels for figure
custom_labels = {
     3: "Black Beauty & Empowerment",
     8: "Kobe Bryant",
     11: "Public Racism Incidents",
     13: "Xenophobia",
     14: "Slavery & Reparations",
     15: "Race & COVID-19",
     16: "Black History Month",
     17: "Racism in Politics",
     19: "Black Women Achievements",
     20: "African Pride & Unity"
 }

topic_model.set_topic_labels(custom_labels)


Top 10 Race-Related Topics (LLM-Filtered):
Topic 3: 3_blackisbeautiful_blackgirlsrock_melanin_hair
Topic 8: 8_kobe_bryant_kobebryant_basketball
Topic 11: 11_racist_customer_lady_woman
Topic 13: 13_nigeria_nigerians_xenophobia_saynotoxenophobia
Topic 14: 14_slavery_slaves_slave_reparations
Topic 15: 15_coronavirus_virus_china_wuhan
Topic 16: 16_history_month_blackhistorymonth_blackhistory
Topic 17: 17_trump_racist_seats_president
Topic 19: 19_first_american_woman_blackexcellence
Topic 20: 20_africa_africans_continent_african


In [None]:
# Visualize LLM-filtered race-related topics
fig = topic_model.visualize_barchart(
    topics=top_10_llm_topic_ids,
    n_words=5,
    custom_labels=True,
    title="<b>Top 10 Race-Related Topics (LLM Classified)</b>"
)

fig.update_layout(height=700, bargap=0.3)
for annotation in fig.layout.annotations:
    annotation.font.size = 14
fig.update_yaxes(tickfont=dict(size=9))
fig.show()


CSV with tweets mapped to BERtopic topic

In [None]:
import pandas as pd

# Create a DataFrame of tweets and their assigned topics
tweet_topic_df = pd.DataFrame({
    "tweet": final_tweets,
    "topic": topics
})

# Only keep tweets whose topic is in the top 10 LLM-identified race-related topics
top_race_tweets_df = tweet_topic_df[tweet_topic_df["topic"].isin(top_10_llm_topic_ids)].copy()

# Add custom labels
custom_labels = {
     3: "Black Beauty & Empowerment",
     8: "Kobe Bryant",
     11: "Public Racism Incidents",
     13: "Xenophobia",
     14: "Slavery & Reparations",
     15: "Race & COVID-19",
     16: "Black History Month",
     17: "Racism in Politics",
     19: "Black Women Achievements",
     20: "African Pride & Unity"
 }

top_race_tweets_df["topic_label"] = top_race_tweets_df["topic"].map(custom_labels)

# Save to CSV
top_race_tweets_df.to_csv("russia_race_related_tweets.csv", index=False, encoding="utf-8")

print("Saved race_related_tweets.csv with tweet, topic ID, and custom label.")


Saved race_related_tweets.csv with tweet, topic ID, and custom label.


Every code segment below is for the keyword approach -- **no longer used**

This code segment identifies the top race-related topics in your dataset, assigns custom labels to those topics, and visualizes them using a bar chart. Here is an explanation of each part:

**1. Define Race-Related Keywords:**

This part creates a comprehensive list of race-related keywords that will be used to filter topics. These keywords are matched against topic names to identify topics related to race.

**2. Get Topic Information:**

Retrieves metadata about all generated topics include Topic (Topic ID), Count (Number of tweets assigned to each topic), and Name (Representative words for each topic).

**3. Filter Race-Related Topics **

This section identifies topics that include any of the race-related keywords.

**4. Select Top X Most Prevalent Race-Related Topics**

This section sorts the filtered race-related topics by their prevalence (i.e. the number of tweets assigned to each topic) and selects the top X amount.

**5. Display the Top Topics**

This section prints the top X most prevalent race-related topics along with their IDs, counts, and representative words.

**6. Define Custom Labels for Topics**

This section creates a dictionary with keys are topic ID's adn values as descriptive labels. These labels replace the default numeric IDs in visualizations.

**7. Assign Custom Labels**

This section updates the BERTopic model to use custome labels instead of default numeric IDs or generated names.

**8. Visualize Topics with Custom Labels**

This section creates an interactive bar chart visualization with top X more prevalent topics. Here are the parameters:
- `top_n_topics=10`: Displays only the top 10 topics by prevalence.
- `n_words=5`: Shows the top five representative words for each topic.
- `custom_labels=True`: Uses your custom labels for the topics.
- `title`: Adds a descriptive title to the chart.

The output is a bar chart where each bar represents a topic, the length of each bar indicates its prevalence (# of tweets), and the y-axis shows custom labels for topics.

In [None]:
# Define race-related keywords
race_keywords = [
    'race', 'racial', 'racist', 'ethnicity', 'ethnic', 'black', 'white', 'asian', 'latino', 'latina', 'latinx',
    'hispanic', 'african', 'native american', 'indigenous', 'biracial', 'multiracial', 'minority',
    'people of color', 'bipoc', 'discrimination', 'prejudice', 'stereotype', 'diversity', 'inclusion',
    'equity', 'privilege', 'microaggression', 'systemic racism', 'institutional racism', 'colorism',
    'cultural appropriation', 'xenophobia', 'antisemitism', 'islamophobia', 'racial profiling',
    'segregation', 'integration', 'affirmative action', 'intersectionality', 'marginalization',
    'oppression', 'tokenism', 'assimilation', 'acculturation', 'racial identity', 'racial bias',
    'hate crime', 'racial slur', 'racial justice', 'racial equality', 'racial equity', 'racial sensitivity',
    'cultural competence', 'racial trauma', 'racial reconciliation', 'racial disparity',
    'racial discrimination', 'racial harassment', 'racial stereotyping', 'racial representation',
    'racial diversity', 'racial inclusion'
]

# Get the topic information
topic_info = topic_model.get_topic_info()

# Function to check if a topic is race-related
def is_race_related(topic_name):
    return any(keyword in topic_name.lower() for keyword in race_keywords)

# Filter race-related topics
race_related_topics = topic_info[topic_info['Name'].apply(is_race_related)]

# Sort by Count (prevalence) and select top 10
top_10_race_topics = race_related_topics.nlargest(10, 'Count')

# Display the top 10 race-related topics
print("Top 10 Most Prevalent Race-Related Topics:")
print(top_10_race_topics[['Topic', 'Count', 'Name']])

# Extract just the topic IDs from the filtered dataframe
top_10_race_topic_ids = top_10_race_topics['Topic'].tolist()

# Now define custom labels specifically for the top 10 race-related topics
# First, create a dictionary to map each topic to its original name for reference
topic_names = dict(zip(top_10_race_topics['Topic'], top_10_race_topics['Name']))

print("\nTop 10 Race-Related Topic IDs and Names:")
for topic_id in top_10_race_topic_ids:
    print(f"Topic {topic_id}: {topic_names[topic_id]}")

# Now you can create custom labels specifically for these top 10 topics
# Replace these with your desired labels after reviewing the topic contents
custom_labels = {}
for topic_id in top_10_race_topic_ids:
    # Default label is just "Race Topic X" - replace these with meaningful labels
    custom_labels[topic_id] = f"Race Topic {topic_id}"

# Example: If you know the contents of each topic, you can manually assign labels like this:
# (Uncomment and modify as needed after seeing what topics appear in your top 10)
"""
custom_labels = {
    3: "African Heritage and Culture",
    14: "Political Discourse on Race",
    24: "Black Excellence and Representation",
    # Add labels for all other topics in your top_10_race_topic_ids list
}
"""

# Assign custom labels to the topics
topic_model.set_topic_labels(custom_labels)

# After creating the bar chart visualization
fig = topic_model.visualize_barchart(
    topics=top_10_race_topic_ids,  # Only show the filtered race topics
    n_words=5,                     # Number of words per topic
    custom_labels=True,            # Use your custom labels
    title="<b>Top 10 Race-Related Topics</b>"  # Add a descriptive title
)

# Increase space between bars
fig.update_layout(
    height=700,                     # Increase overall height
    bargap=0.3                      # Increase gap between bars
)

# Make the keyword text smaller
for annotation in fig.layout.annotations:
    annotation.font.size = 14  # Reduce font size (default is usually 12 or 14)

# Also adjust the topic label font if needed
fig.update_yaxes(
    tickfont=dict(size=9)  # Adjust the size of the topic labels (your custom labels)
)

# Show the updated bar chart
fig.show()

Top 10 Most Prevalent Race-Related Topics:
     Topic  Count                                               Name
0       -1  23783                         -1_racism_regime_oil_white
2        1   1097                         1_lgbt_lgbtq_diversity_gay
6        5    598  5_blackgirlsrock_blackisbeautiful_blackandprou...
19      18    316  18_trumpmeltdown_trumpisatraitor_idiot_trumpis...
47      46    188              46_racist_racism_racistinchief_blacks
48      47    185    47_racism_kickitout_racist_blacktwittermovement
65      64    144                   64_ghana_africa_nigeria_africans
74      73    130                    73_millions_asia_jobless_latino
76      75    128                        75_house_tramp_white_mattis
167    166     66  166_blackhistorymonth_blacksnews_blacklove_bla...

Top 10 Race-Related Topic IDs and Names:
Topic -1: -1_racism_regime_oil_white
Topic 1: 1_lgbt_lgbtq_diversity_gay
Topic 5: 5_blackgirlsrock_blackisbeautiful_blackandproud_blackgirlmagic
Topic 18: 1