# Text Analysis and Preprocessing Pipeline

## Notebook Overview

This notebook provides a comprehensive workflow for text data analysis and preprocessing.

- **Data Loading**: Set up the environment, load and explore text datasets, perform statistical analysis, and create reusable data loading functions
- **Preprocessing**: Apply various NLP preprocessing techniques using NLTK and SpaCy, analyze the impact of different methods, and develop a reusable preprocessing pipeline

The notebook is structured to systematically work through each step of the text analysis pipeline, from the instructions of the second lab

---

## 1. Data Loading

This section covers the initial setup and data loading phase. We will install necessary libraries, load the dataset, explore its structure and characteristics, perform statistical analysis, filter the data to text-only content, and create reusable functions for future use.

### 1.1 Load and Explore Dataset

Loading the dataset into a pandas DataFrame and performing initial exploration. This includes viewing the first few rows, checking data types, examining the shape of the dataset, and identifying any missing values.

#### 1.1.1 Basic Dataset Overview

In [None]:
import pandas as pd
from datasets import load_dataset

# Load the dataset directly from Hugging Face using the datasets library
dataset = load_dataset("cardiffnlp/tweet_topic_multi", split="train_all")

# Convert to pandas DataFrame
df = dataset.to_pandas()

# Display basic information
print("Row count, Column count:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nFirst 10 Rows:")
display(df.head(10))

#### 1.1.2 Initial Data Exploration with NLTK

Using NLTK to perform basic text exploration and tokenization on a sample of tweets.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import string

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("\n" + "="*50)
print("INITIAL TEXT EXPLORATION WITH NLTK")
print("="*50)

# Show sample tokenization on a few tweets
print("\nSample Tweet Tokenization:")
sample_text = df['text'].iloc[0]
print(f"Original: {sample_text}")
tokens = word_tokenize(sample_text)
print(f"Tokens: {tokens}")

# Get English stopwords
stop_words = set(stopwords.words('english'))
print(f"\nNumber of English stopwords: {len(stop_words)}")
print(f"Sample stopwords: {list(stop_words)[:10]}")

#### 1.1.3 Initial Data Exploration with spaCy

Using spaCy to perform basic linguistic analysis on sample tweets.

In [None]:
import spacy

# Load spaCy model
try:
    nlp = spacy.load('en_core_web_sm')
except:
    print("Downloading spaCy model...")
    import subprocess
    subprocess.run(['python', '-m', 'spacy', 'download', 'en_core_web_sm'])
    nlp = spacy.load('en_core_web_sm')

print("\n" + "="*50)
print("INITIAL TEXT EXPLORATION WITH spaCy")
print("="*50)

# Analyze a sample tweet
sample_text = df['text'].iloc[5]
doc = nlp(sample_text)

print(f"\nSample Tweet: {sample_text}")
print(f"\nTokens: {[token.text for token in doc]}")
print(f"\nPOS Tags: {[(token.text, token.pos_) for token in doc][:10]}")
print(f"\nEntities: {[(ent.text, ent.label_) for ent in doc.ents]}")

### 1.2 Calculate Statistics

Computing descriptive statistics on the dataset to understand the data distribution. This includes category distributions, temporal patterns, duplicate analysis, text length statistics, word frequency analysis, and named entity statistics using pandas and NLP libraries.

#### 1.2.1 Category/Topic Distribution Analysis

Analyzing how many tweets are available for each topic category using pandas operations.

In [None]:
from collections import Counter

# Convert label_name to lists (they're already arrays, just ensure they're lists)
df['topics_list'] = df['label_name'].apply(lambda x: list(x) if not isinstance(x, list) else x)

# Count tweets per category
all_topics = []
for topics in df['topics_list']:
    all_topics.extend(topics)

topic_counts = Counter(all_topics)

# Create a DataFrame for better visualization
topic_df = pd.DataFrame(
    topic_counts.items(), 
    columns=['Topic', 'Count']
).sort_values('Count', ascending=False)

print("\n" + "="*50)
print("TWEETS PER CATEGORY")
print("="*50)
print(topic_df.to_string(index=False))
print(f"\nTotal unique topics: {len(topic_counts)}")

# Multi-label statistics
df['num_topics'] = df['topics_list'].apply(len)
print(f"\nTweets with 1 topic: {(df['num_topics'] == 1).sum()} ({(df['num_topics'] == 1).sum() / len(df) * 100:.1f}%)")
print(f"Tweets with 2+ topics: {(df['num_topics'] > 1).sum()} ({(df['num_topics'] > 1).sum() / len(df) * 100:.1f}%)")
print(f"\nMax topics per tweet: {df['num_topics'].max()}")
print(f"Average topics per tweet: {df['num_topics'].mean():.2f}")

#### 1.2.2 Temporal Distribution Analysis

Analyzing the distribution of tweets across different time periods using pandas datetime operations.

In [None]:
# Convert date column to datetime
df['date_parsed'] = pd.to_datetime(df['date'])
df['year'] = df['date_parsed'].dt.year
df['month'] = df['date_parsed'].dt.month
df['day_of_week'] = df['date_parsed'].dt.day_name()

# Count tweets per year
year_counts = df['year'].value_counts().sort_index()

print("\n" + "="*50)
print("TEMPORAL DISTRIBUTION STATISTICS")
print("="*50)

print("\nTweets per Year:")
for year, count in year_counts.items():
    print(f"  {year}: {count:,} tweets ({count/len(df)*100:.1f}%)")

print(f"\nDate range: {df['date_parsed'].min().date()} to {df['date_parsed'].max().date()}")
print(f"Total days covered: {(df['date_parsed'].max() - df['date_parsed'].min()).days} days")

# Month distribution
print("\nTweets per Month (across all years):")
month_counts = df['month'].value_counts().sort_index()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for month, count in month_counts.items():
    print(f"  {month_names[month-1]}: {count:,} tweets")

# Day of week distribution
print("\nTweets per Day of Week:")
dow_counts = df['day_of_week'].value_counts()
for day, count in dow_counts.items():
    print(f"  {day}: {count:,} tweets ({count/len(df)*100:.1f}%)")

#### 1.2.3 Duplicate Detection Statistics

Identifying and quantifying exact and potential near-duplicate tweets using pandas string operations.

In [None]:
print("\n" + "="*50)
print("DUPLICATE ANALYSIS")
print("="*50)

# Check for exact text duplicates
duplicate_texts = df['text'].duplicated().sum()
print(f"Exact duplicate tweets: {duplicate_texts} ({duplicate_texts / len(df) * 100:.2f}%)")

# Check for duplicate IDs
duplicate_ids = df['id'].duplicated().sum()
print(f"Duplicate IDs: {duplicate_ids}")

# Show sample duplicates if any exist
if duplicate_texts > 0:
    print("\nSample Duplicate Tweets:")
    duplicated_mask = df['text'].duplicated(keep=False)
    sample_duplicates = df[duplicated_mask].groupby('text').head(2)
    print(sample_duplicates[['text', 'date', 'label_name']].head(6).to_string())
else:
    print("\nâœ“ No exact duplicate tweets found!")

# Near-duplicate detection (same first 50 characters)
df['text_start'] = df['text'].str.lower().str.strip().str[:50]
potential_near_dupes = df['text_start'].duplicated().sum()
print(f"\nPotential near-duplicates (same first 50 chars): {potential_near_dupes} ({potential_near_dupes / len(df) * 100:.2f}%)")

# List all near-duplicate IDs grouped
if potential_near_dupes > 0:
    print("\nAll Near-Duplicate Groups (sorted by group size):")
    near_dupe_mask = df['text_start'].duplicated(keep=False)
    near_dupe_df = df[near_dupe_mask][['text_start', 'id', 'text']]
    
    # Group by text_start and collect IDs
    grouped = near_dupe_df.groupby('text_start')
    
    # Sort groups by size (largest first)
    group_sizes = grouped.size().sort_values(ascending=False)
    
    print(f"\nTotal near-duplicate groups: {len(group_sizes)}")
    print(f"Largest group size: {group_sizes.max()}\n")
    
    for i, (text_start, size) in enumerate(group_sizes.items(), 1):
        group_ids = grouped.get_group(text_start)['id'].tolist()
        print(f"Group {i} (Size: {size}):")
        print(f"  Text preview: {text_start}...")
        print(f"  IDs: {group_ids}")
        print()
else:
    print("\nâœ“ No potential near-duplicates found!")

#### 1.2.4 Text Length and Composition Statistics

Analyzing text characteristics using pandas string methods and aggregations.

In [None]:
# Calculate text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].str.replace(' ', '').str.len() / df['word_count']

# Count special characters
df['hashtag_count'] = df['text'].str.count('#')
df['mention_count'] = df['text'].str.count('@')
df['url_count'] = df['text'].str.count('http')

print("\n" + "="*50)
print("TEXT LENGTH AND COMPOSITION STATISTICS")
print("="*50)

print("\nCharacter Length Statistics:")
print(f"  Mean: {df['text_length'].mean():.1f} characters")
print(f"  Median: {df['text_length'].median():.1f} characters")
print(f"  Std Dev: {df['text_length'].std():.1f} characters")
print(f"  Min: {df['text_length'].min()} characters")
print(f"  Max: {df['text_length'].max()} characters")

print("\nWord Count Statistics:")
print(f"  Mean: {df['word_count'].mean():.1f} words")
print(f"  Median: {df['word_count'].median():.1f} words")
print(f"  Std Dev: {df['word_count'].std():.1f} words")
print(f"  Min: {df['word_count'].min()} words")
print(f"  Max: {df['word_count'].max()} words")

print("\nAverage Word Length:")
print(f"  Mean: {df['avg_word_length'].mean():.2f} characters per word")
print(f"  Median: {df['avg_word_length'].median():.2f} characters per word")

print("\nSpecial Character Statistics:")
print(f"  Tweets with hashtags: {(df['hashtag_count'] > 0).sum()} ({(df['hashtag_count'] > 0).sum()/len(df)*100:.1f}%)")
print(f"  Average hashtags per tweet: {df['hashtag_count'].mean():.2f}")
print(f"  Tweets with mentions: {(df['mention_count'] > 0).sum()} ({(df['mention_count'] > 0).sum()/len(df)*100:.1f}%)")
print(f"  Average mentions per tweet: {df['mention_count'].mean():.2f}")
print(f"  Tweets with URLs: {(df['url_count'] > 0).sum()} ({(df['url_count'] > 0).sum()/len(df)*100:.1f}%)")

# Show extreme examples
print("\nExtreme Examples:")
print(f"\nShortest tweet ({df['text_length'].min()} chars):")
print(f"  {df.loc[df['text_length'].idxmin(), 'text']}")
print(f"\nLongest tweet ({df['text_length'].max()} chars, first 150):")
print(f"  {df.loc[df['text_length'].idxmax(), 'text'][:150]}...")

#### 1.2.5 Word Frequency Statistics (NLTK)

Analyzing word frequency patterns using NLTK tokenization and pandas aggregations.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Get English stopwords
stop_words = set(stopwords.words('english'))

print("\n" + "="*50)
print("WORD FREQUENCY STATISTICS (NLTK)")
print("="*50)

# Tokenize all tweets
all_tokens = []
for text in df['text']:
    tokens = word_tokenize(text.lower())
    # Remove punctuation and stopwords, keep only meaningful words
    tokens = [
        t for t in tokens 
        if t not in string.punctuation 
        and t not in stop_words 
        and len(t) > 2
    ]
    all_tokens.extend(tokens)

token_freq = Counter(all_tokens)
top_words = pd.DataFrame(
    token_freq.most_common(30), 
    columns=['Word', 'Frequency']
)

print(f"\nVocabulary Statistics:")
print(f"  Total tokens (after filtering): {len(all_tokens):,}")
print(f"  Unique words (vocabulary size): {len(token_freq):,}")
print(f"  Average token frequency: {len(all_tokens) / len(token_freq):.2f}")
print(f"  Words appearing only once: {sum(1 for count in token_freq.values() if count == 1):,}")
print(f"  Words appearing 10+ times: {sum(1 for count in token_freq.values() if count >= 10):,}")

print(f"\nTop 30 Most Common Words:")
print(top_words.to_string(index=False))

#### 1.2.6 Named Entity Statistics (spaCy)

Analyzing named entity distributions using spaCy NER and pandas aggregations.

In [None]:
from tqdm import tqdm

print("\n" + "="*50)
print("NAMED ENTITY STATISTICS (spaCy)")
print("="*50)
print(f"\nAnalyzing {len(df)} tweets... This may take a few minutes...")

# Extract entities from all tweets
all_entities = []
tweets_with_entities = 0
entity_counts_per_tweet = []

# Use nlp.pipe for better performance
texts = df['text'].apply(lambda x: x[:500]).tolist()

for doc in tqdm(nlp.pipe(texts, batch_size=50), total=len(df), desc="Processing"):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts_per_tweet.append(len(entities))
    if entities:
        tweets_with_entities += 1
    all_entities.extend(entities)

# Add entity count to dataframe
df['entity_count'] = entity_counts_per_tweet

print(f"\nEntity Detection Statistics:")
print(f"  Tweets with entities: {tweets_with_entities:,} ({tweets_with_entities/len(df)*100:.1f}%)")
print(f"  Tweets without entities: {len(df) - tweets_with_entities:,} ({(len(df) - tweets_with_entities)/len(df)*100:.1f}%)")
print(f"  Total entities found: {len(all_entities):,}")
print(f"  Average entities per tweet: {df['entity_count'].mean():.2f}")
print(f"  Median entities per tweet: {df['entity_count'].median():.0f}")
print(f"  Max entities in a tweet: {df['entity_count'].max()}")

# Count entity types
entity_types = Counter([label for text, label in all_entities])
entity_type_df = pd.DataFrame(
    entity_types.most_common(), 
    columns=['Entity Type', 'Count']
)
entity_type_df['Percentage'] = (entity_type_df['Count'] / len(all_entities) * 100).round(1)

print("\nEntity Type Distribution:")
print(entity_type_df.to_string(index=False))

print("\nEntity Type Legend:")
print("\nEntity Type Legend (Complete):")
print("  PERSON      = People, including fictional characters")
print("  ORG         = Companies, agencies, institutions, organizations")
print("  GPE         = Countries, cities, states (Geo-Political Entities)")
print("  DATE        = Absolute or relative dates or periods")
print("  CARDINAL    = Numerals that do not fall under another type")
print("  MONEY       = Monetary values, including unit")
print("  TIME        = Times smaller than a day")
print("  NORP        = Nationalities, religious or political groups")
print("  ORDINAL     = First, second, third, etc.")
print("  WORK_OF_ART = Titles of books, songs, movies, etc.")
print("  EVENT       = Named hurricanes, battles, wars, sports events")
print("  PRODUCT     = Objects, vehicles, foods, etc. (not services)")
print("  LOC         = Non-GPE locations, mountain ranges, bodies of water")
print("  FAC         = Buildings, airports, highways, bridges, etc.")
print("  QUANTITY    = Measurements, as of weight or distance")
print("  LAW         = Named documents made into laws")
print("  PERCENT     = Percentage, including '%'")
print("  LANGUAGE    = Any named language")

### 1.3 Filter to Text Only

Filtering the dataset to extract only text columns and remove any non-textual data. This step ensures that subsequent processing focuses exclusively on textual content and handles any data type conversions if necessary.

In [None]:
def filter_to_text_only(dataframe, text_col='text', label_col='label_name'):
    """
    Filter dataset to only text and label columns, removing all numbers.
    
    Parameters:
    -----------
    dataframe : pd.DataFrame
        Input dataframe to filter
    text_col : str
        Name of the text column (default: 'text')
    label_col : str
        Name of the label column (default: 'label_name')
    
    Returns:
    --------
    pd.DataFrame
        Filtered dataframe with only text columns and no numbers
    """
    
    # Step 1: Select only the text and label_name columns
    df_filtered = dataframe[[text_col, label_col]].copy()
    
    # Step 2: Remove all numbers from the text column using regex
    # This removes all digits (0-9) from the text
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\d+', '', regex=True)
    
    # Step 3: Handle label_name - convert to string if it's a list, then remove numbers
    # First check if it's already a list or needs conversion
    if isinstance(df_filtered[label_col].iloc[0], list):
        # Keep as list, no number removal needed (labels are text)
        pass
    else:
        # If it's a string representation, convert and clean
        df_filtered[label_col] = df_filtered[label_col].astype(str)
    
    # Step 4: Clean up any extra whitespace created by removing numbers
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\s+', ' ', regex=True).str.strip()
    
    return df_filtered


# Apply the filtering function
print("\n" + "="*50)
print("FILTERING TO TEXT ONLY")
print("="*50)

# Show original dataset info
print("\nOriginal Dataset:")
print(f"  Shape: {df.shape}")
print(f"  Columns: {df.columns.tolist()}")
print(f"  Sample text: {df['text'].iloc[0][:100]}...")

# Apply the filter
df_text_only = filter_to_text_only(df)

# Show filtered dataset info
print("\nFiltered Dataset (Text Only - No Numbers):")
print(f"  Shape: {df_text_only.shape}")
print(f"  Columns: {df_text_only.columns.tolist()}")
print(f"  Sample text: {df_text_only['text'].iloc[0][:100]}...")

# Show examples of number removal
print("\nExamples of Number Removal:")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"  Original:  {df['text'].iloc[i][:80]}...")
    print(f"  Filtered:  {df_text_only['text'].iloc[i][:80]}...")

# Show label_name comparison
print("\nLabel Name Comparison:")
print(f"  Original label_name: {df['label_name'].iloc[0]}")
print(f"  Filtered label_name: {df_text_only['label_name'].iloc[0]}")

# Statistics on number removal
original_chars = df['text'].str.len().sum()
filtered_chars = df_text_only['text'].str.len().sum()
chars_removed = original_chars - filtered_chars

print(f"\nCharacter Statistics:")
print(f"  Original total characters: {original_chars:,}")
print(f"  Filtered total characters: {filtered_chars:,}")
print(f"  Characters removed (numbers): {chars_removed:,} ({chars_removed/original_chars*100:.2f}%)")

print("\nâœ“ Dataset successfully filtered to text only (numbers removed)")

### 1.4 Create Reusable Loading and Filtering Functions

Reusable functions that encapsulate the data loading and text-filtering logic.

In [None]:
import pandas as pd
from datasets import load_dataset

# Load the dataset directly from Hugging Face using the datasets library
dataset = load_dataset("cardiffnlp/tweet_topic_multi", split="train_all")

# Convert to pandas DataFrame
tweets_raw = dataset.to_pandas()

# Filter the dataset to text only (no numbers)
def filter_to_text_only(dataframe, text_col='text', label_col='label_name', label_num_col='label'):
    df_filtered = dataframe[[text_col, label_col, label_num_col]].copy()
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\d+', '', regex=True)
    
    if isinstance(df_filtered[label_col].iloc[0], list):
        pass
    else:
        df_filtered[label_col] = df_filtered[label_col].astype(str)
    
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\s+', ' ', regex=True).str.strip()
    
    return df_filtered

tweets_text_only = filter_to_text_only(tweets_raw)

print("\nâœ“ Dataset successfully loaded and filtered to text only")

---

## 2. Preprocessing

This section focuses on text preprocessing techniques. We will review common preprocessing methods, apply them systematically using NLTK and SpaCy, analyze how the order of operations affects results, evaluate the usefulness of each method for different scenarios, and create a reusable preprocessing pipeline.


### Methods Used (in Order)

#### 1. **Remove RT indicator**
   - **Purpose:** Removes retweet markers that don't carry semantic meaning
   - **Order:** FIRST - removes noise before any text processing

#### 2. **Remove placeholders (USERNAME, URL, mentions)**
   - **Purpose:** Removes dataset-specific placeholders that don't contribute to topic classification
   - **Order:** EARLY - clean structural noise before text normalization

#### 3. **Convert emojis to text**
   - **Purpose:** Transforms emojis into descriptive words that preserve semantic meaning (ðŸŽ® â†’ video game)
   - **Order:** AFTER placeholders, BEFORE hashtags - emojis may appear anywhere in text

#### 4. **Extract hashtag text**
   - **Purpose:** Preserves topic-relevant keywords from hashtags (#Gaming â†’ Gaming)
   - **Order:** AFTER emoji conversion - hashtags rarely contain emojis but order matters for consistency

#### 5. **Segment CamelCase words**
   - **Purpose:** Splits compound words for better tokenization (GameOfThrones â†’ Game Of Thrones)
   - **Order:** BEFORE lowercase - capitalization patterns guide segmentation

#### 6. **Normalize whitespace and lowercase**
   - **Purpose:** Standardizes text format for consistent processing
   - **Order:** AFTER special token handling - ensures uniform text before tokenization

#### 7. **Tokenize with SpaCy**
   - **Purpose:** Splits text into linguistic tokens with POS and lemma information
   - **Order:** AFTER normalization - requires clean, lowercase text

#### 8. **Filter and lemmatize tokens**
   - **Purpose:** Removes noise tokens and reduces words to base forms
   - **Order:** LAST - operates on clean, filtered tokens

### Why This Order Matters

**Critical Dependencies:**
- Emoji conversion **before** lowercase â†’ emoji descriptions use underscores that get normalized
- Special tokens **before** tokenization â†’ removes as complete units
- Lowercase **before** tokenization â†’ consistent stopword matching
- Tokenization **before** filtering â†’ need tokens to filter
- Lemmatization **last** â†’ final transformation on clean tokens

### 2.1 Reusable Preprocessing Pipeline

Development of a modular, configurable preprocessing function that can be easily reused in future labs. The pipeline allows for flexible selection of preprocessing steps and parameters, making it adaptable to different text analysis tasks and requirements.

In [None]:
# Download required NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Load SpaCy model
import spacy
try:
    nlp = spacy.load('en_core_web_sm')
    print("âœ“ SpaCy model loaded successfully")
except:
    print("Installing SpaCy model...")
    import os
    os.system('python -m spacy download en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')
    print("âœ“ SpaCy model loaded successfully")

# Import emoji package for emoji handling
import emoji

def is_latin_alphabet(word):
    """
    Check if a word contains only Latin alphabet characters.
    Filters out words with Cyrillic, Arabic, Chinese, etc.
    """
    if not word:
        return False
    return all(ord('a') <= ord(c.lower()) <= ord('z') for c in word)

def segment_camelcase(text):
    """
    Segment CamelCase words into separate words without regex.
    Example: 'GameOfThrones' â†’ 'Game Of Thrones'
    This is important for hashtags like #GameOfThrones after removing the #
    """
    if not text:
        return text
    
    result = []
    
    for i, char in enumerate(text):
        # Add current character
        result.append(char)
        
        # Check if we need to insert a space
        if i < len(text) - 1:
            current = char
            next_char = text[i + 1]
            
            # Case 1: lowercase â†’ uppercase (e.g., 'e' â†’ 'O' in 'GameOf')
            if current.islower() and next_char.isupper():
                result.append(' ')
            
            # Case 2: uppercase â†’ uppercase â†’ lowercase (e.g., 'HTML' â†’ 'Parser')
            elif i < len(text) - 2:
                after_next = text[i + 2]
                if current.isupper() and next_char.isupper() and after_next.islower():
                    result.append(' ')
    
    return ''.join(result)

def preprocess_tweet(text):
    """
    Topic-optimized preprocessing for tweet classification.
    Preserves topic-relevant information while removing noise.
    Removes special characters, emojis, and non-Latin script words.
    """
    if not isinstance(text, str):
        return ""
        
    # Step 1: Remove RT (retweet indicator)
    text = text.replace('RT ', ' ').replace('rt ', ' ')
    
    # Step 2: Remove URLs and placeholders
    text = text.replace('{{URL}}', ' ')
    text = text.replace('{{USERNAME}}', ' ')
    for protocol in ['https://', 'http://', 'www.']:
        if protocol in text:
            parts = text.split(protocol)
            text = parts[0] + ' ' + ' '.join([' '.join(p.split()[1:]) if p.split() else '' for p in parts[1:]])
    
    # Step 3: Remove mentions
    words_list = text.split()
    words_list = [w for w in words_list if not (w.startswith('{@') or w.startswith('@'))]
    text = ' '.join(words_list)
    
    # Step 3.5: Convert emojis to text descriptions (ðŸŽ® â†’ video game)
    text = emoji.demojize(text, delimiters=(" ", " "))
    text = text.replace('_', ' ')
    
    # Step 4: Extract hashtag text (#Gaming â†’ Gaming, #GameOfThrones â†’ GameOfThrones)
    words_list = text.split()
    words_list = [w[1:] if w.startswith('#') else w for w in words_list]
    text = ' '.join(words_list)
    
    # Step 4.5: Segment CamelCase words
    # GameOfThrones â†’ Game Of Thrones
    text = segment_camelcase(text)
    
    # Step 5: Normalize whitespace and lowercase
    text = ' '.join(text.split())
    text = text.lower()
    
    # Step 6: Tokenize with SpaCy
    doc = nlp(text)
    
    # Step 7: Filter and lemmatize tokens
    processed_tokens = []
    for token in doc:
        # Skip punctuation
        if token.is_punct:
            continue
        
        # Skip if not alphabetic (removes special characters, emojis, numbers)
        if not token.is_alpha:
            continue
        
        # Skip tokens shorter than 2 characters
        if len(token.text) < 2:
            continue
        
        # Remove stopwords (using SpaCy's stopword detection)
        if token.is_stop:
            continue
        
        # Check if word uses Latin alphabet (filters out Cyrillic, Arabic, Chinese, etc.)
        if not is_latin_alphabet(token.text):
            continue
        
        # Use lemmatized form
        processed_tokens.append(token.lemma_)
    
    return ' '.join(processed_tokens)

# Create a copy of the original dataframe
tweets_preprocessed_train = tweets_text_only.copy()

# Apply preprocessing
tweets_preprocessed_train['text'] = tweets_preprocessed_train['text'].apply(preprocess_tweet)

print("\nâœ“ Preprocessing complete!")
print(f"âœ“ Processed {len(tweets_preprocessed_train)} tweets")
print(f"âœ“ Original 'tweets_text_only' unchanged | Processed data in 'tweets_preprocessed_train'")

# Save the DataFrame to the Data folder
import os

# Create Data folder if it does not exist
os.makedirs('../Data', exist_ok=True)

# Save the data as parquet
output_path = '../Data/tweets_preprocessed_train.parquet'
tweets_preprocessed_train.to_parquet(output_path, index=False)

# Save the data as CSV
output_path_csv = '../Data/tweets_preprocessed_train.csv'
tweets_preprocessed_train.to_csv(output_path_csv, index=False)

print(f"âœ“ Shape: {tweets_preprocessed_train.shape}")

print(f"\nâœ“ Training DataFrame saved to: {output_path}")
print(f"âœ“ Features: {list(tweets_preprocessed_train.columns)}")
print(f"âœ“ Training DataFrame saved to: {output_path_csv}")

### 2.2 Preprocessing for Test and Validation Splits

Apply the same preprocessing pipeline to the test and validation splits from HuggingFace to ensure consistency between training and evaluation data.

In [None]:
# Load test and validation splits from HuggingFace
print("Loading test and validation splits from HuggingFace...")
test_dataset = load_dataset("cardiffnlp/tweet_topic_multi", split="test_2021")
val_dataset = load_dataset("cardiffnlp/tweet_topic_multi", split="validation_2021")

# Convert to pandas DataFrames
test_raw = test_dataset.to_pandas()
val_raw = val_dataset.to_pandas()

print(f"Test samples: {len(test_raw):,}")
print(f"Validation samples: {len(val_raw):,}")

# Apply the same text-only filter (remove numbers)
test_text_only = filter_to_text_only(test_raw)
val_text_only = filter_to_text_only(val_raw)

# Apply the same preprocessing pipeline
tweets_preprocessed_test = test_text_only.copy()
tweets_preprocessed_test['text'] = tweets_preprocessed_test['text'].apply(preprocess_tweet)

tweets_preprocessed_validation = val_text_only.copy()
tweets_preprocessed_validation['text'] = tweets_preprocessed_validation['text'].apply(preprocess_tweet)

# Save test data as parquet
test_output_path = '../Data/tweets_preprocessed_test.parquet'
tweets_preprocessed_test.to_parquet(test_output_path, index=False)

# Save test data as CSV
test_output_path_csv = '../Data/tweets_preprocessed_test.csv'
tweets_preprocessed_test.to_csv(test_output_path_csv, index=False)

# Save validation data as parquet
val_output_path = '../Data/tweets_preprocessed_validation.parquet'
tweets_preprocessed_validation.to_parquet(val_output_path, index=False)

# Save validation data as CSV
val_output_path_csv = '../Data/tweets_preprocessed_validation.csv'
tweets_preprocessed_validation.to_csv(val_output_path_csv, index=False)

print(f"\nâœ“ Test data preprocessing complete!")

print(f"âœ“ Saved to: {test_output_path}")

print(f"âœ“ Features: {list(tweets_preprocessed_validation.columns)}")

print(f"âœ“ Saved to: {test_output_path_csv}")

print(f"âœ“ Shape: {tweets_preprocessed_validation.shape}")

print(f"âœ“ Shape: {tweets_preprocessed_test.shape}")

print(f"âœ“ Saved to: {val_output_path_csv}")

print(f"âœ“ Features: {list(tweets_preprocessed_test.columns)}")

print(f"âœ“ Saved to: {val_output_path}")

print(f"\nâœ“ Validation data preprocessing complete!")

### 2.3 Prepare Data for Training

This section addresses a common issue in multi-label classification: **class imbalance**. Some labels in our dataset have very few training samples, which can negatively impact model performance during training.

**Why this matters:**
- Labels with too few samples cannot be learned effectively by the model
- Class imbalance can lead to biased predictions toward majority classes
- Having consistent labels across train/validation/test splits is crucial for proper evaluation

**What this function does:**
1. **Visualizes** the distribution of labels using matplotlib to identify potential imbalances
2. **Removes** labels with fewer than a specified threshold (default: 180 tweets) from all data splits
3. **Updates** the binary label vectors to reflect the removed labels
4. **Ensures consistency** by applying the same filtering to train, validation, and test sets

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter

def parse_label_names(label_str):
    """
    Parse the label_name string into a list of labels.
    The format is numpy-style: ['label1' 'label2'] instead of ['label1', 'label2']
    
    Parameters:
    -----------
    label_str : str
        String representation of labels in numpy array format
    
    Returns:
    --------
    list
        List of label strings
    """
    if isinstance(label_str, str):
        label_str = label_str.strip()
        if label_str.startswith('[') and label_str.endswith(']'):
            content = label_str[1:-1]
            items = re.findall(r"'([^']*)'" , content)
            return items
    return []


def prepare_data_for_training(df_train, df_validation, df_test, min_samples=180, show_plot=True):
    """
    Prepare datasets for training by removing labels with insufficient samples.
    
    This function performs the following steps:
    1. Counts the number of tweets per label in the training set
    2. Visualizes the label distribution using matplotlib
    3. Identifies labels with fewer than min_samples tweets
    4. Removes tweets containing only the underrepresented labels from all splits
    5. Updates the label vectors and label_name columns
    
    Parameters:
    -----------
    df_train : pd.DataFrame
        Training dataframe with 'text', 'label_name', and 'label' columns
    df_validation : pd.DataFrame
        Validation dataframe with same columns
    df_test : pd.DataFrame
        Test dataframe with same columns
    min_samples : int, default=180
        Minimum number of tweets required for a label to be kept
    show_plot : bool, default=True
        Whether to display the matplotlib visualization
    
    Returns:
    --------
    tuple of (pd.DataFrame, pd.DataFrame, pd.DataFrame, list)
        Filtered train, validation, test dataframes, and list of removed labels
    """
    
    # Make copies to avoid modifying original dataframes
    train = df_train.copy()
    val = df_validation.copy()
    test = df_test.copy()
    
    # Parse labels for all datasets
    train['parsed_labels'] = train['label_name'].apply(parse_label_names)
    val['parsed_labels'] = val['label_name'].apply(parse_label_names)
    test['parsed_labels'] = test['label_name'].apply(parse_label_names)
    
    # Count tweets per label in training set
    label_counts = Counter()
    for labels in train['parsed_labels']:
        for label in labels:
            label_counts[label] += 1
    
    # Sort labels by count for visualization
    sorted_labels = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
    labels_list = [item[0] for item in sorted_labels]
    counts_list = [item[1] for item in sorted_labels]
    
    # Identify labels to remove
    labels_to_remove = [label for label, count in label_counts.items() if count < min_samples]
    labels_to_keep = [label for label, count in label_counts.items() if count >= min_samples]
    
    print("=" * 60)
    print("LABEL DISTRIBUTION ANALYSIS")
    print("=" * 60)
    print(f"\nTotal unique labels: {len(label_counts)}")
    print(f"Minimum samples threshold: {min_samples}")
    print(f"Labels to keep: {len(labels_to_keep)}")
    print(f"Labels to remove: {len(labels_to_remove)}")
    
    if labels_to_remove:
        print(f"\nLabels being removed (< {min_samples} tweets):")
        for label in sorted(labels_to_remove, key=lambda x: label_counts[x], reverse=True):
            print(f"  - {label}: {label_counts[label]} tweets")
    
    # Visualization with matplotlib
    if show_plot:
        fig, ax = plt.subplots(figsize=(12, 8))
        
        # Create colors based on whether label will be kept or removed
        colors = ['#2ecc71' if count >= min_samples else '#e74c3c' for count in counts_list]
        
        # Create horizontal bar chart
        y_pos = np.arange(len(labels_list))
        bars = ax.barh(y_pos, counts_list, color=colors, edgecolor='black', linewidth=0.5)
        
        # Add threshold line
        ax.axvline(x=min_samples, color='#3498db', linestyle='--', linewidth=2, 
                   label=f'Threshold ({min_samples} tweets)')
        
        # Customize chart
        ax.set_yticks(y_pos)
        ax.set_yticklabels(labels_list)
        ax.invert_yaxis()  # Labels read top-to-bottom
        ax.set_xlabel('Number of Tweets', fontsize=12)
        ax.set_ylabel('Label', fontsize=12)
        ax.set_title('Label Distribution in Training Data\n(Green: Keep, Red: Remove)', fontsize=14)
        ax.legend(loc='lower right')
        
        # Add count annotations
        for i, (count, bar) in enumerate(zip(counts_list, bars)):
            ax.annotate(f'{count}', xy=(count + 20, bar.get_y() + bar.get_height()/2),
                       va='center', ha='left', fontsize=9)
        
        plt.tight_layout()
        plt.show()
        print(f"\nâœ“ Plot displayed")
    
    # Define the original label order (from the dataset)
    all_labels_ordered = [
        'arts_&_culture', 'business_&_entrepreneurs', 'celebrity_&_pop_culture',
        'diaries_&_daily_life', 'family', 'fashion_&_style', 'film_tv_&_video',
        'fitness_&_health', 'food_&_dining', 'gaming', 'learning_&_educational',
        'music', 'news_&_social_concern', 'other_hobbies', 'relationships',
        'science_&_technology', 'sports', 'travel_&_adventure', 'youth_&_student_life'
    ]
    
    # Get indices of labels to keep
    keep_indices = [i for i, label in enumerate(all_labels_ordered) if label in labels_to_keep]
    new_labels_ordered = [all_labels_ordered[i] for i in keep_indices]
    
    def filter_and_update_labels(df, labels_to_remove_set, keep_indices, new_labels_ordered):
        """
        Filter out rows where all labels are in labels_to_remove and update label vectors.
        """
        # Check if tweet has at least one label that will be kept
        def has_valid_label(parsed_labels):
            return any(label not in labels_to_remove_set for label in parsed_labels)
        
        # Filter rows
        mask = df['parsed_labels'].apply(has_valid_label)
        df_filtered = df[mask].copy()
        
        # Update label_name to only include kept labels
        def update_label_names(parsed_labels):
            kept = [label for label in parsed_labels if label not in labels_to_remove_set]
            return str(kept).replace(', ', ' ').replace(',', '')
        
        df_filtered['label_name'] = df_filtered['parsed_labels'].apply(update_label_names)
        
        # Update label vectors - parse old vector and create new one with only kept indices
        def update_label_vector(label_str):
            # Parse the vector string like '[0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0]'
            label_str = str(label_str).strip()
            if label_str.startswith('[') and label_str.endswith(']'):
                values = label_str[1:-1].split()
                values = [int(v) for v in values]
                # Keep only the values at keep_indices
                new_values = [values[i] for i in keep_indices]
                return '[' + ' '.join(map(str, new_values)) + ']'
            return label_str
        
        df_filtered['label'] = df_filtered['label'].apply(update_label_vector)
        
        # Drop the temporary parsed_labels column
        df_filtered = df_filtered.drop(columns=['parsed_labels'])
        
        return df_filtered
    
    labels_to_remove_set = set(labels_to_remove)
    
    # Apply filtering to all datasets
    train_filtered = filter_and_update_labels(train, labels_to_remove_set, keep_indices, new_labels_ordered)
    val_filtered = filter_and_update_labels(val, labels_to_remove_set, keep_indices, new_labels_ordered)
    test_filtered = filter_and_update_labels(test, labels_to_remove_set, keep_indices, new_labels_ordered)
    
    # Print summary
    print("\n" + "=" * 60)
    print("FILTERING RESULTS")
    print("=" * 60)
    print(f"\nTraining set:")
    print(f"  Before: {len(train)} tweets")
    print(f"  After:  {len(train_filtered)} tweets ({len(train) - len(train_filtered)} removed)")
    
    print(f"\nValidation set:")
    print(f"  Before: {len(val)} tweets")
    print(f"  After:  {len(val_filtered)} tweets ({len(val) - len(val_filtered)} removed)")
    
    print(f"\nTest set:")
    print(f"  Before: {len(test)} tweets")
    print(f"  After:  {len(test_filtered)} tweets ({len(test) - len(test_filtered)} removed)")
    
    print(f"\nâœ“ Label vector size reduced from {len(all_labels_ordered)} to {len(new_labels_ordered)}")
    print(f"\nRemaining labels ({len(new_labels_ordered)}):")
    for label in new_labels_ordered:
        print(f"  - {label}")
    
    return train_filtered, val_filtered, test_filtered, labels_to_remove


# Apply the prepare_data_for_training function
print("Loading preprocessed datasets...\n")

# Load the preprocessed datasets
train_df = pd.read_csv('../Data/tweets_preprocessed_train.csv')
val_df = pd.read_csv('../Data/tweets_preprocessed_validation.csv')
test_df = pd.read_csv('../Data/tweets_preprocessed_test.csv')

print(f"Loaded train: {len(train_df)} tweets")
print(f"Loaded validation: {len(val_df)} tweets")
print(f"Loaded test: {len(test_df)} tweets\n")

# Prepare data for training (remove labels with < 180 tweets)
train_filtered, val_filtered, test_filtered, removed_labels = prepare_data_for_training(
    train_df, val_df, test_df, 
    min_samples=180, 
    show_plot=True
)

# Save the filtered datasets (overwrite the original preprocessed files)
train_filtered.to_parquet('../Data/tweets_preprocessed_train.parquet', index=False)
val_filtered.to_parquet('../Data/tweets_preprocessed_validation.parquet', index=False)
test_filtered.to_parquet('../Data/tweets_preprocessed_test.parquet', index=False)

train_filtered.to_csv('../Data/tweets_preprocessed_train.csv', index=False)
val_filtered.to_csv('../Data/tweets_preprocessed_validation.csv', index=False)
test_filtered.to_csv('../Data/tweets_preprocessed_test.csv', index=False)

print("\n" + "=" * 60)
print("SAVED FILES")
print("=" * 60)
print("\nâœ“ Filtered datasets saved to:")
print("  - ../Data/tweets_preprocessed_train.parquet")
print("  - ../Data/tweets_preprocessed_validation.parquet")
print("  - ../Data/tweets_preprocessed_test.parquet")
print("  - ../Data/tweets_preprocessed_train.csv")
print("  - ../Data/tweets_preprocessed_validation.csv")
print("  - ../Data/tweets_preprocessed_test.csv")
print("\nâœ“ Data preparation for training complete!")