# Text Loading, Filtering & Preprocessing Pipeline - Version 3

This notebook is an optimized version of the preprocessing pipeline with 3 key improvements:

## Optimizations:
1. **CamelCase Splitting for Hashtags** â†’ Preserves semantic meaning (#GameDay â†’ game day)
2. **Optimized Latin Alphabet Check** â†’ Faster performance (isascii() instead of custom function)
3. **Redundant Number Removal Eliminated** â†’ Numbers removed only once in preprocessing

---

## 1.5 Data Loading and Filtering

Load dataset from Hugging Face and filter to text + label columns.

**OPTIMIZATION #3:** Numbers are NO LONGER removed here - they are preserved for better tokenization by spaCy.

In [None]:
import pandas as pd
from datasets import load_dataset

# Load the dataset directly from Hugging Face using the datasets library
dataset = load_dataset("cardiffnlp/tweet_topic_multi", split="train_all")

# Convert to pandas DataFrame
df = dataset.to_pandas()

# Filter the dataset to text only
def filter_to_text_only(dataframe, text_col='text', label_col='label_name'):
    """
    Filter dataset to only text and label columns.
    
    OPTIMIZATION #3 (Redundante Zahlenentfernung): 
    Numbers are NO LONGER removed here! They are now removed only once in the
    preprocessing step, which allows better tokenization by spaCy.
    """
    df_filtered = dataframe[[text_col, label_col]].copy()
    
    # Handle label_name - keep as is
    if isinstance(df_filtered[label_col].iloc[0], list):
        pass
    else:
        df_filtered[label_col] = df_filtered[label_col].astype(str)
    
    # Clean up any extra whitespace (but DON'T remove numbers)
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\s+', ' ', regex=True).str.strip()
    
    return df_filtered

df_text_only = filter_to_text_only(df)

print("\nâœ“ Dataset successfully loaded and filtered to text only")
print("âœ“ OPTIMIZATION: Numbers preserved for better tokenization")

## 2.5 Preprocessing Pipeline

Topic-optimized preprocessing with 3 key optimizations:

### Preprocessing Steps:
1. Remove RT, URLs, mentions
2. **Extract hashtag text with CamelCase splitting** (OPTIMIZATION #1)
3. Normalize whitespace and lowercase
4. Tokenize with SpaCy
5. Remove punctuation
6. Filter non-alphabetic tokens (removes numbers here - OPTIMIZATION #3)
7. Custom stopword removal (keep topic-relevant words)
8. **Latin alphabet check with isascii()** (OPTIMIZATION #2)
9. Lemmatization

In [None]:
# Download required NLTK data
import nltk
import re

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Load SpaCy model
import spacy
try:
    nlp = spacy.load('en_core_web_sm')
    print("âœ“ SpaCy model loaded successfully")
except:
    print("Installing SpaCy model...")
    import os
    os.system('python -m spacy download en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')
    print("âœ“ SpaCy model loaded successfully")

def split_camel_case(text):
    """
    OPTIMIZATION #1: CamelCase Splitting for Hashtags
    
    Splits CamelCase words to preserve semantic meaning:
    - #MWLplayoffs â†’ MWL playoffs
    - #GameDay â†’ Game Day  
    - #BlackLivesMatter â†’ Black Lives Matter
    
    Without this: #MWLplayoffs â†’ mwlplayoffs (unreadable)
    """
    # Insert space before uppercase letters: BlackLivesMatter â†’ Black Lives Matter
    text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)
    # Insert space before uppercase followed by lowercase: MWLplayoffs â†’ MWL playoffs
    text = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1 \2', text)
    return text

def preprocess_tweet(text):
    """
    Topic-optimized preprocessing for tweet classification.
    Preserves topic-relevant information while removing noise.
    Removes special characters, emojis, and non-Latin script words.
    
    VERSION 3 OPTIMIZATIONS:
    1. CamelCase splitting for hashtags (preserves semantic meaning)
    2. Optimized Latin alphabet check (isascii instead of custom function)
    3. Numbers removed here via is_alpha (no longer in filter_to_text_only)
    """
    if not isinstance(text, str):
        return ""
    
    # Step 1: Remove RT (retweet indicator)
    text = text.replace('RT ', ' ').replace('rt ', ' ')
    
    # Step 2: Remove URLs and placeholders
    text = text.replace('{{URL}}', ' ')
    text = text.replace('{{USERNAME}}', ' ')
    for protocol in ['https://', 'http://', 'www.']:
        if protocol in text:
            parts = text.split(protocol)
            text = parts[0] + ' ' + ' '.join([' '.join(p.split()[1:]) if p.split() else '' for p in parts[1:]])
    
    # Step 3: Remove mentions
    words_list = text.split()
    words_list = [w for w in words_list if not (w.startswith('{@') or w.startswith('@'))]
    text = ' '.join(words_list)
    
    # Step 4: OPTIMIZATION #1 - Extract hashtag text with CamelCase splitting
    words_list = text.split()
    processed_words = []
    for w in words_list:
        if w.startswith('#'):
            # Remove # and split CamelCase
            hashtag_text = w[1:]
            split_text = split_camel_case(hashtag_text)
            processed_words.append(split_text)
        else:
            processed_words.append(w)
    text = ' '.join(processed_words)
    
    # Step 5: Normalize whitespace and lowercase
    text = ' '.join(text.split())
    text = text.lower()
    
    # Step 6: Tokenize with SpaCy
    doc = nlp(text)
    
    # Define topic-relevant words to KEEP (even if they're stopwords)
    keep_words = {'game', 'music', 'news', 'sport', 'film', 'video', 'watch', 'play'}
    
    # Step 7: Filter and lemmatize tokens
    processed_tokens = []
    for token in doc:
        # Skip punctuation
        if token.is_punct:
            continue
        
        # OPTIMIZATION #3: Skip if not alphabetic (removes special characters, emojis, numbers)
        # Numbers are now ONLY removed here, not in filter_to_text_only
        if not token.is_alpha:
            continue
        
        # Skip tokens shorter than 2 characters
        if len(token.text) < 2:
            continue
        
        # Keep topic-relevant words even if they're stopwords
        if token.text in keep_words or token.lemma_ in keep_words:
            processed_tokens.append(token.lemma_)
            continue
        
        # Remove stopwords (using SpaCy's stopword detection)
        if token.is_stop:
            continue
        
        # OPTIMIZATION #2: Use isascii() instead of custom is_latin_alphabet function
        # Filters out Cyrillic, Arabic, Chinese, etc.
        if not token.text.isascii():
            continue
        
        # Use lemmatized form
        processed_tokens.append(token.lemma_)
    
    return ' '.join(processed_tokens)

# Create a copy of the original dataframe
df_preprocessed = df_text_only.copy()

# Apply preprocessing
df_preprocessed['text'] = df_preprocessed['text'].apply(preprocess_tweet)

print("\n" + "="*60)
print("PREPROCESSING PIPELINE - VERSION 3")
print("="*60)
print("\nâœ“ Preprocessing complete!")
print(f"âœ“ Processed {len(df_preprocessed)} tweets")
print(f"âœ“ Original 'df_text_only' unchanged | Processed data in 'df_preprocessed'")
print("\nðŸ“Š APPLIED OPTIMIZATIONS:")
print("  1. âœ“ CamelCase splitting for hashtags (#GameDay â†’ game day)")
print("  2. âœ“ Optimized Latin alphabet check (isascii() instead of custom function)")
print("  3. âœ“ Redundant number removal eliminated (only removed once here)")