# Text Analysis and Preprocessing Pipeline

## Notebook Overview

This notebook provides a comprehensive workflow for text data analysis and preprocessing.

- **Data Loading**: Set up the environment, load and explore text datasets, perform statistical analysis, and create reusable data loading functions
- **Preprocessing**: Apply various NLP preprocessing techniques using NLTK and SpaCy, analyze the impact of different methods, and develop a reusable preprocessing pipeline

The notebook is structured to systematically work through each step of the text analysis pipeline, from the instructions of the second lab

---

## 1. Data Loading

This section covers the initial setup and data loading phase. We will install necessary libraries, load the dataset, explore its structure and characteristics, perform statistical analysis, filter the data to text-only content, and create reusable functions for future use.

### 1.1 Install Required Libraries

Installation of pandas for data manipulation, NLTK for natural language processing, and SpaCy for advanced NLP tasks. This cell ensures all necessary dependencies are available for the subsequent analysis.

In [1]:
%pip install pandas 
%pip install nltk
%pip install spacy
%pip install pyarrow
%pip install tqdm
%pip install datasets

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### 1.2 Load and Explore Dataset

Loading the dataset into a pandas DataFrame and performing initial exploration. This includes viewing the first few rows, checking data types, examining the shape of the dataset, and identifying any missing values.

#### 1.2.1 Basic Dataset Overview

In [2]:
import pandas as pd
from datasets import load_dataset

# Load the dataset directly from Hugging Face using the datasets library
dataset = load_dataset("cardiffnlp/tweet_topic_multi", split="train_all")

# Convert to pandas DataFrame
df = dataset.to_pandas()

# Display basic information
print("Row count, Column count:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nFirst 10 Rows:")
display(df.head(10))

Row count, Column count: (6090, 5)

Column Names: ['text', 'date', 'label', 'label_name', 'id']

Data Types:
text          object
date          object
label         object
label_name    object
id            object
dtype: object

Missing Values:
text          0
date          0
label         0
label_name    0
id            0
dtype: int64

First 10 Rows:


Unnamed: 0,text,date,label,label_name,id
0,The {@Clinton LumberKings@} beat the {@Cedar R...,2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516324419866624
1,I would rather hear Eli Gold announce this Aub...,2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516440690176006
2,"Someone take my phone away, I’m trying to not ...",2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516543387709440
3,"A year ago, Louisville struggled to beat an FC...",2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516620466429953
4,Anyone know why the #Dodgers #Orioles game nex...,2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516711411310592
5,"I don’t care. you gave him a shot, he is strug...",2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516891053580288
6,Okay how can I watch the {@Arkansas State Foot...,2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516916554936322
7,Check out largest crowds ever for a basketball...,2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[sports],1170516940902805504
8,I voted #WeWantNCAAFootball on {{USERNAME}} . ...,2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...","[gaming, sports]",1170517092489187328
9,Streaming a new game #minionmasters come stop ...,2019-09-08,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",[gaming],1170546366566846464


#### 1.2.2 Initial Data Exploration with NLTK

Using NLTK to perform basic text exploration and tokenization on a sample of tweets.

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import string

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("\n" + "="*50)
print("INITIAL TEXT EXPLORATION WITH NLTK")
print("="*50)

# Show sample tokenization on a few tweets
print("\nSample Tweet Tokenization:")
sample_text = df['text'].iloc[0]
print(f"Original: {sample_text}")
tokens = word_tokenize(sample_text)
print(f"Tokens: {tokens}")

# Get English stopwords
stop_words = set(stopwords.words('english'))
print(f"\nNumber of English stopwords: {len(stop_words)}")
print(f"Sample stopwords: {list(stop_words)[:10]}")


INITIAL TEXT EXPLORATION WITH NLTK

Sample Tweet Tokenization:
Original: The {@Clinton LumberKings@} beat the {@Cedar Rapids Kernels@} 4-0 in Game 1 of the Western Division finals. Evan Edwards hit a 2-run HR. WP Josh Roberson: 5 IP, 3 H, 0 R, 0 BB, 10 K #MWLplayoffs #MWLscoreboard
Tokens: ['The', '{', '@', 'Clinton', 'LumberKings', '@', '}', 'beat', 'the', '{', '@', 'Cedar', 'Rapids', 'Kernels', '@', '}', '4-0', 'in', 'Game', '1', 'of', 'the', 'Western', 'Division', 'finals', '.', 'Evan', 'Edwards', 'hit', 'a', '2-run', 'HR', '.', 'WP', 'Josh', 'Roberson', ':', '5', 'IP', ',', '3', 'H', ',', '0', 'R', ',', '0', 'BB', ',', '10', 'K', '#', 'MWLplayoffs', '#', 'MWLscoreboard']

Number of English stopwords: 198
Sample stopwords: ['this', 'own', 'up', "you're", "he's", 're', 'too', 'wouldn', 'them', 'mustn']


#### 1.2.3 Initial Data Exploration with spaCy

Using spaCy to perform basic linguistic analysis on sample tweets.

In [4]:
import spacy

# Load spaCy model
try:
    nlp = spacy.load('en_core_web_sm')
except:
    print("Downloading spaCy model...")
    import subprocess
    subprocess.run(['python', '-m', 'spacy', 'download', 'en_core_web_sm'])
    nlp = spacy.load('en_core_web_sm')

print("\n" + "="*50)
print("INITIAL TEXT EXPLORATION WITH spaCy")
print("="*50)

# Analyze a sample tweet
sample_text = df['text'].iloc[5]
doc = nlp(sample_text)

print(f"\nSample Tweet: {sample_text}")
print(f"\nTokens: {[token.text for token in doc]}")
print(f"\nPOS Tags: {[(token.text, token.pos_) for token in doc][:10]}")
print(f"\nEntities: {[(ent.text, ent.label_) for ent in doc.ents]}")


INITIAL TEXT EXPLORATION WITH spaCy

Sample Tweet: I don’t care. you gave him a shot, he is struggling put Joey in and see what he has to offer. This is the game  you decide who you want starting the Mississippi state game. {@Coach Gus Malzahn@} ‍♂️

Tokens: ['I', 'do', 'n’t', 'care', '.', 'you', 'gave', 'him', 'a', 'shot', ',', 'he', 'is', 'struggling', 'put', 'Joey', 'in', 'and', 'see', 'what', 'he', 'has', 'to', 'offer', '.', 'This', 'is', 'the', 'game', ' ', 'you', 'decide', 'who', 'you', 'want', 'starting', 'the', 'Mississippi', 'state', 'game', '.', '{', '@Coach', 'Gus', 'Malzahn@', '}', '\u200d', '♂', '️']

POS Tags: [('I', 'PRON'), ('do', 'AUX'), ('n’t', 'NOUN'), ('care', 'VERB'), ('.', 'PUNCT'), ('you', 'PRON'), ('gave', 'VERB'), ('him', 'PRON'), ('a', 'DET'), ('shot', 'NOUN')]

Entities: [('Joey', 'PERSON'), ('Mississippi', 'LOC')]


### 1.3 Calculate Statistics

Computing descriptive statistics on the dataset to understand the data distribution. This includes category distributions, temporal patterns, duplicate analysis, text length statistics, word frequency analysis, and named entity statistics using pandas and NLP libraries.

#### 1.3.1 Category/Topic Distribution Analysis

Analyzing how many tweets are available for each topic category using pandas operations.

In [5]:
from collections import Counter

# Convert label_name to lists (they're already arrays, just ensure they're lists)
df['topics_list'] = df['label_name'].apply(lambda x: list(x) if not isinstance(x, list) else x)

# Count tweets per category
all_topics = []
for topics in df['topics_list']:
    all_topics.extend(topics)

topic_counts = Counter(all_topics)

# Create a DataFrame for better visualization
topic_df = pd.DataFrame(
    topic_counts.items(), 
    columns=['Topic', 'Count']
).sort_values('Count', ascending=False)

print("\n" + "="*50)
print("TWEETS PER CATEGORY")
print("="*50)
print(topic_df.to_string(index=False))
print(f"\nTotal unique topics: {len(topic_counts)}")

# Multi-label statistics
df['num_topics'] = df['topics_list'].apply(len)
print(f"\nTweets with 1 topic: {(df['num_topics'] == 1).sum()} ({(df['num_topics'] == 1).sum() / len(df) * 100:.1f}%)")
print(f"Tweets with 2+ topics: {(df['num_topics'] > 1).sum()} ({(df['num_topics'] > 1).sum() / len(df) * 100:.1f}%)")
print(f"\nMax topics per tweet: {df['num_topics'].max()}")
print(f"Average topics per tweet: {df['num_topics'].mean():.2f}")


TWEETS PER CATEGORY
                   Topic  Count
   news_&_social_concern   1782
                  sports   1683
                   music   1131
         film_tv_&_video    953
 celebrity_&_pop_culture    924
    diaries_&_daily_life    866
          arts_&_culture    298
    science_&_technology    294
business_&_entrepreneurs    288
        fitness_&_health    275
           other_hobbies    265
           relationships    264
                  family    252
                  gaming    245
  learning_&_educational    154
           food_&_dining    152
      travel_&_adventure    108
         fashion_&_style    107
    youth_&_student_life     94

Total unique topics: 19

Tweets with 1 topic: 3074 (50.5%)
Tweets with 2+ topics: 3015 (49.5%)

Max topics per tweet: 6
Average topics per tweet: 1.66


#### 1.3.2 Temporal Distribution Analysis

Analyzing the distribution of tweets across different time periods using pandas datetime operations.

In [6]:
# Convert date column to datetime
df['date_parsed'] = pd.to_datetime(df['date'])
df['year'] = df['date_parsed'].dt.year
df['month'] = df['date_parsed'].dt.month
df['day_of_week'] = df['date_parsed'].dt.day_name()

# Count tweets per year
year_counts = df['year'].value_counts().sort_index()

print("\n" + "="*50)
print("TEMPORAL DISTRIBUTION STATISTICS")
print("="*50)

print("\nTweets per Year:")
for year, count in year_counts.items():
    print(f"  {year}: {count:,} tweets ({count/len(df)*100:.1f}%)")

print(f"\nDate range: {df['date_parsed'].min().date()} to {df['date_parsed'].max().date()}")
print(f"Total days covered: {(df['date_parsed'].max() - df['date_parsed'].min()).days} days")

# Month distribution
print("\nTweets per Month (across all years):")
month_counts = df['month'].value_counts().sort_index()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for month, count in month_counts.items():
    print(f"  {month_names[month-1]}: {count:,} tweets")

# Day of week distribution
print("\nTweets per Day of Week:")
dow_counts = df['day_of_week'].value_counts()
for day, count in dow_counts.items():
    print(f"  {day}: {count:,} tweets ({count/len(df)*100:.1f}%)")


TEMPORAL DISTRIBUTION STATISTICS

Tweets per Year:
  2019: 1,555 tweets (25.5%)
  2020: 3,508 tweets (57.6%)
  2021: 1,027 tweets (16.9%)

Date range: 2019-09-08 to 2021-08-29
Total days covered: 721 days

Tweets per Month (across all years):
  Jan: 508 tweets
  Feb: 491 tweets
  Mar: 509 tweets
  Apr: 492 tweets
  May: 497 tweets
  Jun: 504 tweets
  Jul: 521 tweets
  Aug: 535 tweets
  Sep: 494 tweets
  Oct: 485 tweets
  Nov: 508 tweets
  Dec: 546 tweets

Tweets per Day of Week:
  Sunday: 4,698 tweets (77.1%)
  Friday: 238 tweets (3.9%)
  Wednesday: 236 tweets (3.9%)
  Saturday: 236 tweets (3.9%)
  Monday: 234 tweets (3.8%)
  Thursday: 228 tweets (3.7%)
  Tuesday: 220 tweets (3.6%)


#### 1.3.3 Duplicate Detection Statistics

Identifying and quantifying exact and potential near-duplicate tweets using pandas string operations.

In [7]:
print("\n" + "="*50)
print("DUPLICATE ANALYSIS")
print("="*50)

# Check for exact text duplicates
duplicate_texts = df['text'].duplicated().sum()
print(f"Exact duplicate tweets: {duplicate_texts} ({duplicate_texts / len(df) * 100:.2f}%)")

# Check for duplicate IDs
duplicate_ids = df['id'].duplicated().sum()
print(f"Duplicate IDs: {duplicate_ids}")

# Show sample duplicates if any exist
if duplicate_texts > 0:
    print("\nSample Duplicate Tweets:")
    duplicated_mask = df['text'].duplicated(keep=False)
    sample_duplicates = df[duplicated_mask].groupby('text').head(2)
    print(sample_duplicates[['text', 'date', 'label_name']].head(6).to_string())
else:
    print("\n✓ No exact duplicate tweets found!")

# Near-duplicate detection (same first 50 characters)
df['text_start'] = df['text'].str.lower().str.strip().str[:50]
potential_near_dupes = df['text_start'].duplicated().sum()
print(f"\nPotential near-duplicates (same first 50 chars): {potential_near_dupes} ({potential_near_dupes / len(df) * 100:.2f}%)")

# List all near-duplicate IDs grouped
if potential_near_dupes > 0:
    print("\nAll Near-Duplicate Groups (sorted by group size):")
    near_dupe_mask = df['text_start'].duplicated(keep=False)
    near_dupe_df = df[near_dupe_mask][['text_start', 'id', 'text']]
    
    # Group by text_start and collect IDs
    grouped = near_dupe_df.groupby('text_start')
    
    # Sort groups by size (largest first)
    group_sizes = grouped.size().sort_values(ascending=False)
    
    print(f"\nTotal near-duplicate groups: {len(group_sizes)}")
    print(f"Largest group size: {group_sizes.max()}\n")
    
    for i, (text_start, size) in enumerate(group_sizes.items(), 1):
        group_ids = grouped.get_group(text_start)['id'].tolist()
        print(f"Group {i} (Size: {size}):")
        print(f"  Text preview: {text_start}...")
        print(f"  IDs: {group_ids}")
        print()
else:
    print("\n✓ No potential near-duplicates found!")


DUPLICATE ANALYSIS
Exact duplicate tweets: 0 (0.00%)
Duplicate IDs: 0

✓ No exact duplicate tweets found!

Potential near-duplicates (same first 50 chars): 83 (1.36%)

All Near-Duplicate Groups (sorted by group size):

Total near-duplicate groups: 38
Largest group size: 28

Group 1 (Size: 28):
  Text preview: check out what i just added to my closet on poshma...
  IDs: ['1173082475607707649', '1173203031145345024', '1173294716093251585', '1179938888405487616', '1180872625360265216', '1181054714986684416', '1183524248574296065', '1218953515298484224', '1223968032122265600', '1226409017716592641', '1226593965203312640', '1231577835065602050', '1236652371863678977', '1246699764982239232', '1246915880715313152', '1246946850952732673', '1267153856233738240', '1277330021317668864', '1297499158480207872', '1297500377302982656', '1297500531720466432', '1297531856431140865', '1297594410046119936', '1300070301758140416', '1305110632270684161', '1343222350125285376', '1353492013765648384', '1358

#### 1.3.4 Text Length and Composition Statistics

Analyzing text characteristics using pandas string methods and aggregations.

In [8]:
# Calculate text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].str.replace(' ', '').str.len() / df['word_count']

# Count special characters
df['hashtag_count'] = df['text'].str.count('#')
df['mention_count'] = df['text'].str.count('@')
df['url_count'] = df['text'].str.count('http')

print("\n" + "="*50)
print("TEXT LENGTH AND COMPOSITION STATISTICS")
print("="*50)

print("\nCharacter Length Statistics:")
print(f"  Mean: {df['text_length'].mean():.1f} characters")
print(f"  Median: {df['text_length'].median():.1f} characters")
print(f"  Std Dev: {df['text_length'].std():.1f} characters")
print(f"  Min: {df['text_length'].min()} characters")
print(f"  Max: {df['text_length'].max()} characters")

print("\nWord Count Statistics:")
print(f"  Mean: {df['word_count'].mean():.1f} words")
print(f"  Median: {df['word_count'].median():.1f} words")
print(f"  Std Dev: {df['word_count'].std():.1f} words")
print(f"  Min: {df['word_count'].min()} words")
print(f"  Max: {df['word_count'].max()} words")

print("\nAverage Word Length:")
print(f"  Mean: {df['avg_word_length'].mean():.2f} characters per word")
print(f"  Median: {df['avg_word_length'].median():.2f} characters per word")

print("\nSpecial Character Statistics:")
print(f"  Tweets with hashtags: {(df['hashtag_count'] > 0).sum()} ({(df['hashtag_count'] > 0).sum()/len(df)*100:.1f}%)")
print(f"  Average hashtags per tweet: {df['hashtag_count'].mean():.2f}")
print(f"  Tweets with mentions: {(df['mention_count'] > 0).sum()} ({(df['mention_count'] > 0).sum()/len(df)*100:.1f}%)")
print(f"  Average mentions per tweet: {df['mention_count'].mean():.2f}")
print(f"  Tweets with URLs: {(df['url_count'] > 0).sum()} ({(df['url_count'] > 0).sum()/len(df)*100:.1f}%)")

# Show extreme examples
print("\nExtreme Examples:")
print(f"\nShortest tweet ({df['text_length'].min()} chars):")
print(f"  {df.loc[df['text_length'].idxmin(), 'text']}")
print(f"\nLongest tweet ({df['text_length'].max()} chars, first 150):")
print(f"  {df.loc[df['text_length'].idxmax(), 'text'][:150]}...")


TEXT LENGTH AND COMPOSITION STATISTICS

Character Length Statistics:
  Mean: 170.2 characters
  Median: 161.5 characters
  Std Dev: 68.8 characters
  Min: 38 characters
  Max: 364 characters

Word Count Statistics:
  Mean: 28.2 words
  Median: 26.0 words
  Std Dev: 12.3 words
  Min: 6 words
  Max: 62 words

Average Word Length:
  Mean: 5.17 characters per word
  Median: 5.03 characters per word

Special Character Statistics:
  Tweets with hashtags: 2408 (39.5%)
  Average hashtags per tweet: 0.87
  Tweets with mentions: 4235 (69.5%)
  Average mentions per tweet: 1.87
  Tweets with URLs: 0 (0.0%)

Extreme Examples:

Shortest tweet (38 chars):
  an old pic huh okay king  {{USERNAME}}

Longest tweet (364 chars, first 150):
  Oh yeah that s {{USERNAME}} I #asked Jo back in {{USERNAME}} which is many #yrs {{USERNAME}} #now was he {{USERNAME}} #too {{USERNAME}} and he #replie...


#### 1.3.5 Word Frequency Statistics (NLTK)

Analyzing word frequency patterns using NLTK tokenization and pandas aggregations.

In [9]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Get English stopwords
stop_words = set(stopwords.words('english'))

print("\n" + "="*50)
print("WORD FREQUENCY STATISTICS (NLTK)")
print("="*50)

# Tokenize all tweets
all_tokens = []
for text in df['text']:
    tokens = word_tokenize(text.lower())
    # Remove punctuation and stopwords, keep only meaningful words
    tokens = [
        t for t in tokens 
        if t not in string.punctuation 
        and t not in stop_words 
        and len(t) > 2
    ]
    all_tokens.extend(tokens)

token_freq = Counter(all_tokens)
top_words = pd.DataFrame(
    token_freq.most_common(30), 
    columns=['Word', 'Frequency']
)

print(f"\nVocabulary Statistics:")
print(f"  Total tokens (after filtering): {len(all_tokens):,}")
print(f"  Unique words (vocabulary size): {len(token_freq):,}")
print(f"  Average token frequency: {len(all_tokens) / len(token_freq):.2f}")
print(f"  Words appearing only once: {sum(1 for count in token_freq.values() if count == 1):,}")
print(f"  Words appearing 10+ times: {sum(1 for count in token_freq.values() if count >= 10):,}")

print(f"\nTop 30 Most Common Words:")
print(top_words.to_string(index=False))


WORD FREQUENCY STATISTICS (NLTK)

Vocabulary Statistics:
  Total tokens (after filtering): 103,399
  Unique words (vocabulary size): 22,205
  Average token frequency: 4.66
  Words appearing only once: 13,736
  Words appearing 10+ times: 1,730

Top 30 Most Common Words:
    Word  Frequency
username       3680
     url       2955
     via       1098
     new        593
     ...        432
    love        429
     day        395
    game        346
   happy        338
   music        331
    time        324
     one        322
    news        317
    like        298
     get        274
 youtube        272
  please        267
   today        264
     see        261
   video        260
   first        258
   world        257
    live        257
   great        250
    good        245
    year        239
    back        222
  people        219
   watch        209
  change        207


#### 1.3.6 Named Entity Statistics (spaCy)

Analyzing named entity distributions using spaCy NER and pandas aggregations.

In [10]:
from tqdm import tqdm

print("\n" + "="*50)
print("NAMED ENTITY STATISTICS (spaCy)")
print("="*50)
print(f"\nAnalyzing {len(df)} tweets... This may take a few minutes...")

# Extract entities from all tweets
all_entities = []
tweets_with_entities = 0
entity_counts_per_tweet = []

# Use nlp.pipe for better performance
texts = df['text'].apply(lambda x: x[:500]).tolist()

for doc in tqdm(nlp.pipe(texts, batch_size=50), total=len(df), desc="Processing"):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts_per_tweet.append(len(entities))
    if entities:
        tweets_with_entities += 1
    all_entities.extend(entities)

# Add entity count to dataframe
df['entity_count'] = entity_counts_per_tweet

print(f"\nEntity Detection Statistics:")
print(f"  Tweets with entities: {tweets_with_entities:,} ({tweets_with_entities/len(df)*100:.1f}%)")
print(f"  Tweets without entities: {len(df) - tweets_with_entities:,} ({(len(df) - tweets_with_entities)/len(df)*100:.1f}%)")
print(f"  Total entities found: {len(all_entities):,}")
print(f"  Average entities per tweet: {df['entity_count'].mean():.2f}")
print(f"  Median entities per tweet: {df['entity_count'].median():.0f}")
print(f"  Max entities in a tweet: {df['entity_count'].max()}")

# Count entity types
entity_types = Counter([label for text, label in all_entities])
entity_type_df = pd.DataFrame(
    entity_types.most_common(), 
    columns=['Entity Type', 'Count']
)
entity_type_df['Percentage'] = (entity_type_df['Count'] / len(all_entities) * 100).round(1)

print("\nEntity Type Distribution:")
print(entity_type_df.to_string(index=False))

print("\nEntity Type Legend:")
print("\nEntity Type Legend (Complete):")
print("  PERSON      = People, including fictional characters")
print("  ORG         = Companies, agencies, institutions, organizations")
print("  GPE         = Countries, cities, states (Geo-Political Entities)")
print("  DATE        = Absolute or relative dates or periods")
print("  CARDINAL    = Numerals that do not fall under another type")
print("  MONEY       = Monetary values, including unit")
print("  TIME        = Times smaller than a day")
print("  NORP        = Nationalities, religious or political groups")
print("  ORDINAL     = First, second, third, etc.")
print("  WORK_OF_ART = Titles of books, songs, movies, etc.")
print("  EVENT       = Named hurricanes, battles, wars, sports events")
print("  PRODUCT     = Objects, vehicles, foods, etc. (not services)")
print("  LOC         = Non-GPE locations, mountain ranges, bodies of water")
print("  FAC         = Buildings, airports, highways, bridges, etc.")
print("  QUANTITY    = Measurements, as of weight or distance")
print("  LAW         = Named documents made into laws")
print("  PERCENT     = Percentage, including '%'")
print("  LANGUAGE    = Any named language")


NAMED ENTITY STATISTICS (spaCy)

Analyzing 6090 tweets... This may take a few minutes...


Processing: 100%|██████████| 6090/6090 [00:23<00:00, 262.62it/s]


Entity Detection Statistics:
  Tweets with entities: 5,644 (92.7%)
  Tweets without entities: 446 (7.3%)
  Total entities found: 17,762
  Average entities per tweet: 2.92
  Median entities per tweet: 3
  Max entities in a tweet: 18

Entity Type Distribution:
Entity Type  Count  Percentage
     PERSON   4394        24.7
        ORG   3965        22.3
   CARDINAL   2418        13.6
       DATE   2126        12.0
        GPE   1441         8.1
      MONEY    892         5.0
       TIME    588         3.3
       NORP    462         2.6
    ORDINAL    395         2.2
WORK_OF_ART    265         1.5
      EVENT    199         1.1
    PRODUCT    182         1.0
        LOC    158         0.9
        FAC    125         0.7
   QUANTITY     53         0.3
        LAW     47         0.3
    PERCENT     45         0.3
   LANGUAGE      7         0.0

Entity Type Legend:

Entity Type Legend (Complete):
  PERSON      = People, including fictional characters
  ORG         = Companies, agencies, instit




### 1.4 Filter to Text Only

Filtering the dataset to extract only text columns and remove any non-textual data. This step ensures that subsequent processing focuses exclusively on textual content and handles any data type conversions if necessary.

In [11]:
def filter_to_text_only(dataframe, text_col='text', label_col='label_name'):
    """
    Filter dataset to only text and label columns, removing all numbers.
    
    Parameters:
    -----------
    dataframe : pd.DataFrame
        Input dataframe to filter
    text_col : str
        Name of the text column (default: 'text')
    label_col : str
        Name of the label column (default: 'label_name')
    
    Returns:
    --------
    pd.DataFrame
        Filtered dataframe with only text columns and no numbers
    """
    
    # Step 1: Select only the text and label_name columns
    df_filtered = dataframe[[text_col, label_col]].copy()
    
    # Step 2: Remove all numbers from the text column using regex
    # This removes all digits (0-9) from the text
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\d+', '', regex=True)
    
    # Step 3: Handle label_name - convert to string if it's a list, then remove numbers
    # First check if it's already a list or needs conversion
    if isinstance(df_filtered[label_col].iloc[0], list):
        # Keep as list, no number removal needed (labels are text)
        pass
    else:
        # If it's a string representation, convert and clean
        df_filtered[label_col] = df_filtered[label_col].astype(str)
    
    # Step 4: Clean up any extra whitespace created by removing numbers
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\s+', ' ', regex=True).str.strip()
    
    return df_filtered


# Apply the filtering function
print("\n" + "="*50)
print("FILTERING TO TEXT ONLY")
print("="*50)

# Show original dataset info
print("\nOriginal Dataset:")
print(f"  Shape: {df.shape}")
print(f"  Columns: {df.columns.tolist()}")
print(f"  Sample text: {df['text'].iloc[0][:100]}...")

# Apply the filter
df_text_only = filter_to_text_only(df)

# Show filtered dataset info
print("\nFiltered Dataset (Text Only - No Numbers):")
print(f"  Shape: {df_text_only.shape}")
print(f"  Columns: {df_text_only.columns.tolist()}")
print(f"  Sample text: {df_text_only['text'].iloc[0][:100]}...")

# Show examples of number removal
print("\nExamples of Number Removal:")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"  Original:  {df['text'].iloc[i][:80]}...")
    print(f"  Filtered:  {df_text_only['text'].iloc[i][:80]}...")

# Show label_name comparison
print("\nLabel Name Comparison:")
print(f"  Original label_name: {df['label_name'].iloc[0]}")
print(f"  Filtered label_name: {df_text_only['label_name'].iloc[0]}")

# Statistics on number removal
original_chars = df['text'].str.len().sum()
filtered_chars = df_text_only['text'].str.len().sum()
chars_removed = original_chars - filtered_chars

print(f"\nCharacter Statistics:")
print(f"  Original total characters: {original_chars:,}")
print(f"  Filtered total characters: {filtered_chars:,}")
print(f"  Characters removed (numbers): {chars_removed:,} ({chars_removed/original_chars*100:.2f}%)")

print("\n✓ Dataset successfully filtered to text only (numbers removed)")


FILTERING TO TEXT ONLY

Original Dataset:
  Shape: (6090, 19)
  Columns: ['text', 'date', 'label', 'label_name', 'id', 'topics_list', 'num_topics', 'date_parsed', 'year', 'month', 'day_of_week', 'text_start', 'text_length', 'word_count', 'avg_word_length', 'hashtag_count', 'mention_count', 'url_count', 'entity_count']
  Sample text: The {@Clinton LumberKings@} beat the {@Cedar Rapids Kernels@} 4-0 in Game 1 of the Western Division ...

Filtered Dataset (Text Only - No Numbers):
  Shape: (6090, 2)
  Columns: ['text', 'label_name']
  Sample text: The {@Clinton LumberKings@} beat the {@Cedar Rapids Kernels@} - in Game of the Western Division fina...

Examples of Number Removal:

Example 1:
  Original:  The {@Clinton LumberKings@} beat the {@Cedar Rapids Kernels@} 4-0 in Game 1 of t...
  Filtered:  The {@Clinton LumberKings@} beat the {@Cedar Rapids Kernels@} - in Game of the W...

Example 2:
  Original:  I would rather hear Eli Gold announce this Auburn game than these dumbasses. {@E...


### 1.5 Create Reusable Loading and Filtering Functions

Reusable functions that encapsulate the data loading and text-filtering logic.

In [12]:
import pandas as pd
from datasets import load_dataset

# Load the dataset directly from Hugging Face using the datasets library
dataset = load_dataset("cardiffnlp/tweet_topic_multi", split="train_all")

# Convert to pandas DataFrame
tweets_raw = dataset.to_pandas()

# Filter the dataset to text only (no numbers)
def filter_to_text_only(dataframe, text_col='text', label_col='label_name', label_num_col='label'):
    df_filtered = dataframe[[text_col, label_col, label_num_col]].copy()
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\d+', '', regex=True)
    
    if isinstance(df_filtered[label_col].iloc[0], list):
        pass
    else:
        df_filtered[label_col] = df_filtered[label_col].astype(str)
    
    df_filtered[text_col] = df_filtered[text_col].str.replace(r'\s+', ' ', regex=True).str.strip()
    
    return df_filtered

tweets_text_only = filter_to_text_only(df)

print("\n✓ Dataset successfully loaded and filtered to text only")


✓ Dataset successfully loaded and filtered to text only


---

## 2. Preprocessing

This section focuses on text preprocessing techniques. We will review common preprocessing methods, apply them systematically using NLTK and SpaCy, analyze how the order of operations affects results, evaluate the usefulness of each method for different scenarios, and create a reusable preprocessing pipeline.


### Methods Used (in Order)

#### 1. Remove RT, URLs, and Mentions
- Removes retweet indicators, web links, and user mentions
- **Why:** Don't contribute to topic classification
- **Order:** Before tokenization to remove as complete units

#### 2. Extract Hashtag Text
- Converts `#Gaming` → `gaming`
- **Why:** Hashtags contain topic keywords - critical for classification!
- **Order:** Before tokenization, preserves topic information

#### 3. Normalize & Lowercase
- Removes extra spaces and converts to lowercase
- **Why:** "Sports" = "sports", reduces vocabulary by ~30-40%
- **Order:** Before tokenization for consistent token matching

#### 4. Tokenization (SpaCy)
- Breaks text into individual tokens/words
- **Why:** Foundation for all token-based processing
- **Order:** After text cleaning, before token filtering

#### 5. Remove Punctuation
- Filters out punctuation tokens
- **Why:** Don't contribute semantic meaning for topics
- **Order:** After tokenization using `token.is_punct`

#### 6. Filter Non-Alphabetic & Short Tokens
- Removes non-alphabetic tokens (emojis, special chars) and tokens < 2 characters
- **Why:** Reduces noise from social media content
- **Order:** After tokenization, before stopword removal

#### 7. Custom Stopword Removal (SpaCy)
- Removes common words BUT keeps topic-relevant words (game, music, news, sport, film, video, watch, play)
- **Why:** Standard stopword lists would remove topic indicators!
- **Order:** After lowercase, before lemmatization (more efficient)

#### 8. Latin Alphabet Check
- Keeps only words with Latin letters (a-z)
- **Why:** Removes non-English scripts while keeping modern social media terms (videogames, esports)
- **Order:** After alphabetic filtering, before lemmatization

#### 9. Lemmatization (SpaCy)
- Converts to base form: running → run, games → game
- **Why:** Groups related word forms, better than stemming for preserving meaning
- **Order:** LAST - operates on clean, filtered tokens

### Why This Order Matters

**Critical Dependencies:**
- Special tokens **before** tokenization → removes as complete units
- Lowercase **before** tokenization → consistent stopword matching
- Tokenization **before** filtering → need tokens to filter
- Lemmatization **last** → final transformation on clean tokens

### 2.5 Reusable Preprocessing Pipeline

Development of a modular, configurable preprocessing function that can be easily reused in future labs. The pipeline allows for flexible selection of preprocessing steps and parameters, making it adaptable to different text analysis tasks and requirements.

In [15]:
# Download required NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

# Load SpaCy model
import spacy
try:
    nlp = spacy.load('en_core_web_sm')
    print("✓ SpaCy model loaded successfully")
except:
    print("Installing SpaCy model...")
    import os
    os.system('python -m spacy download en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')
    print("✓ SpaCy model loaded successfully")

def is_latin_alphabet(word):
    """
    Check if a word contains only Latin alphabet characters.
    Filters out words with Cyrillic, Arabic, Chinese, etc.
    """
    if not word:
        return False
    return all(ord('a') <= ord(c.lower()) <= ord('z') for c in word)

def segment_camelcase(text):
    """
    Segmentiert CamelCase-Wörter in separate Wörter ohne Regex.
    Beispiel: 'GameOfThrones' → 'Game Of Thrones'
    Dies ist wichtig für Hashtags wie #GameOfThrones nach Entfernung des #
    """
    if not text:
        return text
    
    result = []
    
    for i, char in enumerate(text):
        # Füge aktuellen Character hinzu
        result.append(char)
        
        # Prüfe, ob wir ein Leerzeichen einfügen müssen
        if i < len(text) - 1:
            current = char
            next_char = text[i + 1]
            
            # Fall 1: lowercase → uppercase (z.B. 'e' → 'O' in 'GameOf')
            if current.islower() and next_char.isupper():
                result.append(' ')
            
            # Fall 2: uppercase → uppercase → lowercase (z.B. 'HTML' → 'Parser')
            elif i < len(text) - 2:
                after_next = text[i + 2]
                if current.isupper() and next_char.isupper() and after_next.islower():
                    result.append(' ')
    
    return ''.join(result)

def preprocess_tweet(text):
    """
    Topic-optimized preprocessing for tweet classification.
    Preserves topic-relevant information while removing noise.
    Removes special characters, emojis, and non-Latin script words.
    """
    if not isinstance(text, str):
        return ""
    
    # Step 1: Remove RT (retweet indicator)
    text = text.replace('RT ', ' ').replace('rt ', ' ')
    
    # Step 2: Remove URLs and placeholders
    text = text.replace('{{URL}}', ' ')
    text = text.replace('{{USERNAME}}', ' ')
    for protocol in ['https://', 'http://', 'www.']:
        if protocol in text:
            parts = text.split(protocol)
            text = parts[0] + ' ' + ' '.join([' '.join(p.split()[1:]) if p.split() else '' for p in parts[1:]])
    
    # Step 3: Remove mentions
    words_list = text.split()
    words_list = [w for w in words_list if not (w.startswith('{@') or w.startswith('@'))]
    text = ' '.join(words_list)
    
    # Step 4: Extract hashtag text (#Gaming → Gaming, #GameOfThrones → GameOfThrones)
    words_list = text.split()
    words_list = [w[1:] if w.startswith('#') else w for w in words_list]
    text = ' '.join(words_list)
    
    # Step 4.5: Segment CamelCase words (WICHTIG: VOR dem Lowercase!)
    # GameOfThrones → Game Of Thrones
    text = segment_camelcase(text)
    
    # Step 5: Normalize whitespace and lowercase
    text = ' '.join(text.split())
    text = text.lower()
    
    # Step 6: Tokenize with SpaCy
    doc = nlp(text)
    
    # Step 7: Filter and lemmatize tokens
    processed_tokens = []
    for token in doc:
        # Skip punctuation
        if token.is_punct:
            continue
        
        # Skip if not alphabetic (removes special characters, emojis, numbers)
        if not token.is_alpha:
            continue
        
        # Skip tokens shorter than 2 characters
        if len(token.text) < 2:
            continue
        
        # Remove stopwords (using SpaCy's stopword detection)
        if token.is_stop:
            continue
        
        # Check if word uses Latin alphabet (filters out Cyrillic, Arabic, Chinese, etc.)
        if not is_latin_alphabet(token.text):
            continue
        
        # Use lemmatized form
        processed_tokens.append(token.lemma_)
    
    return ' '.join(processed_tokens)

# Create a copy of the original dataframe
tweets_preprocessed_train = tweets_text_only.copy()

# Apply preprocessing
tweets_preprocessed_train['text'] = tweets_preprocessed_train['text'].apply(preprocess_tweet)

print("\n✓ Preprocessing complete!")
print(f"✓ Processed {len(tweets_preprocessed_train)} tweets")
print(f"✓ Original 'tweets_text_only' unchanged | Processed data in 'tweets_preprocessed_train'")

# Speichere den DataFrame im Data Ordner
import os

# Erstelle Data Ordner falls nicht vorhanden
os.makedirs('../Data', exist_ok=True)

# Speichere ab
output_path = '../Data/tweets_preprocessed_train.parquet'
tweets_preprocessed_train.to_parquet(output_path, index=False)

print(f"\n✓ DataFrame for training saved under path: {output_path}")
print(f"✓ Features: {list(tweets_preprocessed_train.columns)}")
print(f"✓ Shape: {tweets_preprocessed_train.shape}")

✓ SpaCy model loaded successfully

✓ Preprocessing complete!
✓ Processed 6090 tweets
✓ Original 'tweets_text_only' unchanged | Processed data in 'tweets_preprocessed_train'

✓ DataFrame for training saved under path: ../Data/tweets_preprocessed_train.parquet
✓ Features: ['text', 'label_name', 'label']
✓ Shape: (6090, 3)
