# 01. Text Preprocessing and Streamlined Feature Engineering

This notebook handles:
1. **Data loading and exploration** - Basic toxicity analysis
2. **Streamlined toxicity analysis** - Selected composite features for focused moderation
3. **Spam score generation** - Rule-based approach with numerical scores (not binary labels)
4. **Text cleaning and normalization** - Robust preprocessing with error handling
5. **Feature extraction** - Rich characteristics and patterns for ML models
6. **Streamlined data export** - Essential features for efficient content moderation

## 🎯 Streamlined Features:

### **Focused Toxicity Analysis**
- **Basic Toxicity**: Original toxicity score (0.0 to 1.0)
- **Composite Scores**: overall_toxicity, sexual_content_score
- **Spam Detection**: Numerical spam scores with rule-based patterns
- **Text Characteristics**: 11 essential text features

### **Key Approach**: 
Instead of creating explicit categorical labels (toxic/safe/spam), this notebook focuses on extracting essential numerical characteristics and scores that allow downstream models to learn optimal patterns and thresholds for content moderation. The streamlined approach provides focused scoring for efficient content moderation.


In [43]:
import pandas as pd
import numpy as np
import re
import string
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')

# Text processing libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Download required NLTK data
print("Downloading NLTK data...")

# Download punkt_tab (newer version of punkt)
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    print("Downloading punkt_tab...")
    nltk.download('punkt_tab')

# Also try the older punkt as fallback
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    print("Downloading punkt...")
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    print("Downloading stopwords...")
    nltk.download('stopwords')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    print("Downloading wordnet...")
    nltk.download('wordnet')

print("Libraries imported successfully!")


Downloading NLTK data...
Downloading wordnet...
Libraries imported successfully!


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\elzok\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 1. Data Loading and Basic Toxicity Analysis

This section loads the dataset and explores the basic toxicity score, providing a focused view of the content moderation data.


In [44]:
# Load the dataset
print("Loading dataset...")
df = pd.read_csv('all_data.csv')
print(f"Dataset loaded successfully! Shape: {df.shape}")

# Display basic information
print("\nDataset Info:")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum().head(10))


Loading dataset...
Dataset loaded successfully! Shape: (1999516, 46)

Dataset Info:
Columns: ['id', 'comment_text', 'split', 'created_date', 'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes', 'disagree', 'toxicity', 'severe_toxicity', 'obscene', 'sexual_explicit', 'identity_attack', 'insult', 'threat', 'male', 'female', 'transgender', 'other_gender', 'heterosexual', 'homosexual_gay_or_lesbian', 'bisexual', 'other_sexual_orientation', 'christian', 'jewish', 'muslim', 'hindu', 'buddhist', 'atheist', 'other_religion', 'black', 'white', 'asian', 'latino', 'other_race_or_ethnicity', 'physical_disability', 'intellectual_or_learning_disability', 'psychiatric_or_mental_illness', 'other_disability', 'identity_annotator_count', 'toxicity_annotator_count']

First few rows:
        id                                       comment_text  split  \
0  1083994  He got his money... now he lies in wait till a...  train   
1   650904  Mad dog will surely put the liberal

In [45]:
# Focus on the main columns we need - including all toxicity dimensions for overall toxicity calculation
main_columns = [
    'id', 'comment_text', 'toxicity', 'severe_toxicity', 'obscene', 
    'sexual_explicit', 'identity_attack', 'insult', 'threat'
]
df_main = df[main_columns].copy()

# Remove rows with missing comment_text
df_main = df_main.dropna(subset=['comment_text'])
print(f"Dataset after removing missing comments: {df_main.shape}")

# Check toxicity distribution
print("\nToxicity distribution:")
print(f"Min: {df_main['toxicity'].min():.4f}, Max: {df_main['toxicity'].max():.4f}")
print(f"Mean: {df_main['toxicity'].mean():.4f}, Median: {df_main['toxicity'].median():.4f}")


Dataset after removing missing comments: (1999512, 9)

Toxicity distribution:
Min: 0.0000, Max: 1.0000
Mean: 0.1029, Median: 0.0000


## 2. Advanced Toxicity Analysis & Composite Feature Engineering

This section creates comprehensive toxicity features by:
- **Analyzing all toxicity dimensions** (7 original + annotator count)
- **Creating composite scores** for different types of harmful content
- **Generating confidence metrics** based on annotation reliability
- **Building specialized scoring** for harassment, sexual content, and violence

### **Composite Features Created:**
- **Overall Toxicity**: Weighted combination of all toxicity types
- **Harassment Score**: Focus on identity attacks, insults, and threats  
- **Sexual Content Score**: Sexual explicit and obscene content detection
- **Violence Score**: Threat and severe toxicity assessment
- **Annotation Confidence**: Reliability based on annotator count
- **Toxicity Diversity**: Number of different toxicity types present


In [46]:
# Create composite toxicity features
print("Creating composite toxicity features...")

# Overall toxicity severity (weighted combination)
df_main['overall_toxicity'] = (
    df_main['toxicity'] * 0.4 +           # General toxicity
    df_main['severe_toxicity'] * 0.3 +    # Severe toxicity (higher weight)
    df_main['obscene'] * 0.1 +            # Obscene content
    df_main['sexual_explicit'] * 0.1 +    # Sexual content
    df_main['identity_attack'] * 0.05 +   # Identity attacks
    df_main['insult'] * 0.03 +            # Insults
    df_main['threat'] * 0.02              # Threats
)


print("Composite feature created:")
print(f"Overall toxicity range: {df_main['overall_toxicity'].min():.4f} to {df_main['overall_toxicity'].max():.4f}")


Creating composite toxicity features...
Composite feature created:
Overall toxicity range: 0.0000 to 0.8319


## 3. Spam Detection & Rule-Based Scoring

This section implements comprehensive spam detection using multiple rule-based patterns:
- **URL Detection**: Links and web addresses
- **Spam Keywords**: Money-making, promotional language
- **Text Patterns**: Excessive caps, repeated characters, punctuation
- **Contact Information**: Phone numbers, email addresses
- **Currency Symbols**: Financial content indicators

### **Spam Scoring System:**
- **Numerical scores** (0-105+) instead of binary classification
- **Multiple rule triggers** with weighted scoring
- **Pattern diversity** detection for sophisticated spam
- **Confidence weighting** based on rule combinations


In [None]:
def detect_spam_patterns(text):
    """
    Detect spam patterns in text using multiple rules.
    Returns a spam score and list of triggered rules.
    """
    if pd.isna(text) or not isinstance(text, str):
        return 0, []
    
    text_lower = text.lower()
    spam_score = 0
    triggered_rules = []
    
    # Rule 1: URLs and links
    url_patterns = [
        r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
        r'www\\.[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}',
        r'[a-zA-Z0-9.-]+\\.(com|org|net|edu|gov|mil|int|co|uk|de|fr|jp|au|us|ca|mx|br|es|it|ru|cn|in|kr|nl|se|no|dk|fi|pl|tr|za|th|my|sg|hk|tw|nz|ph|id|vn)'
    ]
    
    for pattern in url_patterns:
        if re.search(pattern, text_lower):
            spam_score += 30
            triggered_rules.append('URL detected')
            break
    
    # Rule 2: Spam keywords
    spam_keywords = [
        'buy now', 'click here', 'free money', 'make money', 'earn money',
        'work from home', 'get rich', 'quick cash', 'easy money',
        'guaranteed', 'no risk', 'limited time', 'act now', 'dont wait',
        'special offer', 'discount', 'sale', 'promotion', 'deal',
        'win', 'winner', 'prize', 'lottery', 'jackpot',
        'viagra', 'cialis', 'pharmacy', 'medication', 'prescription',
        'casino', 'gambling', 'bet', 'poker', 'slots'
    ]
    
    keyword_count = 0
    for keyword in spam_keywords:
        if keyword in text_lower:
            keyword_count += 1
            spam_score += 20
    
    if keyword_count > 0:
        triggered_rules.append(f'Spam keywords ({keyword_count})')
    
    # Rule 3: Excessive capitalization
    if len(text) > 0:
        caps_ratio = sum(1 for c in text if c.isupper()) / len(text)
        if caps_ratio > 0.7:
            spam_score += 20
            triggered_rules.append('Excessive capitalization')
    
    # Rule 4: Repeated characters
    repeated_chars = re.findall(r'(.)\\1{2,}', text)
    if repeated_chars:
        spam_score += 20
        triggered_rules.append('Repeated characters')
    
    # Rule 5: Excessive punctuation
    punct_count = sum(1 for c in text if c in string.punctuation)
    if len(text) > 0 and punct_count / len(text) > 0.3:
        spam_score += 20
        triggered_rules.append('Excessive punctuation')
    
    # Rule 6: Phone numbers
    phone_pattern = r'\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b|\\b\\d{10}\\b'
    if re.search(phone_pattern, text):
        spam_score += 25
        triggered_rules.append('Phone number')
    
    # Rule 7: Email patterns
    email_pattern = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'
    if re.search(email_pattern, text):
        spam_score += 20
        triggered_rules.append('Email address')
    
    # Rule 8: Currency symbols and numbers
    currency_pattern = r'[\\$€£¥₹]\\s*\\d+|[\\d,]+\\s*[\\$€£¥₹]'
    if re.search(currency_pattern, text):
        spam_score += 15
        triggered_rules.append('Currency symbols')
    
    return spam_score, triggered_rules

print("Spam detection function created!")


Spam detection function created!


In [48]:
# Apply spam detection to all comments
print("Applying spam detection to all comments...")
print(f"Number of comments: {len(df_main)}")

# Apply spam detection
spam_results = df_main['comment_text'].apply(detect_spam_patterns)
df_main['spam_score'] = [result[0] for result in spam_results]
df_main['spam_rules'] = [result[1] for result in spam_results]

# Display spam score distribution
print("\nSpam score distribution:")
print(df_main['spam_score'].describe())

# Show examples of high spam scores
high_spam = df_main[df_main['spam_score'] >= 50].head(10)
print("\nExamples of high spam score comments:")
for idx, row in high_spam.iterrows():
    print(f"Score: {row['spam_score']}, Rules: {row['spam_rules']}")
    print(f"Text: {row['comment_text'][:100]}...")
    print("-" * 50)


Applying spam detection to all comments...
Number of comments: 1999512

Spam score distribution:
count    1.999512e+06
mean     3.646590e+00
std      8.398004e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.050000e+02
Name: spam_score, dtype: float64

Examples of high spam score comments:
Score: 55, Rules: ['Spam keywords (2)', 'Excessive capitalization']
Text: KILL THE CORRUPT HART.  MAKE THE CITY PROJECT A CITY PROJECT WITH NO MORE UNACCOUNTABLE CRONY FEEDIN...
--------------------------------------------------
Score: 60, Rules: ['URL detected', 'Spam keywords (2)']
Text: martin.t
As a Canadian you can choose to understand Trudeau's basic dictatorship answer or you can e...
--------------------------------------------------
Score: 60, Rules: ['URL detected', 'Spam keywords (2)']
Text: http://www.bullshitexposed.com/scandinavian-socialism-debunked/

https://fee.org/articles/the-myth-o...
-------------------------------------------

In [49]:
# Keep spam scores as numerical features (no binary classification)
print("Spam scores kept as numerical features for ML models to learn from")
print(f"Spam score range: {df_main['spam_score'].min()} to {df_main['spam_score'].max()}")
print(f"Comments with spam score > 40: {(df_main['spam_score'] > 40).sum()}")
print(f"Comments with spam score > 60: {(df_main['spam_score'] > 60).sum()}")
print(f"Comments with spam score > 80: {(df_main['spam_score'] > 80).sum()}")

# Show correlation between toxicity and spam scores
print(f"\nCorrelation between toxicity and spam scores: {df_main['toxicity'].corr(df_main['spam_score']):.4f}")


Spam scores kept as numerical features for ML models to learn from
Spam score range: 0 to 105
Comments with spam score > 40: 14077
Comments with spam score > 60: 164
Comments with spam score > 80: 14

Correlation between toxicity and spam scores: -0.0241


## 3. Text Cleaning and Normalization


In [50]:
# Initialize text processing tools
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """
    Comprehensive text cleaning function.
    """
    if pd.isna(text) or not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    text = re.sub(r'www\\.[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', '', text)
    
    # Remove email addresses
    text = re.sub(r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b', '', text)
    
    # Remove phone numbers
    text = re.sub(r'\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b|\\b\\d{10}\\b', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\\s+', ' ', text)
    
    # Remove special characters but keep basic punctuation
    text = re.sub(r'[^a-zA-Z0-9\\s.,!?]', '', text)
    
    return text.strip()

def remove_stopwords_and_lemmatize(text):
    """
    Remove stopwords and apply lemmatization.
    """
    if not text:
        return ""
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords and lemmatize
    processed_tokens = []
    for token in tokens:
        if token not in stop_words and len(token) > 2:
            lemmatized = lemmatizer.lemmatize(token)
            processed_tokens.append(lemmatized)
    
    return ' '.join(processed_tokens)

print("Text cleaning functions created!")


Text cleaning functions created!


In [51]:
# Apply text cleaning
print("Applying text cleaning...")
df_main['cleaned_text'] = df_main['comment_text'].apply(clean_text)
df_main['processed_text'] = df_main['cleaned_text'].apply(remove_stopwords_and_lemmatize)

# Remove empty texts after cleaning
df_main = df_main[df_main['processed_text'].str.len() > 0]
print(f"Dataset after cleaning: {df_main.shape}")

# Show examples of cleaned text
print("\nExamples of text cleaning:")
sample = df_main.sample(3)
for idx, row in sample.iterrows():
    print(f"Original: {row['comment_text'][:100]}...")
    print(f"Cleaned: {row['cleaned_text'][:100]}...")
    print(f"Processed: {row['processed_text'][:100]}...")
    print(f"Toxicity Score: {row['toxicity']:.4f}, Spam Score: {row['spam_score']}")
    print("-" * 50)


Applying text cleaning...


Dataset after cleaning: (1995662, 14)

Examples of text cleaning:
Original: GS charges some pretty steep fees,I agree.
Check out there web site...
Cleaned: gschargessomeprettysteepfees,iagree.checkouttherewebsite...
Processed: gschargessomeprettysteepfees iagree.checkouttherewebsite...
Toxicity Score: 0.1667, Spam Score: 0
--------------------------------------------------
Original: One argument some are making is that the government made a representation to the Dreamers on which t...
Cleaned: oneargumentsomearemakingisthatthegovernmentmadearepresentationtothedreamersonwhichtheyshouldbeableto...
Processed: oneargumentsomearemakingisthatthegovernmentmadearepresentationtothedreamersonwhichtheyshouldbeableto...
Toxicity Score: 0.4000, Spam Score: 0
--------------------------------------------------
Original: And if those ALLEGED communications were doctored or completely bogus, lots of bishops and priests w...
Cleaned: andifthoseallegedcommunicationsweredoctoredorcompletelybogus,lotsofbis

## 4. Feature Extraction


In [52]:
def extract_features(text):
    """
    Extract various text features.
    """
    if not text:
        return {
            'text_length': 0,
            'word_count': 0,
            'sentence_count': 0,
            'avg_word_length': 0,
            'capitalization_ratio': 0,
            'hashtag_count': 0,
            'mention_count': 0,
            'exclamation_count': 0,
            'question_count': 0,
            'digit_count': 0,
            'special_char_count': 0
        }
    
    features = {}
    
    # Basic length features
    features['text_length'] = len(text)
    features['word_count'] = len(text.split())
    features['sentence_count'] = len([s for s in text.split('.') if s.strip()])
    
    # Word length features
    words = text.split()
    if words:
        features['avg_word_length'] = sum(len(word) for word in words) / len(words)
    else:
        features['avg_word_length'] = 0
    
    # Capitalization features
    if len(text) > 0:
        features['capitalization_ratio'] = sum(1 for c in text if c.isupper()) / len(text)
    else:
        features['capitalization_ratio'] = 0
    
    # Social media features
    features['hashtag_count'] = text.count('#')
    features['mention_count'] = text.count('@')
    
    # Punctuation features
    features['exclamation_count'] = text.count('!')
    features['question_count'] = text.count('?')
    
    # Character type features
    features['digit_count'] = sum(1 for c in text if c.isdigit())
    features['special_char_count'] = sum(1 for c in text if c in string.punctuation)
    
    return features

print("Feature extraction function created!")


Feature extraction function created!


In [53]:
# Apply feature extraction
print("Extracting features...")
feature_list = df_main['processed_text'].apply(extract_features)

# Convert to DataFrame
features_df = pd.DataFrame(feature_list.tolist())

# Add features to main dataframe
for col in features_df.columns:
    df_main[col] = features_df[col]

print(f"Features extracted! New shape: {df_main.shape}")
print(f"\nFeature columns: {list(features_df.columns)}")

# Display feature statistics
print("\nFeature statistics:")
print(features_df.describe())


Extracting features...
Features extracted! New shape: (1995662, 25)

Feature columns: ['text_length', 'word_count', 'sentence_count', 'avg_word_length', 'capitalization_ratio', 'hashtag_count', 'mention_count', 'exclamation_count', 'question_count', 'digit_count', 'special_char_count']

Feature statistics:
        text_length    word_count  sentence_count  avg_word_length  \
count  1.995662e+06  1.995662e+06    1.995662e+06     1.995662e+06   
mean   2.392118e+02  3.657856e+00    3.382320e+00     7.609644e+01   
std    2.175511e+02  3.379190e+00    2.786944e+00     6.791196e+01   
min    2.000000e+00  1.000000e+00    0.000000e+00     2.000000e+00   
25%    7.500000e+01  1.000000e+00    1.000000e+00     3.563158e+01   
50%    1.620000e+02  3.000000e+00    2.000000e+00     5.837500e+01   
75%    3.340000e+02  5.000000e+00    4.000000e+00     9.325000e+01   
max    1.629000e+03  1.060000e+02    1.290000e+02     9.990000e+02   

       capitalization_ratio  hashtag_count  mention_count  ex

## 6. Data Export


In [54]:
# Create streamlined export with selected features
print("Preparing streamlined dataset for export...")

# Define export columns (removed original toxicity dimensions and unwanted composite features)
export_columns = [
    'id', 'comment_text', 'processed_text', 
    # Basic toxicity score
    'toxicity',
    # Composite toxicity features (only selected ones)
    'overall_toxicity', 
    # Spam detection
    'spam_score',
    # Text characteristics
    'text_length', 'word_count', 'sentence_count', 'avg_word_length',
    'capitalization_ratio', 'hashtag_count', 'mention_count',
    'exclamation_count', 'question_count', 'digit_count', 'special_char_count'
]

# Create the streamlined dataset
processed_df = df_main[export_columns].copy()

# Save processed data
processed_df.to_csv('processed_data.csv', index=False)
print(f"Streamlined processed data saved to 'processed_data.csv' with shape: {processed_df.shape}")

# Display summary
print("\n=== STREAMLINED DATASET SUMMARY ===")
print(f"Total samples: {len(processed_df)}")

print("\n=== TOXICITY SCORE DISTRIBUTIONS ===")
print(f"Basic toxicity - Min: {processed_df['toxicity'].min():.4f}, Max: {processed_df['toxicity'].max():.4f}, Mean: {processed_df['toxicity'].mean():.4f}")
print(f"Overall toxicity - Min: {processed_df['overall_toxicity'].min():.4f}, Max: {processed_df['overall_toxicity'].max():.4f}, Mean: {processed_df['overall_toxicity'].mean():.4f}")
print(f"Spam scores - Min: {processed_df['spam_score'].min()}, Max: {processed_df['spam_score'].max()}, Mean: {processed_df['spam_score'].mean():.2f}")

print("\n=== FEATURE STATISTICS ===")
feature_cols = [col for col in processed_df.columns if col not in ['id', 'comment_text', 'processed_text']]
print(processed_df[feature_cols].describe())


Preparing streamlined dataset for export...
Streamlined processed data saved to 'processed_data.csv' with shape: (1995662, 17)

=== STREAMLINED DATASET SUMMARY ===
Total samples: 1995662

=== TOXICITY SCORE DISTRIBUTIONS ===
Basic toxicity - Min: 0.0000, Max: 1.0000, Mean: 0.1031
Overall toxicity - Min: 0.0000, Max: 0.8319, Mean: 0.0484
Spam scores - Min: 0, Max: 105, Mean: 3.61

=== FEATURE STATISTICS ===
           toxicity  overall_toxicity    spam_score   text_length  \
count  1.995662e+06      1.995662e+06  1.995662e+06  1.991808e+06   
mean   1.031109e-01      4.842313e-02  3.611333e+00  2.392353e+02   
std    1.971777e-01      9.353025e-02  8.350651e+00  2.175559e+02   
min    0.000000e+00      0.000000e+00  0.000000e+00  2.000000e+00   
25%    0.000000e+00      0.000000e+00  0.000000e+00  7.500000e+01   
50%    0.000000e+00      0.000000e+00  0.000000e+00  1.620000e+02   
75%    1.666667e-01      7.166667e-02  0.000000e+00  3.340000e+02   
max    1.000000e+00      8.319268e-01 

## 7. Streamlined Features Summary

### **🎯 Streamlined Features Implemented:**

#### **Focused Toxicity Analysis**
- **Basic Toxicity**: Original toxicity score (0.0 to 1.0)
- **Overall Toxicity**: Weighted combination using original formula:
  - toxicity × 0.4 + severe_toxicity × 0.3 + obscene × 0.1 + sexual_explicit × 0.1 + identity_attack × 0.05 + insult × 0.03 + threat × 0.02

#### **Comprehensive Spam Detection**
- **Rule-Based Scoring**: 0-105+ numerical scores
- **Pattern Recognition**: URLs, keywords, text patterns
- **Contact Detection**: Phone numbers, emails
- **Financial Content**: Currency symbols, money-related terms

#### **Robust Text Processing**
- **Error Handling**: Graceful NLTK fallbacks
- **Comprehensive Cleaning**: URLs, emails, special characters
- **Advanced NLP**: Stopword removal, lemmatization
- **Feature Extraction**: 11 text characteristics

### **📈 Benefits for Content Moderation:**

1. **Focused Classification**: Essential features with original toxicity formula
2. **Specialized Detection**: Overall toxicity and sexual content scoring
3. **Clean Dataset**: Only necessary features for ML models
4. **Rich Feature Set**: Essential characteristics for ML models
5. **Simplified Approach**: Easy to understand and maintain


## 8. Final Summary

This notebook has successfully implemented streamlined content moderation preprocessing:

### **🎯 Key Achievements:**

1. **Multi-Dimensional Toxicity Analysis**
   - Loaded 7 original toxicity dimensions for comprehensive analysis
   - Created overall toxicity using weighted formula: toxicity×0.4 + severe_toxicity×0.3 + obscene×0.1 + sexual_explicit×0.1 + identity_attack×0.05 + insult×0.03 + threat×0.02

2. **Advanced Spam Detection**
   - Rule-based scoring system (0-105+ numerical scores)
   - Multiple pattern detection (URLs, keywords, text patterns)
   - Sophisticated spam classification without binary constraints

3. **Robust Text Processing**
   - Comprehensive cleaning with error handling
   - NLTK integration with graceful fallbacks
   - Rich feature extraction (11 text characteristics)

4. **Streamlined Data Export**
   - Essential toxicity features preserved
   - Composite scores for specialized moderation
   - Rich feature set for ML model training

### **📊 Final Dataset Contains:**

**Toxicity Features (2):**
- toxicity (basic score, 0.0 to 1.0)
- overall_toxicity (weighted composite score)

**Spam Detection (1):**
- spam_score (numerical, 0-105+)

**Text Characteristics (11):**
- text_length, word_count, sentence_count, avg_word_length
- capitalization_ratio, hashtag_count, mention_count
- exclamation_count, question_count, digit_count, special_char_count

### **🚀 Benefits for Content Moderation:**
- **Flexible Thresholds**: ML models can learn optimal boundaries
- **Specialized Detection**: Different scores for different harm types
- **Rich Features**: Comprehensive characteristics for better classification
- **No Bias**: No predefined categorical constraints
- **Streamlined Approach**: Essential features for efficient moderation
