# Data Pre-processing for Reddit Post Virality Prediction

This notebook covers the data preprocessing pipeline for predicting Reddit post virality. It includes:

1. **Data Loading and Exploration**: Loading and understanding the raw Reddit dataset from *reddit_raw_data*
2. **Feature Engineering**: Creating features from raw data (text features, engagement metrics, subreddit features)
3. **Virality Score Computation**: Computing virality scores and creating binary labels

The processed data will be saved to `data/reddit_features.csv` for use in model construction.


In [2]:
# Import necessary modules
import pandas as pd
import numpy as np
import json
from scipy import stats
from pathlib import Path

## 1.1 Data Loading and Exploration

In [3]:
# Set dataset path to reddit_raw_data folder
dataset_path = Path("reddit_raw_data")
print("Files in dataset:")
for file in dataset_path.iterdir():
    print(f"  - {file.name} ({file.stat().st_size / (1024*1024):.2f} MB)")

Files in dataset:
  - reddit_data_counts.json (0.00 MB)
  - reddit_dataset.json (11.27 MB)


In [4]:
# Explore JSON structure from reddit_dataset.json
json_file = dataset_path / "reddit_dataset.json"

if json_file.exists():
    print(f"Loading: {json_file.name}")
    print(f"File size: {json_file.stat().st_size / (1024*1024):.2f} MB")
    
    # Load and explore structure
    with open(json_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
        print("\nTop-level keys:", list(data.keys()))
        
        # Check if it has a "posts" key
        if 'posts' in data:
            print(f"\nFound 'posts' array with {len(data['posts'])} items")
            if len(data['posts']) > 0:
                sample_post = data['posts'][0]
                print("\nSample post structure:")
                print(json.dumps(sample_post, indent=2)[:1000])  # First 1000 chars
                print("\nKeys in post:", list(sample_post.keys()))
        else:
            print("\nFull data structure (first 1000 chars):")
            print(json.dumps(data, indent=2)[:1000])
else:
    print(f"File not found: {json_file}")
    print("Available files:")
    for file in dataset_path.iterdir():
        print(f"  - {file.name}")


Loading: reddit_dataset.json
File size: 11.27 MB

Top-level keys: ['posts']

Found 'posts' array with 6187 items

Sample post structure:
{
  "title": "Which country in the world suffers most from wage inequality and why?",
  "body": "Shall we discuss this topic in the comments? I'm curious to hear your opinions. I have written my own thoughts below.\r  \n\r  \nMany sources and studies highlight countries like Brazil, South Africa, India, and the United States as standing out in terms of income inequality. Inequality factors in these countries can include high income inequality, challenging working conditions faced by low-wage workers, racial or ethnic discrimination, gender inequality, and social class disparities.\r  \n\r  \nThe causes of income inequality in these countries can be complex and multifaceted. For example, high income inequality can sometimes reflect a wide economic gap between the rich and the poor. Challenging working conditions experienced by low-wage workers can aris

In [5]:
# Load JSON data from reddit_dataset.json
json_file = dataset_path / "reddit_dataset.json"

print(f"Loading {json_file.name}...")
with open(json_file, 'r', encoding='utf-8') as f:
    data = json.load(f)
    
    # Extract posts array
    if 'posts' in data:
        raw_data = data['posts']
        print(f"Loaded {len(raw_data)} posts from 'posts' array")
    elif isinstance(data, list):
        raw_data = data
        print(f"Loaded {len(raw_data)} items from JSON array")
    else:
        # If it's a single object, wrap it in a list
        raw_data = [data]
        print(f"Loaded 1 item from JSON object")

print(f"\nTotal records loaded: {len(raw_data)}")

Loading reddit_dataset.json...
Loaded 6187 posts from 'posts' array

Total records loaded: 6187


In [6]:
# Convert to DataFrame and explore
df = pd.DataFrame(raw_data)
print(f"DataFrame shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

DataFrame shape: (6187, 6)

Columns: ['title', 'body', 'url', 'post_score', 'comment', 'comment_score']

First few rows:


Unnamed: 0,title,body,url,post_score,comment,comment_score
0,Which country in the world suffers most from w...,Shall we discuss this topic in the comments? I...,https://www.reddit.com/r/business/comments/14e...,3,"Close your eyes, and you can choose one of the...",5
1,Passion,Does your work drive you? Or is it something y...,https://www.reddit.com/r/business/comments/14e...,1,"Wow, you and I are the same person. Haha, exce...",1
2,Biz Savings Interest Rates,I‚Äôm assuming the answer is obviously that the ...,https://www.reddit.com/r/business/comments/14e...,2,"I think your assumption is correct, businesses...",1
3,How much is international ocean freight?,,https://www.reddit.com/r/business/comments/14e...,1,Way too vague to be answered.\nFrom where to w...,1
4,Hello everyone I want to start a low budget bu...,,https://www.reddit.com/r/business/comments/14e...,1,Thanks üôè,2


In [7]:
# Check data types and basic stats
print("Data types:")
print(df.dtypes)
print("\nBasic statistics:")
df.describe()


Data types:
title            object
body             object
url              object
post_score        int64
comment          object
comment_score     int64
dtype: object

Basic statistics:


Unnamed: 0,post_score,comment_score
count,6187.0,6187.0
mean,517.082916,191.261516
std,1970.486668,1102.608781
min,0.0,-1547.0
25%,1.0,4.0
50%,13.0,16.0
75%,315.0,106.0
max,42256.0,26998.0


In [8]:
# Check for outliers
numeric_cols = ['post_score', 'comment_score']
for col in numeric_cols:
    if col in df.columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        print(f"\n{col}:")
        print(f"  Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
        print(f"  Outlier bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
        print(f"  Number of outliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")
        print(f"  Min: {df[col].min()}, Max: {df[col].max()}")



post_score:
  Q1: 1.00, Q3: 315.00, IQR: 314.00
  Outlier bounds: [-470.00, 786.00]
  Number of outliers: 702 (11.35%)
  Min: 0, Max: 42256

comment_score:
  Q1: 4.00, Q3: 106.00, IQR: 102.00
  Outlier bounds: [-149.00, 259.00]
  Number of outliers: 692 (11.18%)
  Min: -1547, Max: 26998


## 1.2 Feature Engineering

In [9]:
# Create features from raw data for Random Forest
def extract_features(df):
    features_df = df.copy()
    
    # 1. Extract subreddit from URL
    if 'url' in features_df.columns:
        features_df['subreddit'] = features_df['url'].astype(str).str.extract(r'/r/([^/]+)/')
    
    # 2. Text-based features (combine title and body)
    if 'title' in features_df.columns and 'body' in features_df.columns:
        features_df['combined_text'] = (
            features_df['title'].astype(str) + ' ' + features_df['body'].astype(str)
        )
        text_col = 'combined_text'
    elif 'title' in features_df.columns:
        text_col = 'title'
        features_df['combined_text'] = features_df['title'].astype(str)
    elif 'body' in features_df.columns:
        text_col = 'body'
        features_df['combined_text'] = features_df['body'].astype(str)
    else:
        text_col = None
    
    if text_col:
        # Character and word counts
        features_df['text_length'] = features_df['combined_text'].str.len()
        features_df['word_count'] = features_df['combined_text'].str.split().str.len()
        
        # Title-specific features
        if 'title' in features_df.columns:
            features_df['title_length'] = features_df['title'].astype(str).str.len()
            features_df['title_word_count'] = features_df['title'].astype(str).str.split().str.len()
        
        # Body-specific features
        if 'body' in features_df.columns:
            features_df['body_length'] = features_df['body'].astype(str).str.len()
            features_df['body_word_count'] = features_df['body'].astype(str).str.split().str.len()
        
        # Text patterns (use regex=False to treat ? and ! as literal characters)
        features_df['has_question_mark'] = features_df['combined_text'].str.contains('?', regex=False, na=False).astype(int)
        features_df['has_exclamation'] = features_df['combined_text'].str.contains('!', regex=False, na=False).astype(int)
        features_df['uppercase_ratio'] = features_df['combined_text'].apply(
            lambda x: sum(1 for c in str(x) if c.isupper()) / len(str(x)) if len(str(x)) > 0 else 0
        )
    
    # 3. Engagement metrics
    if 'post_score' in features_df.columns:
        features_df['score'] = pd.to_numeric(features_df['post_score'], errors='coerce').fillna(0)
    elif 'score' in features_df.columns:
        features_df['score'] = pd.to_numeric(features_df['score'], errors='coerce').fillna(0)
    
    if 'comment_score' in features_df.columns:
        features_df['comment_score'] = pd.to_numeric(features_df['comment_score'], errors='coerce').fillna(0)
        features_df['comment_to_score_ratio'] = features_df['comment_score'] / (features_df['score'] + 1)
        features_df['total_engagement'] = features_df['score'] + features_df['comment_score']
        
        # Standardize scores using z-scores for comparable scales
        score_z_array = stats.zscore(features_df['score'], nan_policy='omit')
        comment_score_z_array = stats.zscore(features_df['comment_score'], nan_policy='omit')
        
        features_df['score_z'] = pd.Series(score_z_array, index=features_df.index).fillna(0)
        features_df['comment_score_z'] = pd.Series(comment_score_z_array, index=features_df.index).fillna(0)
        
        # Virality_score will be computed later with tunable Œ±
        # Formula: v = z_post + Œ± √ó z_comment (using z-scores)
    
    if 'comment' in features_df.columns:
        features_df['has_comment'] = (features_df['comment'].astype(str).str.len() > 0).astype(int)
        features_df['comment_length'] = features_df['comment'].astype(str).str.len()
    
    # 4. Subreddit features
    if 'subreddit' in features_df.columns:
        # Load subreddit frequency from reddit_data_counts.json
        counts_file = dataset_path / "reddit_data_counts.json"
        if counts_file.exists():
            with open(counts_file, 'r', encoding='utf-8') as f:
                subreddit_freq_dict = json.load(f)
            # Map subreddit frequencies from the JSON file
            features_df['subreddit_frequency'] = features_df['subreddit'].map(subreddit_freq_dict).fillna(0)
            print(f"Loaded subreddit frequencies from {counts_file.name}")
        else:
            # Fallback: calculate from dataset if JSON file not found
            subreddit_counts = features_df['subreddit'].value_counts()
            features_df['subreddit_frequency'] = features_df['subreddit'].map(subreddit_counts)
            print("Warning: reddit_data_counts.json not found. Using calculated frequencies.")
    
    return features_df

# Apply feature engineering
df_features = extract_features(df)
print(f"Original columns: {len(df.columns)}")
print(f"Features after engineering: {len(df_features.columns)}")
print(f"\nNew features created:")
new_cols = [col for col in df_features.columns if col not in df.columns]
print(new_cols)


Loaded subreddit frequencies from reddit_data_counts.json
Original columns: 6
Features after engineering: 25

New features created:
['subreddit', 'combined_text', 'text_length', 'word_count', 'title_length', 'title_word_count', 'body_length', 'body_word_count', 'has_question_mark', 'has_exclamation', 'uppercase_ratio', 'score', 'comment_to_score_ratio', 'total_engagement', 'score_z', 'comment_score_z', 'has_comment', 'comment_length', 'subreddit_frequency']


In [10]:
# Handle missing values
missing_before = df_features.isnull().sum()
print("Missing values before handling:")
print(missing_before[missing_before > 0])

# Fill missing values
df_features = df_features.fillna(0)

missing_after = df_features.isnull().sum()
print("\nMissing values after handling:")
print(missing_after[missing_after > 0])

Missing values before handling:
subreddit    1258
dtype: int64

Missing values after handling:
Series([], dtype: int64)


## 1.3 Virality Score Computation and Label Creation

In [11]:
# Compute virality score with tunable Œ± (alpha) hyperparameter
def compute_virality_score(df, alpha=0.5):
    if 'score_z' in df.columns and 'comment_score_z' in df.columns:
        return df['score_z'] + alpha * df['comment_score_z']
    else:
        raise ValueError("Missing 'score_z' or 'comment_score_z' columns. Run feature engineering first.")

# Start with default Œ± = 0.5 (will be tuned via grid search)
default_alpha = 0.5
df_features['virality_score'] = compute_virality_score(df_features, alpha=default_alpha)

print(f"Virality score formula: v = z_post + Œ± √ó z_comment")
print(f"Using default Œ± = {default_alpha} (will be tuned via grid search)")
print(f"\nVirality score statistics:")
print(df_features['virality_score'].describe())


Virality score formula: v = z_post + Œ± √ó z_comment
Using default Œ± = 0.5 (will be tuned via grid search)

Virality score statistics:
count    6187.000000
mean        0.000000
std         1.388064
min        -0.349173
25%        -0.346182
50%        -0.333980
75%        -0.120144
max        23.583316
Name: virality_score, dtype: float64


In [12]:
# Grid Search for Optimal Œ± Hyperparameter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score, precision_score, recall_score

# Note: exclude engagement metrics (score, comment_score, total_engagement, comment_to_score_ratio)
# because these are used to compute virality_score, which would cause data leakage
base_features = [
    'text_length', 'word_count', 'has_question_mark', 
    'has_exclamation', 'uppercase_ratio',
    'title_length', 'title_word_count', 'body_length', 'body_word_count',
    'has_comment', 'comment_length',  # Comment presence/length OK, but NOT comment_score
    'subreddit_frequency'
]
X_base = df_features[[col for col in base_features if col in df_features.columns]].fillna(0)

# Grid of Œ± values to try
alpha_values = [0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0, 1.5]
results = []

print(f"\nTesting Œ± values: {alpha_values}")
print("For each Œ±, computing virality score and evaluating model performance\n")

for alpha in alpha_values:
    # Compute virality score with this Œ±
    virality_scores = compute_virality_score(df_features, alpha=alpha)
    
    # Create virality label (top 30%)
    threshold = virality_scores.quantile(0.70)
    y_temp = (virality_scores >= threshold).astype(int)
    
    # Split data
    X_train_temp, X_test_temp, y_train_temp, y_test_temp = train_test_split(
        X_base, y_temp, test_size=0.2, random_state=42, stratify=y_temp
    )
    
    # Train model
    rf_temp = RandomForestClassifier(
        n_estimators=50,
        max_depth=10,
        min_samples_split=5,
        random_state=42,
        class_weight='balanced',
        n_jobs=-1
    )
    rf_temp.fit(X_train_temp, y_train_temp)
    
    # Evaluate
    y_pred_temp = rf_temp.predict(X_test_temp)
    y_proba_temp = rf_temp.predict_proba(X_test_temp)[:, 1]
    
    f1 = f1_score(y_test_temp, y_pred_temp)
    auc = roc_auc_score(y_test_temp, y_proba_temp)
    precision = precision_score(y_test_temp, y_pred_temp)
    recall = recall_score(y_test_temp, y_pred_temp)
    
    results.append({
        'alpha': alpha,
        'f1_score': f1,
        'roc_auc': auc,
        'precision': precision,
        'recall': recall,
        'threshold': threshold
    })
    
    print(f"Œ± = {alpha:4.2f}: F1 = {f1:.4f}, Precision = {precision:.4f}, Recall = {recall:.4f}, ROC-AUC = {auc:.4f}")

# Find best Œ±
# Use ROC-AUC as primary metric (more robust to class imbalance than F1)
results_df = pd.DataFrame(results)

best_idx_auc = results_df['roc_auc'].idxmax()
best_alpha_auc = results_df.loc[best_idx_auc, 'alpha']
best_auc = results_df.loc[best_idx_auc, 'roc_auc']
best_f1_auc = results_df.loc[best_idx_auc, 'f1_score']
best_precision_auc = results_df.loc[best_idx_auc, 'precision']
best_recall_auc = results_df.loc[best_idx_auc, 'recall']

# Use ROC-AUC as primary (more robust to class imbalance)
best_alpha = best_alpha_auc
best_f1 = best_f1_auc
best_precision = best_precision_auc
best_recall = best_recall_auc

print(f"  \nBest Œ± = {best_alpha:.2f}")
print(f"  ROC-AUC:   {best_auc:.4f}")
print(f"  F1-Score:  {best_f1:.4f}")

# Recompute virality score with best Œ±
df_features['virality_score'] = compute_virality_score(df_features, alpha=best_alpha)



Testing Œ± values: [0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0, 1.5]
For each Œ±, computing virality score and evaluating model performance

Œ± = 0.20: F1 = 0.6291, Precision = 0.6122, Recall = 0.6469, ROC-AUC = 0.8283
Œ± = 0.25: F1 = 0.6273, Precision = 0.6240, Recall = 0.6307, ROC-AUC = 0.8305
Œ± = 0.30: F1 = 0.6252, Precision = 0.6146, Recall = 0.6361, ROC-AUC = 0.8296
Œ± = 0.35: F1 = 0.6146, Precision = 0.6042, Recall = 0.6253, ROC-AUC = 0.8268
Œ± = 0.40: F1 = 0.6245, Precision = 0.6108, Recall = 0.6388, ROC-AUC = 0.8285
Œ± = 0.45: F1 = 0.6344, Precision = 0.6014, Recall = 0.6712, ROC-AUC = 0.8162
Œ± = 0.50: F1 = 0.6192, Precision = 0.5960, Recall = 0.6442, ROC-AUC = 0.8068
Œ± = 0.55: F1 = 0.6187, Precision = 0.6230, Recall = 0.6146, ROC-AUC = 0.8118
Œ± = 0.60: F1 = 0.6269, Precision = 0.5923, Recall = 0.6658, ROC-AUC = 0.8158
Œ± = 0.65: F1 = 0.6121, Precision = 0.6096, Recall = 0.6146, ROC-AUC = 0.8185
Œ± = 0.70: F1 = 0.6121, Preci

In [13]:
# Create virality label using optimal Œ±
if 'virality_score' in df_features.columns:
    threshold = df_features['virality_score'].quantile(0.70)
    df_features['is_viral'] = (df_features['virality_score'] >= threshold).astype(int)
    
    print(f"Virality threshold (top 30%): {threshold:.4f}")
    print(f"\nClass distribution:")
    print(df_features['is_viral'].value_counts())
else:
    print("Warning: 'virality_score' column not found. Cannot create virality labels.")


Virality threshold (top 30%): -0.1655

Class distribution:
is_viral
0    4331
1    1856
Name: count, dtype: int64


In [14]:
# Select final features for model (remove raw text columns)
# The model should predict virality from content/subreddit features, not from engagement metrics
feature_columns = [
    # Text features
    'text_length', 'word_count', 'has_question_mark', 
    'has_exclamation', 'uppercase_ratio',
    'title_length', 'title_word_count', 'body_length', 'body_word_count',
    # Comment features (presence/length OK, but NOT comment_score - that's used to define virality)
    'has_comment', 'comment_length',
    # Subreddit features
    'subreddit_frequency',
    # Target variable
    'is_viral'
]

# Filter to only columns that exist
available_features = [col for col in feature_columns if col in df_features.columns]

df_model = df_features[available_features].copy()

# Final check for missing values
df_model = df_model.fillna(0)

print(f"Final dataset shape: {df_model.shape}")
print(f"Target: is_viral")
print(f"\nFeature columns: {[col for col in df_model.columns if col != 'is_viral']}")
print(f"\nTarget distribution:")
print(df_model['is_viral'].value_counts())


Final dataset shape: (6187, 13)
Target: is_viral

Feature columns: ['text_length', 'word_count', 'has_question_mark', 'has_exclamation', 'uppercase_ratio', 'title_length', 'title_word_count', 'body_length', 'body_word_count', 'has_comment', 'comment_length', 'subreddit_frequency']

Target distribution:
is_viral
0    4331
1    1856
Name: count, dtype: int64


In [15]:
# Save processed data
output_dir = Path("data")
output_dir.mkdir(exist_ok=True)

# Save processed dataset
csv_path = output_dir / "reddit_features.csv"
df_model.to_csv(csv_path, index=False)
print(f"Saved processed data to: {csv_path}")
print(f"Shape: {df_model.shape}")
print(f"Columns: {list(df_model.columns)}")

Saved processed data to: data/reddit_features.csv
Shape: (6187, 13)
Columns: ['text_length', 'word_count', 'has_question_mark', 'has_exclamation', 'uppercase_ratio', 'title_length', 'title_word_count', 'body_length', 'body_word_count', 'has_comment', 'comment_length', 'subreddit_frequency', 'is_viral']
