# 🎯 AdTech Campaign Performance Data Generator

This notebook generates synthetic campaign performance data for AdTech analytics and testing purposes.

## 📊 **Features**
- **Realistic Data**: Generates 50,000 records with realistic marketing metrics
- **Performance Tiers**: High, Medium, and Low performing campaigns with appropriate metrics
- **Multiple Dimensions**: Publishers, regions, devices, ad types, and platform ratios
- **Time Series**: 26 weeks of historical data
- **Delta Table**: Saves to Delta format with Change Data Feed enabled

## ⚙️ **Configuration**
Use the widgets below to configure:
- **CATALOG**: Target catalog name
- **SCHEMA**: Target schema name  
- **TABLE**: Target table name (default: campaign_performance)

---

In [0]:
# =============================================================================
# WIDGET CONFIGURATION
# =============================================================================
# Create input widgets for catalog, schema, and table configuration

dbutils.widgets.text("CATALOG", "your_catalog", "Enter Catalog Name")
dbutils.widgets.text("SCHEMA", "your_schema", "Enter Schema Name")
dbutils.widgets.text("TABLE", "campaign_performance", "Enter Table Name")

print("✅ Widgets configured successfully")

"""
Weekly Campaign Performance Data Generator
=========================================

This script generates synthetic campaign performance data for the last 26 weeks.
The data includes various marketing metrics with realistic distributions based on
performance tiers (High, Medium, Low).

"""

In [0]:
# =============================================================================
# READ WIDGET VALUES
# =============================================================================

# Get the selected values
CATALOG = dbutils.widgets.get("CATALOG")
SCHEMA = dbutils.widgets.get("SCHEMA")
TABLE = dbutils.widgets.get("TABLE")

print(f"🎯 Using Catalog: {CATALOG}")
print(f"📁 Using Schema: {SCHEMA}")
print(f"📁 Using Table: {TABLE}")

🎯 Using Catalog: your_catalog
📁 Using Schema: your_schema
📁 Using Table: campaign_performance


In [0]:
# Set up catalog and schema with IF NOT EXISTS (no error handling needed)
spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}")
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"USE {CATALOG}.{SCHEMA}")

DataFrame[]

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import random
from datetime import datetime, timedelta

# Initialize Spark session
spark = SparkSession.builder.appName("WeeklyCampaignPerformance").getOrCreate()

# =============================================================================
# CONFIGURATION PARAMETERS
# =============================================================================

# Data volume settings
NUM_ROWS = 50000          # Total number of records to generate
NUM_CAMPAIGNS = 1000      # Number of unique campaigns

# Categorical data options
PUBLISHERS = ['Google', 'Meta', 'LinkedIn', 'Twitter', 'TikTok']
REGIONS = ['NA', 'EMEA', 'APAC', 'LATAM']
DEVICES = ['Mobile', 'Desktop', 'Tablet']
AD_TYPES = ['Banner', 'Video', 'Native', 'Interstitial']
PLATFORM_KEYS = ['Website', 'App', 'Mobile Web']

# =============================================================================
# DATE GENERATION
# =============================================================================

# Generate last 26 weeks (6 months) of data
end_date = datetime.today()
start_date = end_date - timedelta(weeks=26)
weekly_dates = []

# Create list of weekly start dates (Mondays)
current = start_date
while current <= end_date:
    # Adjust to next Monday if not already Monday
    if current.weekday() != 0:  # Monday = 0
        current += timedelta(days=(7 - current.weekday()))
    weekly_dates.append(current.strftime("%Y-%m-%d"))
    current += timedelta(weeks=1)

# =============================================================================
# CAMPAIGN PERFORMANCE TIER ASSIGNMENT
# =============================================================================

# Create campaign IDs
campaign_ids = [f"cmp_{i}" for i in range(1, NUM_CAMPAIGNS + 1)]

# Assign performance tiers to campaigns with weighted distribution
# 20% High performers, 50% Medium, 30% Low performers
campaign_performance = {
    cid: random.choices(["High", "Medium", "Low"], weights=[0.2, 0.5, 0.3])[0]
    for cid in campaign_ids
}

# Define metric ranges for each performance tier
tier_metrics = {
    "High": {
        "impr": (100000, 500000),    # Impressions range
        "ctr": (0.05, 0.15),         # Click-through rate range
        "cr": (0.05, 0.20),          # Conversion rate range
        "cpc": (0.5, 1.5)            # Cost per click range
    },
    "Medium": {
        "impr": (50000, 300000),
        "ctr": (0.01, 0.05),
        "cr": (0.01, 0.07),
        "cpc": (1.0, 3.0)
    },
    "Low": {
        "impr": (10000, 150000),
        "ctr": (0.005, 0.02),
        "cr": (0.002, 0.03),
        "cpc": (2.0, 6.0)
    },
}

# =============================================================================
# HELPER FUNCTIONS
# =============================================================================

def generate_platform_ratios():
    """
    Generate realistic platform distribution ratios that sum to 1.0
    
    Returns:
        dict: Platform ratios for Website, App, and Mobile Web
    """
    # Generate random ratios ensuring they sum to 1.0
    w = random.uniform(0.2, 0.6)  # Website: 20-60%
    a = random.uniform(0.1, 1 - w)  # App: 10% to remaining
    m = max(0.0, 1.0 - w - a)  # Mobile Web: remainder
    
    return {
        "Website": round(w, 2),
        "App": round(a, 2),
        "Mobile Web": round(m, 2)
    }

# =============================================================================
# DATA GENERATION
# =============================================================================

# Generate synthetic campaign performance data
data = []
while len(data) < NUM_ROWS:
    # Select random campaign and get its performance tier
    campaign_id = random.choice(campaign_ids)
    tier = campaign_performance[campaign_id]
    metrics = tier_metrics[tier]

    # Select random week
    week = random.choice(weekly_dates)
    
    # Generate base metrics
    impressions = random.randint(*metrics["impr"])
    ctr = round(random.uniform(*metrics["ctr"]), 4)
    clicks = max(1, int(impressions * ctr))  # Ensure at least 1 click
    cr = round(random.uniform(*metrics["cr"]), 4)
    conversions = int(clicks * cr)
    cpc = round(random.uniform(*metrics["cpc"]), 2)
    cost = round(clicks * cpc, 2)
    
    # Generate derived metrics
    budget = round(cost * random.uniform(1.0, 1.3), 2)  # Budget >= cost
    
    # ROAS varies by performance tier
    roas = round(
        random.uniform(2.0, 10.0) if tier == "High" else
        random.uniform(1.0, 5.0) if tier == "Medium" else
        random.uniform(0.2, 2.0), 2
    )
    revenue = round(cost * roas, 2)

    # Generate target metrics (slightly below actual performance)
    target_ctr = round(metrics["ctr"][1] * 0.9, 4)
    target_cr = round(metrics["cr"][1] * 0.9, 4)
    
    # Generate platform distribution
    platform_ratios = generate_platform_ratios()

    # Create data row
    data.append((
        campaign_id, week, tier, 
        random.choice(PUBLISHERS), random.choice(REGIONS),
        random.choice(DEVICES), random.choice(AD_TYPES),
        impressions, clicks, ctr, conversions, cr,
        cost, cpc, budget, revenue, roas, 
        target_ctr, target_cr, platform_ratios
    ))

# =============================================================================
# SCHEMA DEFINITION
# =============================================================================

# Define the DataFrame schema
schema = StructType([
    StructField("campaign_id", StringType(), True),           # Unique campaign identifier
    StructField("week_start", StringType(), True),           # Week start date (YYYY-MM-DD)
    StructField("performance_tier", StringType(), True),     # High/Medium/Low
    StructField("publisher", StringType(), True),            # Ad platform
    StructField("region", StringType(), True),               # Geographic region
    StructField("device_type", StringType(), True),          # Device category
    StructField("ad_type", StringType(), True),              # Ad format
    StructField("impressions", IntegerType(), True),         # Number of impressions
    StructField("clicks", IntegerType(), True),              # Number of clicks
    StructField("CTR", FloatType(), True),                   # Click-through rate
    StructField("conversions", IntegerType(), True),         # Number of conversions
    StructField("conversion_rate", FloatType(), True),       # Conversion rate
    StructField("cost", FloatType(), True),                  # Total cost
    StructField("CPC", FloatType(), True),                   # Cost per click
    StructField("budget", FloatType(), True),                # Campaign budget
    StructField("revenue", FloatType(), True),               # Generated revenue
    StructField("ROAS", FloatType(), True),                  # Return on ad spend
    StructField("target_CTR", FloatType(), True),            # Target CTR
    StructField("target_CR", FloatType(), True),             # Target conversion rate
    StructField("platform_ratios", MapType(StringType(), FloatType()), True),  # Platform distribution
])

# =============================================================================
# DATAFRAME CREATION AND PREVIEW
# =============================================================================

# Create DataFrame from generated data
df = spark.createDataFrame(data, schema)

# Display first 5 rows to verify data
print("Generated Campaign Performance Data Preview:")
print("=" * 50)
df.show(5, truncate=False)

# Display summary statistics
print("\nData Summary:")
print("=" * 20)
print(f"Total records: {df.count()}")
print(f"Unique campaigns: {df.select('campaign_id').distinct().count()}")
print(f"Date range: {df.agg({'week_start': 'min'}).collect()[0][0]} to {df.agg({'week_start': 'max'}).collect()[0][0]}")

print("\nData generation completed successfully!")

Generated Campaign Performance Data Preview:
+-----------+----------+----------------+---------+------+-----------+------------+-----------+------+------+-----------+---------------+--------+----+--------+--------+----+----------+---------+--------------------------------------------------+
|campaign_id|week_start|performance_tier|publisher|region|device_type|ad_type     |impressions|clicks|CTR   |conversions|conversion_rate|cost    |CPC |budget  |revenue |ROAS|target_CTR|target_CR|platform_ratios                                   |
+-----------+----------+----------------+---------+------+-----------+------------+-----------+------+------+-----------+---------------+--------+----+--------+--------+----+----------+---------+--------------------------------------------------+
|cmp_595    |2025-07-28|Medium          |Twitter  |NA    |Mobile     |Banner      |288039     |4378  |0.0152|166        |0.0381         |6172.98 |1.41|7550.85 |23210.4 |3.76|0.045     |0.063    |{Website -> 0.37, A

In [0]:
# # Write result_df to the table
# df.write.mode("overwrite").saveAsTable("campaign_performance")

In [0]:
from pyspark.sql.functions import monotonically_increasing_id

# Add a primary key column
df_with_pk = df.withColumn("primary_key", monotonically_increasing_id())

# Write the updated DataFrame to the table
df_with_pk.write.option("mergeSchema", "true").mode("overwrite").saveAsTable("campaign_performance")

In [0]:
# Add a unique primary key column to the DataFrame
# monotonically_increasing_id() generates unique, monotonically increasing 64-bit integers
# This ensures each row has a unique identifier for database operations
from pyspark.sql.functions import monotonically_increasing_id
df_with_pk = df.withColumn("primary_key", monotonically_increasing_id())

# Write the DataFrame to a Delta table with the following options:
# - mergeSchema: "true" - Allows schema evolution (adds new columns if schema changes)
# - mode: "overwrite" - Replaces the entire table content (use "append" to add data)
# - saveAsTable: Creates/overwrites the table in the current catalog/schema
df_with_pk.write.option("mergeSchema", "true").mode("overwrite").saveAsTable("campaign_performance")

print("✅ Campaign performance data successfully saved to 'campaign_performance' table")
print(f"📊 Total records written: {df_with_pk.count()}")

✅ Campaign performance data successfully saved to 'campaign_performance' table
📊 Total records written: 50000


In [0]:
# Enable Change Data Feed (CDF) for the campaign_performance table
# This allows tracking of row-level changes (INSERT, UPDATE, DELETE) over time
# Useful for audit trails, incremental processing, and data lineage
spark.sql(f"ALTER TABLE `{CATALOG}`.`{SCHEMA}`.`{TABLE}` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

DataFrame[]

## 🎯 **Data Generation Complete-n Congratulations generate_campaign_data.ipynb*

### ✅ **What Was Created**
- **50,000 synthetic records** with realistic AdTech metrics
- **1,000 unique campaigns** across 3 performance tiers
- **26 weeks of historical data** with weekly granularity
- **Delta table** with primary key and Change Data Feed enabled

### 📊 **Table Structure**
- **21 columns** including primary key
- **Performance metrics**: Impressions, clicks, CTR, conversions, cost, revenue, ROAS
- **Categorical data**: Publishers, regions, devices, ad types
- **Platform ratios**: Website, App, Mobile Web distribution
- **Target metrics**: Target CTR and conversion rates

### 📈 **Sample Queries**


**🎊 Congratulations! Your AdTech campaign data is ready for analysis-n Congratulations generate_campaign_data.ipynb*