When generating synthetic data for **Market Mix Modeling (MMM)**—especially for building and testing something like your **AutoMMM system**—you need to simulate data that is both *statistically realistic* and *functionally useful* for model development and validation.

## ✅ 1. **Reflect Real-World Relationships**

### Why: MMM models are sensitive to data patterns. Unrealistic or poorly constructed relationships lead to invalid conclusions.

* Ensure **positive correlation** between media spend and sales.
* Include **diminishing returns** (non-linear) and **carryover effects** (adstock).
* Inject **seasonality, holidays, and external factors** as they significantly influence sales.
* Introduce **multicollinearity** only if you're planning to test model robustness or regularization strategies.


In [81]:
# Base line
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

In [82]:
weeks = pd.date_range(start='2025-06-05', periods=104, freq='W-SAT')
print(len(weeks))
print(weeks)

104
DatetimeIndex(['2025-06-07', '2025-06-14', '2025-06-21', '2025-06-28',
               '2025-07-05', '2025-07-12', '2025-07-19', '2025-07-26',
               '2025-08-02', '2025-08-09',
               ...
               '2027-03-27', '2027-04-03', '2027-04-10', '2027-04-17',
               '2027-04-24', '2027-05-01', '2027-05-08', '2027-05-15',
               '2027-05-22', '2027-05-29'],
              dtype='datetime64[ns]', length=104, freq='W-SAT')


In [83]:
skus = ['sku_A','sku_B','sku_C']
skus?

[31mType:[39m        list
[31mString form:[39m ['sku_A', 'sku_B', 'sku_C']
[31mLength:[39m      3
[31mDocstring:[39m  
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.

In [84]:
#Price
"""
price for sku_A is usually around 80 
price for sku_B is usually around 30 
price for sku_C is usually around 20 
"""
def generate_small_fluctuations(baseline :int,length:int):
    std_devs = np.arange(20, 20 + 70 * length, 70) / 10000

    lower_bounds = (baseline - std_devs).tolist()
    upper_bounds = (baseline + std_devs).tolist()
    combined = lower_bounds + upper_bounds

    random_combined = random.sample(combined, length)

    if random.random() < 0.5:
        print("increasing")
        random_combined += np.arange(length)/100
    else:
        print("decreasing")
        random_combined -= np.arange(length)/100
        
    return random_combined

sku_a_price = generate_small_fluctuations(80,104)
sku_b_price = generate_small_fluctuations(30,104)
sku_c_price = generate_small_fluctuations(20,104)

decreasing
decreasing
increasing


In [85]:
# OOS
sku_a_oos = np.random.choice([1,2,3,4,5,6,7], size=(104,)) * np.random.choice([0, 1], size=(104,), p=[7./10, 3./10])
sku_b_oos = np.random.choice([1,2,3,4,5,6,7], size=(104,)) * np.random.choice([0, 1], size=(104,), p=[8./10, 2./10])
sku_c_oos = np.random.choice([1,2,3,4,5,6,7], size=(104,)) * np.random.choice([0, 1], size=(104,), p=[9./10, 1./10])

In [86]:
events = np.random.choice([0, 1], size=(104,), p=[9./10, 1./10])

In [87]:
def brand_level_advt_strategy(no_of_weeks :int):
    # Branded ads (5-10% probability - less frequent than product-level)
    b_ads_occurrences = np.random.choice([True, False], size=no_of_weeks, p=[0.08, 0.92])
    # Non-branded ads (20-30% probability)
    nb_ads_occurrences = np.random.choice([True, False], size=no_of_weeks, p=[0.25, 0.75])
    # Branded: Higher clicks when active ($500–$1000 per week)
    brand_level_branded_clicks = np.where(b_ads_occurrences, np.random.randint(200, 700, no_of_weeks), 0)
    # Non-Branded: Lower spend ($100–$300 per week)
    brand_level_nonbranded_clicks = np.where(nb_ads_occurrences, np.random.randint(100, 200, no_of_weeks), 0)
    # Branded CPC: Gamma distribution ($1.50-$3.50)
    price_per_branded_click = np.round(np.random.gamma(shape=5, scale=0.3, size=no_of_weeks) + 1.5, 2)
    # Non-Branded CPC: Normal distribution ($0.50-$1.80)
    price_per_nonbranded_click = np.round(np.clip(np.random.normal(loc=1.1, scale=0.3, size=no_of_weeks), 0.5, 1.8), 2)
    #spends
    brand_level_branded_spends = brand_level_branded_clicks * price_per_branded_click
    brand_level_nonbranded_spends = brand_level_nonbranded_clicks * price_per_nonbranded_click

    return  brand_level_branded_clicks, brand_level_nonbranded_clicks, brand_level_branded_spends, brand_level_nonbranded_spends


def product_level_advt_strategy(no_of_weeks :int):
    # Branded ads (10-20% probability per week)
    b_ads_occurances = np.random.choice([True, False], size=no_of_weeks, p=[0.15, 0.85])
    # Non-branded ads (30-50% probability per week)
    nb_ads_occurances = np.random.choice([True, False], size=no_of_weeks, p=[0.4, 0.6])
    # Branded: Higher clicks when active ($500–$1000 per week)
    product_level_branded_clicks = np.where(b_ads_occurances, np.random.randint(500, 1000, no_of_weeks), 0)
    # Non-Branded: Lower spend ($100–$300 per week)
    product_level_nonbranded_clicks = np.where(nb_ads_occurances, np.random.randint(100, 300, no_of_weeks), 0)
    # Branded CPC: Gamma distribution ($1.50-$3.50)
    price_per_branded_click = np.round(np.random.gamma(shape=5, scale=0.3, size=no_of_weeks) + 1.5, 2)
    # Non-Branded CPC: Normal distribution ($0.50-$1.80)
    price_per_nonbranded_click = np.round(np.clip(np.random.normal(loc=1.1, scale=0.3, size=no_of_weeks), 0.5, 1.8), 2)
    
    #spends
    product_level_branded_spends = product_level_branded_clicks * price_per_branded_click
    product_level_nonbranded_spends = product_level_nonbranded_clicks * price_per_nonbranded_click

    return product_level_branded_clicks, product_level_branded_spends, product_level_nonbranded_clicks, product_level_nonbranded_spends

In [88]:
brand_level_branded_clicks, brand_level_nonbranded_clicks, brand_level_branded_spends, brand_level_nonbranded_spends =  brand_level_advt_strategy(no_of_weeks = 104)

In [89]:
no_of_weeks = 104

In [90]:
# insta_clicks
insta_clicks_occ = np.random.choice([True, False], size=no_of_weeks, p=[0.1, 0.9])
insta_clicks = np.where(insta_clicks_occ, np.random.randint(250, 300, no_of_weeks), 0)
price_per_insta_click = np.round(np.clip(np.random.normal(loc=1.1, scale=0.3, size=no_of_weeks), 2, 3), 2)
insta_spends = insta_clicks * price_per_insta_click

# fb_clicks
fb_clicks_occ = np.random.choice([True, False], size=no_of_weeks, p=[0.1, 0.9])
fb_clicks = np.where(fb_clicks_occ, np.random.randint(150, 360, no_of_weeks), 0)
price_per_fb_click = np.round(np.clip(np.random.normal(loc=1.1, scale=0.3, size=no_of_weeks), 2, 3), 2)
fb_spends = fb_clicks * price_per_fb_click


In [91]:
def generate_indiv_sku_data(sku, weeks, price, oos, events, no_of_weeks, 
                            brand_level_branded_clicks, brand_level_nonbranded_clicks, brand_level_branded_spends, brand_level_nonbranded_spends,
                            insta_clicks, insta_spends,
                            fb_clicks, fb_spends):
    product_level_branded_clicks, product_level_branded_spends, product_level_nonbranded_clicks, product_level_nonbranded_spends = product_level_advt_strategy(no_of_weeks)

    indiv_data = pd.DataFrame(
        {   
            'date': weeks,
            'sku' : sku,
            'sales': 0,
            'units' : 0,
            'price': price,
            'oos' : oos,
            'events': events,
            'product_level_branded_clicks' : product_level_branded_clicks,
            'product_level_branded_spends' : product_level_branded_spends,
            'product_level_nonbranded_clicks' : product_level_nonbranded_clicks,
            'product_level_nonbranded_spends' : product_level_nonbranded_spends,
            'brand_level_branded_clicks' : brand_level_branded_clicks,
            'brand_level_branded_spends' : brand_level_branded_spends,
            'brand_level_nonbranded_clicks' : brand_level_nonbranded_clicks,
            'brand_level_nonbranded_spends' : brand_level_nonbranded_spends,
            'insta_clicks' : insta_clicks,
            'insta_spends' : insta_spends,
            'fb_clicks' :  fb_clicks, 
            'fb_spends' :  fb_spends
        }
    )
    return indiv_data

In [92]:
no_of_weeks = 104
sku_a_data = generate_indiv_sku_data('sku_A', weeks, sku_a_price, sku_a_oos , events, no_of_weeks, 
                            brand_level_branded_clicks, brand_level_nonbranded_clicks, brand_level_branded_spends, brand_level_nonbranded_spends,
                            insta_clicks, insta_spends,
                            fb_clicks, fb_spends)


sku_b_data = generate_indiv_sku_data('sku_B', weeks, sku_b_price, sku_b_oos , events, no_of_weeks, 
                            brand_level_branded_clicks, brand_level_nonbranded_clicks, brand_level_branded_spends, brand_level_nonbranded_spends,
                            insta_clicks, insta_spends,
                            fb_clicks, fb_spends)

sku_c_data = generate_indiv_sku_data('sku_C', weeks, sku_c_price, sku_c_oos , events, no_of_weeks, 
                            brand_level_branded_clicks, brand_level_nonbranded_clicks, brand_level_branded_spends, brand_level_nonbranded_spends,
                            insta_clicks, insta_spends,
                            fb_clicks, fb_spends)




# random_intercept = np.random.normal(100, 10)
sku_a_data['units'] = (np.random.normal(100, 10)
    - 1.5 * (sku_a_data['price']/80)
    - 2.0 * sku_a_data['oos']
    + 0.5 * sku_a_data['events']
    + 0.3 * sku_a_data['product_level_branded_clicks']
    + 0.2 * sku_a_data['product_level_nonbranded_clicks']
    + 0.25 * sku_a_data['brand_level_branded_clicks']
    + 0.15 * sku_a_data['brand_level_nonbranded_clicks']
    + 0.1 * sku_a_data['insta_clicks']
    + 0.1 * sku_a_data['fb_clicks']).astype(int)

sku_b_data['units'] = (np.random.normal(100, 10)
    - 0.5 * (sku_b_data['price']/30)
    - 1.0 * sku_b_data['oos']
    + 0.5 * sku_b_data['events']
    + 0.1 * sku_b_data['product_level_branded_clicks']
    + 0.6 * sku_b_data['product_level_nonbranded_clicks']
    + 0.25 * sku_b_data['brand_level_branded_clicks']
    + 0.15 * sku_b_data['brand_level_nonbranded_clicks']
    + 0.1 * sku_b_data['insta_clicks']
    + 0.1 * sku_b_data['fb_clicks']).astype(int)

sku_c_data['units'] = (np.random.normal(100, 10)
    - 0.74 * (sku_c_data['price']/20)
    - 0.87 * sku_c_data['oos']
    + 0.5 * sku_c_data['events']
    + 0.4 * sku_c_data['product_level_branded_clicks']
    + 0.2 * sku_c_data['product_level_nonbranded_clicks']
    + 0.25 * sku_c_data['brand_level_branded_clicks']
    + 0.15 * sku_c_data['brand_level_nonbranded_clicks']
    + 0.1 * sku_c_data['insta_clicks']
    + 0.1 * sku_c_data['fb_clicks']).astype(int)


sku_a_data['sales'] = sku_a_data['units'] * sku_a_data['price'] 
sku_b_data['sales'] = sku_b_data['units'] * sku_b_data['price'] 
sku_c_data['sales'] = sku_c_data['units'] * sku_c_data['price'] 


In [93]:
data_list = [sku_a_data,sku_b_data,sku_c_data]

data = pd.concat(data_list)
data.head()

Unnamed: 0,date,sku,sales,units,price,oos,events,product_level_branded_clicks,product_level_branded_spends,product_level_nonbranded_clicks,product_level_nonbranded_spends,brand_level_branded_clicks,brand_level_branded_spends,brand_level_nonbranded_clicks,brand_level_nonbranded_spends,insta_clicks,insta_spends,fb_clicks,fb_spends
0,2025-06-07,sku_A,8546.25,106,80.625,6,0,0,0.0,0,0.0,0,0.0,117,126.36,0,0.0,0,0.0
1,2025-06-14,sku_A,7219.03,91,79.33,5,0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
2,2025-06-21,sku_A,8148.175,101,80.675,0,0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
3,2025-06-28,sku_A,8069.698,101,79.898,0,0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
4,2025-07-05,sku_A,12421.235,155,80.137,0,0,0,0.0,126,122.22,0,0.0,195,191.1,0,0.0,0,0.0


In [94]:
data.to_excel(r'C:\Users\nigam\OneDrive\Documents\university_classes\AutoMMM\data\data.xlsx',index=False, sheet_name= 'data')

## ✅ 2. **Embed Adstock and Saturation Effects**

### Why: These effects are foundational to MMM and must be realistically emulated.

* Use decaying functions for **carryover (adstock)**.
* Apply transformations (e.g., logistic, power law) to model **saturation**.
* Different channels should have **different decay and saturation parameters**, just like in real-world data.

## ✅ 3. **Vary Spend and Effects Over Time**

### Why: Real campaigns vary over time. Flat or uniform data can lead to overfitting.

* Simulate **budget reallocations**, **seasonal bursts**, or **campaign launches**.
* Vary **channel spend levels** across weeks or months to create meaningful signal variance.
* Inject **missing data points** if you want to test model robustness and preprocessing.

Things to keep in mind

## ✅ 4. **Noise and Outliers**

### Why: Real-world sales data is noisy. Models must learn to deal with uncertainty.

* Add **Gaussian noise** to the sales output.
* Introduce occasional **outliers** (e.g., unexpected sales spikes/dips).
* Control noise variance to keep the signal-to-noise ratio realistic.

## ✅ 5. **Generate Sufficient Time Granularity**

### Why: MMM typically operates on **weekly** or **monthly** data.

* Use at least **104 weeks (2 years)** to allow for training + validation.
* Include **enough season cycles** to identify periodic patterns.
* For geo-level or brand-level models, generate **panel-style** data with multiple units (e.g., region × week).

## ✅ 6. **Ensure Model Identifiability**

### Why: If all features are highly correlated, it becomes hard for the model to isolate effects.

* Avoid perfect collinearity between channels (e.g., TV spend shouldn't always grow with digital).
* Vary ad spends independently when possible.

## ✅ 7. **Create Ground Truth Coefficients (for validation)**

### Why: You want to compare model-estimated effects to known "true" effects.

* Store the coefficients you use to generate sales (e.g., 0.3 for TV, 0.4 for Digital).
* After modeling, compare model output to these coefficients for **model validation**.

## ✅ 8. **Label Campaign Periods**

### Why: Useful for testing campaign attribution models or causal inference.

* Create `campaign_active` flags to simulate burst periods.
* Useful for validating ROAS and marginal return calculations.

## ✅ 9. **Track Cumulative Spend & Lag Variables**

### Why: Advanced MMM includes lagged, cumulative, or rolling window variables.

* Precompute rolling averages or lag variables to simulate richer input space.
* Can also help you test whether your agent-based system is handling time-dependencies correctly.

## ✅ 10. **Document the Data-Generating Process (DGP)**

### Why: Transparency in how you generated synthetic data helps debug your modeling pipeline.

* Keep a record of:

  * Coefficients
  * Adstock/saturation parameters
  * Noise variance
  * External variable effects
  * Random seeds (for reproducibility)