# Synthetic Data Generation for A/B Test of Email Campaign Effectiveness on Customer Retention

Since real A/B test data is not available, we generate synthetic data for both a control group (customers who did not receive the email) and a treatment group (customers who received the email).

To simulate a realistic A/B test on customers of the ecommerce platform, we decided to use actual churn, emails opened and purchase rates from our current dataset to ensure the synthetic data reflects actual customer behavior. We also introduce random perturbations to the treatment group's churn and open rates, ensuring a diverse range of customer responses. These variations are designed to represent natural variability in how customers interact with the email campaign.

In [None]:
import sqlite3
import pandas as pd
import numpy as np

# Connect to the database 
conn = sqlite3.connect('Messages.db')

# Create a cursor object to interact with the database
cursor = conn.cursor()

msg = pd.read_sql_query("SELECT * FROM messages;", conn)

# Display the DataFrame
print(msg.head())

                                message_id  campaign_id message_type  \
0  1515915625489079625-11387-64244e6bd3873        11387         bulk   
1  1515915625489079631-11387-64244e6bd38c2        11387         bulk   
2  1515915625489079981-11387-64244e6bd3a36        11387         bulk   
3  1515915625489079662-11387-64244e6bd38d9        11387         bulk   
4  1515915625489079664-11387-64244e6bd38f5        11387         bulk   

             client_id      channel   stream                 date  \
0  1515915625489079625  mobile_push  android  2023-03-29 00:00:00   
1  1515915625489079631  mobile_push  desktop  2023-03-29 00:00:00   
2  1515915625489079981  mobile_push  desktop  2023-03-29 00:00:00   
3  1515915625489079662  mobile_push  android  2023-03-29 00:00:00   
4  1515915625489079664  mobile_push  android  2023-03-29 00:00:00   

               sent_at  is_opened  is_clicked  is_unsubscribed  is_complained  \
0  2023-03-29 14:51:38          1           0                0         

In [None]:
customers = pd.read_csv("Customers.csv", index_col=0)

In [None]:
n_cust = customers.shape[0]

overall_churn_rate = sum(customers['Churn']) * 100 / n_cust

In [None]:
opened_rate = msg['is_opened'].sum() / len(msg['is_opened'].dropna())        # Replace with actual email open rate (if already tracked)
purchase_rate = msg['is_purchased'].sum() / len(msg['is_purchased'].dropna())     # Replace with actual purchase rate from your dataset

n_customers = 5000

# Set random seed for reproducibility
np.random.seed(0)

# Simulate Control Group (No Email)
control_group = pd.DataFrame({
    'customer_id': range(n_customers),
    'group': 'Control',
    'is_opened': np.zeros(n_customers, dtype=int),  # No email, so open rate is 0
    'is_purchased': np.random.binomial(1, purchase_rate, n_customers),
    'churned': np.random.binomial(1, overall_churn_rate/100, n_customers)
})

# Perturbation ranges (e.g., ±3%)
churn_perturbations = np.random.uniform(-0.05, 0.05, n_customers)
purchase_perturbations = np.random.uniform(0, 0.05, n_customers)

# Adjusted rates for each customer in the Treatment group
treatment_churn_rates = np.clip(overall_churn_rate/100 + churn_perturbations, 0, 1)
treatment_purchase_rates = np.clip(purchase_rate + purchase_perturbations, 0, 1)


# Simulate Treatment Group (Email Sent) with no assumed lift
treatment_group = pd.DataFrame({
    'customer_id': range(n_customers, 2 * n_customers),
    'group': 'Treatment',
    'is_opened': np.random.binomial(1, opened_rate, n_customers),
    'is_purchased': np.random.binomial(1, treatment_purchase_rates),
    'churned': np.random.binomial(1, treatment_churn_rates)
})

control_group.to_csv('campaign_control.csv')
treatment_group.to_csv('campaigh_treatment.csv')