## Analysis Overview 

This notebook contains **R code** to analyze the simulated data generated in `simulation.ipynb`. The analysis focuses on evaluating the effectiveness of various advertising strategies across a marketing funnel.

---

### Key Questions

1. What is the effect of **different ad campaigns** on purchase probability and total sales?
2. How does **ad type** influence user progression through the funnel? 
3. What are the **Return on Investment (ROI)** for each ad campaign? 
4. How effective is the full-funnel strategy when **funnel stage classification is predicted with high, medium, or low accuracy**?

---

### Extensions

For more advanced analysis, students can combine the simulation and analysis notebooks to explore how changes in simulation parameters influence campaign performance. For example:

1. How do results differ between **new brands** (where most users start in the **Not Aware** stage) and **well-known brands** (where most users start in the **Aware** stage)?
2. How do the relative performance of different campaigns change when **branding and performance ads** become **more or less effective**?

In [1]:
suppressMessages(library(tidyverse))

In [2]:
df = read.csv('data.csv')
df = arrange(df, campaign_type, user_id, visit)

In [3]:
# control group as the baseline
df$campaign_type = as.factor(df$campaign_type)
df$campaign_type <- relevel(df$campaign_type, ref = 'control')

In [4]:
# no ad as the baseline
df$ad_type = as.factor(df$ad_type)
df$ad_type <- relevel(df$ad_type, ref = 'none')

In [5]:
# inspect the data structure 
head(df)

Unnamed: 0_level_0,user_id,current_funnel_stage,next_funnel_stage,ad_type,purchase,sales,date,campaign_type,visit
Unnamed: 0_level_1,<int>,<chr>,<chr>,<fct>,<int>,<int>,<chr>,<fct>,<int>
1,30001,not aware,not aware,branding,0,0,2025-05-13,brand_plus_performance,1
2,30001,not aware,aware,branding,0,0,2025-05-18,brand_plus_performance,2
3,30001,aware,aware,performance,0,0,2025-05-19,brand_plus_performance,3
4,30001,aware,aware,performance,0,0,2025-05-20,brand_plus_performance,4
5,30002,aware,aware,branding,0,0,2025-05-07,brand_plus_performance,1
6,30002,aware,aware,branding,0,0,2025-05-10,brand_plus_performance,2


In [6]:
# all ad types
unique(as.character(df$ad_type))

In [7]:
# all campaign types
unique(as.character(df$campaign_type))

#### 1. overall effect of campaign on purchase probability and sales

In [8]:
# purchase probability as the outcome (all campaigns)
# intercept shows purchase probability without any ad
# data is aggregated from user-visit level to user level
model = lm(purchase ~ campaign_type, 
           df %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df %>% group_by(user_id) %>% 
    mutate(purchase = sum(purchase)) %>% distinct(user_id, .keep_all = T))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1013 -0.0943 -0.0484  0.9516 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.048400   0.002809  17.230   <2e-16 ***
campaign_typebrand_plus_performance 0.052900   0.003973  13.316   <2e-16 ***
campaign_typebranding               0.004400   0.003973   1.108    0.268    
campaign_typefull_funnel            0.093800   0.003973  23.612   <2e-16 ***
campaign_typeperformance            0.045900   0.003973  11.554   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2809 on 49995 degrees of freedom
Multiple R-squared:  0.01489,	Adjusted R-squared:  0.01481 
F-statistic: 188.9 on 4 and 49995 DF,  p-value: < 2.2e-16


In [9]:
# purchase probability as the outcome (directly comparing any two campaigns)
model = lm(purchase ~ campaign_type, 
           df %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('brand_plus_performance', 'full_funnel'))) 
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df %>% group_by(user_id) %>% 
    mutate(purchase = sum(purchase)) %>% distinct(user_id, .keep_all = T) %>% 
    filter(campaign_type %in% c("brand_plus_performance", "full_funnel")))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1422 -0.1013 -0.1013  0.8987 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.101300   0.003264  31.038   <2e-16 ***
campaign_typefull_funnel 0.040900   0.004616   8.861   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3264 on 19998 degrees of freedom
Multiple R-squared:  0.003911,	Adjusted R-squared:  0.003861 
F-statistic: 78.52 on 1 and 19998 DF,  p-value: < 2.2e-16


In [10]:
# purchase probability as the outcome (directly comparing any two campaigns)
model = lm(purchase ~ campaign_type, 
           df %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('performance', 'full_funnel')))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df %>% group_by(user_id) %>% 
    mutate(purchase = sum(purchase)) %>% distinct(user_id, .keep_all = T) %>% 
    filter(campaign_type %in% c("performance", "full_funnel")))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1422 -0.0943 -0.0943  0.9057 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               0.142200   0.003220   44.16   <2e-16 ***
campaign_typeperformance -0.047900   0.004554  -10.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.322 on 19998 degrees of freedom
Multiple R-squared:  0.005501,	Adjusted R-squared:  0.005452 
F-statistic: 110.6 on 1 and 19998 DF,  p-value: < 2.2e-16


In [11]:
# sales as the outcome (all campaigns)
# intercept shows average sales without any ad
# data is aggregated from user-visit level to user level
model = lm(sales ~ campaign_type, 
           df %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -10.13  -9.43  -4.84  95.16 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           4.8400     0.2809  17.230   <2e-16 ***
campaign_typebrand_plus_performance   5.2900     0.3973  13.316   <2e-16 ***
campaign_typebranding                 0.4400     0.3973   1.108    0.268    
campaign_typefull_funnel              9.3800     0.3973  23.612   <2e-16 ***
campaign_typeperformance              4.5900     0.3973  11.554   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.09 on 49995 degrees of freedom
Multiple R-squared:  0.01489,	Adjusted R-squared:  0.01481 
F-statistic: 188.9 on 4 and 49995 DF,  p-value: < 2.2e-16


In [12]:
# sales as the outcome (directly comparing any two campaigns)
model = lm(sales ~ campaign_type, 
           df %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('brand_plus_performance', 'full_funnel')))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T) %>% 
    filter(campaign_type %in% c("brand_plus_performance", "full_funnel")))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -14.22 -10.13 -10.13  89.87 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               10.1300     0.3264  31.038   <2e-16 ***
campaign_typefull_funnel   4.0900     0.4616   8.861   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.64 on 19998 degrees of freedom
Multiple R-squared:  0.003911,	Adjusted R-squared:  0.003861 
F-statistic: 78.52 on 1 and 19998 DF,  p-value: < 2.2e-16


In [13]:
# sales as the outcome (directly comparing any two campaigns)
model = lm(sales ~ campaign_type, 
           df %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('performance', 'full_funnel')))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T) %>% 
    filter(campaign_type %in% c("performance", "full_funnel")))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -14.22  -9.43  -9.43  90.57 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               14.2200     0.3220   44.16   <2e-16 ***
campaign_typeperformance  -4.7900     0.4554  -10.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.2 on 19998 degrees of freedom
Multiple R-squared:  0.005501,	Adjusted R-squared:  0.005452 
F-statistic: 110.6 on 1 and 19998 DF,  p-value: < 2.2e-16


#### 2. effect of ad type on funnel progression

In [14]:
df$stage_progress = ifelse(df$current_funnel_stage == df$next_funnel_stage, 0, 1)

In [15]:
model = lm(stage_progress ~ ad_type, 
          filter(df, current_funnel_stage == 'not aware'))
summary(model)


Call:
lm(formula = stage_progress ~ ad_type, data = filter(df, current_funnel_stage == 
    "not aware"))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.39759 -0.39759 -0.09927 -0.09653  0.90347 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.096528   0.002364  40.830   <2e-16 ***
ad_typebranding    0.301058   0.002971 101.343   <2e-16 ***
ad_typeperformance 0.002746   0.003195   0.859     0.39    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3933 on 108977 degrees of freedom
Multiple R-squared:  0.125,	Adjusted R-squared:  0.125 
F-statistic:  7783 on 2 and 108977 DF,  p-value: < 2.2e-16


In [16]:
model = lm(stage_progress ~ ad_type, 
          filter(df, current_funnel_stage == 'aware'))
summary(model)


Call:
lm(formula = stage_progress ~ ad_type, data = filter(df, current_funnel_stage == 
    "aware"))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3014 -0.3014 -0.0991 -0.0991  0.9009 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.103349   0.004541  22.760   <2e-16 ***
ad_typebranding    -0.004251   0.005243  -0.811    0.417    
ad_typeperformance  0.198060   0.005069  39.075   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3884 on 58997 degrees of freedom
Multiple R-squared:  0.06291,	Adjusted R-squared:  0.06288 
F-statistic:  1980 on 2 and 58997 DF,  p-value: < 2.2e-16


In [17]:
model = lm(stage_progress ~ ad_type, 
          filter(df, current_funnel_stage == 'consider'))
summary(model)


Call:
lm(formula = stage_progress ~ ad_type, data = filter(df, current_funnel_stage == 
    "consider"))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1985 -0.1985 -0.1126 -0.1032  0.8968 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         0.112558   0.005558  20.252   <2e-16 ***
ad_typebranding    -0.009318   0.007067  -1.319    0.187    
ad_typeperformance  0.085971   0.006258  13.737   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3645 on 27319 degrees of freedom
Multiple R-squared:  0.01519,	Adjusted R-squared:  0.01511 
F-statistic: 210.6 on 2 and 27319 DF,  p-value: < 2.2e-16


#### 3. ROI

In [18]:
# cpm (cost per thousand impressions) for branding ad, can be changed 
cpm = 30 
# cpa (cost per action/purchase) for performance ad, can be changed
cpa = 10

In [19]:
# cost of branding campaign
branding_ad_cost = cpm/1000 * df %>% filter(campaign_type == 'branding', ad_type == 'branding') %>% nrow
performance_ad_cost = 0
cost_branding = branding_ad_cost + performance_ad_cost

print(paste("branding ad cost:", branding_ad_cost))
print(paste("performance ad cost:", performance_ad_cost))
print(paste("total cost:", cost_branding))

[1] "branding ad cost: 1180.05"
[1] "performance ad cost: 0"
[1] "total cost: 1180.05"


In [20]:
# cost of performance campaign
branding_ad_cost = 0
performance_ad_cost = cpa * df %>% filter(campaign_type == 'performance', ad_type == 'performance') %>% pull(purchase) %>% sum
cost_performance = branding_ad_cost + performance_ad_cost

print(paste("branding ad cost:", branding_ad_cost))
print(paste("performance ad cost:", performance_ad_cost))
print(paste("total cost:", cost_performance))

[1] "branding ad cost: 0"
[1] "performance ad cost: 9430"
[1] "total cost: 9430"


In [21]:
# cost of brand-plus-performance campaign
branding_ad_cost = cpm/1000 * df %>% filter(campaign_type == 'brand_plus_performance', ad_type == 'branding') %>% nrow
performance_ad_cost = cpa * df %>% filter(campaign_type == 'brand_plus_performance', ad_type == 'performance') %>% pull(purchase) %>% sum
cost_brand_plus_performance = branding_ad_cost + performance_ad_cost

print(paste("branding ad cost:", branding_ad_cost))
print(paste("performance ad cost:", performance_ad_cost))
print(paste("total cost:", cost_brand_plus_performance))

[1] "branding ad cost: 597.15"
[1] "performance ad cost: 8210"
[1] "total cost: 8807.15"


In [22]:
# cost of full-funnel campaign
branding_ad_cost = cpm/1000 * df %>% filter(campaign_type == 'full_funnel', ad_type == 'branding') %>% nrow
performance_ad_cost = cpa * df %>% filter(campaign_type == 'full_funnel', ad_type == 'performance') %>% pull(purchase) %>% sum
cost_full_funnel = branding_ad_cost + performance_ad_cost

print(paste("branding ad cost:", branding_ad_cost))
print(paste("performance ad cost:", performance_ad_cost))
print(paste("total cost:", cost_full_funnel))

[1] "branding ad cost: 524.94"
[1] "performance ad cost: 14220"
[1] "total cost: 14744.94"


In [23]:
# ROI of branding campaign
roi_branding = ((df %>% filter(campaign_type == 'branding') %>% pull(sales) %>% sum - 
                 df %>% filter(campaign_type == 'control') %>% pull(sales) %>% sum) - cost_branding) / cost_branding
print(paste("branding campaign ROI:", roi_branding))

[1] "branding campaign ROI: 2.72865556544214"


In [24]:
# ROI of performance campaign
roi_performance = ((df %>% filter(campaign_type == 'performance') %>% pull(sales) %>% sum - 
                    df %>% filter(campaign_type == 'control') %>% pull(sales) %>% sum) - cost_performance) / cost_performance
print(paste("performance campaign ROI:", roi_performance))

[1] "performance campaign ROI: 3.86744432661718"


In [25]:
# ROI of brand-plus-performance campaign
roi_brand_plus_performance = ((df %>% filter(campaign_type == 'brand_plus_performance') %>% pull(sales) %>% sum - 
                    df %>% filter(campaign_type == 'control') %>% pull(sales) %>% sum) - cost_brand_plus_performance) / cost_brand_plus_performance
print(paste("brand-plus-performance campaign ROI:", roi_brand_plus_performance))

[1] "brand-plus-performance campaign ROI: 5.00648336862663"


In [26]:
# ROI of full-funnel campaign
roi_full_funnel = ((df %>% filter(campaign_type == 'full_funnel') %>% pull(sales) %>% sum - 
                    df %>% filter(campaign_type == 'control') %>% pull(sales) %>% sum) - cost_full_funnel) / cost_full_funnel
print(paste("full-funnel campaign ROI:", roi_full_funnel))

[1] "full-funnel campaign ROI: 5.36150435335783"


#### 5. adding a full_funnel group where funnel stages are predicted instead of known

##### high prediction accuracy (90%)

In [27]:
df_predicted = read.csv('data_predicted_high.csv')

In [28]:
df_predicted$stage_progress = ifelse(df_predicted$current_funnel_stage == df_predicted$next_funnel_stage, 0, 1)

In [29]:
df_predicted = rbind(df_predicted, 
                     df %>% mutate(current_funnel_stage_predicted = current_funnel_stage, # add two predicted funnel stage columns in other conditions so we can stack them up 
                                  next_funnel_stage_predicted = next_funnel_stage))
df_predicted = arrange(df_predicted, campaign_type, user_id, visit)

In [30]:
# control group as the baseline
df_predicted$campaign_type = as.factor(df_predicted$campaign_type)
df_predicted$campaign_type <- relevel(df_predicted$campaign_type, ref = 'control')

In [31]:
# no ad as the baseline
df_predicted$ad_type = as.factor(df_predicted$ad_type)
df_predicted$ad_type <- relevel(df_predicted$ad_type, ref = 'none')

In [32]:
head(df_predicted)

Unnamed: 0_level_0,user_id,current_funnel_stage,current_funnel_stage_predicted,next_funnel_stage,next_funnel_stage_predicted,ad_type,purchase,sales,date,campaign_type,visit,stage_progress
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<fct>,<int>,<int>,<chr>,<fct>,<int>,<dbl>
1,30001,not aware,not aware,not aware,not aware,branding,0,0,2025-05-13,brand_plus_performance,1,0
2,30001,not aware,not aware,aware,aware,branding,0,0,2025-05-18,brand_plus_performance,2,1
3,30001,aware,aware,aware,aware,performance,0,0,2025-05-19,brand_plus_performance,3,0
4,30001,aware,aware,aware,aware,performance,0,0,2025-05-20,brand_plus_performance,4,0
5,30002,aware,aware,aware,aware,branding,0,0,2025-05-07,brand_plus_performance,1,0
6,30002,aware,aware,aware,aware,branding,0,0,2025-05-10,brand_plus_performance,2,0


In [33]:
# purchase probability as the outcome (all campaigns)
# intercept shows average probability without any ad
# data is aggregated from user-visit level to user level
model = lm(purchase ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df_predicted %>% 
    group_by(user_id) %>% mutate(purchase = sum(purchase)) %>% 
    distinct(user_id, .keep_all = T))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1378 -0.0943 -0.0484  0.9516 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.048400   0.002925  16.547   <2e-16 ***
campaign_typebrand_plus_performance 0.052900   0.004137  12.788   <2e-16 ***
campaign_typebranding               0.004400   0.004137   1.064    0.287    
campaign_typefull_funnel            0.093800   0.004137  22.675   <2e-16 ***
campaign_typefull_funnel_predicted  0.089400   0.004137  21.612   <2e-16 ***
campaign_typeperformance            0.045900   0.004137  11.096   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2925 on 59994 degrees of freedom
Multiple R-squared:  0.01543,	Adjusted R-squared:  

In [34]:
# purchase probability as the outcome (directly comparing two campaigns)
model = lm(purchase ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('full_funnel_predicted', 'full_funnel')))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df_predicted %>% 
    group_by(user_id) %>% mutate(purchase = sum(purchase)) %>% 
    distinct(user_id, .keep_all = T) %>% filter(campaign_type %in% 
    c("full_funnel_predicted", "full_funnel")))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1422 -0.1378 -0.1378  0.8622 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.142200   0.003470  40.980   <2e-16 ***
campaign_typefull_funnel_predicted -0.004400   0.004907  -0.897     0.37    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.347 on 19998 degrees of freedom
Multiple R-squared:  4.02e-05,	Adjusted R-squared:  -9.804e-06 
F-statistic: 0.8039 on 1 and 19998 DF,  p-value: 0.3699


In [35]:
# sales as the outcome (all campaigns)
# intercept shows average sales without any ad
# data is aggregated from user-visit level to user level
model = lm(sales ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df_predicted %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -13.78  -9.43  -4.84  95.16 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           4.8400     0.2925  16.547   <2e-16 ***
campaign_typebrand_plus_performance   5.2900     0.4137  12.788   <2e-16 ***
campaign_typebranding                 0.4400     0.4137   1.064    0.287    
campaign_typefull_funnel              9.3800     0.4137  22.675   <2e-16 ***
campaign_typefull_funnel_predicted    8.9400     0.4137  21.612   <2e-16 ***
campaign_typeperformance              4.5900     0.4137  11.096   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29.25 on 59994 degrees of freedom
Multiple R-squared:  0.01543,	Adjusted R-squared:  0.01535 
F-statistic:   1

In [36]:
# sales as the outcome (directly comparing two campaigns)
model = lm(sales ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('full_funnel_predicted', 'full_funnel')))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df_predicted %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T) %>% 
    filter(campaign_type %in% c("full_funnel_predicted", "full_funnel")))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -14.22 -13.78 -13.78  86.22 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         14.2200     0.3470  40.980   <2e-16 ***
campaign_typefull_funnel_predicted  -0.4400     0.4907  -0.897     0.37    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 34.7 on 19998 degrees of freedom
Multiple R-squared:  4.02e-05,	Adjusted R-squared:  -9.804e-06 
F-statistic: 0.8039 on 1 and 19998 DF,  p-value: 0.3699


In [37]:
# cost of predicted full-funnel campaign
branding_ad_cost = cpm/1000 * df_predicted %>% filter(campaign_type == 'full_funnel_predicted', ad_type == 'branding') %>% nrow
performance_ad_cost = cpa * df_predicted %>% filter(campaign_type == 'full_funnel_predicted', ad_type == 'performance') %>% pull(purchase) %>% sum
cost_full_funnel_predicted = branding_ad_cost + performance_ad_cost

print(paste("branding ad cost:", branding_ad_cost))
print(paste("performance ad cost:", performance_ad_cost))
print(paste("total cost:", cost_full_funnel_predicted))

[1] "branding ad cost: 524.34"
[1] "performance ad cost: 13470"
[1] "total cost: 13994.34"


In [38]:
# ROI of predicted full-funnel campaign
roi_full_funnel_predicted = ((df_predicted %>% filter(campaign_type == 'full_funnel_predicted') %>% pull(sales) %>% sum - 
                    df %>% filter(campaign_type == 'control') %>% pull(sales) %>% sum) - cost_full_funnel_predicted) / cost_full_funnel_predicted
print(paste("predicted full-funnel campaign ROI:", roi_full_funnel_predicted))

[1] "predicted full-funnel campaign ROI: 5.3882969829231"


##### medium prediction accuracy (60%)

In [39]:
df_predicted = read.csv('data_predicted_medium.csv')

In [40]:
df_predicted$stage_progress = ifelse(df_predicted$current_funnel_stage == df_predicted$next_funnel_stage, 0, 1)

In [41]:
df_predicted = rbind(df_predicted, 
                     df %>% mutate(current_funnel_stage_predicted = current_funnel_stage,
                                  next_funnel_stage_predicted = next_funnel_stage))
df_predicted = arrange(df_predicted, campaign_type, user_id, visit)

In [42]:
# control group as the baseline
df_predicted$campaign_type = as.factor(df_predicted$campaign_type)
df_predicted$campaign_type <- relevel(df_predicted$campaign_type, ref = 'control')

In [43]:
# no ad as the baseline
df_predicted$ad_type = as.factor(df_predicted$ad_type)
df_predicted$ad_type <- relevel(df_predicted$ad_type, ref = 'none')

In [44]:
# purchase probability as the outcome (all campaigns)
# intercept shows average probability without any ad
# data is aggregated from user-visit level to user level
model = lm(purchase ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df_predicted %>% 
    group_by(user_id) %>% mutate(purchase = sum(purchase)) %>% 
    distinct(user_id, .keep_all = T))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1097 -0.0943 -0.0484  0.9516 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.048400   0.002864  16.898   <2e-16 ***
campaign_typebrand_plus_performance 0.052900   0.004051  13.060   <2e-16 ***
campaign_typebranding               0.004400   0.004051   1.086    0.277    
campaign_typefull_funnel            0.093800   0.004051  23.157   <2e-16 ***
campaign_typefull_funnel_predicted  0.061300   0.004051  15.134   <2e-16 ***
campaign_typeperformance            0.045900   0.004051  11.332   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2864 on 59994 degrees of freedom
Multiple R-squared:  0.01276,	Adjusted R-squared:  

In [45]:
# purchase probability as the outcome (directly comparing two campaigns)
model = lm(purchase ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('full_funnel_predicted', 'full_funnel')))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df_predicted %>% 
    group_by(user_id) %>% mutate(purchase = sum(purchase)) %>% 
    distinct(user_id, .keep_all = T) %>% filter(campaign_type %in% 
    c("full_funnel_predicted", "full_funnel")))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1422 -0.1097 -0.1097  0.8903 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.142200   0.003314  42.907  < 2e-16 ***
campaign_typefull_funnel_predicted -0.032500   0.004687  -6.934 4.21e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3314 on 19998 degrees of freedom
Multiple R-squared:  0.002399,	Adjusted R-squared:  0.002349 
F-statistic: 48.08 on 1 and 19998 DF,  p-value: 4.208e-12


In [46]:
# sales as the outcome (all campaigns)
# intercept shows average sales without any ad
# data is aggregated from user-visit level to user level
model = lm(sales ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df_predicted %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -10.97  -9.43  -4.84  95.16 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           4.8400     0.2864  16.898   <2e-16 ***
campaign_typebrand_plus_performance   5.2900     0.4051  13.060   <2e-16 ***
campaign_typebranding                 0.4400     0.4051   1.086    0.277    
campaign_typefull_funnel              9.3800     0.4051  23.157   <2e-16 ***
campaign_typefull_funnel_predicted    6.1300     0.4051  15.134   <2e-16 ***
campaign_typeperformance              4.5900     0.4051  11.332   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.64 on 59994 degrees of freedom
Multiple R-squared:  0.01276,	Adjusted R-squared:  0.01268 
F-statistic: 155

In [47]:
# sales as the outcome (directly comparing two campaigns)
model = lm(sales ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('full_funnel_predicted', 'full_funnel')))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df_predicted %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T) %>% 
    filter(campaign_type %in% c("full_funnel_predicted", "full_funnel")))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -14.22 -10.97 -10.97  89.03 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         14.2200     0.3314  42.907  < 2e-16 ***
campaign_typefull_funnel_predicted  -3.2500     0.4687  -6.934 4.21e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 33.14 on 19998 degrees of freedom
Multiple R-squared:  0.002399,	Adjusted R-squared:  0.002349 
F-statistic: 48.08 on 1 and 19998 DF,  p-value: 4.208e-12


In [48]:
# cost of predicted full-funnel campaign
branding_ad_cost = cpm/1000 * df_predicted %>% filter(campaign_type == 'full_funnel_predicted', ad_type == 'branding') %>% nrow
performance_ad_cost = cpa * df_predicted %>% filter(campaign_type == 'full_funnel_predicted', ad_type == 'performance') %>% pull(purchase) %>% sum
cost_full_funnel_predicted = branding_ad_cost + performance_ad_cost

print(paste("branding ad cost:", branding_ad_cost))
print(paste("performance ad cost:", performance_ad_cost))
print(paste("total cost:", cost_full_funnel_predicted))

[1] "branding ad cost: 484.59"
[1] "performance ad cost: 9750"
[1] "total cost: 10234.59"


In [49]:
# ROI of predicted full-funnel campaign
roi_full_funnel_predicted = ((df_predicted %>% filter(campaign_type == 'full_funnel_predicted') %>% pull(sales) %>% sum - 
                    df %>% filter(campaign_type == 'control') %>% pull(sales) %>% sum) - cost_full_funnel_predicted) / cost_full_funnel_predicted
print(paste("predicted full-funnel campaign ROI:", roi_full_funnel_predicted))

[1] "predicted full-funnel campaign ROI: 4.98949249554696"


##### low prediction accuracy (30%)

In [50]:
df_predicted = read.csv('data_predicted_low.csv')

In [51]:
df_predicted$stage_progress = ifelse(df_predicted$current_funnel_stage == df_predicted$next_funnel_stage, 0, 1)

In [52]:
df_predicted = rbind(df_predicted, 
                     df %>% mutate(current_funnel_stage_predicted = current_funnel_stage,
                                  next_funnel_stage_predicted = next_funnel_stage))
df_predicted = arrange(df_predicted, campaign_type, user_id, visit)

In [53]:
# control group as the baseline
df_predicted$campaign_type = as.factor(df_predicted$campaign_type)
df_predicted$campaign_type <- relevel(df_predicted$campaign_type, ref = 'control')

In [54]:
# no ad as the baseline
df_predicted$ad_type = as.factor(df_predicted$ad_type)
df_predicted$ad_type <- relevel(df_predicted$ad_type, ref = 'none')

In [55]:
# purchase probability as the outcome (all campaigns)
# intercept shows average probability without any ad
# data is aggregated from user-visit level to user level
model = lm(purchase ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df_predicted %>% 
    group_by(user_id) %>% mutate(purchase = sum(purchase)) %>% 
    distinct(user_id, .keep_all = T))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1013 -0.0859 -0.0528  0.9516 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.048400   0.002808  17.237   <2e-16 ***
campaign_typebrand_plus_performance 0.052900   0.003971  13.322   <2e-16 ***
campaign_typebranding               0.004400   0.003971   1.108    0.268    
campaign_typefull_funnel            0.093800   0.003971  23.621   <2e-16 ***
campaign_typefull_funnel_predicted  0.037500   0.003971   9.443   <2e-16 ***
campaign_typeperformance            0.045900   0.003971  11.559   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2808 on 59994 degrees of freedom
Multiple R-squared:  0.01245,	Adjusted R-squared:  

In [56]:
# purchase probability as the outcome (directly comparing two campaigns)
model = lm(purchase ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(purchase = sum(purchase)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('full_funnel_predicted', 'full_funnel')))
summary(model)


Call:
lm(formula = purchase ~ campaign_type, data = df_predicted %>% 
    group_by(user_id) %>% mutate(purchase = sum(purchase)) %>% 
    distinct(user_id, .keep_all = T) %>% filter(campaign_type %in% 
    c("full_funnel_predicted", "full_funnel")))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1422 -0.1422 -0.0859 -0.0859  0.9141 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.142200   0.003166   44.91   <2e-16 ***
campaign_typefull_funnel_predicted -0.056300   0.004478  -12.57   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3166 on 19998 degrees of freedom
Multiple R-squared:  0.007842,	Adjusted R-squared:  0.007793 
F-statistic: 158.1 on 1 and 19998 DF,  p-value: < 2.2e-16


In [57]:
# sales as the outcome (all campaigns)
# intercept shows average sales without any ad
# data is aggregated from user-visit level to user level
model = lm(sales ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df_predicted %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -10.13  -8.59  -5.28  95.16 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           4.8400     0.2808  17.237   <2e-16 ***
campaign_typebrand_plus_performance   5.2900     0.3971  13.322   <2e-16 ***
campaign_typebranding                 0.4400     0.3971   1.108    0.268    
campaign_typefull_funnel              9.3800     0.3971  23.621   <2e-16 ***
campaign_typefull_funnel_predicted    3.7500     0.3971   9.443   <2e-16 ***
campaign_typeperformance              4.5900     0.3971  11.559   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.08 on 59994 degrees of freedom
Multiple R-squared:  0.01245,	Adjusted R-squared:  0.01237 
F-statistic: 151

In [58]:
# sales as the outcome (directly comparing two campaigns)
model = lm(sales ~ campaign_type, 
           df_predicted %>% 
           group_by(user_id) %>% 
           mutate(sales = sum(sales)) %>% 
           distinct(user_id, .keep_all = T) %>%
           filter(campaign_type %in% c('full_funnel_predicted', 'full_funnel')))
summary(model)


Call:
lm(formula = sales ~ campaign_type, data = df_predicted %>% group_by(user_id) %>% 
    mutate(sales = sum(sales)) %>% distinct(user_id, .keep_all = T) %>% 
    filter(campaign_type %in% c("full_funnel_predicted", "full_funnel")))

Residuals:
   Min     1Q Median     3Q    Max 
-14.22 -14.22  -8.59  -8.59  91.41 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         14.2200     0.3166   44.91   <2e-16 ***
campaign_typefull_funnel_predicted  -5.6300     0.4478  -12.57   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.66 on 19998 degrees of freedom
Multiple R-squared:  0.007842,	Adjusted R-squared:  0.007793 
F-statistic: 158.1 on 1 and 19998 DF,  p-value: < 2.2e-16


In [59]:
# cost of predicted full-funnel campaign
branding_ad_cost = cpm/1000 * df_predicted %>% filter(campaign_type == 'full_funnel_predicted', ad_type == 'branding') %>% nrow
performance_ad_cost = cpa * df_predicted %>% filter(campaign_type == 'full_funnel_predicted', ad_type == 'performance') %>% pull(purchase) %>% sum
cost_full_funnel_predicted = branding_ad_cost + performance_ad_cost

print(paste("branding ad cost:", branding_ad_cost))
print(paste("performance ad cost:", performance_ad_cost))
print(paste("total cost:", cost_full_funnel_predicted))

[1] "branding ad cost: 373.86"
[1] "performance ad cost: 6740"
[1] "total cost: 7113.86"


In [60]:
# ROI of predicted full-funnel campaign
roi_full_funnel_predicted = ((df_predicted %>% filter(campaign_type == 'full_funnel_predicted') %>% pull(sales) %>% sum - 
                    df %>% filter(campaign_type == 'control') %>% pull(sales) %>% sum) - cost_full_funnel_predicted) / cost_full_funnel_predicted
print(paste("predicted full-funnel campaign ROI:", roi_full_funnel_predicted))

[1] "predicted full-funnel campaign ROI: 4.27139977452466"
