# Media Mix Modeling for B2B: Data Preprocessing

In [1]:
### 0. Load Required Packages and Utility Functions
defaultW = getOption("warn") 
options(warn = -1) 

suppressPackageStartupMessages({
#     library(rstan)
#     options(mc.cores = parallel::detectCores())
#     rstan_options(auto_write = TRUE)
#     Sys.setenv(LOCAL_CPPFLAGS = '-march=native')

#     library(brms)
#     library(lme4)
#     library(here)
    library(tidyverse)
#     library(tidybayes)
#     library(bayesplot)
    library(lubridate)
    library(zoo)
    library(corrr)
    library(ggdendro)
    library(ggpubr)
})

source("Packages/Utility_Functions_MMM.R")

In [2]:
### 1. df_XyZ: Target/Media/Control Features
df_XyZ <- read.csv("Data/B2B_data_30DMAs_Q321_refresh_v3.csv", stringsAsFactor = FALSE)

df_XyZ <- df_XyZ %>% dplyr::mutate(period = lubridate::as_date(period))


names(df_XyZ)
glimpse(df_XyZ)

Rows: 5,850
Columns: 79
$ brand                         [3m[90m<chr>[39m[23m "Suddenlink", "Suddenlink", "Suddenlink"…
$ dma                           [3m[90m<chr>[39m[23m "ABILENE, TX", "ABILENE, TX", "ABILENE, …
$ period                        [3m[90m<date>[39m[23m 2017-12-31, 2018-01-07, 2018-01-14, 201…
$ gross_add                     [3m[90m<int>[39m[23m 7, 8, 13, 12, 11, 13, 17, 6, 11, 7, 13, …
$ sub_comm                      [3m[90m<int>[39m[23m 4782, 4782, 4782, 4782, 4782, 4782, 4782…
$ non_sub_comm                  [3m[90m<int>[39m[23m 3581, 3581, 3581, 3581, 3581, 3581, 3581…
$ homespassed_comm              [3m[90m<int>[39m[23m 8363, 8363, 8363, 8363, 8363, 8363, 8363…
$ sub_resi                      [3m[90m<int>[39m[23m 44507, 44507, 44507, 44507, 44507, 44507…
$ non_sub_resi                  [3m[90m<int>[39m[23m 39851, 39851, 39851, 39851, 39851, 39851…
$ homespassed_resi              [3m[90m<int>[39m[23m 84358, 84358, 84358, 84358, 

In [3]:
#################################################################################
### 1. Pre-Processing                                                         ###
###                                                                           ###
### (1) Make `df_XyZ`: Target/Media/Control Features.                         ###
### (2) Make `df_seasonal`: Sales Seasonalities.                              ###
### (3) Make `df_norm`: Subscribers/Non-Subscribers by DMA.                   ###
### (4) Create Inputs to `preprocess_MMM`.                                    ###
### (5) Conduct `proprocess_MMM`.                                             ###
###                                                                           ###
#################################################################################

## New Control Variables
1. Search Spend
    - Brand
    - Non-Brand
    - Competitive
2. Big Holidays
    - New Year's, MLK, Presidents, Memorial, Independence, Thanksgiving, X-Mas
    - 12+ More Holidays
3. Promotion: Optimum
    - Optimum: Flash Sales & Fall Sales in 2018
    - Suddenlink: Flash Sales 2018, Summer Sales 2018 & 2019, Fall Sales 2018
5. Web Traffics
    - Structural Break: `2019-01-01`
    - Website Redesign: `2018-07-09`

# Preprocessing

## 1.Create additional features from raw competition and google indexes

In [4]:
comp_raw_data <- read.csv("Data/competition_API_add_brand.csv", stringsAsFactor = FALSE)
glimpse(comp_raw_data)



Rows: 195
Columns: 6
$ period                 [3m[90m<chr>[39m[23m "2017-12-31", "2018-01-07", "2018-01-14", "2018…
$ X.frontier.business.   [3m[90m<int>[39m[23m 2, 2, 7, 4, 4, 2, 4, 6, 2, 4, 9, 4, 4, 2, 2, 4,…
$ X.optimum.business.    [3m[90m<int>[39m[23m 9, 11, 4, 11, 0, 4, 6, 4, 0, 9, 4, 9, 2, 4, 7, …
$ X.verizon.business.    [3m[90m<int>[39m[23m 50, 68, 55, 56, 73, 60, 73, 64, 49, 73, 41, 57,…
$ X.ATT.business.        [3m[90m<int>[39m[23m 18, 26, 15, 26, 24, 19, 24, 28, 38, 34, 21, 15,…
$ X.suddenlink.business. [3m[90m<int>[39m[23m 5, 2, 2, 2, 2, 2, 0, 0, 2, 2, 4, 4, 2, 0, 2, 0,…


In [5]:
comp_raw_data <- comp_raw_data %>%
                 rename_all(.funs = function(x){trimws(gsub("[X.|\\.]"," ",x))}) %>%
                 rename_all(.funs = function(x){gsub(" ","_",x)}) %>%
                 mutate(period = as.Date(period),
                        opt_comp = (verizon_business-optimum_business)/optimum_business,
                        sdl_comp_front = (frontier_business-suddenlink_business)/suddenlink_business,
                        sdl_comp_att = (ATT_business - suddenlink_business)/suddenlink_business) %>%
                 mutate_if(is.numeric, function(x) ifelse(is.infinite(x), 0, x)) %>%
                 replace_na(list('sdl_comp_front' = 0,
                                 'opt_comp' = 0,
                                 'sdl_comp_att' = 0))

In [6]:
google_comp_cols_vec <- c('google_altice_search_idx','google_comp_search_idx','google_non_brand_search_idx')

In [7]:
derived_competition_vars <- c('comp_diff_c_a','comp_diff_a_nb','comp_diff_c_nb','comp_index','comp_ratio_c_nb','comp_ratio_a_nb','comp_ratio_c_a','comp_index2')

In [8]:
##Join the competition raw data to the main dataframe and create new features
##In Q3 2021 refresh, the competition_correct is ignored as the competition data in the source file and seperate file are the same
df_XyZ <- df_XyZ %>%
             left_join(comp_raw_data, by = c('period')) %>%
             mutate(competition_alt = ifelse(dma == 'NEW YORK, NY',opt_comp,sdl_comp_front)) %>%
                    #comp_correct = ifelse(dma == 'NEW YORK, NY',opt_comp,sdl_comp_att)
             as.data.frame()

In [9]:
##Additional features of competition from google search indexes
df_XyZ <-   df_XyZ %>%
            dplyr::mutate(comp_diff_c_a = google_comp_search_idx - google_altice_search_idx, 
                          comp_diff_a_nb = google_altice_search_idx - google_non_brand_search_idx,
                          comp_diff_c_nb = google_comp_search_idx - google_non_brand_search_idx,
                          comp_index = comp_diff_c_a/google_altice_search_idx,
                          #comp_index2 = comp_diff_b_nb/comp_diff_g_nb,
                          comp_ratio_c_nb = google_comp_search_idx/google_non_brand_search_idx,
                          comp_ratio_a_nb = google_altice_search_idx/google_non_brand_search_idx,
                          comp_ratio_c_a = google_comp_search_idx/google_altice_search_idx,
                          comp_index2 = ifelse(dma == 'NEW YORK, NY',competition,comp_index))

In [10]:
df_XyZ <- df_XyZ %>%
          select(-matches('business'))

## 2. Adding AR and MA data

In [11]:
load("RData/df_ARMA_B2B.RData")
##Joining with the latest 3 years of data

# names(df_XyZ)
# glimpse(df_XyZ)

df_XyZ <- df_XyZ %>%
        dplyr::inner_join(df_ARMA_B2B, by = c("brand" = "brand", "dma" = "dma", "period" = "period")) #%>%
#         dplyr::inner_join(df_comp2, by = c("dma" = "DMA", "period" = "period"))
names(df_XyZ)
glimpse(df_XyZ)

Rows: 5,850
Columns: 95
$ brand                         [3m[90m<chr>[39m[23m "Suddenlink", "Suddenlink", "Suddenlink"…
$ dma                           [3m[90m<chr>[39m[23m "ABILENE, TX", "ABILENE, TX", "ABILENE, …
$ period                        [3m[90m<date>[39m[23m 2017-12-31, 2018-01-07, 2018-01-14, 201…
$ gross_add                     [3m[90m<int>[39m[23m 7, 8, 13, 12, 11, 13, 17, 6, 11, 7, 13, …
$ sub_comm                      [3m[90m<int>[39m[23m 4782, 4782, 4782, 4782, 4782, 4782, 4782…
$ non_sub_comm                  [3m[90m<int>[39m[23m 3581, 3581, 3581, 3581, 3581, 3581, 3581…
$ homespassed_comm              [3m[90m<int>[39m[23m 8363, 8363, 8363, 8363, 8363, 8363, 8363…
$ sub_resi                      [3m[90m<int>[39m[23m 44507, 44507, 44507, 44507, 44507, 44507…
$ non_sub_resi                  [3m[90m<int>[39m[23m 39851, 39851, 39851, 39851, 39851, 39851…
$ homespassed_resi              [3m[90m<int>[39m[23m 84358, 84358, 84358, 84358, 

In [12]:
df_XyZ %>%
summarize(oldest_period = min(period),
          latest_period = max(period))

oldest_period,latest_period
<date>,<date>
2017-12-31,2021-09-19


## 3. Holiday dummies

In [13]:
holidays_data = read.csv('Data/List_Of_Holidays_US.csv',stringsAsFactors = F)
holidays_data

ds,holiday,country,year
<chr>,<chr>,<chr>,<int>
1995-01-01,New Year's Day,US,1995
1995-01-02,New Year's Day (Observed),US,1995
1995-01-16,"Martin Luther King, Jr. Day",US,1995
1995-02-20,Washington's Birthday,US,1995
1995-05-29,Memorial Day,US,1995
1995-07-04,Independence Day,US,1995
1995-09-04,Labor Day,US,1995
1995-10-09,Columbus Day,US,1995
1995-11-10,Veterans Day (Observed),US,1995
1995-11-11,Veterans Day,US,1995


In [14]:
major_holidays <- c("New Year's Day","Martin Luther King, Jr. Day","Washington's Birthday","Memorial Day","Independence Day","Thanksgiving","Christmas Day")
tg_chr <- c("Thanksgiving","Christmas Day")

In [15]:
##Major holidays excluding Thanksgiving and Christmas
major_holidays2 <- major_holidays[!(major_holidays%in%tg_chr)]
major_holidays2

In [16]:
major_holidays_str <- sprintf("(^%s)",paste(major_holidays,collapse = '|^'))
major_holidays_str

In [17]:
major_holidays_str2 <- sprintf("(^%s)",paste(major_holidays2,collapse = '|^'))
tg_chr_str <- sprintf("(^%s)",paste(tg_chr,collapse = '|^'))
tg_chr_str

In [18]:
##Subsetting the holidays data to current data period and also removing the observed weeks
##Also adding the start of the week to the data to match the period in df_XyZ
holidays_data     <- holidays_data %>%
                     filter(!grepl('Observed',holiday)  & ds>=min(as.Date(df_XyZ$period)) & ds<=max(as.Date(df_XyZ$period))) %>%
                     mutate(holiday_week = floor_date(as.Date(ds), unit="week"))
holidays_data

ds,holiday,country,year,holiday_week
<chr>,<chr>,<chr>,<int>,<date>
2018-01-01,New Year's Day,US,2018,2017-12-31
2018-01-15,"Martin Luther King, Jr. Day",US,2018,2018-01-14
2018-02-19,Washington's Birthday,US,2018,2018-02-18
2018-05-28,Memorial Day,US,2018,2018-05-27
2018-07-04,Independence Day,US,2018,2018-07-01
2018-09-03,Labor Day,US,2018,2018-09-02
2018-10-08,Columbus Day,US,2018,2018-10-07
2018-11-11,Veterans Day,US,2018,2018-11-11
2018-11-22,Thanksgiving,US,2018,2018-11-18
2018-12-25,Christmas Day,US,2018,2018-12-23


In [19]:
mega_holiday = holidays_data %>%
               filter(grepl(major_holidays_str,holiday)) %>%
               select(holiday_week) %>%
               pull()

mega_holiday2 = holidays_data %>%
                   filter(grepl(major_holidays_str2,holiday)) %>%
                   select(holiday_week) %>%
                   pull()

tg_chr = holidays_data %>%
                   filter(grepl(tg_chr_str,holiday)) %>%
                   select(holiday_week) %>%
                   pull()


In [20]:
mega_holiday

In [21]:
mega_holiday2

In [22]:
tg_chr

In [23]:
#-------------------------------------------------#
# New control variables                           #
#-------------------------------------------------#
# New Control Variables
# New Year's, MLK, Presidents, Memorial, Independence, Thanksgiving, X-Mas
promotion_OPT = c('2018-03-25', '2018-06-24', '2018-09-16', '2019-03-31', '2019-06-23', '2019-09-08', '2019-09-29', '2020-01-26') %>% as_date()
promotion_SDL = c('2018-03-11', '2018-07-08', '2019-03-24', '2019-08-25', '2019-09-08', '2019-11-03', '2020-01-26') %>% as_date()

##Theres always a promotion that was in place. (Refer to the additional price and promo data provided) So weeks where the promotions took effect in the new periods
promotion_OPT_new = c('2020-06-14','2020-06-21','2020-07-05','2020-07-12','2020-08-23','2020-09-27','2021-01-24','2021-02-21') %>% as_date()
promotion_SDL_new = c('2020-06-14','2020-08-23','2020-11-15','2021-01-24','2021-02-21') %>% as_date()

promotion_OPT_new = c(promotion_OPT_new,c('2021-05-02','2021-05-23','2021-06-20','2021-07-25','2021-09-12') %>% as_date())
promotion_SDL_new = c(promotion_SDL_new,c('2021-05-02','2021-05-23','2021-06-27','2021-07-25','2021-08-15') %>% as_date())

##Initial covid dummy - when the COVID pandemic started
covid_period = c('2020-03-15','2020-03-22') %>% as_date()



df_XyZ <- df_XyZ %>% 
  dplyr::mutate(period        = lubridate::as_date(period),
                mega_holiday  = (period %in% mega_holiday) %>% as.numeric(),
                mega_holiday2 = (period %in% mega_holiday2) %>% as.numeric(),
                tg_chr        = (period %in% tg_chr) %>% as.numeric(),
                promotion_OPT     = ifelse(dma%in%'NEW YORK, NY' & period %in% c(promotion_OPT,promotion_OPT_new), 1, 0),
                promotion_SDL     = ifelse(!dma%in%'NEW YORK, NY' & period %in% c(promotion_SDL,promotion_SDL_new),1,0),
                promotion         = promotion_OPT + promotion_SDL,
                winter_storm  = ifelse(brand == 'Suddenlink' & period == '2021-02-14',1,0),
                post_laura = ifelse((period >= "2020-08-30" & period <= "2020-09-30") & (grepl('LA$',dma)==TRUE),1,0),
                website_visitors_p_2018 = ifelse(period >= "2018-12-30", website_visitors, 0),
                website_visitors_2018 = ifelse(period < "2018-12-30", website_visitors, 0),
                ad_messenger_campaign = ifelse(period >= "2020-12-30",1,0),
                covid_dummy = (period %in% covid_period) %>% as.numeric(),
                price_OPT = ifelse(dma%in%'NEW YORK, NY',product_price,0),
                price_SDL = ifelse(!dma%in%'NEW YORK, NY',product_price,0),
                price_ARPU_fprint = avg_tot_price_fprint,
                price_ARPU = avg_tot_price,
                price_hybrid = ifelse(dma%in%'NEW YORK, NY',avg_tot_price_fprint,product_price)) %>%
               
  dplyr::select(-c(sub_comm, non_sub_comm, homespassed_comm, sub_resi, non_sub_resi, homespassed_resi)) 

names(df_XyZ)
# df_XyZ %>% select(brand:gross_add, search_spend:website_visitors_2019)
glimpse(df_XyZ)

df_XyZ %>% distinct(dma) %>% pull()
df_XyZ %>% distinct(brand) %>% pull()

Rows: 5,850
Columns: 106
$ brand                         [3m[90m<chr>[39m[23m "Suddenlink", "Suddenlink", "Suddenlink"…
$ dma                           [3m[90m<chr>[39m[23m "ABILENE, TX", "ABILENE, TX", "ABILENE, …
$ period                        [3m[90m<date>[39m[23m 2017-12-31, 2018-01-07, 2018-01-14, 201…
$ gross_add                     [3m[90m<int>[39m[23m 7, 8, 13, 12, 11, 13, 17, 6, 11, 7, 13, …
$ dma_insertable_proj_grps      [3m[90m<dbl>[39m[23m 70.285284, 58.911590, 58.911590, 9.00103…
$ dma_insertable_ue             [3m[90m<dbl>[39m[23m 22364, 22364, 22364, 22364, 22244, 22086…
$ cross_channel_imp             [3m[90m<dbl>[39m[23m 15719, 13175, 13175, 2013, 0, 8522, 1113…
$ cross_channel_spend           [3m[90m<dbl>[39m[23m 690.85005, 579.04125, 579.04125, 88.4713…
$ dm_volume                     [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 2698, 0, 2698, 0, 2698, 0…
$ dm_spend                      [3m[90m<dbl>[39m[23m 0.00, 0.00, 0.00, 0.00, 0.0

In [24]:
df_XyZ %>%
group_by(dma) %>%
summarize(max(promotion))

dma,max(promotion)
<chr>,<dbl>
"ABILENE, TX",1
"ALEXANDRIA, LA",1
ALL OTHER SUDDENLINK DMA TOTAL,1
"AMARILLO, TX",1
"AUSTIN, TX",1
"BLUEFIELD-BECKLEY, WV",1
"CHARLESTON-HUNTINGTON, WV",1
"DALLAS, TX",1
"EUREKA, CA",1
"GREENVILLE-NEW BERN, NC",1


In [25]:
data_XyZ          = df_XyZ
#save(data_XyZ, file = "RData/data_XyZ_3y_v1.RData")
# write.csv(data_XyZ,'RData/v10/data_XyZ.csv')

In [26]:
### 2. df_seasonal: Sales Seasonalities
df_seasonal = read.csv("Data/B2B_GA_2016_2021_30DMAs_0901_imputed.csv", stringsAsFactors=FALSE)
df_seasonal <- df_seasonal %>% 
  dplyr::mutate(period = lubridate::as_date(period)) %>% 
  dplyr::as_tibble()

In [27]:
## Check whether dma names are identical.
dma_XyZ      = df_XyZ %>% select(dma) %>% distinct(dma) %>% arrange(dma) %>% pull()
dma_seasonal = df_seasonal %>% select(dma) %>% distinct(dma) %>% arrange(dma) %>% pull()
identical(dma_XyZ, dma_seasonal)



### 3. df_norm: Subscribers/Non-Subscribers by DMA
df_norm = read.csv("Data/non_sub_tv_insert.csv")
df_norm = df_norm %>%
#   dplyr::rename(dma = DMA) %>%
  dplyr::mutate(dma = as.character(dma)) %>%   
  dplyr::arrange(dma) %>% 
  dplyr::as_tibble()

glimpse(df_norm)

## 1. Check whether dma names are identical.
dma_XyZ  = df_XyZ %>% select(dma) %>% distinct(dma) %>% arrange(dma) %>% pull()
dma_norm = df_norm %>% select(dma) %>% distinct(dma) %>% arrange(dma) %>% pull()
identical(dma_XyZ, dma_norm)

# dma_idx  = which(dma_XyZ != dma_norm)
# dma_idx


## 2. Make dma names identical.
# df_norm$dma[dma_idx] = dma_XyZ[dma_idx]

# dma_XyZ  = df_XyZ %>% select(dma) %>% distinct(dma) %>% arrange(dma) %>% pull()
# dma_norm = df_norm %>% select(dma) %>% distinct(dma) %>% arrange(dma) %>% pull()
# identical(dma_XyZ, dma_norm)
# which(dma_XyZ != dma_norm)

Rows: 30
Columns: 4
$ dma              [3m[90m<chr>[39m[23m "ABILENE, TX", "ALEXANDRIA, LA", "ALL OTHER SUDDENLIN…
$ non_sub          [3m[90m<int>[39m[23m 5320, 4734, 18695, 9090, 4751, 5780, 12663, 14579, 27…
$ subs             [3m[90m<int>[39m[23m 43776, 42504, 120518, 65625, 62073, 59173, 120630, 89…
$ tv_insertable_ue [3m[90m<dbl>[39m[23m 19984.83, 26808.40, 58028.15, 39283.45, 24379.33, 372…


## Pre-processing 

### X = 1 and y = 1

In [28]:
### 4. Inputs to `preprocess_MMM` 
data_XyZ        = data_XyZ
data_seasonal   = df_seasonal
data_norm       = df_norm
target_var      = c('gross_add')
media_names     = c('digital_spend', 'DRTV_spend_w', 'radio_spend_w','social_spend','dm_spend',
                    'paid_search_spend','cross_channel_imp')
control_names   = c( 'product_offer', 'inflation', 'promotion','promotion_OPT','promotion_SDL',
                    'AR_1','AR_5','AR_52','MA_4',
                     'product_price','price_ARPU', 'price_ARPU_fprint', 'price_SDL','price_OPT','price_hybrid',
                    'avg_video_price','avg_ov_price','avg_ool_price','avg_video_price_fprint','avg_ool_price_fprint','avg_ov_price_fprint',
                    'website_visitors', 'website_visitors_2018', 'website_visitors_p_2018',
                    'big_holiday','mega_holiday','post_laura','winter_storm','ad_messenger_campaign','tg_chr','mega_holiday2',
                    'covid_daily_confirmed_cases','covid_daily_deaths','covid_cum_confirmed_cases','covid_cum_deaths','covid_dummy', ##COVID features
                    google_comp_cols_vec,derived_competition_vars,'competition','competition_alt','SNOW','email_sent_w' ##Treating email as control
                    )
weather_names   = c('PRCP', 'SNOW', 'TMAX', 'TMIN','AWND','SNWD','WDF2','WDF5','WSF2','WSF5')
compute_PCA     = 'YES'
N_lag_target    = 4 
N_lag_media     = 13
min_var_weather = 1
data_frequency  = 'weekly'
DMA_Infos       = TRUE
subs            = 'no'
y_percentile    = 1
X_percentile    = 1
media_norm_var = rep('non_sub',length(media_names))
names(media_norm_var) <- media_names
media_norm_var['cross_channel_imp'] <- 'tv_insertable_ue' 
media_norm_var = as.character(media_norm_var)
media_group_norm_var = rep(1,length(media_names))
names(media_group_norm_var) <- media_names
media_group_norm_var['cross_channel_imp'] <- 2
#media_group_norm_var['email_sent_w'] <- 3
media_group_norm_var = as.character(media_group_norm_var)
target_norm_var = 'non_sub'
control_norm_per_capita_var = c('google_altice_search_idx','google_comp_search_idx','google_non_brand_search_idx','MA_4','AR_1','AR_5','AR_52',
                                'website_visitors', 'website_visitors_2018', 'website_visitors_p_2018',
                                 'covid_daily_confirmed_cases','covid_daily_deaths','covid_cum_confirmed_cases','covid_cum_deaths')

In [29]:
## 5. Conduct `preprocess_MMM`

# Use the Global Max
pp_MMM_email_cntrl = preprocess_MMM(data_XyZ    = data_XyZ,
                        data_seasonal   = data_seasonal,
                        data_norm       = data_norm,
                        target_var      = target_var,
                        media_names     = media_names,
                        control_names   = control_names,
                        weather_names   = weather_names,
                        compute_PCA     = compute_PCA,
                        N_lag_target    = N_lag_target,
                        N_lag_media     = N_lag_media,
                        min_var_weather = min_var_weather,
                        data_frequency  = data_frequency,
                        DMA_Infos       = DMA_Infos,
                        subs            = subs,
                        target_norm_var = target_norm_var,
                        media_norm_var = media_norm_var,
                        media_group_norm_var = media_group_norm_var,
                        X_percentile    = X_percentile,
                        y_percentile = y_percentile,
                        control_norm_per_capita_var = control_norm_per_capita_var
                        )

save(pp_MMM_email_cntrl, file = "RData/pp_MMM_email_cntrl.RData")
#save(data_XyZ, file = "RData/data_XyZ_v4.RData")
# write.csv(data_XyZ,'RData/data_XyZ.csv')

# load("RData/data_XyZ.RData")
# load("RData/pp_MMM.RData")
# names(pp_MMM)





Note: Preprocessing of MMM Raw Datasets
1. Target, 'y'
 - Normalize 'y' by 'non_sub'.
 - Transform Normalized 'y' by Max('y').
 - Create Lagged 'y'.
 - Create Seasonality.

 - Create Trend.

2. Media Variables, 'X'
 - Normalize 'X' by `Normalization Variable` per Media Channel:

|Media Channel     |Normalization Variable |
|:-----------------|:----------------------|
|digital_spend     |non_sub                |
|DRTV_spend_w      |non_sub                |
|radio_spend_w     |non_sub                |
|social_spend      |non_sub                |
|dm_spend          |non_sub                |
|paid_search_spend |non_sub                |
|cross_channel_imp |tv_insertable_ue       |

 - Transform Normalized 'X' by MinMax('X').
 - Create Lagged 'X'.

3. Control Variables, 'Z'
 - Convert 'weather' variables into N components/factors.
 - Add the N 'weather' components to Z.
 - Normalize few 'Z' by `Normalization Variable` per Control Variable (using the same normalization factor as target):

|C

`summarise()` has grouped output by 'dma_id', 'dma'. You can override using the `.groups` argument.

`summarise()` has grouped output by 'dma_id'. You can override using the `.groups` argument.

`summarise()` has grouped output by 'dma_id', 'dma', 'X'. You can override using the `.groups` argument.

`summarise()` has grouped output by 'dma_id', 'dma'. You can override using the `.groups` argument.

`summarise()` has grouped output by 'X'. You can override using the `.groups` argument.




Note: Principal Component Analysis of Weather Variables
1. 5 components are selected by 'min_var_weather'=1.
2. 93.72% of the total variance are explained by the selected components.


Note: Using an external vector in selections is ambiguous.
[34mℹ[39m Use `all_of(target_norm_var)` instead of `target_norm_var` to silence this message.
[34mℹ[39m See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
[90mThis message is displayed once per session.[39m

Note: Using an external vector in selections is ambiguous.
[34mℹ[39m Use `all_of(control_norm_per_capita_var)` instead of `control_norm_per_capita_var` to silence this message.
[34mℹ[39m See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
[90mThis message is displayed once per session.[39m

`summarise()` has grouped output by 'dma_id', 'dma'. You can override using the `.groups` argument.




Note: Summary of MMM Raw Datasets
1. 'data_XyZ'
 - Number of X      : 7 Media Variables
 - Number of Z      : 60 Control Variables
 - Study Period     : 2017-12-31 ~ 2021-09-19
 - Number of weeks  : 195 weeks
 - Number of DMAs   : 30 DMAs
2. 'data_seasonal'
 - Optimum
   - Study Period   : 2016-01-03 ~ 2021-09-19
   - Number of weeks : 299 weeks
 - Suddenlink
   - Study Period   : 2016-01-03 ~ 2021-09-19
   - Number of weeks : 299 weeks
3. 'data_norm'
 - Number of DMAs   : 30 DMAs



In [30]:
pp_MMM_email_cntrl$control$Z_minmax %>%
ungroup() %>%
select(X) %>%
distinct() %>%
pull()

In [31]:
all.equal(pp_MMM_email_cntrl$control$Z_per_capita$AR_1,pp_MMM_email_cntrl$control$Z_per_capita$AR1)