# Instructions of step 2 & 3

After identify the causes towards each outcome, the causes can be used to predict outcomes. This section uses supervised machine learning methods, linear regression, XGBoost, logistic regression, and other models, to predict the outcome based on the causes, and then evaluates the performance of these models.
This section will also compare the performance of use of audience segmentation and the above case both for outcome prediction.
In order to contrast with the above two, an additional case is added to the comparison, where only the active stream as well as the programmed stream are used to predict the results.

In [12]:
from func import *

## 1. Load data from disk

In [None]:
# Load data
raw_input = load_data('../Data/CausalFandom_main_data.pickle')

## 2. Set outcome and causes columns

In [None]:
# outcome in pre and post period
preoutcome = {
    'ticket':['tickets_pre_period_1'],
    'merch':['merch_pre_period_1'],
    'share':['shares_pre_period_1'],
    'stream':['streams_active_streams_pre_period_1', 'streams_programmed_streams_pre_period_1']
}
postoutcome = {
    'ticket':['tickets_following_four_weeks'],
    'merch':['merch_following_four_weeks'],
    'share':['shares_following_four_weeks'],
    'stream':['streams_active_streams_following_four_weeks', 'streams_programmed_streams_following_four_weeks']
}

# Total cause list
# Hide internal variables name
cause_dict = {
    'ticket':{
        'pre1':[],
        'treat':[],
        'pre1treat':[]
    },
    'merch':{
        'pre1':[],
        'treat':[],
        'pre1treat':[],
    },
    'share':{
        'pre1':[],
        'treat':[],
        'pre1treat':[],
    },
    'stream':{
        'pre1':[],
        'treat':[],
        'pre1treat':[],
    },
}

# Causes for only stream
# Hide internal variables name
onlystreamcause = {
        'pre1':[],
        'treat':[],
        'pre1treat':[],
}

# All possible causes
# Hide internal variables name
allposscaues = []

## 2.1. Set relevant configurations

In [None]:
# Set the bounds to remove extreme values (first bound is for outcome, second is for change)
bound_dict = {
    'ticket':[100,50],
    'merch':[10,5],
    'share':[30,15],
    'stream':[100,50]
}

# Set cutoff of Logistic regression's data
logcutoff_dict = {
    'ticket':[5,0],
    'merch':[1,0],
    'share':[15,0],
    'stream':[50,0]
}

# Set the minimal number of a value of y
samplesize_dict = {
    'ticket':300,
    'merch':300,
    'share':300,
    'stream':500
}

## 3. Using 5 nested for loops to iterate though all cases

Task of loops (outer to inner):

Loop1 iterates 2 cases:
1. ifchange = 0 or 1 : use post-treatment period outcome or change of outcome as label

loop2 iterates 2 cases:
1. ifpreandtreat = 0 or 1 : use treatment period data or pre-treatment period and treatment period data as input data

Loop2 iterates 4 cases: 4 outcomes (stream, share, merch, ticket)

Loop3 iterates 3 cases:
1. Using identified causes (step1) as input data
2. Using only stream (active stream & programmed stream) as input data
3. Using Audience Segments as input data

Loop4 iterates 3 cases: training 3 models (XGBoost, Linear Regression, Logistic Regression)

Result: Big dictionary contains all cases' MAE score, R2 score, Coefficient table

In [None]:
fileloop = tqdm([True, False])
ifpreandtreat = tqdm(['treat','pre1treat'])
outcomeloop = tqdm(['stream','share','merch','ticket'])
methodloop = tqdm(['step1','stream','seg'])

# First loop: iterates two cases of ifchange
for ifchange in fileloop:
    # Second loop: iterates two cases of ifpreandtreat
    for timenote in ifpreandtreat:
        # Create dicts to store results
        maedict = {
            'xgb': {},
            'linear': {},
            'logistics': {},
        }
        r2dict = {
            'xgb': {},
            'linear': {},
            'logistics': {},
        }
        coefdict = {
            'xgb': {},
            'linear': {},
            'logistics': {},
        }

        # Third loop: iterates all outcomes
        for outcomename in outcomeloop:
            if ifchange:
                outcomedata = extract_outcome(raw_input,allposscaues,postoutcome[outcomename],preoutcome[outcomename])
            else:
                outcomedata = extract_outcome(raw_input,allposscaues,postoutcome[outcomename])
            boundvalue = bound_dict[outcomename][ifchange]
            exclude_bigvalue_data = bound_data(outcomedata, boundvalue, ifabs=ifchange)
            sampleddata = sample_data(exclude_bigvalue_data, sample_method='log', balance_bound=samplesize_dict[outcomename])
            # Print shape of data after sampled
            print('Data shape after sampling: ',sampleddata.shape)

            logval = logcutoff_dict[outcomename][ifchange]

            # Temporarily store results
            maeresult = {
                'xgb':[],
                'linear':[],
                'logistics':[],
            }
            r2result = {
                'xgb':[],
                'linear':[],
                'logistics':[],
            }
            coefresult = {
                'xgb':[],
                'linear':[],
                'logistics':[],
            }

            # Forth loop: iterate all choice of data
            for causes_type in methodloop:
                if causes_type == 'seg':
                    causes_list = ['light_listener','moderate_listener','super_listener']
                elif causes_type == 'stream':
                    causes_list = onlystreamcause[timenote]
                else:
                    causes_list = cause_dict[outcomename][timenote]

                # Fifth loop: iterate all models
                for modelname in ['xgb','linear','logistics']:
                    ypred,mae,r2,coeftabletemp = modeldata(sampleddata,model_name=modelname,selected_xvars=causes_list,logistic_cutoff=logval)
                    maeresult[modelname].append(mae)
                    r2result[modelname].append(r2)
                    coefresult[modelname].append(coeftabletemp)
                methodloop.update()

            # Store results
            for modelname in ['xgb','linear','logistics']:
                maedict[modelname][outcomename] = maeresult[modelname]
                r2dict[modelname][outcomename] = r2result[modelname]
                coefdict[modelname][outcomename] = coefresult[modelname]
            # Force to print final state
            methodloop.refresh()
            # Reuse tqdm bar
            methodloop.reset()
            # Update outer tqdm
            outcomeloop.update()

        # Save results locally ('./step23result' directory is required to store results)
        filename = './step23result/0812_step23_ifchange_'+str(ifchange)+'_time_' + str(timenote) + '.xlsx'
        with pd.ExcelWriter(filename) as writer:
            for item in maedict:
                item_mae =  pd.DataFrame(maedict[item], index = ['Step1 Causes','Only Stream','Segmentation'])
                item_r2 =  pd.DataFrame(r2dict[item], index = ['Step1 Causes','Only Stream','Segmentation'])
                item_mae.to_excel(writer, sheet_name= item + 'MAE')
                item_r2.to_excel(writer, sheet_name= item + 'R2')
        outcomeloop.refresh()
        outcomeloop.reset()
        ifpreandtreat.update()

    ifpreandtreat.refresh()
    ifpreandtreat.reset()
    fileloop.update()
