## Our ML strategy

Let us formulate the objective in formal terms. Our input features are the following:
- date (as category variables Year, Month, Season)
- market (US/UK) - binary categorical variable
- keyword - string variable
- CPC - float variable

Our target variables are the following, all numerical:
- CTR
- Clicks
- Impressions
- Cost
- AveragePosition

We have then a multiple-output regression problem. However because the outputs are related, we can use the following formulas:
$$ \textbf{impressions} = \frac{\textbf{clicks}}{\textbf{CTR}} ; \quad \textbf{Cost} = \textbf{clicks}\times \textbf{CPC}$$

Thus we can actually opt to only predict CTR, number of clicks and average position, and then compute the number of impressions and cost afterwards. However to avoid spreading of variance, and because in retrospect training a regressor does not take too much time, it is better to create five independent regressors for each target variable.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

import flair
from flair.data import Sentence
from flair.embeddings import ELMoEmbeddings, DocumentPoolEmbeddings
import catboost

import sklearn
from sklearn.model_selection import train_test_split as TTsplit, KFold
from sklearn.metrics import explained_variance_score, mean_absolute_error
from joblib import dump, load

In [2]:
data = pd.read_csv('cleaned_data.csv')

In [3]:
data.shape

(203643, 12)

In [4]:
data.sample(n=10)

Unnamed: 0,Date,Market,Keyword,CPC,Clicks,CTR,Impressions,Cost,AveragePosition,Year,Month,Season
172068,20121103,1,chicago personal injury lawyer,6.882276,-0.09691,0.7,2.033424,1.977449,1.0,2012,11,4
53010,20121002,1,canon 5d,1.627607,1.690462,0.5,4.03595,2.179954,1.0,2012,10,3
32132,20120815,1,chatroulette,-1.089267,2.511121,2.6,4.094296,2.186872,1.0,2012,8,2
132674,20120902,1,auto insurance companies,5.871104,0.426511,0.1,3.260548,2.193848,5.3,2012,9,3
31529,20120814,1,hdmi to hdmi cables,-1.0,0.550228,0.3,3.123525,0.252853,1.0,2012,8,2
165621,20121024,1,auto accident attorney,4.801159,1.805569,2.3,3.441066,3.250866,1.0,2012,10,3
149686,20120928,1,secure credit card,6.411426,0.311754,0.5,2.583199,2.242094,1.0,2012,9,3
70146,20121113,1,donate to haiti,-0.217591,1.058046,2.6,2.646404,0.993877,1.0,2012,11,4
19133,20120712,1,mininova,1.327687,1.786467,4.4,3.143951,2.185315,1.0,2012,7,2
158602,20121012,1,best credit card deals,1.794936,0.69897,2.4,2.31597,1.239049,1.0,2012,10,3


We make a train-test split of this cleaned dataset, keeping the data stratified along the Month variable. This means the two resulting dataframes maintain approximately the same percentage of data per month as the original dataframe.

In [5]:
train_df, test_df = TTsplit(data, test_size=0.3, stratify=data.Month)

In [6]:
train_df.shape

(142550, 12)

We save the test set for later evaluation. But we will undo all of the transformations so that this test set follows the format of the original dataset. This is because our `evaluation.py` script expects that format as well.

In [7]:
test_df = test_df.drop(['Year', 'Month', 'Season'], axis=1)
test_df['Average.Position'] = test_df.AveragePosition
test_df.drop('AveragePosition', axis=1, inplace=True)
test_df.Market = test_df.Market.map({0:'UK-Market', 1:'US-Market'})
test_df.CTR = test_df.CTR.apply(lambda x: str(x)+'%')
test_df.CPC = test_df.CPC.apply(np.exp2)
exp10 = lambda x: np.float_power(10,x)
test_df.Clicks = test_df.Clicks.apply(exp10)
test_df.Impressions = test_df.Impressions.apply(exp10)
test_df.Cost = test_df.Cost.apply(exp10)

test_df.to_csv('testset.csv', index_label=False)

# Computing ELMo word embeddings

In the first half of 2018, the ELMo algorithm ([whitepaper](https://arxiv.org/pdf/1802.05365.pdf)) was state-of-the-art in Natural Language Processing problems as it introduced a new concept called deep contextualized word representations in which the vector embedding of each word is not just a function of the word itself with respect to the entire vocabulary, but is also a function of the sentence in which it appears in. Thus ELMo is able to model situations in which words have different meanings when used in a different context. It is a major improvement over the popular word2vec and GloVe algorithms for generating word embeddings

We will use the ELMoEmbedding function provided in the Flair library. ELMo uses a recurrent neural network architecture thus it builds on top of PyTorch

In [8]:
# this will download the pretrained model if not yet previously downloaded
elmo_small = ELMoEmbeddings('small')

In [9]:
document_embedding = DocumentPoolEmbeddings([elmo_small])

In [10]:
def compute_elmo_embedding(keyword):
    sentence = Sentence(keyword)
    document_embedding.embed(sentence)
    return sentence.get_embedding().detach().cpu().numpy()

Generating embedding vectors of dimension 768 for the dataset of 140k keywords takes a while.

In [12]:
vectors = []
for keyword in tqdm(train_df.Keyword.values, total=train_df.shape[0]):
    vectors.append(compute_elmo_embedding(keyword))

HBox(children=(IntProgress(value=0, max=142550), HTML(value='')))




In [13]:
vectors = pd.DataFrame.from_records(np.array(vectors),index=train_df.index)

In [14]:
train_df = pd.concat([train_df, vectors], axis=1)

In [15]:
train_df.shape

(142550, 780)

We save the embeddings along with the other features so it is faster to iterate when experimenting with machine learning models

In [16]:
train_df.to_csv('embeddings_elmo.csv', index_label=False)

# Training Regressors

We choose to use the gradient boosted tree library named CatBoost from Yandex. It has been shown to improve upon the very popular XGBoost and LightGBM libraries and performs well for categorical variables. In our dataset we have Market, Year and Month as categorical variables. I have done some hyperparameter tuning beforehand and below are values I chose.

In [114]:
cbmodel = catboost.CatBoostRegressor(task_type='GPU', depth=16, grow_policy='Lossguide', max_leaves=63)

We prepare a function to encapsulate cross validation

In [17]:
def perform_cross_validation(model, train_df, target_df, k_folds=5, fit_params=None):
    kf = KFold(n_splits=k_folds, shuffle=True, random_state=17)
    scores = []
    errors = []
    i = 1

    for train_indices, val_indices in tqdm(kf.split(train_df, target_df), total=k_folds):
        print("Training on fold " + str(i) + f" of {k_folds}...", end='')
        i += 1
        
        if not fit_params:
            model.fit(train_df.iloc[train_indices], target_df.iloc[train_indices])
        else:
            model.fit(train_df.iloc[train_indices], target_df.iloc[train_indices], **fit_params)
        print(" Done.")
        predicted_value = model.predict(train_df.iloc[val_indices])
        actual_value = target_df.iloc[val_indices]
        scores.append(explained_variance_score(actual_value, predicted_value))
        errors.append(mean_absolute_error(actual_value, predicted_value))
        
    scores = np.array(scores)
    errors = np.array(errors)
    print(f"Results:\nscores: {scores.mean()} +/- {scores.std()}")
    print(f"MAE: {errors.mean()} +/- {errors.std()}")
    return scores, errors

We need to train a separate regressor for each target numerical feature. First we perform 5-fold cross-validation and compute the explained variance score and mean absolute error for each regressor. The closer to 1.0 the EV score is, the better the model is, while the smaller the MAE is, the better the model is.

The sklearn docs has a detailed breakdown of regression metrics [here](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)

In [128]:
train_inputs = train_df.drop(['Season','Keyword','CTR', 'Clicks', 'Impressions', 'Cost', 'AveragePosition'], axis=1)
cat_features = [1,3,4] # CatBoost performs better if you identify column indices of categorical features

In [131]:
perform_cross_validation(cbmodel, train_inputs, train_df.CTR, fit_params={'cat_features':cat_features, 'verbose':False});

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Training on fold 1 of 5... Done.
Training on fold 2 of 5... Done.
Training on fold 3 of 5... Done.
Training on fold 4 of 5... Done.
Training on fold 5 of 5... Done.

Results:
scores: 0.857199317259842 +/- 0.0013826916390928092
MAE: 0.4756340001814621 +/- 0.002766557705139676


In [132]:
perform_cross_validation(cbmodel, train_inputs, train_df.Clicks, fit_params={'cat_features':cat_features, 'verbose':False});

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Training on fold 1 of 5... Done.
Training on fold 2 of 5... Done.
Training on fold 3 of 5... Done.
Training on fold 4 of 5... Done.
Training on fold 5 of 5... Done.

Results:
scores: 0.9354116017991402 +/- 0.0014617189391723617
MAE: 0.1815999779398072 +/- 0.0005864121940182363


In [133]:
perform_cross_validation(cbmodel, train_inputs, train_df.Cost, fit_params={'cat_features':cat_features, 'verbose':False});

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Training on fold 1 of 5... Done.
Training on fold 2 of 5... Done.
Training on fold 3 of 5... Done.
Training on fold 4 of 5...
 Done.
Training on fold 5 of 5... Done.

Results:
scores: 0.9506809341907697 +/- 0.0009073600486482371
MAE: 0.18496600106016417 +/- 0.0009618761953733562


In [134]:
perform_cross_validation(cbmodel, train_inputs, train_df.Impressions, fit_params={'cat_features':cat_features, 'verbose':False});

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Training on fold 1 of 5... Done.
Training on fold 2 of 5... Done.
Training on fold 3 of 5... Done.
Training on fold 4 of 5... Done.
Training on fold 5 of 5... Done.

Results:
scores: 0.9579082471076855 +/- 0.0007381083421909577
MAE: 0.13909714664921663 +/- 0.0007171096090780626


In [135]:
perform_cross_validation(cbmodel, train_inputs, train_df.AveragePosition, fit_params={'cat_features':cat_features, 'verbose':False});

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))

Training on fold 1 of 5... Done.
Training on fold 2 of 5... Done.
Training on fold 3 of 5... Done.
Training on fold 4 of 5... Done.
Training on fold 5 of 5... Done.

Results:
scores: 0.7158014788863788 +/- 0.03195369250490474
MAE: 0.039145478275825936 +/- 0.0007969905130086278


After selection of hyperparameters, we will train regressors over the entire dataset and prepare a wrapper function that will take the predictor features and compute all target features

In [136]:
ctr_predictor = catboost.CatBoostRegressor(task_type='GPU', depth=16, grow_policy='Lossguide', max_leaves=63)
click_predictor = catboost.CatBoostRegressor(task_type='GPU', depth=16, grow_policy='Lossguide', max_leaves=63)
ap_predictor = catboost.CatBoostRegressor(task_type='GPU', depth=16, grow_policy='Lossguide', max_leaves=63)
impression_predictor = catboost.CatBoostRegressor(task_type='GPU', depth=16, grow_policy='Lossguide', max_leaves=63)
cost_predictor = catboost.CatBoostRegressor(task_type='GPU', depth=16, grow_policy='Lossguide', max_leaves=63);

In [137]:
ctr_predictor.fit(train_inputs,train_df.CTR, cat_features=cat_features, verbose=False);

In [138]:
cost_predictor.fit(train_inputs,train_df.Cost, cat_features=cat_features, verbose=False);

In [139]:
impression_predictor.fit(train_inputs,train_df.Impressions, cat_features=cat_features, verbose=False);

In [140]:
click_predictor.fit(train_inputs,train_df.Clicks, cat_features=cat_features, verbose=False);

In [141]:
ap_predictor.fit(train_inputs,train_df.AveragePosition, cat_features=cat_features, verbose=False);

We save the regressors to disk so we can load them in the `evaluation.py` script

In [142]:
prefix = 'models/with-date-wo-season'
dump(ctr_predictor, prefix+'.ctr.joblib')
dump(cost_predictor, prefix+'.cost.joblib')
dump(impression_predictor, prefix+'.impr.joblib')
dump(click_predictor, prefix+'.click.joblib')
dump(ap_predictor, prefix+'.ap.joblib');

## Constructing the Predictor() function

In [39]:
def get_season(month):
    if month >= 3 and month < 6:
        return 1 # Spring
    elif month >= 6 and month < 9:
        return 2 # Summer
    elif month >= 9 and month < 11:
        return 3 # Fall
    else: 
        return 4 # Winter

In [86]:
def Predictor(Date, Market, Keyword, CPC, embedding_function=compute_elmo_embedding):
    # NOTE: this function only takes a single datapoint at a time
    # Each input must match the data type of the corresponding column in the original dataset, then 
    # the same data transformations are applied as in the EDA notebook
    year = int(Date[:4])
    month = int(Date[4:6])
    season = get_season(month)
    market = 1 if Market == 'US-Market' else 0
    cpc = np.log2(CPC)
    keyword = Keyword.lower()
    vector = list(embedding_function(keyword))
    input_vector = [Date, market, cpc, year, month, season, *vector]
    ctr = ctr_predictor.predict(input_vector)
    clicks = click_predictor.predict(input_vector)
    averageposition = ap_predictor.predict(input_vector)
    impressions = impression_predictor.predict(input_vector)
    cost = cost_predictor.predict(input_vector)
    return ctr, 10**clicks, 10**impressions, 10**cost, averageposition

Now we can go ahead and try out our predictor function

In [41]:
Predictor('20120524', 'US-Market', 'agile management software', 1.2)

(9.13484179228817,
 13.023005634660755,
 185.93421578419878,
 15.688379451107428,
 1.0002161035330488)

In [42]:
raw_data = pd.read_csv('dataset.csv')

In [43]:
raw_data.head()

Unnamed: 0,Date,Market,Keyword,Average.Position,CPC,Clicks,CTR,Impressions,Cost
0,20120524,US-Market,secure online back up,0.0,0.0,0.0,0.00%,0.0,0.0
1,20120524,US-Market,agile management software,1.0,1.2,21.22,8.20%,260.0,25.45
2,20120524,US-Market,crm for financial,0.0,0.0,0.0,0.00%,0.0,0.0
3,20120524,US-Market,disaster recovery planning for it,0.0,0.0,0.0,0.00%,0.0,0.0
4,20120524,US-Market,tracking a vehicle,0.0,0.0,0.0,0.00%,0.0,0.0


You can now try out the predictor function encapsulated in the `evaluation.py` script.