### Import libraries

In [1]:
import pandas as pd

from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import MinMaxScaler
from tpot import TPOTRegressor

pd.set_option('display.max_columns', None)  

### Read datasetMinMaxScaler

In [2]:
df = pd.read_csv('data/dataset_v3.csv')

I decided to proceed with TPOT as an AutoML framework. Initially, I was considering auto-sklearn, but after reading this article where TPOT was recommended for regression problems (https://medium.com/georgian-impact-blog/choosing-the-best-automl-framework-4f2a90cb1826), I made my choice.

Now let's split this dataframe into train and test. We'll pick the last 5 days as a test set.

In [3]:
# Calculate the number of unique dates that correspond to 20% of the dataset
num_dates = int(df['date'].nunique() * 0.2)

# Get the date that splits the data into 80% training and 20% testing
split_date = df['date'].unique()[-num_dates]

# Splitting the dataset
train = df[df['date'] < split_date]
test = df[df['date'] >= split_date]

X_train = train.drop(columns=['pageviews_-1d_lag', 'offer_id', 'date'])  # Dropping 'date' as it's not a feature
y_train = train['pageviews_-1d_lag']

X_test = test.drop(columns=['pageviews_-1d_lag', 'offer_id', 'date'])
y_test = test['pageviews_-1d_lag']

Data normalation since linear algorithms are sensitive to Feature Scaling.

In [4]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) 

Define searching space.

In [5]:
tpot_config = {
    'sklearn.linear_model.LinearRegression': {
        'normalize': [True, False]
    },
    'sklearn.linear_model.Ridge': {
        'alpha': [0.1, 1.0, 10.0]
    },
    'sklearn.linear_model.Lasso': {
        'alpha': [0.1, 1.0, 10.0]
    }
}

### Instantiate the TPOTRegressor for AutoML

In [6]:
n_splits = 5
time_series_cv = TimeSeriesSplit(n_splits=n_splits)

In [7]:
tpot = TPOTRegressor(
    generations=10,
    population_size=100,
    n_jobs=-1,
    verbosity=2, 
    cv=time_series_cv,
    random_state=42,
    max_time_mins=60,
    max_eval_time_mins=10,
    config_dict=tpot_config,
)

# Run TPOT
tpot.fit(X_train_scaled, y_train)

Optimization Progress:   0%|          | 0/100 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -101704.13446582647

Generation 2 - Current best internal CV score: -101592.85324160481

Generation 3 - Current best internal CV score: -101592.85324160481

Generation 4 - Current best internal CV score: -101592.84482444699

Generation 5 - Current best internal CV score: -101592.84482444699

Generation 6 - Current best internal CV score: -101592.84482444699

Generation 7 - Current best internal CV score: -101592.84482444699

Generation 8 - Current best internal CV score: -101577.29252692935

Generation 9 - Current best internal CV score: -101577.29252692935

Generation 10 - Current best internal CV score: -101577.29252692935

Best pipeline: Ridge(CombineDFs(CombineDFs(CombineDFs(input_matrix, input_matrix), input_matrix), input_matrix), alpha=1.0)


Run the TPOT optimization

Check the score of the best pipeline.

In [10]:
print("Test Score: ", tpot.score(X_test_scaled, y_test))

Test Score:  -37722.12283746584


Export the best pipeline as a Python script file.

In [11]:
tpot.export('tpot_model_selection/tpot_model_pipeline_linear.py')