### Import libraries

In [1]:
import pandas as pd

from sklearn.preprocessing import StandardScaler
from tpot import TPOTRegressor

pd.set_option('display.max_columns', None)  

### Read dataset

In [2]:
df = pd.read_csv('data/dataset_v3.csv')

I decided to proceed with TPOT as an AutoML framework. Initially, I was considering auto-sklearn, but after reading this article where TPOT was recommended for regression problems (https://medium.com/georgian-impact-blog/choosing-the-best-automl-framework-4f2a90cb1826), I made my choice.

Now let's split this dataframe into train and test. We'll pick the last 5 days as a test set.

In [3]:
# Calculate the number of unique dates that correspond to 20% of the dataset
num_dates = int(df['date'].nunique() * 0.2)

# Get the date that splits the data into 80% training and 20% testing
split_date = df['date'].unique()[-num_dates]

# Splitting the dataset
train = df[df['date'] < split_date]
test = df[df['date'] >= split_date]

X_train = train.drop(columns=['pageviews_-1d_lag', 'offer_id', 'date'])  # Dropping 'date' as it's not a feature
y_train = train['pageviews_-1d_lag']

X_test = test.drop(columns=['pageviews_-1d_lag', 'offer_id', 'date'])
y_test = test['pageviews_-1d_lag']

Data normalation since some algorithms are sensitive to Feature Scaling.

In [4]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  

Define searching space.

In [5]:
tpot_config = {
    # Existing ensemble models
    'sklearn.ensemble.RandomForestRegressor': {
        'n_estimators': [100, 200, 300],
        'max_features': ["auto", "sqrt", "log2"],
        'max_depth': [1, 5, 10],
        'min_samples_split': [2, 5, 10, 15, 20],
        'min_samples_leaf': [1, 5, 10, 15, 20],
        'bootstrap': [True, False]
    },
    'xgboost.XGBRegressor': {
        'n_estimators': [100, 200, 300],
        'max_depth': [1, 5, 10],
        'learning_rate': [0.01, 0.1, 0.2, 0.3],
        'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
        'min_child_weight': [1, 2, 5, 10]
    },
    'lightgbm.LGBMRegressor': {
        'num_leaves': [20, 50, 100, 150],
        'learning_rate': [0.01, 0.05, 0.1],
        'n_estimators': [100, 200, 300],
        'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
        'max_depth': [1, 5, 10]
    },

    # Linear models
    'sklearn.linear_model.LinearRegression': {
    },
    'sklearn.linear_model.Ridge': {
        'alpha': [1.0, 10.0, 100.0, 1000.0]
    },
    'sklearn.linear_model.Lasso': {
        'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]
    }
}

### Instantiate the TPOTRegressor for AutoML

In [6]:
tpot = TPOTRegressor(
    generations=5,
    population_size=30,
    n_jobs=-1,
    verbosity=3, 
    cv=None,
    random_state=42,
    max_time_mins=60,
    max_eval_time_mins=10,
    config_dict=tpot_config,
)

# Run TPOT
tpot.fit(X_train, y_train)

6 operators have been imported by TPOT.


Optimization Progress:   0%|          | 0/30 [00:00<?, ?pipeline/s]

Skipped pipeline #15 due to time out. Continuing to the next pipeline.
Skipped pipeline #20 due to time out. Continuing to the next pipeline.
Skipped pipeline #24 due to time out. Continuing to the next pipeline.
_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.
_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.
_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.
_pre_test decorator: _mate_operator: num_test=0 'str' object has no attribute 'arity'.
Pipeline encountered that has previously been evaluated during the optimization process. Using the score from the previous evaluation.
Skipped pipeline #48 due to time out. Continuing to the next pipeline.
Skipped pipeline #60 due to time out. Continuing to the next pipeline.

Generation 1 - Current Pareto front scores:

-1	-187413.27873369632	LinearRegression(input_matrix)

-2	-187412.1138934574	LinearRegression(LinearRegression(input

Run the TPOT optimization

Check the score of the best pipeline.

In [7]:
print("Test Score: ", tpot.score(X_test_scaled, y_test))

Test Score:  -1.0497266696908933e+24




Export the best pipeline as a Python script file.

In [8]:
tpot.export('tpot_model_selection/tpot_model_pipeline.py')