<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Libraries" data-toc-modified-id="Libraries-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Optimization" data-toc-modified-id="Optimization-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Optimization</a></span></li></ul></div>

# Introduction

This notebook finds the best hyperparameter configuration for the ALS model on the MovieLens 20M Dataset.

# Setup

In [1]:
%%capture
%cd ..

# Libraries

In [2]:
from hyperopt import fmin, tpe, hp

In [3]:
from lenskit import batch, topn, util
from lenskit import crossfold as xf
from lenskit.algorithms import Recommender, als
from lenskit import topn

In [4]:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

In [5]:
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
import mlflow

In [6]:
import pprint

# Data

In [7]:
%%time
spark = SparkSession.builder.getOrCreate()
train_df = spark.read.parquet("/tmp/ml-20m/train_df.parquet/").toPandas()
test_df = spark.read.parquet("/tmp/ml-20m/test_df.parquet/").toPandas()

Wall time: 2min 57s


# Optimization

Define the objective function (to minimize):

In [10]:
def create_objective_fn(
    train_df, test_df, recsize
):

    assert {"user", "item", "rating"}.issubset(train_df.columns)
    assert {"user", "item", "rating"}.issubset(test_df.columns)

    test_users = test_df.user.unique()

    def objective_fn(params):
        algo = als.BiasedMF(
            features=params["features"],
            iterations=params["iteration"],
            reg=0.1,
            damping=5,
        )

        model = util.clone(algo)
        model = Recommender.adapt(model)
        model.fit(train_df)

        recs = batch.recommend(model, test_users, recsize)

        rla = topn.RecListAnalysis()
        rla.add_metric(topn.ndcg)

        results = rla.compute(recs, test_df)

        target_metric = -results.ndcg.mean()

        return {"loss": target_metric, "status": STATUS_OK}

    return objective_fn

In [11]:
objective_fn = create_objective_fn(train_df, test_df, recsize=10)

Define the search space:

In [12]:
search_space = hp.choice(
    "params",
    [
        {
            "features": 1 + hp.randint("features", 400),
            "iteration": 1 + hp.randint("iteration", 10),
        }
    ],
)

Sample some values from the search space to make sure it works as expected:

In [13]:
import hyperopt.pyll.stochastic

for _ in range(5):
    
    print(hyperopt.pyll.stochastic.sample(search_space))

{'features': 46, 'iteration': 7}
{'features': 129, 'iteration': 5}
{'features': 250, 'iteration': 7}
{'features': 296, 'iteration': 5}
{'features': 223, 'iteration': 6}


Define search strategy:

In [14]:
algo = tpe.suggest

Define a spark trials:

In [15]:
spark_trials = SparkTrials(parallelism=5)

Tune!:

In [16]:
%%time
with mlflow.start_run():
    best_result = fmin(
        fn=objective_fn, 
        space=search_space,
        algo=algo,
        max_evals=30,
        trials=spark_trials)

 17%|███████████████████████████████████▎                                                                                                                                                                                | 5/30 [00:06<00:30,  1.23s/it, best loss: ?]


KeyboardInterrupt: 

In [None]:
pprint.pprint(best_result)