<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Movie-recommendation" data-toc-modified-id="Movie-recommendation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Movie recommendation</a></span><ul class="toc-item"><li><span><a href="#Dataset" data-toc-modified-id="Dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Evaluation-Protocol" data-toc-modified-id="Evaluation-Protocol-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Evaluation Protocol</a></span></li><li><span><a href="#Models" data-toc-modified-id="Models-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Models</a></span><ul class="toc-item"><li><span><a href="#ALS" data-toc-modified-id="ALS-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span><a href="https://spark.apache.org/docs/latest/ml-collaborative-filtering.html#explicit-vs-implicit-feedback" target="_blank">ALS</a></a></span></li><li><span><a href="#Ваша-формулировка" data-toc-modified-id="Ваша-формулировка-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Ваша формулировка</a></span></li></ul></li><li><span><a href="#Evaluation-Results" data-toc-modified-id="Evaluation-Results-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Evaluation Results</a></span></li></ul></li></ul></div>

# Movie recommendation

Ваша задача - рекомендация фильмов для пользователей


In [1]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import os
import sys
import glob
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import spatial

import pyspark
from pyspark.conf import SparkConf
from pyspark.ml.recommendation import ALS
from pyspark.mllib.evaluation import RankingMetrics
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.window import Window


spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("spark_sql_examples") \
    .config("spark.executor.memory", "8g") \
    .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

## Dataset 

`MovieLens-25M`

In [2]:
DATA_PATH = '/workspace/data/ml-25m'

RATINGS_PATH = os.path.join(DATA_PATH, 'ratings.csv')
MOVIES_PATH = os.path.join(DATA_PATH, 'movies.csv')
TAGS_PATH = os.path.join(DATA_PATH, 'tags.csv')

SEED = 90

In [3]:
import pyspark.sql.functions as F
from pyspark.sql.types import *


def read_df(path, sampling_rate=None):
    df = sqlContext.read.format("com.databricks.spark.csv") \
        .option("delimiter", ",") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load('file:///' + path)
    if sampling_rate:
        df = df.sample(False, sampling_rate, SEED)
    return df

## Evaluation Protocol

Так как мы хотим оценивать качество разных алгоритмов рекомендаций, то в первую очередь нам нужно определить
* Как разбить данные на `Train`/`Validation`/`Test`;
* Какие метрики использовать для оценки качества.

### Splits

The splits has been done in a way that picks the 80% of the earliest (by timestamp) ratings of a user to the train set, the next 10% to the dev set and the rest 10% to the test test.

This allows to solve the problem during evaluation with the user "cold-start", i.e. for every rating in the test set every user will have some at least some items before that timestamp, but does not resolve the problem with the item "cold-start" which is harder to mitigate.

It also does not "look into the future" for every single user.

In [4]:
def split_ratings_df(sampling_rate=None):
    ratings_df = read_df(RATINGS_PATH, sampling_rate)
    print(ratings_df.count())
    
    TILES = 10

    user_window = Window.orderBy('timestamp').partitionBy('userId')

    tiled_ratings_df = ratings_df \
        .withColumn('tile', F.ntile(TILES).over(user_window))

    train_ratings_df = tiled_ratings_df \
        .filter(F.col('tile') <= 8) \
        .drop('tile') \
        .cache()
    print(train_ratings_df.count())
    
    dev_ratings_df = tiled_ratings_df \
        .filter(F.col('tile') == 9) \
        .drop('tile') \
        .cache()
    print(dev_ratings_df.count())
    
    test_ratings_df = tiled_ratings_df \
        .filter(F.col('tile') == 10) \
        .drop('tile') \
        .cache()
    print(test_ratings_df.count())
    
    return train_ratings_df, dev_ratings_df, test_ratings_df

### Metrics

I have picked several diagnostic metrics and a more comprehensive metric. For the diagnostic metrics I have picked Precision and Normalized discounted cumulative gain at positions 1, 5, 10. They allow to see the certain aspects of system performance, and allow for debugging (for example, AssosiationsRuleModel has been getting the Precision@1=0, because it has always recommended the last item from the training set, until I have fixed it).

The general performance metric is Mean Average Precision, which accounts for both the relevance of the predictions and takes the order into account. This metric is useful as a loss for the hyperparameter optimization stage.

In [5]:
def evaluate_recommendations_on(model, recommendations_map_fn, df):
    labels = df \
        .groupby('userId') \
        .agg(F.collect_set('movieId').alias('labels')) \
        .cache()
        
    ATS = [1, 5, 10]
    MAX_ATS = max(ATS)
    
    user_recs = model \
        .recommendForUserSubset(labels, MAX_ATS) \
        .cache()
    
    assert labels.count() == user_recs.count()

    recs_and_labels = labels \
        .join(user_recs, 'userId') \
        .select('recommendations', 'labels') \
        .rdd \
        .map(lambda row: (list(map(recommendations_map_fn, row[0])), row[1])) \
        .cache()
    
    ranking_metrics = RankingMetrics(recs_and_labels)
    
    metrics_values = {}
    
    for N in ATS:
        precision_n_handle = "Precision@" + str(N)
        metrics_values[precision_n_handle] = ranking_metrics.precisionAt(N)
        ndcg_n_handle = "NDCG@" + str(N)
        metrics_values[ndcg_n_handle] = ranking_metrics.ndcgAt(N)
    metrics_values["MAP"] = ranking_metrics.meanAveragePrecision
    return metrics_values

In [6]:
def get_ate(groups, control_name) -> pd.DataFrame:
    """Get Average Treatment Effect
    groups - dictionary where keys - names of models, values - dicts of pairs <metric_name>, <metric_value>
    control_name - name of baseline model
    
    return pd.DataFrame (rows corresponds to metrics, cols corresponds to models and ATE with respect to control)
    """
    
    metric_names = []
    for metric_name_values in groups.values():
        for metric_name, _ in metric_name_values.items():
            if metric_name not in metric_names:
                metric_names.append(metric_name)
    metric_names = list(sorted(metric_names))
    
    if control_name not in groups:
        raise ValueError("Control experiment is not in the group.")
    control_values = groups[control_name]
    if len(control_values) != len(metric_names):
        raise ValueError("Control experiment does not have all the metrics computed.")

    model_names = list(sorted(groups.keys()))
    metric_model_ates = []
    for metric_name in metric_names:
        control_value = control_values[metric_name]
        model_ates = []
        for model_name in model_names:
            if metric_name in groups[model_name]:
                ate = (groups[model_name][metric_name] - control_value) / control_value * 100
            else:
                ate = None
            model_ates.append(ate)
        metric_model_ates.append(model_ates)

    ates_df = pd.DataFrame(data=metric_model_ates, index=metric_names, columns=model_names)
    return ates_df

In [7]:
all_metrics = {}

## Models

Теперь мы можем перейти к формулировке задачи в терминах машинного обучения.

Одна из формулировок, к которой мы сведем нашу задачу - **Matrix Completetion**. Данную задачу будем решать с помощью `ALS`

In [8]:
train_ratings_df, dev_ratings_df, test_ratings_df = split_ratings_df()

25000095
20123411
2445623
2431061


### [ALS](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html#explicit-vs-implicit-feedback)

In [9]:
def get_baseline_als_space():
    space = {
        'rank': 10,
        'maxIter': 10,
        'regParam': 0.1,
        'implicitPrefs': False,
        'alpha': 1.0,
        'nonnegative': False,

        'numUserBlocks': 10,
        'numItemBlocks': 10,
        'userCol': 'userId',
        'itemCol': 'movieId',
        'ratingCol': 'rating',
        'seed': SEED,
        'coldStartStrategy': 'nan',
    }
    return space

In [10]:
als = ALS(**get_baseline_als_space())
baseline_als_model = als.fit(train_ratings_df)

In [11]:
%%time

for eval_df in [train_ratings_df, dev_ratings_df, test_ratings_df]:
    metrics = evaluate_recommendations_on(
        model=baseline_als_model,
        recommendations_map_fn=lambda rec: rec.movieId,
        df=eval_df)
    print(metrics)
    
BASELINE_HANDLE = '1.als_baseline'
all_metrics[BASELINE_HANDLE] = metrics

{'NDCG@5': 0.00017637827044626967, 'Precision@5': 0.00018456881648322575, 'MAP': 3.8610239522794426e-06, 'Precision@1': 0.00014150275930380657, 'Precision@10': 0.00017780129321217424, 'NDCG@10': 0.00017456509633644512, 'NDCG@1': 0.00014150275930380657}
{'NDCG@5': 7.688792036277059e-06, 'Precision@5': 8.613211435883877e-06, 'MAP': 7.251632737680066e-07, 'Precision@1': 0.0, 'Precision@10': 9.228440824161288e-06, 'NDCG@10': 8.84849400421429e-06, 'NDCG@1': 0.0}
{'NDCG@5': 2.3391599451119216e-05, 'Precision@5': 1.722642287176776e-05, 'MAP': 8.256419136002381e-07, 'Precision@1': 4.9218351062193566e-05, 'Precision@10': 1.1689358377270964e-05, 'NDCG@10': 1.7319877774538593e-05, 'NDCG@1': 4.9218351062193566e-05}
CPU times: user 745 ms, sys: 257 ms, total: 1 s
Wall time: 14min 49s


Покажите для выбранных вами фильмов топ-20 наиболее похожих фильмов

In [12]:
movies_df = read_df(MOVIES_PATH)
movies_df.count()

62423

In [13]:
def get_cosine_similarity(features_0, features_1):
    similarity = 1 - spatial.distance.cosine(features_0, features_1)
    return float(similarity)


cosine_similarity_udf = F.udf(get_cosine_similarity, FloatType())


def find_similar_to(movie_id, model, N=20):
    movie_factors = model.itemFactors

    selected_movie_factors = movie_factors \
        .filter(F.col('id') == movie_id) \
        .selectExpr('id as movieId', 'features as movieFeatures') \
        .cache()
    
    out = selected_movie_factors \
        .crossJoin(movie_factors) \
        .withColumn('sim', cosine_similarity_udf('movieFeatures', 'features')) \
        .select('movieId', 'id', 'sim') \
        .sort(F.col('sim').desc()) \
        .limit(N)
    
    return out

In [14]:
similar_movies = find_similar_to(1, baseline_als_model)
similar_movies.collect()

[Row(movieId=1, id=1, sim=1.0),
 Row(movieId=1, id=3114, sim=0.9954542517662048),
 Row(movieId=1, id=78499, sim=0.990972101688385),
 Row(movieId=1, id=27624, sim=0.9887143969535828),
 Row(movieId=1, id=2355, sim=0.9836386442184448),
 Row(movieId=1, id=8961, sim=0.9826635718345642),
 Row(movieId=1, id=6377, sim=0.9790077805519104),
 Row(movieId=1, id=4886, sim=0.9785723686218262),
 Row(movieId=1, id=136489, sim=0.978314995765686),
 Row(movieId=1, id=7639, sim=0.9772404432296753),
 Row(movieId=1, id=26940, sim=0.9770073890686035),
 Row(movieId=1, id=171515, sim=0.9765953421592712),
 Row(movieId=1, id=588, sim=0.9761655330657959),
 Row(movieId=1, id=74342, sim=0.9758329391479492),
 Row(movieId=1, id=66369, sim=0.9754895567893982),
 Row(movieId=1, id=184485, sim=0.9753639101982117),
 Row(movieId=1, id=116975, sim=0.9752519130706787),
 Row(movieId=1, id=33524, sim=0.9749162197113037),
 Row(movieId=1, id=8827, sim=0.9743496775627136),
 Row(movieId=1, id=162422, sim=0.9742727279663086)]

### The movie with the highest similarity 1.0 is the original movie, i.e. Toy Story (1995)

In [15]:
similar_movies \
    .join(movies_df, similar_movies['id'] == movies_df['movieId']) \
    .select('title', 'genres') \
    .collect()

[Row(title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(title='Aladdin (1992)', genres='Adventure|Animation|Children|Comedy|Musical'),
 Row(title="Bug's Life, A (1998)", genres='Adventure|Animation|Children|Comedy'),
 Row(title='Toy Story 2 (1999)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(title='Monsters, Inc. (2001)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(title='Finding Nemo (2003)', genres='Adventure|Animation|Children|Comedy'),
 Row(title='Nice Guys Sleep Alone (1999)', genres='Comedy|Romance'),
 Row(title='Bill Cosby, Himself (1983)', genres='Comedy|Documentary'),
 Row(title='Incredibles, The (2004)', genres='Action|Adventure|Animation|Children|Comedy'),
 Row(title='Late Shift, The (1996)', genres='Comedy'),
 Row(title='Live from Baghdad (2002)', genres='Drama|War'),
 Row(title='Death Takes a Holiday (1934)', genres='Fantasy|Romance'),
 Row(title='Man Who Quit Smoking, The (Mannen som slutade röka) (1972)'

### Ваша формулировка

На лекции было еще несколько ML формулировок задачи рекомендаций. Выберете одну из них и реализуйте метод.

## Based on Evaluation of Session-based Recommendation Algorithms: https://arxiv.org/pdf/1803.09587.pdf

Found from reference [11] https://dl.acm.org/doi/10.1145/3298689.3347041

### Session-based Recommendation

### Simple Association Rules (AR)

The equation (1) in the paper. The baseline version will treat each user as a single session.

The normalizer in the equation (1) could be disregarded, since it only depends on the s, not the i, and used to treat the scores as probabilities. So the score function takes the form of:

score_{AR}(i, s) = \sum_{p\in S_p} \sum_{x=1}^{|p|} \sum_{y=1}^{|p|} 1_{EQ}(s_|s|, p_|x|) 1_{EQ}(i, p_|y|)

In [16]:
train_ratings_df, dev_ratings_df, test_ratings_df = split_ratings_df()

# Subsample for training, since there is not enough memory for pair co-occurence calculation on a single machine.
train_ratings_df = train_ratings_df.sample(False, 0.01, seed=SEED).cache()
train_ratings_df.count()

25000095
20123411
2445623
2431061


200208

In [17]:
%%time

cooccurrence_df = train_ratings_df \
    .selectExpr('userId', 'movieId as movieId2') \
    .join(train_ratings_df, 'userId') \
    .select('movieId', 'movieId2') \
    .groupBy('movieId', 'movieId2') \
    .agg(F.count('movieId').alias('cooccurrence')) \
    .cache()

cooccurrence_df.take(10)

CPU times: user 18 ms, sys: 3.42 ms, total: 21.4 ms
Wall time: 8.05 s


[Row(movieId=648, movieId2=648, cooccurrence=320),
 Row(movieId=46347, movieId2=3033, cooccurrence=1),
 Row(movieId=49272, movieId2=296, cooccurrence=2),
 Row(movieId=3852, movieId2=673, cooccurrence=1),
 Row(movieId=2133, movieId2=4519, cooccurrence=1),
 Row(movieId=780, movieId2=2471, cooccurrence=1),
 Row(movieId=5876, movieId2=7121, cooccurrence=1),
 Row(movieId=6731, movieId2=6731, cooccurrence=12),
 Row(movieId=111235, movieId2=6971, cooccurrence=1),
 Row(movieId=98116, movieId2=3430, cooccurrence=1)]

In [18]:
cooccurrence_df \
    .filter(F.col('movieId') == 49272) \
    .filter(F.col('movieId2') == 296) \
    .head()

Row(movieId=49272, movieId2=296, cooccurrence=2)

In [19]:
cooccurrence_df \
    .filter(F.col('movieId') == 296) \
    .filter(F.col('movieId2') == 49272) \
    .head()

Row(movieId=296, movieId2=49272, cooccurrence=2)

In [20]:
user_max_timestamp_df = train_ratings_df \
    .groupBy('userId') \
    .agg(F.max('timestamp').alias('maxTimestamp')) \
    .cache()

user_max_timestamp_df.take(10)

[Row(userId=148, maxTimestamp=1454942724),
 Row(userId=463, maxTimestamp=854641157),
 Row(userId=496, maxTimestamp=1397231655),
 Row(userId=833, maxTimestamp=1467561417),
 Row(userId=1238, maxTimestamp=1495912728),
 Row(userId=1342, maxTimestamp=1429640019),
 Row(userId=1580, maxTimestamp=1341025712),
 Row(userId=1829, maxTimestamp=909019436),
 Row(userId=3749, maxTimestamp=843684126),
 Row(userId=3794, maxTimestamp=1484585769)]

In [21]:
user_last_movie_id_df = train_ratings_df \
    .join(user_max_timestamp_df, (
        (train_ratings_df['userId'] == user_max_timestamp_df['userId']) &
        (train_ratings_df['timestamp'] == user_max_timestamp_df['maxTimestamp']))) \
    .select(train_ratings_df['userId'], train_ratings_df['movieId'].alias('lastMovieId')) \
    .cache()

user_last_movie_id_df.take(10)

[Row(userId=148, lastMovieId=1222),
 Row(userId=463, lastMovieId=648),
 Row(userId=496, lastMovieId=66200),
 Row(userId=833, lastMovieId=104841),
 Row(userId=1238, lastMovieId=55247),
 Row(userId=1342, lastMovieId=78039),
 Row(userId=1580, lastMovieId=105),
 Row(userId=1829, lastMovieId=1474),
 Row(userId=3749, lastMovieId=589),
 Row(userId=3794, lastMovieId=96079)]

In [22]:
class AssosiationsRuleModel(object):

    def __init__(self, user_last_movie_id_df, cooccurrence_df):
        self.user_last_movie_id_df = user_last_movie_id_df
        self.cooccurrence_df = cooccurrence_df
    
    def recommendForUserSubset(self, users_df, N):
        def postprocess_recommendations(recommendations):
            # Discard the first recommendation since it is always the last movie in the user session from
            # the fact that it has the highest co-occurrence with itself.
            # We would probably want to discard all movies we have seen in the training ratings set for the user,
            # but I do not think it holds for general session-based recommendation problems.
            recommendations = recommendations[1:]
            # Limit the size up to N.
            recommendations = recommendations[:N]
            return recommendations

        postprocess_recommendations_udf = F.udf(postprocess_recommendations, ArrayType(IntegerType()))
        
        w = Window.partitionBy('userId').orderBy(F.col('cooccurrence').desc())
        
        subset_user_last_movie_id_df = users_df \
            .join(self.user_last_movie_id_df, 'userId', how='left') \
            .select('userId', 'lastMovieId')
        
        user_raw_recommendations_df = subset_user_last_movie_id_df \
            .join(self.cooccurrence_df, 
                  subset_user_last_movie_id_df['lastMovieId'] == self.cooccurrence_df['movieId'],
                  how='left') \
            .withColumn('rawRecommendations', F.collect_list('movieId2').over(w)) \
            .groupBy('userId') \
            .agg(F.max('rawRecommendations').alias('rawRecommendations')) \
            .select('userId', 'rawRecommendations')
        
        user_recommendations_df = user_raw_recommendations_df \
            .withColumn('recommendations', postprocess_recommendations_udf('rawRecommendations')) \
            .select('userId', 'recommendations')

        return user_recommendations_df

In [23]:
assosiations_rule_model = AssosiationsRuleModel(user_last_movie_id_df, cooccurrence_df)

In [24]:
dev_labels = dev_ratings_df \
    .groupby('userId') \
    .agg(F.collect_set('movieId').alias('labels'))

assosiations_rule_model \
    .recommendForUserSubset(dev_labels, 10) \
    .take(10)

[Row(userId=148, recommendations=[260, 4993, 2918, 1265, 56174, 329, 1094, 2997, 1210, 780]),
 Row(userId=463, recommendations=[527, 2028, 3527, 377, 608, 1206, 593, 1080, 7153, 6377]),
 Row(userId=471, recommendations=[]),
 Row(userId=496, recommendations=[5378, 91535, 5826, 27706, 111375, 5196, 73321, 2858, 3948, 72605]),
 Row(userId=833, recommendations=[1203, 95510, 5218, 1196, 79132, 4878, 1682, 5618, 97304, 6365]),
 Row(userId=1088, recommendations=[]),
 Row(userId=1238, recommendations=[2959, 68358, 5952, 7458, 1610, 1923, 293, 48774, 1580, 6870]),
 Row(userId=1342, recommendations=[76093, 50872, 4848, 40412, 5096, 6377, 115231, 50601, 97752, 76210]),
 Row(userId=1580, recommendations=[318, 364, 5, 361, 1213, 1953, 236, 185, 1387, 2987]),
 Row(userId=1591, recommendations=[])]

In [25]:
%%time

for eval_df in [test_ratings_df]:
    metrics = evaluate_recommendations_on(
        model=assosiations_rule_model,
        recommendations_map_fn=lambda rec: rec,
        df=eval_df)
    print(metrics)
    
ASSOSIATIONS_RULE_MODEL_HANDLE = '2.assosiations_rule_model'
all_metrics[ASSOSIATIONS_RULE_MODEL_HANDLE] = metrics

{'NDCG@5': 0.007916663365450152, 'Precision@5': 0.007466423856134762, 'MAP': 0.0019036979068597997, 'Precision@1': 0.008311749035627937, 'Precision@10': 0.006934250435274794, 'NDCG@10': 0.0081844900393176, 'NDCG@1': 0.008311749035627937}
CPU times: user 265 ms, sys: 84.2 ms, total: 349 ms
Wall time: 1min 19s


## Hyperparameter optimization

Tune the parameters on 1% of the data, but train the final model on the full dataset.

In [26]:
train_ratings_df, dev_ratings_df, test_ratings_df = split_ratings_df(sampling_rate=0.01)

249572
239953
5333
4286


In [27]:
!pip3.5 install hyperopt



In [28]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

In [29]:
def als_objective(space):
    estimator = ALS(**space)
    print('SPACE: ' + str(space))
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_ratings_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    dev_metrics = evaluate_recommendations_on(
        model=model,
        recommendations_map_fn=lambda rec: rec.movieId,
        df=dev_ratings_df)    
    print('METRICS: ' + str(dev_metrics))
    
    trial_status_dict = {
        'loss': -dev_metrics['MAP'],
        'status': STATUS_OK,
    }
    return trial_status_dict

In [30]:
def tune_als_paired_params(space, param1_name, param1_choice, param2_name, param2_choice):
    assert param1_name in space
    assert param2_name in space

    space[param1_name] = hp.choice(param1_name, param1_choice)
    space[param2_name] = hp.choice(param2_name, param2_choice)

    trials = Trials()
    best_choices = fmin(fn=als_objective,
                        space=space,
                        algo=tpe.suggest,
                        max_evals=5,
                        trials=trials)

    space[param1_name] = param1_choice[best_choices[param1_name]]
    space[param2_name] = param2_choice[best_choices[param2_name]]
    
    return space

In [31]:
space = get_baseline_als_space()

### Tuning the higher level model definition parameters which considers whether to use explicit or implicit user preferences

In [33]:
%%time

tune_als_paired_params(space, 'implicitPrefs', [True, False], 'alpha', [0.75, 1.0, 1.25, 1.5])

SPACE: {'implicitPrefs': True, 'numItemBlocks': 10, 'rank': 10, 'nonnegative': False, 'userCol': 'userId', 'ratingCol': 'rating', 'numUserBlocks': 10, 'itemCol': 'movieId', 'seed': 90, 'regParam': 0.1, 'coldStartStrategy': 'nan', 'maxIter': 10, 'alpha': 0.75}
METRICS: {'NDCG@5': 0.0005588813327732003, 'Precision@5': 0.00018260671079662174, 'MAP': 0.0005598671028938093, 'Precision@1': 0.00022825838849577741, 'Precision@10': 0.00018260671079662182, 'NDCG@10': 0.0008468902088005061, 'NDCG@1': 0.00022825838849577736}
SPACE: {'implicitPrefs': True, 'numItemBlocks': 10, 'rank': 10, 'nonnegative': False, 'userCol': 'userId', 'ratingCol': 'rating', 'numUserBlocks': 10, 'itemCol': 'movieId', 'seed': 90, 'regParam': 0.1, 'coldStartStrategy': 'nan', 'maxIter': 10, 'alpha': 1.25}
METRICS: {'NDCG@5': 0.000657186869881941, 'Precision@5': 0.0002739100661949327, 'MAP': 0.000566479349862139, 'Precision@1': 0.0, 'Precision@10': 0.00027391006619493266, 'NDCG@10': 0.0010431175396105692, 'NDCG@1': 0.0}
SPA

{'alpha': 1.0,
 'coldStartStrategy': 'nan',
 'implicitPrefs': True,
 'itemCol': 'movieId',
 'maxIter': 10,
 'nonnegative': False,
 'numItemBlocks': 10,
 'numUserBlocks': 10,
 'rank': 10,
 'ratingCol': 'rating',
 'regParam': 0.1,
 'seed': 90,
 'userCol': 'userId'}

### Tuning the factorization limitations

In [38]:
%%time

tune_als_paired_params(space, 'rank', [8, 10, 12, 15], 'nonnegative', [True, False])

SPACE: {'implicitPrefs': True, 'numItemBlocks': 10, 'rank': 8, 'nonnegative': True, 'userCol': 'userId', 'ratingCol': 'rating', 'numUserBlocks': 10, 'itemCol': 'movieId', 'seed': 90, 'regParam': 1.0, 'coldStartStrategy': 'nan', 'maxIter': 12, 'alpha': 1.0}
METRICS: {'NDCG@5': 0.0008029633784081257, 'Precision@5': 0.0002739100661949328, 'MAP': 0.0008525269652866089, 'Precision@1': 0.00022825838849577722, 'Precision@10': 0.0003423875827436659, 'NDCG@10': 0.0013894059677839019, 'NDCG@1': 0.00022825838849577706}
SPACE: {'implicitPrefs': True, 'numItemBlocks': 10, 'rank': 12, 'nonnegative': True, 'userCol': 'userId', 'ratingCol': 'rating', 'numUserBlocks': 10, 'itemCol': 'movieId', 'seed': 90, 'regParam': 0.1, 'coldStartStrategy': 'nan', 'maxIter': 8, 'alpha': 1.0}
METRICS: {'NDCG@5': 0.0008872067580996447, 'Precision@5': 0.00027391006619493266, 'MAP': 0.0009428339547033911, 'Precision@1': 0.0004565167769915549, 'Precision@10': 0.00029673590504451054, 'NDCG@10': 0.0013904146709633567, 'NDCG

{'alpha': 1.0,
 'coldStartStrategy': 'nan',
 'implicitPrefs': True,
 'itemCol': 'movieId',
 'maxIter': <hyperopt.pyll.base.Apply at 0x7f62d6ada9b0>,
 'nonnegative': True,
 'numItemBlocks': 10,
 'numUserBlocks': 10,
 'rank': 12,
 'ratingCol': 'rating',
 'regParam': <hyperopt.pyll.base.Apply at 0x7f62d6adac50>,
 'seed': 90,
 'userCol': 'userId'}

### Tuning the best training procedure for the tuned above model kind 

In [39]:
%%time

tune_als_paired_params(space, 'maxIter', [8, 10, 12], 'regParam', [1e-2, 1e-1, 1e-0])

SPACE: {'implicitPrefs': True, 'numItemBlocks': 10, 'rank': 12, 'nonnegative': True, 'userCol': 'userId', 'ratingCol': 'rating', 'numUserBlocks': 10, 'itemCol': 'movieId', 'seed': 90, 'regParam': 0.1, 'coldStartStrategy': 'nan', 'maxIter': 8, 'alpha': 1.0}
METRICS: {'NDCG@5': 0.0008872067580996445, 'Precision@5': 0.00027391006619493277, 'MAP': 0.0009428339547033905, 'Precision@1': 0.00045651677699155466, 'Precision@10': 0.0002967359050445105, 'NDCG@10': 0.0013904146709633547, 'NDCG@1': 0.0004565167769915546}
SPACE: {'implicitPrefs': True, 'numItemBlocks': 10, 'rank': 12, 'nonnegative': True, 'userCol': 'userId', 'ratingCol': 'rating', 'numUserBlocks': 10, 'itemCol': 'movieId', 'seed': 90, 'regParam': 0.1, 'coldStartStrategy': 'nan', 'maxIter': 8, 'alpha': 1.0}
METRICS: {'NDCG@5': 0.0008872067580996447, 'Precision@5': 0.0002739100661949328, 'MAP': 0.0009428339547033904, 'Precision@1': 0.0004565167769915544, 'Precision@10': 0.0002967359050445104, 'NDCG@10': 0.001390414670963356, 'NDCG@1'

{'alpha': 1.0,
 'coldStartStrategy': 'nan',
 'implicitPrefs': True,
 'itemCol': 'movieId',
 'maxIter': 8,
 'nonnegative': True,
 'numItemBlocks': 10,
 'numUserBlocks': 10,
 'rank': 12,
 'ratingCol': 'rating',
 'regParam': 0.1,
 'seed': 90,
 'userCol': 'userId'}

As noted before we use the whole ratingss set for the training and evaluation of the tuned ALS model.

In [None]:
train_ratings_df, dev_ratings_df, test_ratings_df = split_ratings_df()

In [None]:
als = ALS(**space)
tuned_als_model = als.fit(train_ratings_df)

In [None]:
%%time

for eval_df in [test_ratings_df]:
    metrics = evaluate_recommendations_on(
        model=tuned_als_model,
        recommendations_map_fn=lambda rec: rec.movieId,
        df=eval_df)
    print(metrics)
    
TUNED_ALS_HANDLE = '3.als_tuned'
all_metrics[TUNED_ALS_HANDLE] = metrics

## Evaluation Results

Сравните реализованные методы с помощью выбранных метрик. Не забывайте про оптимизацию гиперпараметров.

In [32]:
get_ate(all_metrics, BASELINE_HANDLE)

Unnamed: 0,1.als_baseline,2.assosiations_rule_model
MAP,0.0,230471.858756
NDCG@1,0.0,16787.5
NDCG@10,0.0,47154.894901
NDCG@5,0.0,33744.044662
Precision@1,0.0,16787.5
Precision@10,0.0,59221.052632
Precision@5,0.0,43242.857143
