


**Введение**

В задании предлагалось сравнить два подхода к построению рекоммеднаций: коллаборативный и гибридный (коллаборативный+контентый). В качестве данных для обучения и валидации использовался MOvieLens20M - датасет, содержащий информацию об оценках 138493 пользователей для 27278 фильмов, а также некоторую дополнительную информацию об этих фильмах.

**Формирование датасета**

Начнем с построения dataframe, содержащего данные, необходимые для дальнейших операций. Из всего MOvieLens20M нам понадобятся только два csv-файла - "rating.csv" и "movie.csv". "rating.csv" содержит информацию об оценках пользователей для конкретных фильмов, данные этого файла будут использоваться для построения чисто коллаборативной модели. 'movie.csv' содержит информацию о названии фильмов и их жанрах, данные этого файла вместе с данными из "rating.csv" будут использоваться для построения гибридной (коллаборативной+контентой) модели.

Финальный dataframe имеет следующие поля данных: 'userId' - id пользователя; 'movieId' - id фильма; 'rating' - оценка, данная пользователем с userId фильму с movieId; 'timestamp' - время, когда была сделана оценка; 19 дополнительных полей, содержащих инфорацию о жанрах фильма с movieId (Adventure, Animation, Children, Comedy	.....). Для жанров применялось one-hot кодирование: 1 - если в списке жанров фильма, есть данный жанр; 0 - если нет.

Также хотелось бы отметить, что из-за недостаточно высоких вычислительных ресурсов ноутбука и с целью снижения времени обучения, исходный датасет был уменьшен, из него были выбраны только 15000 самых часто встречающихся пользователей и 3000 самых часто встречающихся фильмов.

Функция "create_dataframe" создает dataframe, сам dataframe можно увидеть в ячейке [4].

In [1]:
import pandas as pd
import numpy as np


In [2]:
def create_dataframe(n_users, n_items, rating_file_name, movie_file_name):
    
    """Return dataframe with ratings and one-hot encoded movies' genres. 
    
    Size of the returned dataframe is reduced in comparison with the original datasets: in the final dataframe 
    only n_users most frequent users and n_items most frequent movies are taken into account. """

    df_moive_with_genres=pd.read_csv(movie_file_name)
    df_init=pd.read_csv(rating_file_name)

    df_moive_with_genres['genres']=df_moive_with_genres['genres'].apply(lambda x: x.split('|'))



    for index, row in df_moive_with_genres.iterrows():

        for genre in row['genres']:
            df_moive_with_genres.at[index, genre] = 1

    df_moive_with_genres=df_moive_with_genres.fillna(0)


    
    df_with_feat_full=df_init.merge(df_moive_with_genres, on='movieId')
    
    from collections import Counter
    ucount = Counter(df_init['userId'])
    mcount = Counter(df_init['movieId'])
    
    top_userid = [u for u,c in ucount.most_common(n_users)]
    top_movieid = [i for i, c in mcount.most_common(n_items)]
    
    df_with_feat= df_with_feat_full[df_with_feat_full['userId'].isin(top_userid) & df_with_feat_full['movieId'].isin(top_movieid)].copy()


    df_with_feat.drop(['title', 'genres','(no genres listed)'], axis=1, inplace=True)
    
    return df_with_feat
    

In [3]:
df=create_dataframe(15000,3000,'rating.csv','movie.csv')

In [4]:
df

Unnamed: 0,userId,movieId,rating,timestamp,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
5,54,2,3.0,2000-11-22 18:36:16,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,91,2,3.5,2005-03-29 01:55:58,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,116,2,2.0,2005-11-23 06:41:08,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,131,2,1.0,2009-03-29 11:41:01,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,132,2,3.0,2005-04-22 12:29:57,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19572775,137277,8948,3.0,2014-03-07 22:43:33,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19572779,137893,8948,4.0,2008-12-19 03:51:01,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19572782,138067,8948,1.5,2005-06-08 07:20:14,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19572783,138200,8948,3.0,2009-03-18 23:37:36,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Разбиение на трэйн, валидацию и тест производилось из следующих соображений:** исходный dataframe сортируется в порядке возрастания по полю "timestamp" с целью избежания дата-ликов; параметр "train_val_frac" показывает долю сортированного dataframe (считая от его начала), которая используется для обучения и валидации моделей; (1 - "train_val_frac") - доля отсортированного dataframe (считая от его конца), использованная для теста моделей. На втором шаге, датасет для обучения и валидации разбивается на train и valiadtion с долей разбиения "train_frac" для обучения.

Функция "trainval_test_split" осуществляет разбиение dataframe на трейн-валидацию и тест.

Функция "train_val_split" осуществляет разбиение датасета трейн-валидация (из предыдущей функции) на обучающую и валидационную выборки.

Вид train, validation и test сетов можно посмотреть в ячейках [9], [10] и [11] соответственно.

In [5]:
def trainval_test_split(df, train_val_frac=0.8):
    
    """Sort initial dataframe by timestamp, remove timestamp column and return train-validation and test subsets. 
    train_val_frac - is a part of the entire dataset which is used as a train-validation set."""
    
    
    df=df.sort_values('timestamp', ascending=True)

    test_frac=1-train_val_frac
    train_val=df.iloc[:int(train_val_frac*len(df))].copy()
    test=df.iloc[int(train_val_frac*len(df))+1:].copy()
    train_val.drop('timestamp', axis=1, inplace=True)
    test.drop('timestamp', axis=1, inplace=True)
    
    return train_val, test

In [6]:
train_val, test=trainval_test_split(df, train_val_frac=0.8)

In [7]:
def train_val_split(train_val, train_frac=0.9):
    
    """Split train-validation set (sorted by timestamp) into train and validation subsets 
    with "train_frac" being the portion of the train subset."""
    
    train=train_val.iloc[:int(train_frac*len(train_val))]
    validation=train_val.iloc[int(train_frac*len(train_val))+1:]
    
    return train, validation

In [8]:
train, validation = train_val_split(train_val, train_frac=0.9)

In [9]:
train

Unnamed: 0,userId,movieId,rating,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
163287,130558,50,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8087470,130558,25,5.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7371005,130558,21,4.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5181524,130558,17,5.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3127224,130558,24,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5423970,47866,508,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3259927,47866,442,3.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4575145,47866,2916,3.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2406906,47866,7153,4.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
validation

Unnamed: 0,userId,movieId,rating,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
1214179,47866,1246,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1630333,47866,2291,3.5,0.0,0.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1155157,47866,1222,3.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6300450,47866,1183,2.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6396271,47866,1391,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16176008,76987,33162,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9272725,76987,8641,4.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93825,58222,47,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
350268,58222,296,4.5,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
test

Unnamed: 0,userId,movieId,rating,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
13481060,76987,7373,4.5,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
307223,58222,293,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
798780,58222,1089,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3693819,58222,1213,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1156816,58222,1222,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16651637,70232,58998,2.5,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4243329,16978,2093,3.5,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9618691,89081,55232,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
18690497,89081,52458,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Для построения коллаборативной и гибридной моделей будем использовать библиотеку LightFM.** Функции "get_movie_features", "create_skeleton", "build_interactions_weights", "build_movie_features" используются для преобразования данных в соотвествующий для LightFM формат. Подробно на этих функциях останавливаться не будем, так как сути данных они не меняют, меняют лишь форму. Их примерное описание представлено в самом коде.

In [12]:
def get_movie_features(data):
    
    """Return list of all additional movie features (genres) and all possible values for them"""
    
    movie_features_names=[]
    for feature in data.columns:
        if feature[0].istitle():
            
            movie_features_names+=[feature]*len(data[feature].unique())
            
            
    unique_feature_values=[]   
    
    for feature in movie_features_names:
         unique_feature_values+=list(data[feature].unique())
            
    movie_features_list=[]
            
    for x,y in zip(movie_features_names, unique_feature_values):
        res = str(x)+ ":" +str(y)
        movie_features_list.append(res)
    
            
    
    
    return movie_features_list
    

In [13]:
movie_features=get_movie_features(df)

In [14]:
def create_skeleton(data, item_features):
    
    """Return a special structure of the dataset required for LightFM"""
    
    from lightfm.data import Dataset
    
    skeleton = Dataset()
    users=list(data['userId'].unique())
    items=list(data['movieId'].unique())
    
    skeleton.fit(users, items, item_features=item_features)
    
    return skeleton

In [15]:
skeleton=create_skeleton(df, movie_features)

  "LightFM was compiled without OpenMP support. "


In [16]:
def build_interactions_weights(data, skeleton):
    
    """Return two sparse matrices: "interactions" and "weights".
    
    "interactions" matrix contains information about the presence of interaction between a user and an item.
    
    "weights" matrix contains quantifies for that interactions."""
    
    
    interactions, weights=skeleton.build_interactions(((data.iloc[i]['userId'],data.iloc[i]['movieId'],
                                                         data.iloc[i]['rating']) for i in range(len(data))))
    
    return interactions, weights

In [17]:
train_interactions, train_weights=build_interactions_weights(train, skeleton)

In [18]:
def build_movie_features(data,skeleton):
    
    """Return list of movies and corresponing additional features (genres) in the appropriate for LightFM format"""
    
    feature_list=[]
    
    movies=list(data['movieId'].unique())
    for movie in movies:

        
        temp=data[data['movieId']==movie].iloc[0]
        
        add_list=[]
        
        for feature in data.columns:           
            if feature[0].istitle():
                res=feature+':'+str(temp[feature])
                add_list.append(res)
                
            
        feature_list.append((movie,add_list))
        
        
                
    movie_features=skeleton.build_item_features(feature_list, normalize= True)
    
    return movie_features
                

In [19]:
item_features=build_movie_features(df,skeleton)

**В качестве метрики будем использовать nDCG10 из следующих соображений:**
1. Данная метрика оценивает именно качество ранжирования, а не точность предсказания рейтинга для конкретного фильма (как RMSE, например).
2. Метрика - нормализованная (варьируется в диапазоне [0,1]), причем nDCG10 учитывает еще и позицию релевантых элементов.

Функция "ndcg_score" вычисляет метрику nDCG10, "dcg_score" - вспомогательная функция для вычисления nDCG10. 

In [20]:
def dcg_score(y_true, y_score, k):
    
    """Return dcg_score at k for y_true and y_score arrays"""
    
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])

    gains = 2 ** y_true - 1

    discounts = np.log2(np.arange(len(y_true)) + 2)
    
    return np.sum(gains / discounts)

In [21]:
def ndcg_score(df_pred, k_val=10):
    
    """Return average (for all users) ndcg_score at k for the test or validation set"""
    
    
    users_test=df_pred['uid'].unique()
    ndcg=0
    count=0
    
    for user in users_test:
        y_true=np.array(df_pred[df_pred['uid']==user]['r_ui'])
        y_score=np.array(df_pred[df_pred['uid']==user]['scores'])
        
        dcg=dcg_score(y_true, y_score, k_val)
        idcg=dcg_score(y_true, y_true, k_val)
        
        if idcg!=0:
            ndcg+=dcg/idcg
            count+=1
            
    if count!=0:
        return ndcg/count
    else:
        return None

**Перед потсроением моделей оценим нашу метрику на baseline**. В качестве baseline будем использовать рекоммендацию Top10 наиболее часто просматриваемых (популярных) фильмов из train. Каждому пользователю из test будем рекомендовать 10 самых популярных фильмов из train, причем ранжировать фильмы будем в порядке убывания популярности. То есть, самый часто просматриваемый фильм будет первым в списке рекомендаций, а самый менее просматриваемый из этих Top10 будет последним.

Функции "match_test_rows_with_top_n_movies" и "create_df_pred_for_baseline" являются вспомогательными. Они добавляют к test сету поле 'score', в случае baseline иммитирующее предсказания модели, а также трансформируют test сет с полем 'score' в форму, необходимую для вычисления функции "ndcg_score".

Для baseline nDCG10 = 0.576 (см. в ячейке 25)

In [22]:
#find the most popular (frequently watched) movies in the train set

from collections import Counter

top_n=10

mcount = Counter(train['movieId'])

top_movieid_train = [i for i, c in mcount.most_common(top_n)]



In [23]:
def match_test_rows_with_top_n_movies(x, top_movieid_train):
    
    """This function is additional for "create_df_pred_for_baseline". Calculates scores for the 10 most popular 
    (frequently watched) movies based on their popularity."""
    
    if x['movieId'] in top_movieid_train:
        
        score=(10-top_movieid_train.index(x['movieId']))/2
        
        return score
    else:
        return 0

In [24]:
def create_df_pred_for_baseline(test):
    
    """Returns dataframe with "userId", "movieId", "r_ui" (true rating) and "scores" predicted by the baseline model."""
    
    df_pred=test.copy()
    
    df_pred['scores']=df_pred.apply(lambda x: match_test_rows_with_top_n_movies(x, top_movieid_train), axis=1)
    
    df_pred.rename(columns={"userId": "uid","movieId": "iid", 'rating':'r_ui'}, inplace=True)
    
    return df_pred

In [25]:
# nDCG10 value for the baseline model

df_pred=create_df_pred_for_baseline(test)
print("For baseline NDCG10 =",ndcg_score(df_pred, k_val=10))

For baseline NDCG10 = 0.5763855341638605


**Построение моделей**

Функция "get_trained_LightFM" создает, обучает (на предобработанных данных из train) и возвращает обученную LightFM модель. Параметр 'item_features' функции отвечает за присутствие дополнительных данных о фильмах (жанрах в нашем случае). 

*Если 'item_features'= None*, то никакой, кроме рейтингов, дополнительной информации модели не сообщается и модель реализует чисто коллаборативный подход.

*Если 'item_features' не None*, то модель получает дополнительную информацию о фильмах и реализует гибридный подход (коллаборативный+контентый).

Функция "create_df_pred_for_LightFM" является вспомогательной. Она добавляет к validation и test сетам поле 'score', содержащее информацию о предсказаниях модели, а также трансформирует validation и test сеты с полем 'score' в форму, необходимую для вычисления метрики.

In [26]:
def get_trained_LightFM(train_interactions, train_weights, item_features, params):
    
    """Create and fit LightFM model, and then return the fitted model"""
    
    from lightfm import LightFM
    
    no_components=params['no_components']
    epochs=params['epochs']
    item_alpha =params['item_alpha']
    user_alpha=params['user_alpha']
    
    model = LightFM(no_components=no_components,loss='warp', 
                    item_alpha=item_alpha, user_alpha=user_alpha)
    
    model.fit(train_interactions,
      item_features= item_features,
      sample_weight= train_weights,
      epochs=epochs,num_threads=3)
    
    return model
    

In [27]:
def create_df_pred_for_LightFM(model, test, skeleton):
    
    """Return dataframe with "userId", "movieId", "r_ui" (true rating) and "scores" predicted by the LightFM model"""
    
    user_id_map, user_feature_map, item_id_map, item_feature_map=skeleton.mapping()
    
    df_pred=test.copy()
    scores=[]
    
    for i in range(len(test)):
        
        temp=test.iloc[i]
        score=model.predict(np.array([user_id_map[temp['userId']]]), np.array([item_id_map[temp['movieId']]]))
        scores.append(score[0])
    
    
    df_pred['scores']=scores
    
    df_pred.rename(columns={"userId": "uid","movieId": "iid", 'rating':'r_ui'}, inplace=True)
    
    return df_pred

**Сначала рассмотрим модель с коллаборативным подходом. В ней дополнтиельные фичи фильмов, связанные с жанрами, не учитываются (item_features=None).**

Проварьируем различные параметры модели, сеты параметров представлены в списке 'list_params' (см. ячейку ниже). Для каждого набора параметров выведем следующие результаты: время обучения модели (Training time) и значение нашей метрики nDCG10 на validation выборке (см. ячейку ниже).

Из данных результатов можно сделать следующие выводы:
1. Увеличение параметра 'no_components', отвечающего за размерость векторов представления пользователей и фильмов, сильно увеличивает время обучение модели и, при прочих равных, практически не улучшает значение метрики (а в некоторых случаях даже ухудшает).

2. Увеличение параметров 'item_alpha' и 'user_alpha', отвечающих за регуляризацию функции потерь, так же увеличивает время обучения (пусть и менее значительно, чем 'no_components') и ухудшает значение метрики.

Оптимальным с точки зрения времени обучения и значения метрики является набор параметров: {'no_components':10, 'epochs':20, 'item_alpha':0, 'user_alpha':0}. Для этого набора время обучения составило 6 мин, а nDCG10=0.634.

In [28]:
#fitting of the collaborative model

list_params=[{'no_components':10, 'epochs':20, 'item_alpha':0, 'user_alpha':0},
             {'no_components':10, 'epochs':20, 'item_alpha':0.01, 'user_alpha':0.01},
             {'no_components':50, 'epochs':20, 'item_alpha':0, 'user_alpha':0},
             {'no_components':50, 'epochs':20, 'item_alpha':0.01, 'user_alpha':0.01}]

print('For pure collaborative model:\n\n')

import time

for params in list_params:
    
    start_time=time.monotonic()
    
    model=get_trained_LightFM(train_interactions, train_weights, None, params)
    
    finish_time=time.monotonic()
    
    df_pred=create_df_pred_for_LightFM(model, validation, skeleton)
        
    print('==========================')
    print('For params:', params,'\n\n' 'Training time (min):',round(2*(finish_time-start_time)/60)/2 ,'       NDCG10=',ndcg_score(df_pred, k_val=10),'\n==========================\n')

For pure collaborative model:


For params: {'no_components': 10, 'epochs': 20, 'item_alpha': 0, 'user_alpha': 0} 

Training time (min): 6.0        NDCG10= 0.6343169332443236 

For params: {'no_components': 10, 'epochs': 20, 'item_alpha': 0.01, 'user_alpha': 0.01} 

Training time (min): 7.0        NDCG10= 0.6188044006276578 

For params: {'no_components': 50, 'epochs': 20, 'item_alpha': 0, 'user_alpha': 0} 

Training time (min): 19.5        NDCG10= 0.6375469138306958 

For params: {'no_components': 50, 'epochs': 20, 'item_alpha': 0.01, 'user_alpha': 0.01} 

Training time (min): 27.0        NDCG10= 0.6098998690706406 



Проведем финальную оценку модели с колаборативным подходом на тестовой выборке (см. ячейку ниже). Полученное на тесте значение метрики nDCG10 = 0.601 , что превосходит значение метрики для baseline (0.576) примерно на 4.3 %. Может показаться, что прирост метрики не велик, однако, очень важно понимать, что рекомендация Top10 фильмов из train является достаточно сильным baseline, перебить который бывает непросто.

In [29]:
# nDCG10 value for the collaborative model

params= {'no_components': 10, 'epochs': 20, 'item_alpha': 0, 'user_alpha': 0} 
model=get_trained_LightFM(train_interactions, train_weights, None, params)
df_pred=create_df_pred_for_LightFM(model, test, skeleton)

print('For pure collaborative model:')
print('Test NDCG10 =', ndcg_score(df_pred, k_val=10))

For pure collaborative model:
Test NDCG10 = 0.6014518719202534


**Теперь рассмотрим гибридную. В ней  item_features учитываются.** 

Обучим гибридную модель на оптимальном для коллаборативной модели сете параметров и выведем время обучения и значение метрики на validation выборке. (сет параметров для коллаборативной модели оказался оптимальным и для данной). Результаты представлены в ячейке ниже.

Время обучения составило 40 мин, что значительно выше времени обучения коллаборативной модели. Значение метрики на валидационной выборке для гибридной модели (0.631) примерно на 0.5 %  меньше чем значение для коллаборативной модели (0.634). Вероятно, это связано с тем, что гибридная модель значительно сложнее коллаборативной, и, соотвественно, гибридная модель может хуже обучаться на том же train сете.

In [32]:
#fitting of the hybrid model


list_params=[{'no_components':10, 'epochs':20, 'item_alpha':0, 'user_alpha':0}]

print('For hybrid model:\n\n')

import time

for params in list_params:
    
    start_time=time.monotonic()
    
    model=get_trained_LightFM(train_interactions, train_weights, item_features, params)
    
    finish_time=time.monotonic()

    df_pred=create_df_pred_for_LightFM(model, validation, skeleton)
    
    print('==========================')
    print('For params:', params,'\n\n' 'Training time (min):',round(2*(finish_time-start_time)/60)/2 ,'       NDCG10=',ndcg_score(df_pred, k_val=10),'\n==========================\n')

For hybrid model:


For params: {'no_components': 10, 'epochs': 20, 'item_alpha': 0, 'user_alpha': 0} 

Training time (min): 40.0        NDCG10= 0.6307978126973641 



Проведем финальную оценку гибридной модели на тестовой выборке (см. ячейку ниже). Полученное значение метрики nDCG10 = 0.590, что примерно на 2.4 % превосходит значение метрики для baseline (0.576). Как и на валидационной, на тестовой выборке значение nDCG10 гибридной модели хуже значения коллаборативной, что опять-таки связано с особенностями, озвученными выше.

In [33]:
# nDCG10 value for the hybrid model

df_pred=create_df_pred_for_LightFM(model, test, skeleton)

print('For hybrid model:')
print('Test NDCG10 =', ndcg_score(df_pred, k_val=10))

For hybrid model:
Test NDCG10 = 0.5904613576105188


**Выводы:** 

1. Полученны следующие значения метрики nDCG10 на тестовой выборке: для baseline - 0.576; для коллаборативной модели - 0.601; для гибридной модели - 0.590.  

2. И коллаборативная, и гибридная модели превосходят baseline-модель (рекоммендация 10 самых популярных фильмов из train) по значению метрики. Однако, превосходят несильно, поэтому стоит провести еще и статистическую оценку результатов, например, с помощью метода bootsrap. 

3. Гибридная модель немного проигрывает коллабортивной в значении nDCG10. Однако, коллаборативная модель страдает от так называемой "cold-start" проблемы, связанной с недостатком информации при добавлении нового объекта. Гибридная модель помогает решить эту проблему, так как содержит некоторую дополнительную ифнормацию о фильмах, поэтому чуть меньшее значение метрики в гибридной модели компенсируется ее возможностью взаимодействовать с новыми объектами.

**На что не хватило отведенного времени:** 

1. Провести статистическую оценку результатов (например,  с помощью bootstrap).

2. Возможно, попробовать более экзотическое разбиение на трэйн, валидацию и тест. Однако, на данном этапе моего понимания задачи, кажется, что все остальные варианты, кроме банальной сортивовки по timestamp и последующему разбиению в порядке возрасатания timestamp, в большей или меньшей степени приведут к дата-ликам.

3. Попробовать модели из других библиотек (например, SVD, SVD++, NMF из scikit-surprise) и оценить их.