### Описание задания:

- Использовать датасет ml-latest.
- Выбрать подходящий подход к построению гибридным системам для предоставления рекомендаций.
- Написать свою гибридную систему

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import wget
import zipfile

import pandas as pd
import numpy as np

from lightfm import LightFM
import lightfm.data as light
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import LabelEncoder

from scipy.sparse import coo_matrix

from surprise import accuracy
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

In [3]:
warnings.filterwarnings('ignore')

In [4]:
dataset = 'ml-latest' 
#Размер датасета 264M

#dataset = 'ml-latest-small'
#Размер датасета 1M - используется для тестового прогона

url = f'https://files.grouplens.org/datasets/movielens/{dataset}.zip'

In [5]:
wget.download(url, f'{dataset}.zip')

100% [......................................................................] 277113433 / 277113433

'ml-latest.zip'

In [6]:
with zipfile.ZipFile(f'{dataset}.zip', 'r') as zip_data:
    zip_data.extractall()

In [7]:
movies = pd.read_csv(f'./{dataset}/movies.csv')
movies.name = 'movies'
ratings = pd.read_csv(f'./{dataset}/ratings.csv')
ratings.name = 'ratings'
tags = pd.read_csv(f'./{dataset}/tags.csv')
tags.name = 'tags'

In [8]:
def get_analises(dataset) -> None:
    print(dataset.name)
    dataset.info()
    print(f'Дублирующих записей: {dataset.duplicated().sum()}')
    print('------------')

### Первичный анализ данных

In [9]:
get_analises(movies)
get_analises(ratings)
get_analises(tags)

movies
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58098 entries, 0 to 58097
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  58098 non-null  int64 
 1   title    58098 non-null  object
 2   genres   58098 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.3+ MB
Дублирующих записей: 0
------------
ratings
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27753444 entries, 0 to 27753443
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 847.0 MB
Дублирующих записей: 0
------------
tags
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1108997 entries, 0 to 1108996
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   userId     1108997 non-null  int64 
 1   movieId    1108

### Подготовка данных

In [10]:
movies_with_ratings = movies.merge(ratings, on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)
movies_with_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4,4.0,1113765937
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,5.0,948885850
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,14,4.5,1442169375
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,4.0,1370810063
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,22,4.0,1237622631


In [11]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.userId,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

In [12]:
reader = Reader(rating_scale=(ratings.rating.min(), ratings.rating.max()))
data = Dataset.load_from_df(dataset, reader)

In [15]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=21)

### Обучение RecSys из библиотеки surprise

В качестве модели выбрана SVD библиотеки surprise: пожертвуем немного качеством модели по сравнению с SVDpp, но кратно сократим время.
В качестве параметров возьмём те, что подобраны в проекте [Сollaborative_filtering](https://github.com/msavilov/Recommender_Systems_ML/blob/main/2_Collaborative_filtering/collaborative_filtering.ipynb)

In [16]:
svd = SVD(n_factors=100, n_epochs=50, lr_all=0.01, reg_all=0.08, random_state=21)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1f846135c00>

In [17]:
predictions = svd.test(testset)
accur_score = accuracy.rmse(predictions, verbose=True)

RMSE: 0.8138


In [18]:
def get_recommend(user_id):
    'Предоставление рекомендаций фильмов по userID'
    user_movies = movies_with_ratings[movies_with_ratings.userId == user_id].title.unique()
    scores = []
    titles = []
    for movie in movies_with_ratings.title.unique():
        if movie in user_movies:
            continue
        
        scores.append(svd.predict(uid=user_id, iid=movie).est)
        titles.append(movie)
    titles = np.array(titles)
    return titles[np.argsort(-np.array(scores))[:10]]

In [19]:
userId = 30
print(f'Для пользователя с ID {userId} рекомендуются следующие фильмы:')
print('--------------')
print(*get_recommend(userId), sep='\n')

Для пользователя с ID 30 рекомендуются следующие фильмы:
--------------
Loot (1970)
Story of Science, The (2010)
Dreams with Sharp Teeth (2008)
Emo Philips Live (1987)
Vergeef me
Perceval (1978)
Grin Without a Cat, A (Fond de l'air est rouge, Le) (1977)
O Pátio das Cantigas (1942)
Frozen North, The (2006)
Godfather, The (1972)


### Обучение RecSys на основе content-based с применением TF-IDF

В качестве модели content-based модель с применением TF-IDF к признаку genre. Рекомендации посчитаем с помощью метода NearestNeighbors.

In [20]:
def get_X_y(data):
    tfidf = TfidfVectorizer()
    data['genres'] = data['genres'].apply(lambda val: val.replace('|', ' '))
    X = pd.DataFrame(tfidf.fit_transform(data['genres']).toarray(),
                       columns=tfidf.get_feature_names_out()).reset_index(drop=True)
    return X, data['rating']

In [21]:
def change_string(s):
    return ' '.join(s.replace(' ', '').replace('-', '').split('|'))

In [22]:
movie_genres = [change_string(g) for g in movies.genres.values]

In [23]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(movie_genres)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

neigh = NearestNeighbors(n_neighbors=20, n_jobs=-1, metric='euclidean') 
neigh.fit(X_train_tfidf)

Добавим дополнительный признак для фильмов: их средний рейтинг. После подбора похожих фильмов, отсортируем их по более рейтинговым

In [24]:
mean_movie_rating = movies_with_ratings.groupby('movieId')['rating'].mean()
movies_with_ratings['mean_movie_rating'] = movies_with_ratings['movieId'].apply(lambda x: mean_movie_rating[x])

In [25]:
def get_rec_by_genres(genres):
    predict = count_vect.transform([genres])
    X_tfidf2 = tfidf_transformer.transform(predict)
    res = neigh.kneighbors(X_tfidf2, return_distance=True)
    return np.array(movies_with_ratings.iloc[res[1][0]].sort_values(by=['mean_movie_rating'], ascending=False)['title'].unique()[:10])

In [43]:
genres = 'Adventure Comedy Fantasy Crime'
print(f'По жанрам "{genres}" рекомендуются следующие фильмы:')
print('--------------')
print(*get_rec_by_genres(genres), sep='\n')

По жанрам "Adventure Comedy Fantasy Crime" рекомендуются следующие фильмы:
--------------
Like Water for Chocolate (Como agua para chocolate) (1992)
Apollo 13 (1995)
Dances with Wolves (1990)
Bullets Over Broadway (1994)
Beauty and the Beast (1991)
Leaving Las Vegas (1995)
Ed Wood (1994)
Get Shorty (1995)
Death and the Maiden (1994)
Rob Roy (1995)


### Гибридная RecSys

Объединяет в себе рекомендательную систему на основе content-based и SVD библиотеки surprise

In [27]:
movies_with_ratings.sort_values('timestamp', inplace=True)

In [28]:
title_genres = {}

for index, row in movies.iterrows():
    title_genres[row.title] = row.genres

In [29]:
def recommend_for_user_by_genre(user_id):
    user_movies = movies_with_ratings[movies_with_ratings.userId == user_id].title.unique()
    
    last_user_movie = user_movies[-1]
    
    movie_genres = title_genres[last_user_movie]
    
    movie_genres = change_string(movie_genres)

    predict = count_vect.transform([movie_genres])
    
    X_tfidf2 = tfidf_transformer.transform(predict)

    res = neigh.kneighbors(X_tfidf2, return_distance=True)
    
    movies_to_score = movies.iloc[res[1][0]].title.values

    scores = []
    titles = []

    for movie in movies_to_score:
        if movie in user_movies:
            continue
            
        scores.append(svd.predict(uid=user_id, iid=movie).est)
        titles.append(movie)
        
    best_indexes = np.argsort(scores)[10:]
    
    for i in reversed(best_indexes):
        print(titles[i], scores[i])

In [30]:
userId = 30
print(f'Для пользователя с ID {userId} рекомендуются следующие фильмы:')
print('--------------')
recommend_for_user_by_genre(userId)

Для пользователя с ID 30 рекомендуются следующие фильмы:
--------------
French Connection, The (1971) 4.065736250230989
Heat (1995) 4.028011572622383
Die Hard (1988) 3.650946297422606
Natural Born Killers (1994) 3.544875454146122
Coffy (1973) 3.4967851763319637
F/X (1986) 3.4507534717508466
Die Hard: With a Vengeance (1995) 3.3893047272108445
Point Break (1991) 3.3385426096538695
Someone to Watch Over Me (1987) 3.263281216937152


### RecSys на основе библиотеки Lightfm

#### Подготовка данных

In [31]:
def create_rate_matrix(df, shuffle = True, split_ratio = 0.8):
    '''
    Split the Pandas DataFrame into train and test according to the split_ratio.
    INPUT:
      - df: Pandas DataFrame of interaction data, including user id, product id, and rate.
      - shuffle: boolean, whether to randomly shuffle the dataframe before splitting
      - split_ratio: the ratio of train and test 
    OUTPUT:
      - rate_matrix: a dictionary, keys ['train', 'test'], value is coo_matrix of the same shape 
    '''
    if shuffle:
        df = df.sample(frac = 1).reset_index(drop = True)
    split_point = np.int(np.round(df.shape[0] * split_ratio))
    df_train = df.iloc[0:split_point]
    df_test = df.iloc[split_point::]
    df_test = df_test[(df_test['userId'].isin(df_train['userId']))&\
                     (df_test['movieId'].isin(df_train['movieId']))]

    print('Train dataset size is %d, test dataset size is %d' 
          % (len(df_train), len(df_test)))
    
    id_cols = ['userId', 'movieId']
    trans_cat_train = dict()
    trans_cat_test = dict()
  
    encoder = dict()
    for k in id_cols:
        le = LabelEncoder()
        trans_cat_train[k] = le.fit_transform(df_train[k].values)
        trans_cat_test[k] = le.transform(df_test[k].values)
        encoder[k] = le
        
    trans_cat_train['rating'] = df_train['rating']
    trans_cat_test['rating'] = df_test['rating']
    
    users = np.unique(trans_cat_train['userId'])
    items = np.unique(trans_cat_train['movieId'])
    n_users = len(users)
    n_items = len(items)    
    print('There are %d users and %d products in dataset.' 
          % (n_users, n_items))
    
    rate_matrix = dict()
    rate_matrix['train'] = coo_matrix((trans_cat_train['rating'],
                                       (trans_cat_train['userId'],
                                        trans_cat_train['movieId'])),
                                      shape = (n_users, n_items))
    
    rate_matrix['test'] = coo_matrix((trans_cat_test['rating'],
                                      (trans_cat_test['userId'],
                                       trans_cat_test['movieId'])),
                                     shape = (n_users, n_items))
    
    return rate_matrix, users, items, encoder

In [32]:
rating_matrix, users, items, encoder_dict = create_rate_matrix(ratings)

Train dataset size is 22202755, test dataset size is 5546575
There are 281964 users and 51610 products in dataset.


In [33]:
tfidf = TfidfVectorizer()
movie_train_tfidf = tfidf.fit_transform(movie_genres)
movie_train_tfidf = pd.DataFrame(movie_train_tfidf.toarray(), columns=tfidf.get_feature_names_out())
movies = movies.drop(columns=['genres'])
movies = pd.concat([movies, movie_train_tfidf], axis=1)

In [34]:
movies['movieId'] = movies['movieId'].apply(lambda x: 
                                                      'other' if x not in encoder_dict['movieId'].classes_ 
                                                      else x)

movies = movies[movies['movieId'] != 'other']
movies['movieId'] = encoder_dict['movieId'].transform(movies['movieId'].values)

In [35]:
columns = movies.columns.to_list()
columns.remove('movieId')

In [36]:
def generate_feature_list(df, columns):
    '''
    Generate the list of features of corresponding columns to list
    In order to fit the lightdm Dataset
    '''
    features = df[columns].apply(
        lambda x: ','.join(x.map(str)), axis = 1)
    features = features.str.split(',')
    features = features.apply(pd.Series).stack().reset_index(drop = True)
    return features

In [37]:
def prepare_item_features(df, columns, id_col_name):
    '''
    Prepare the corresponding feature formats for 
    the lightdm.dataset's build_item_features function
    '''
    features = df[columns].apply(
            lambda x: ','.join(x.map(str)), axis = 1)
    features = features.str.split(',')
    features = list(zip(df[id_col_name], features))
    return features

#### Обучение модели

In [38]:
dataset = light.Dataset()
fitting_item_features = generate_feature_list(movies, columns)
lightdm_features = prepare_item_features(movies, columns, 'movieId')

dataset.fit(users, items, item_features = fitting_item_features)
item_feature = dataset.build_item_features(lightdm_features, 
                                            normalize = True)

In [39]:
model = LightFM()
model.fit(rating_matrix['train'], epochs=10)

train_precision = precision_at_k(model, rating_matrix['train'], k=10).mean()
test_precision = precision_at_k(model, rating_matrix['test'], k=10, train_interactions=rating_matrix['train']).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))

Precision: train 0.23, test 0.11.


In [40]:
movies_title = np.array(movies['title'])
movies_title

array(['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)',
       ..., 'Her Name Was Mumu (2016)', 'Flora (2017)', 'Leal (2018)'],
      dtype=object)

In [41]:
def sample_recommendation(model, data, user_ids):
    n_users, n_items = data['train'].shape
    for user_id in user_ids:
        known_positives = movies_title[data['train'].tocsr()                                    
                          [user_id].indices]
        
        scores = model.predict(user_id, np.arange(n_items))

        top_items = movies_title[np.argsort(-scores)]
        
        print(f'Для пользователей с ID {user_id} рекомендуются следующие фильмы:')
        print('--------------')
        
        for x in top_items[:10]:
            print("        %s" % x)
        print()

In [42]:
userId = [30, 600]
sample_recommendation(model, rating_matrix, userId)

Для пользователей с ID 30 рекомендуются следующие фильмы:
--------------
        Shawshank Redemption, The (1994)
        Forrest Gump (1994)
        Pulp Fiction (1994)
        Silence of the Lambs, The (1991)
        Matrix, The (1999)
        Star Wars: Episode IV - A New Hope (1977)
        Jurassic Park (1993)
        Schindler's List (1993)
        Braveheart (1995)
        Toy Story (1995)

Для пользователей с ID 600 рекомендуются следующие фильмы:
--------------
        Shawshank Redemption, The (1994)
        Forrest Gump (1994)
        Pulp Fiction (1994)
        Silence of the Lambs, The (1991)
        Matrix, The (1999)
        Star Wars: Episode IV - A New Hope (1977)
        Jurassic Park (1993)
        Schindler's List (1993)
        Braveheart (1995)
        Toy Story (1995)



### Вывод:

В проекте были построены соедующие рекомендательные модели на основе:
- SVD библиотеки surprise
- content-based с применением TF-IDF и метода NearestNeighbors
- гибридной модели из двух предыдущих
- библиотеки Lightfm