### Домашнее задание по теме «Рекомендации на основе содержания»

Построить рекомендации (регрессия, предсказываем оценку) на фичах: TF-IDF на тегах и жанрах   

Средние оценки (+ median, variance, etc.) пользователя и фильма   

Оценить RMSE на тестовой выборке

In [24]:
import pandas as pd
import numpy as np
import scipy.stats

from tqdm import tqdm_notebook

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

%matplotlib inline

Заугрузим данные в датафреймы

In [25]:
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [13]:
movies.sample(3)

Unnamed: 0,movieId,title,genres
1587,2126,Snake Eyes (1998),Action|Crime|Mystery|Thriller
4075,5812,Far from Heaven (2002),Drama|Romance
2157,2872,Excalibur (1981),Adventure|Fantasy


Очистим признак жанры от служебных симоволов и разделим пробелами

In [26]:
movies['genres'] = movies.genres.apply(lambda x: ' '.join(x.replace(' ', '').replace('-', '').split('|')))

In [15]:
movies.sample(3)

Unnamed: 0,movieId,title,genres
5447,26059,When a Woman Ascends the Stairs (Onna ga kaida...,Drama
638,813,Larger Than Life (1996),Comedy
4342,6342,"Trip, The (2002)",Comedy Drama Romance


Очистим признак тэги от служебных симоволов и разделим пробелами

In [27]:
tags['tag'] = tags.tag.apply(lambda x: ' '.join(x.replace(' ', ' ').replace('-', '').split('|')))

In [17]:
tags.sample(3)

Unnamed: 0,userId,movieId,tag,timestamp
2675,477,61323,dark comedy,1269832488
1452,474,1466,Mafia,1137191577
2468,474,37741,Truman Capote,1142996187


Объединим датафремы

In [28]:
movie_tags_merge = movies.merge(tags, how='left', on='movieId').fillna('0')
print(movie_tags_merge.info())
movie_tags_merge.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11853 entries, 0 to 11852
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   movieId    11853 non-null  int64 
 1   title      11853 non-null  object
 2   genres     11853 non-null  object
 3   userId     11853 non-null  object
 4   tag        11853 non-null  object
 5   timestamp  11853 non-null  object
dtypes: int64(1), object(5)
memory usage: 648.2+ KB
None


Unnamed: 0,movieId,title,genres,userId,tag,timestamp
1513,1243,Rosencrantz and Guildenstern Are Dead (1990),Comedy Drama,474,Shakespeare sort of,1137180000.0
11847,193579,Jon Stewart Has Left the Building (2015),Documentary,0,0,0.0
11130,143367,Silence (2016),Drama Thriller,567,tragic,1525280000.0


In [19]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [29]:
movie_ratings_merge = movie_tags_merge.merge(ratings, how='left', on='movieId').fillna(0)

In [21]:
print(movie_ratings_merge.info())
movie_ratings_merge.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      285783 non-null  int64  
 1   title        285783 non-null  object 
 2   genres       285783 non-null  object 
 3   userId_x     285783 non-null  object 
 4   tag          285783 non-null  object 
 5   timestamp_x  285783 non-null  object 
 6   userId_y     285783 non-null  float64
 7   rating       285783 non-null  float64
 8   timestamp_y  285783 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 21.8+ MB
None


Unnamed: 0,movieId,title,genres,userId_x,tag,timestamp_x,userId_y,rating,timestamp_y
269300,81591,Black Swan (2010),Drama Thriller,424,surreal,1457850000.0,249.0,4.0,1346847000.0
176180,2959,Fight Club (1999),Action Crime Drama Thriller,599,Nudity (Topless),1498460000.0,514.0,2.5,1533872000.0
197810,4103,Empire of the Sun (1987),Action Adventure Drama War,0,0,0.0,74.0,4.0,1207501000.0


Добавим столбец со средним значением оценки  по каждому пользователю

In [30]:
movie_ratings_merge['avg_rate_user'] = movie_ratings_merge.groupby('userId_x')['rating'].transform('mean')

Добавим столбец со средним значением оценки по каждому фильму

In [31]:
movie_ratings_merge['avg_rate_movie'] = movie_ratings_merge.groupby('movieId')['rating'].transform('mean')

In [26]:
movie_ratings_merge.sample(3)

Unnamed: 0,movieId,title,genres,userId_x,tag,timestamp_x,userId_y,rating,timestamp_y,avg_rate_user,avg_rate_movie
225542,7153,"Lord of the Rings: The Return of the King, The...",Action Adventure Drama Fantasy,62,scenic,1528150000.0,514.0,3.0,1533871000.0,3.651496,4.118919
63837,296,Pulp Fiction (1994),Comedy Crime Drama Thriller,599,multiple stories,1498460000.0,566.0,4.0,849005100.0,4.168586,4.197068
76760,296,Pulp Fiction (1994),Comedy Crime Drama Thriller,599,smart writing,1498460000.0,6.0,2.0,845553100.0,4.168586,4.197068


Заполним  строки с Nan медианой из соответсвующий стобцов

In [32]:
movie_ratings_merge['avg_rate_user'].fillna(movie_ratings_merge['avg_rate_user'].median(), inplace=True)
movie_ratings_merge['avg_rate_movie'].fillna(movie_ratings_merge['avg_rate_movie'].median(), inplace=True)
movie_ratings_merge['rating'].fillna(movie_ratings_merge['rating'].median(), inplace=True)

In [29]:
movie_ratings_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   movieId         285783 non-null  int64  
 1   title           285783 non-null  object 
 2   genres          285783 non-null  object 
 3   userId_x        285783 non-null  object 
 4   tag             285783 non-null  object 
 5   timestamp_x     285783 non-null  object 
 6   userId_y        285783 non-null  float64
 7   rating          285783 non-null  float64
 8   timestamp_y     285783 non-null  float64
 9   avg_rate_user   285783 non-null  float64
 10  avg_rate_movie  285783 non-null  float64
dtypes: float64(5), int64(1), object(5)
memory usage: 26.2+ MB


Получим итоговый датафрейм с данными 

In [33]:
movie_result = movie_ratings_merge[['genres', 'tag', 'rating', 'avg_rate_user', 'avg_rate_movie']]
movie_result.sample(3)
# movie_result.info()

Unnamed: 0,genres,tag,rating,avg_rate_user,avg_rate_movie
170536,Action Crime Drama Thriller,action,5.0,4.168586,4.272936
239669,Action Adventure SciFi Thriller,0,1.5,3.285343,3.15
122523,Action Adventure SciFi,classic,4.0,3.872817,4.21564


In [34]:
X_means = movie_result[['avg_rate_user', 'avg_rate_movie']].to_numpy()

Подготовим данные для обучения модели

In [81]:
tfidf = (movie_result['genres'] + ' ' + movie_result['tag'] ).values
# y = movie_result['rating'].values

Преобразуем данные о жанрах и тэгах в векторы и обучим модель Линейной регрессии

In [35]:
tfidf_model = TfidfVectorizer()
X_genres = tfidf_model.fit_transform(movie_result['genres']).toarray()

In [36]:
X_tags = tfidf_model.fit_transform(movie_result['tag']).toarray()

In [37]:
X_tfidf = np.hstack((X_tags, X_genres))

Подготовим данные для обучения модели

In [38]:
X = np.hstack((X_tfidf, X_means))
# y = movie_result['rating'].values

In [39]:
# X = (movie_result['genres'] + ' ' + movie_result['tag']).values
y = movie_result['rating'].to_numpy().flatten()

In [50]:
y


array([4. , 4. , 4.5, ..., 3.5, 3.5, 4. ])

Преобразуем данные о жанрах и тэгах в векторы и обучим модель Линейной регрессии

In [40]:
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X)

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

In [41]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

In [42]:
lr = LinearRegression().fit(X_train, y_train)

In [43]:
y_pred = lr.predict(X_test)

In [47]:
mean_squared_error(y_test, y_pred, squared=False)

2402226351.8992825

Добавим колонку с предсказаниями в наш итоговый датафрейм

In [51]:
prediction_rating = lr.predict(X_tfidf)

ValueError: X has 1768 features, but LinearRegression is expecting 1770 features as input.

In [503]:
movies_result['Prediction'] = prediction_rating

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_result['Prediction'] = prediction_rating


In [507]:
movies_result.sample(10)

Unnamed: 0,movieId,genres,tag,rating,Prediction
246974,49530,Action Adventure Crime Drama Thriller War,Jennifer Connelly,4.0,3.681821
9605,110,Action Drama War,Medieval,4.0,4.018135
266175,79132,Action Crime Drama Mystery SciFi Thriller IMAX,complicated,5.0,4.149748
47397,296,Comedy Crime Drama Thriller,ensemble cast,4.5,4.176957
4812,39,Comedy Romance,chick flick,3.0,3.069541
2330,25,Drama Romance,alcoholism,2.0,3.700549
107769,780,Action Adventure SciFi Thriller,aliens,4.0,3.709327
113780,924,Adventure Drama SciFi,revolutionary,4.5,3.951612
54707,296,Comedy Crime Drama Thriller,Harvey Keitel,5.0,4.141394
57984,296,Comedy Crime Drama Thriller,intellectual,4.0,4.143521


#### Обучим модель Линейной регрессии на средних оценках пользователей

In [137]:
movies_ratings = movies.merge(ratings, how='left', on='movieId')
movies_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,9.649827e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,8.474350e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1.106636e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1.510578e+09
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1.305696e+09
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,1.537109e+09
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,1.537110e+09
100851,193585,Flint (2017),Drama,184.0,3.5,1.537110e+09
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,1.537110e+09


Добавим столбец со средним значением оценки  по каждому пользователю

In [138]:
movies_ratings['avg_rate_user'] = movies_ratings.groupby('userId')['rating'].transform('mean')

Добавим столбец со средним значением оценки по каждому фильму

In [139]:
movies_ratings['avg_rate_movie'] = movies_ratings.groupby('movieId')['rating'].transform('mean')

In [140]:
movies_ratings.sample(3)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,avg_rate_user,avg_rate_movie
37595,1982,Halloween (1978),Horror,561.0,4.0,1491092000.0,3.372277,3.722222
83574,48516,"Departed, The (2006)",Crime|Drama|Thriller,317.0,5.0,1430362000.0,3.730159,4.252336
35481,1754,Fallen (1998),Crime|Drama|Fantasy|Thriller,19.0,2.0,965711200.0,2.607397,3.5


Заполним  строки с Nan медианой из соответсвующий стобцов

In [145]:
movies_ratings['avg_rate_user'].fillna(movies_ratings['avg_rate_user'].median(), inplace=True)
movies_ratings['avg_rate_movie'].fillna(movies_ratings['avg_rate_movie'].median(), inplace=True)
movies_ratings['rating'].fillna(movies_ratings['rating'].median(), inplace=True)

In [147]:
movies_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100854 entries, 0 to 100853
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   movieId         100854 non-null  int64  
 1   title           100854 non-null  object 
 2   genres          100854 non-null  object 
 3   userId          100836 non-null  float64
 4   rating          100854 non-null  float64
 5   timestamp       100836 non-null  float64
 6   avg_rate_user   100854 non-null  float64
 7   avg_rate_movie  100854 non-null  float64
dtypes: float64(5), int64(1), object(2)
memory usage: 6.9+ MB


Подготовим данные для обучения модели

In [80]:
X_rate = (movies_ratings[['avg_rate_movie','avg_rate_user']]).values
y = movies_ratings['rating'].values

NameError: name 'movie_ratings' is not defined

Преобразуем данные о жанрах и тэгах в векторы и обучим модель Линейной регрессии

In [149]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

In [150]:
lr = LinearRegression().fit(X_train, y_train)

In [151]:
y_pred = lr.predict(X_test)

In [152]:
mean_squared_error(y_test, y_pred, squared=False)

0.8078184233283396

In [67]:
import scipy.stats

all_data = ratings.copy()

# добавляем к рейтингам теги
all_data = all_data.join(tags[['userId','movieId','tag']].set_index(['userId','movieId']),on=['userId','movieId'],rsuffix='_tags')

# добавляем к рейтингам фильмы
# all_data = all_data.join(movies.set_index('movieId'),on='movieId',rsuffix='_movies')

In [70]:
# добавляем к рейтингам фильмы
all_data = all_data.join(movies.set_index('movieId'),on='movieId',rsuffix='_movies')


In [71]:
all_data

Unnamed: 0,userId,movieId,rating,timestamp,tag,title,genres
0,1,1,4.0,964982703,,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,,Split (2017),Drama|Horror|Thriller
100832,610,168248,5.0,1493850091,Heroic Bloodshed,John Wick: Chapter Two (2017),Action|Crime|Thriller
100833,610,168250,5.0,1494273047,,Get Out (2017),Horror
100834,610,168252,5.0,1493846352,,Logan (2017),Action|Sci-Fi


In [72]:
ratings_mean = ratings.groupby(['movieId'],as_index=False).agg({'rating':np.mean})
all_data = all_data.join(ratings_mean.set_index('movieId'),on='movieId',rsuffix='_mean')

In [74]:
all_data

Unnamed: 0,userId,movieId,rating,timestamp,tag,title,genres,rating_mean
0,1,1,4.0,964982703,,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.920930
1,1,3,4.0,964981247,,Grumpier Old Men (1995),Comedy|Romance,3.259615
2,1,6,4.0,964982224,,Heat (1995),Action|Crime|Thriller,3.946078
3,1,47,5.0,964983815,,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,3.975369
4,1,50,5.0,964982931,,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,4.237745
...,...,...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,,Split (2017),Drama|Horror|Thriller,3.333333
100832,610,168248,5.0,1493850091,Heroic Bloodshed,John Wick: Chapter Two (2017),Action|Crime|Thriller,4.142857
100833,610,168250,5.0,1494273047,,Get Out (2017),Horror,3.633333
100834,610,168252,5.0,1493846352,,Logan (2017),Action|Sci-Fi,4.280000


In [75]:
# добавляем кол-во отзывов
ratings_len = ratings.groupby(['movieId'],as_index=False).agg({'rating':len})
all_data = all_data.join(ratings_len.set_index('movieId'),on='movieId',rsuffix='_cnt')

In [78]:
all_data

Unnamed: 0,userId,movieId,rating,timestamp,tag,title,genres,rating_mean,rating_cnt
0,1,1,4.0,964982703,,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.920930,215.0
1,1,3,4.0,964981247,,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0
2,1,6,4.0,964982224,,Heat (1995),Action|Crime|Thriller,3.946078,102.0
3,1,47,5.0,964983815,,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,3.975369,203.0
4,1,50,5.0,964982931,,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,4.237745,204.0
...,...,...,...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,,Split (2017),Drama|Horror|Thriller,3.333333,6.0
100832,610,168248,5.0,1493850091,Heroic Bloodshed,John Wick: Chapter Two (2017),Action|Crime|Thriller,4.142857,7.0
100833,610,168250,5.0,1494273047,,Get Out (2017),Horror,3.633333,15.0
100834,610,168252,5.0,1493846352,,Logan (2017),Action|Sci-Fi,4.280000,25.0


In [None]:
# сначала соберем данные в одну табличку

import scipy.stats

all_data = ratings.copy()

# добавляем к рейтингам теги
all_data = all_data.join(tags[['userId','movieId','tag']].set_index(['userId','movieId']),on=['userId','movieId'],rsuffix='_tags')

# добавляем к рейтингам фильмы
all_data = all_data.join(movies.set_index('movieId'),on='movieId',rsuffix='_movies')

# добавляем среднюю оценку (mean)
ratings_mean = ratings.groupby(['movieId'],as_index=False).agg({'rating':np.mean})
all_data = all_data.join(ratings_mean.set_index('movieId'),on='movieId',rsuffix='_mean')

# добавляем кол-во отзывов
ratings_len = ratings.groupby(['movieId'],as_index=False).agg({'rating':len})
all_data = all_data.join(ratings_len.set_index('movieId'),on='movieId',rsuffix='_cnt')

# добавляем медианную оценку
ratings_median = ratings.groupby(['movieId'],as_index=False).agg({'rating':np.median})
all_data = all_data.join(ratings_median.set_index('movieId'),on='movieId',rsuffix='_median')

# добавляем оценку вариативности
ratings_variance = ratings.groupby(['movieId'],as_index=False).agg({'rating':lambda arr: np.var(arr) if len(arr)>0 else 0.0})
all_data = all_data.join(ratings_variance.set_index('movieId'),on='movieId',rsuffix='_variance')

# добавляем моду
ratings_mode = ratings.groupby(['movieId'],as_index=False).agg({'rating':lambda arr: scipy.stats.mode(arr)[0]})
all_data = all_data.join(ratings_mode.set_index('movieId'),on='movieId',rsuffix='_mode')

all_data