### Домашнее задание по теме «Рекомендации на основе содержания»

Построить рекомендации (регрессия, предсказываем оценку) на фичах: TF-IDF на тегах и жанрах   

Средние оценки (+ median, variance, etc.) пользователя и фильма   

Оценить RMSE на тестовой выборке

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Заугрузим данные в датафреймы

In [34]:
# links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [3]:
movies.sample(3)

Unnamed: 0,movieId,title,genres
7788,91935,Albatross (2011),Drama
6051,40491,"Match Factory Girl, The (Tulitikkutehtaan tytt...",Comedy|Drama
2259,2997,Being John Malkovich (1999),Comedy|Drama|Fantasy


Очистим признак жанры от служебных симоволов и разделим пробелами

In [35]:
movies['genres'] = movies.genres.apply(lambda x: ' '.join(x.replace(' ', '').replace('-', '').split('|')))

In [4]:
movies.sample(3)

Unnamed: 0,movieId,title,genres
6682,58047,"Definitely, Maybe (2008)",Comedy Drama Romance
6780,60291,Gonzo: The Life and Work of Dr. Hunter S. Thom...,Documentary
982,1283,High Noon (1952),Drama Western


Очистим признак тэги от служебных симоволов и разделим пробелами

In [37]:
tags['tag'] = tags.tag.apply(lambda x: ' '.join(x.replace(' ', ' ').replace('-', ' ').split(' ')))

In [15]:
tags.sample(10)

Unnamed: 0,userId,movieId,tag,timestamp
653,357,1059,shakespeare,1348627264
765,424,1200,sci fi,1457901245
3675,606,3578,Romans,1173212944
1906,474,4326,rasicm,1137375204
1207,474,918,1900s,1138137949
1782,474,3385,Peace Corp,1137374553
594,318,68954,computer animation,1266408645
168,62,45447,treasure hunt,1525637084
325,62,128360,violent,1526078912
2061,474,6016,violence,1138039157


Объединим датафремы

In [38]:
movie_tags_merge = movies.merge(tags, how='left', on='movieId').fillna('0')
print(movie_tags_merge.info())
movie_tags_merge.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11853 entries, 0 to 11852
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   movieId    11853 non-null  int64 
 1   title      11853 non-null  object
 2   genres     11853 non-null  object
 3   userId     11853 non-null  object
 4   tag        11853 non-null  object
 5   timestamp  11853 non-null  object
dtypes: int64(1), object(5)
memory usage: 648.2+ KB
None


Unnamed: 0,movieId,title,genres,userId,tag,timestamp
6868,26590,G.I. Joe: The Movie (1987),Action Adventure Animation Children Fantasy SciFi,0,0,0.0
10661,122912,Avengers: Infinity War - Part I (2018),Action Adventure SciFi,62,Robert Downey Jr.,1526030000.0
7915,51935,Shooter (2007),Action Drama Thriller,0,0,0.0


In [39]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [40]:
movie_ratings_merge = movie_tags_merge.merge(ratings, how='left', on='movieId').fillna(0)

In [41]:
print(movie_ratings_merge.info())
movie_ratings_merge.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      285783 non-null  int64  
 1   title        285783 non-null  object 
 2   genres       285783 non-null  object 
 3   userId_x     285783 non-null  object 
 4   tag          285783 non-null  object 
 5   timestamp_x  285783 non-null  object 
 6   userId_y     285783 non-null  float64
 7   rating       285783 non-null  float64
 8   timestamp_y  285783 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 21.8+ MB
None


Unnamed: 0,movieId,title,genres,userId_x,tag,timestamp_x,userId_y,rating,timestamp_y
9399,110,Braveheart (1995),Action Drama War,62,inspirational,1528150000.0,160.0,4.0,971112700.0
215099,5803,I Spy (2002),Action Adventure Comedy Crime,0,0,0.0,275.0,2.0,1049078000.0
78156,296,Pulp Fiction (1994),Comedy Crime Drama Thriller,599,Steve Buscemi,1498460000.0,328.0,5.0,1494212000.0


Добавим столбец со средним значением оценки  по каждому пользователю

In [43]:
movie_ratings_merge['avg_rate_user'] = movie_ratings_merge.groupby('userId_x')['rating'].transform('mean')

Добавим столбец со средним значением оценки по каждому фильму

In [44]:
movie_ratings_merge['avg_rate_movie'] = movie_ratings_merge.groupby('movieId')['rating'].transform('mean')

In [45]:
movie_ratings_merge.sample(3)

Unnamed: 0,movieId,title,genres,userId_x,tag,timestamp_x,userId_y,rating,timestamp_y,avg_rate_user,avg_rate_movie
231976,7572,Wit (2001),Drama,474,cancer,1137180000.0,474.0,4.5,1132173000.0,3.776389,4.5
200597,4226,Memento (2000),Mystery Thriller,567,dreamlike,1525280000.0,105.0,3.5,1446573000.0,4.020959,4.122642
180765,2959,Fight Club (1999),Action Crime Drama Thriller,599,violent,1498460000.0,533.0,5.0,1424754000.0,4.168586,4.272936


Заполним  строки с нулевыми данными медианой из соответсвующий стобцов

In [46]:
movie_ratings_merge['avg_rate_user'].fillna(movie_ratings_merge['avg_rate_user'].median(), inplace=True)
movie_ratings_merge['avg_rate_movie'].fillna(movie_ratings_merge['avg_rate_movie'].median(), inplace=True)
movie_ratings_merge['rating'].fillna(movie_ratings_merge['rating'].median(), inplace=True)

In [47]:
movie_ratings_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   movieId         285783 non-null  int64  
 1   title           285783 non-null  object 
 2   genres          285783 non-null  object 
 3   userId_x        285783 non-null  object 
 4   tag             285783 non-null  object 
 5   timestamp_x     285783 non-null  object 
 6   userId_y        285783 non-null  float64
 7   rating          285783 non-null  float64
 8   timestamp_y     285783 non-null  float64
 9   avg_rate_user   285783 non-null  float64
 10  avg_rate_movie  285783 non-null  float64
dtypes: float64(5), int64(1), object(5)
memory usage: 26.2+ MB


Получим итоговый датафрейм с данными 

In [81]:
movie_result = movie_ratings_merge[['genres', 'tag', 'rating', 'avg_rate_user', 'avg_rate_movie']]
movie_result.sample(10)
# movie_result.info()

Unnamed: 0,genres,tag,rating,avg_rate_user,avg_rate_movie
117129,Drama Romance,Leonardo DiCaprio,5.0,3.792018,3.722222
14937,Adventure Comedy,0,2.5,3.285343,3.06015
121075,Comedy Drama,0,1.0,3.285343,3.136364
162443,Action Adventure Comedy,0,0.5,3.285343,3.198347
229408,Drama Romance SciFi,feel good,4.0,4.020959,4.160305
275244,Action Drama Western,Soundtrack,4.0,3.955575,3.943662
20618,Action Adventure SciFi,sci fi,5.0,3.990667,4.231076
163097,Animation Comedy Musical,free speech,4.0,3.990667,3.861842
188513,Comedy Romance,alternate endings,4.0,3.776389,3.611111
69407,Comedy Crime Drama Thriller,philosophical,5.0,4.168586,4.197068


Преобразуем полученные средние оценки в массив 

In [50]:
X_means = movie_result[['avg_rate_user', 'avg_rate_movie']].to_numpy()

In [51]:
X_means.shape


(285783, 2)

Преобразуем данные о жанрах и тэгах в векторыx

In [52]:
tfidf_model = TfidfVectorizer()
X_genres = tfidf_model.fit_transform(movie_result['genres']).toarray()

In [53]:
X_genres.shape

(285783, 20)

In [55]:
tfidf_transformer = TfidfVectorizer(tokenizer=lambda x: x.split('|'))
X_tags = tfidf_transformer.fit_transform(movie_result['tag']).toarray()

In [56]:
X_tags.shape

(285783, 1468)

In [57]:
X_tfidf = np.hstack((X_tags, X_genres))

In [58]:
X_tfidf.shape

(285783, 1488)

Подготовим данные и обучим модель Линейной регрессии

In [63]:
X = np.hstack((X_tfidf, X_means))

In [60]:
y = movie_result['rating'].to_numpy().flatten()

In [64]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

In [65]:
lr = LinearRegression().fit(X_train, y_train)

In [66]:
y_pred = lr.predict(X_test)

In [67]:
mean_squared_error(y_test, y_pred, squared=False)

0.9020290186759707

In [68]:
r2_score = lr.score(X_test,y_test)
print(r2_score*100,'%')

21.63815496166289 %


Добавим колонку с предсказаниями в наш итоговый датафрейм

In [77]:
prediction_rating = lr.predict(X)

In [78]:
movie_result['Prediction'] = prediction_rating

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_result['Prediction'] = prediction_rating


In [80]:
movie_result.sample(10)

Unnamed: 0,genres,tag,rating,avg_rate_user,avg_rate_movie,Prediction
168191,Horror SciFi,0,3.0,3.285343,2.5,2.482561
173118,Action Crime Drama Thriller,dark comedy,4.5,4.168586,4.272936,4.274812
87375,Comedy Drama Romance War,bubba gump shrimp,5.0,4.164134,4.164134,4.164766
244068,Action Comedy Horror Thriller,0,3.0,3.285343,2.642857,2.633378
148182,Drama SciFi Thriller,atmospheric,1.0,4.020959,3.851064,3.842012
79869,Comedy Crime Drama Thriller,thought provoking,4.0,4.168586,4.197068,4.195658
8616,Adventure Children Comedy Musical,muppets,4.0,3.776389,3.326923,3.339235
156845,Drama,0,3.5,3.285343,3.375,3.375305
159500,Action SciFi,0,2.0,3.285343,2.777778,2.76634
274809,Action Drama Western,Humour,4.5,3.651496,3.943662,3.943009
