### Домашнее задание по теме «Рекомендации на основе содержания»

Построить рекомендации (регрессия, предсказываем оценку) на фичах: TF-IDF на тегах и жанрах   

Средние оценки (+ median, variance, etc.) пользователя и фильма   

Оценить RMSE на тестовой выборке

In [1]:
import pandas as pd
import numpy as np

from tqdm import tqdm_notebook

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

%matplotlib inline

Заугрузим данные в датафреймы

In [2]:
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [3]:
movies.sample(3)

Unnamed: 0,movieId,title,genres
62,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller
902,1200,Aliens (1986),Action|Adventure|Horror|Sci-Fi
1961,2600,eXistenZ (1999),Action|Sci-Fi|Thriller


Очистим признак жанры от служебных симоволов и разделим пробелами

In [488]:
movies['genres'] = movies.genres.apply(lambda x: ' '.join(x.replace(' ', '').replace('-', '').split('|')))

In [489]:
movies.sample(3)

Unnamed: 0,movieId,title,genres
438,502,"Next Karate Kid, The (1994)",Action Children Romance
2491,3325,"Next Best Thing, The (2000)",Comedy Drama
3470,4734,Jay and Silent Bob Strike Back (2001),Adventure Comedy


Очистим признак тэги от служебных симоволов и разделим пробелами

In [490]:
tags['tag'] = tags.tag.apply(lambda x: ' '.join(x.replace(' ', ' ').replace('-', '').split('|')))

In [497]:
tags.sample(3)

Unnamed: 0,userId,movieId,tag,timestamp
2882,537,105504,suspense,1424141149
1285,474,1088,music,1138306856
1163,474,647,Gulf War,1137375718


Объединим датафремы

In [474]:
movie_tags_merge = movies.merge(tags, how='left', on='movieId').fillna('0')
print(movie_tags_merge.info())
movie_tags_merge.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11853 entries, 0 to 11852
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   movieId    11853 non-null  int64 
 1   title      11853 non-null  object
 2   genres     11853 non-null  object
 3   userId     11853 non-null  object
 4   tag        11853 non-null  object
 5   timestamp  11853 non-null  object
dtypes: int64(1), object(5)
memory usage: 648.2+ KB
None


Unnamed: 0,movieId,title,genres,userId,tag,timestamp
760,476,"Inkwell, The (1994)",Comedy Drama,0,0,0.0
11480,165347,Jack Reacher: Never Go Back (2016),Action Crime Drama Mystery Thriller,0,0,0.0
5095,5633,Heaven (2002),Drama,474,bombs,1138040000.0


In [413]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [415]:
movie_ratings_merge = movie_tags_merge.merge(ratings, how='left', on='movieId').fillna(0)

In [420]:
print(movie_ratings_merge.info())
movie_ratings_merge.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      285783 non-null  int64  
 1   title        285783 non-null  object 
 2   genres       285783 non-null  object 
 3   userId_x     285783 non-null  object 
 4   tag          285783 non-null  object 
 5   timestamp_x  285783 non-null  object 
 6   userId_y     285783 non-null  float64
 7   rating       285783 non-null  float64
 8   timestamp_y  285783 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 21.8+ MB
None


Unnamed: 0,movieId,title,genres,userId_x,tag,timestamp_x,userId_y,rating,timestamp_y
256232,66934,Dr. Horrible's Sing-Along Blog (2008),Comedy Drama Musical SciFi,477,Neil Patrick Harris,1244790000.0,177.0,4.0,1435525000.0
129424,1219,Psycho (1960),Crime Horror,477,Alfred Hitchcock,1242580000.0,367.0,4.0,997811300.0
188999,3527,Predator (1987),Action SciFi Thriller,477,Jesse Ventura,1269830000.0,95.0,4.5,1105401000.0


Получим итоговый датафрейм с данными 

In [500]:
movies_result = movie_ratings_merge[['movieId', 'genres', 'tag', 'rating']]
movies_result.sample(3)
movies_result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   movieId  285783 non-null  int64  
 1   genres   285783 non-null  object 
 2   tag      285783 non-null  object 
 3   rating   285783 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 10.9+ MB


Подготовим данные для обучения модели

In [461]:
X = (movies_result['genres'] + ' ' + movies_result['tag']).values
y = movies_result['rating'].values

Преобразуем данные о жанрах и тэгах в векторы и обучим модель Линейной регрессии

In [465]:
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X)

In [466]:
X_train, X_test, y_train, y_test= train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

In [467]:
lr = LinearRegression().fit(X_train, y_train)

In [468]:
y_pred = lr.predict(X_test)

In [469]:
mean_squared_error(y_test, y_pred, squared=False)

0.9602211162551327

Добавим колонку с предсказаниями в наш итоговый датафрейм

In [502]:
prediction_rating = lr.predict(X_tfidf)

In [503]:
movies_result['Prediction'] = prediction_rating

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_result['Prediction'] = prediction_rating


In [507]:
movies_result.sample(10)

Unnamed: 0,movieId,genres,tag,rating,Prediction
246974,49530,Action Adventure Crime Drama Thriller War,Jennifer Connelly,4.0,3.681821
9605,110,Action Drama War,Medieval,4.0,4.018135
266175,79132,Action Crime Drama Mystery SciFi Thriller IMAX,complicated,5.0,4.149748
47397,296,Comedy Crime Drama Thriller,ensemble cast,4.5,4.176957
4812,39,Comedy Romance,chick flick,3.0,3.069541
2330,25,Drama Romance,alcoholism,2.0,3.700549
107769,780,Action Adventure SciFi Thriller,aliens,4.0,3.709327
113780,924,Adventure Drama SciFi,revolutionary,4.5,3.951612
54707,296,Comedy Crime Drama Thriller,Harvey Keitel,5.0,4.141394
57984,296,Comedy Crime Drama Thriller,intellectual,4.0,4.143521


#### Обучим модель Линейной регрессии на средних оценках пользователей

In [137]:
movies_ratings = movies.merge(ratings, how='left', on='movieId')
movies_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,9.649827e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,8.474350e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1.106636e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1.510578e+09
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1.305696e+09
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,1.537109e+09
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,1.537110e+09
100851,193585,Flint (2017),Drama,184.0,3.5,1.537110e+09
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,1.537110e+09


Добавим столбец со средним значением оценки  по каждому пользователю

In [138]:
movies_ratings['avg_rate_user'] = movies_ratings.groupby('userId')['rating'].transform('mean')

Добавим столбец со средним значением оценки по каждому фильму

In [139]:
movies_ratings['avg_rate_movie'] = movies_ratings.groupby('movieId')['rating'].transform('mean')

In [140]:
movies_ratings.sample(3)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,avg_rate_user,avg_rate_movie
37595,1982,Halloween (1978),Horror,561.0,4.0,1491092000.0,3.372277,3.722222
83574,48516,"Departed, The (2006)",Crime|Drama|Thriller,317.0,5.0,1430362000.0,3.730159,4.252336
35481,1754,Fallen (1998),Crime|Drama|Fantasy|Thriller,19.0,2.0,965711200.0,2.607397,3.5


Заполним  строки с Nan медианой из соответсвующий стобцов

In [145]:
movies_ratings['avg_rate_user'].fillna(movies_ratings['avg_rate_user'].median(), inplace=True)
movies_ratings['avg_rate_movie'].fillna(movies_ratings['avg_rate_movie'].median(), inplace=True)
movies_ratings['rating'].fillna(movies_ratings['rating'].median(), inplace=True)

In [147]:
movies_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100854 entries, 0 to 100853
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   movieId         100854 non-null  int64  
 1   title           100854 non-null  object 
 2   genres          100854 non-null  object 
 3   userId          100836 non-null  float64
 4   rating          100854 non-null  float64
 5   timestamp       100836 non-null  float64
 6   avg_rate_user   100854 non-null  float64
 7   avg_rate_movie  100854 non-null  float64
dtypes: float64(5), int64(1), object(2)
memory usage: 6.9+ MB


Подготовим данные для обучения модели

In [148]:
X = (movies_ratings[['avg_rate_movie','avg_rate_user']]).values
y = movies_ratings['rating'].values

Преобразуем данные о жанрах и тэгах в векторы и обучим модель Линейной регрессии

In [149]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

In [150]:
lr = LinearRegression().fit(X_train, y_train)

In [151]:
y_pred = lr.predict(X_test)

In [152]:
mean_squared_error(y_test, y_pred, squared=False)

0.8078184233283396