# MoodStream: подготовка данных

## Датасеты

Необходимые данные:
- постер
- жанр
- продолжительность/количество страниц
- рейтинг
- идентификатор пользователя
- рейтинг отзыва

Жанр, продолжительность и рейтинг позволяют решать задачу подбора "в лоб", а датасеты с отзывами позволят подобрать похожих на пользователя рецензентов и сделать выдачу более релевантной.

https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows 1000 фильмов, постер, название, продолжительность, жанр, год выхода

https://www.kaggle.com/datasets/ashishjangra27/imdb-movies-dataset 2.5М фильмов, название, продолжительность, жанр, год выхода

https://zenodo.org/record/7665868#.ZElLGn7P3zc название, дата выхода, жанр, отзывы с оценками

https://www.kaggle.com/datasets/whenamancodes/popular-movies-datasets-58000-movies название, отзывы с оценками

https://www.kaggle.com/datasets/veeralakrishna/movielens-25m-dataset название, отзывы с оценками

https://www.kaggle.com/datasets/devanshiipatel/imdb-tv-shows 2986 сериалов, название, продолжительность эпизода, жанр, годы выхода


---


https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset : 271360 книг, названия, средние изображения, оценки пользователей

https://www.kaggle.com/datasets/ruchi798/bookcrossing-dataset : 271379 книг, названия, средние изображения, оценки пользователей

https://www.kaggle.com/datasets/bahramjannesarr/goodreads-book-datasets-10m : 2М книг, названия, количество страниц

https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews : 212404 книг, названия, средние изображения, жанры, оценки пользователей

https://www.kaggle.com/datasets/thedevastator/comprehensive-overview-of-52478-goodreads-best-b жанры

https://www.kaggle.com/datasets/michaelrussell4/10000-books-and-their-genres-standardized жанры

---


https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset 89741 песен, название, артист, продолжительность в мс, танцевальность, энергичность

https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db 176774 песен, название, артист, жанр, продолжительность в мс, танцевальность, энергичность, акустичность, инструменталичность

https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019 1879 песен, название, артист, продолжительность в мс, танцевальность, энергичность

https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks 586672 песен, название, артист, продолжительность в мс, танцевальность, энергичность

https://www.kaggle.com/datasets/cbhavik/music-taste-recommendation есть пользователи и лайки. данные песен обезличены, поэтому можно использовать для моков

https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019 название, артист, жанр

## Предобработка данных

### Импорты

In [1]:
import pandas as pd

### IMDB Movies TOP-1000

In [2]:
# https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
IMDB_TOP_1000_PATH = './datasets/src/movies/imdb-dataset-of-top-1000-movies-and-tv-shows.csv'
df1 = pd.read_csv(IMDB_TOP_1000_PATH)
df1 = df1[['Poster_Link', 'Series_Title', 'Released_Year', 'Genre', 'IMDB_Rating']]
df1 = df1.rename(columns={'Series_Title': 'Title'})
df1['token'] = df1['Title'].str.lower()
df1['token'] = df1['token'].str.replace(pat='[^\w]', repl='', regex=True)

df1.head(5)

Unnamed: 0,Poster_Link,Title,Released_Year,Genre,IMDB_Rating,token
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,Drama,9.3,theshawshankredemption
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,"Crime, Drama",9.2,thegodfather
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,"Action, Crime, Drama",9.0,thedarkknight
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,"Crime, Drama",9.0,thegodfatherpartii
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,"Crime, Drama",9.0,12angrymen


### Movies and Ratings

In [3]:
# https://zenodo.org/record/7665868#.ZElLGn7P3zc 
MOVIE_LENS_45K_MOVIES_PATH = './datasets/src/movies/ZElLGn7P3zc/movies.csv' 
df2_movies = pd.read_csv(MOVIE_LENS_45K_MOVIES_PATH, delimiter='\t')
df2_movies['token'] = df2_movies['Title'].str.lower()
df2_movies['token'] = df2_movies['token'].str.replace(pat='[^\w]', repl='', regex=True)

df2_movies.head()


Unnamed: 0,MovieID,Title,Release_date,Budget,Genres,Spoken_languages,token
0,1,Toy Story,1995-10-30,30000000,"[{\id\"": 16, \""name\"": \""Animation\""}, {\""id\""...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}]""",toystory
1,2,Jumanji,1995-12-15,65000000,"[{\id\"": 12, \""name\"": \""Adventure\""}, {\""id\""...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}...",jumanji
2,3,Grumpier Old Men,1995-12-22,0,"[{\id\"": 10749, \""name\"": \""Romance\""}, {\""id\...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}]""",grumpieroldmen
3,4,Waiting to Exhale,1995-12-22,16000000,"[{\id\"": 35, \""name\"": \""Comedy\""}, {\""id\"": 1...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}]""",waitingtoexhale
4,5,Father of the Bride Part II,1995-02-10,0,"[{\id\"": 35, \""name\"": \""Comedy\""}]""","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}]""",fatherofthebridepartii


In [4]:
MOVIE_LENS_26M_RATINGS_PATH = './datasets/src/movies/ZElLGn7P3zc/ratings.csv'
df2_ratings = pd.read_csv(MOVIE_LENS_26M_RATINGS_PATH)
df2_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,58365,31,3.0,836391314
1,58365,32,5.0,836391153
2,58365,34,4.0,836391124
3,58365,44,4.0,836391253
4,58365,48,3.0,836391292


### Подготовка результирующих датасетов

In [5]:
df = df1.merge(df2_movies, on='token')
df.head(5)

Unnamed: 0,Poster_Link,Title_x,Released_Year,Genre,IMDB_Rating,token,MovieID,Title_y,Release_date,Budget,Genres,Spoken_languages
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,Drama,9.3,theshawshankredemption,318,The Shawshank Redemption,1994-09-23,25000000,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 80...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}]"""
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,"Crime, Drama",9.2,thegodfather,858,The Godfather,1972-03-14,6000000,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 80...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}..."
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,"Action, Crime, Drama",9.0,thedarkknight,58559,The Dark Knight,2008-07-16,185000000,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 28...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}..."
3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,"Action, Crime, Drama",9.0,thedarkknight,130219,The Dark Knight,2011-07-11,0,"[{\id\"": 28, \""name\"": \""Action\""}, {\""id\"": 8...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}]"""
4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,"Crime, Drama",9.0,thegodfatherpartii,1221,The Godfather: Part II,1974-12-20,13000000,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 80...","[{\name\"": \""English\"", \""iso_639_1\"": \""en\""}..."


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 928 entries, 0 to 927
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Poster_Link       928 non-null    object 
 1   Title_x           928 non-null    object 
 2   Released_Year     928 non-null    object 
 3   Genre             928 non-null    object 
 4   IMDB_Rating       928 non-null    float64
 5   token             928 non-null    object 
 6   MovieID           928 non-null    int64  
 7   Title_y           928 non-null    object 
 8   Release_date      928 non-null    object 
 9   Budget            928 non-null    int64  
 10  Genres            928 non-null    object 
 11  Spoken_languages  928 non-null    object 
dtypes: float64(1), int64(2), object(9)
memory usage: 94.2+ KB


In [77]:
movie_df = df.rename(columns={'Poster_Link': 'poster', 'Title_x': 'title', 'Released_Year': 'year', 'IMDB_Rating': 'imdb_rating', 'MovieID': 'movie_id', 'Genres': 'genres'})
movie_df = movie_df[['poster', 'title', 'year', 'imdb_rating', 'movie_id', 'genres']]
movie_df = movie_df[movie_df['genres'] != '[]']
movie_df.head(5)

Unnamed: 0,poster,title,year,imdb_rating,movie_id,genres
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,9.3,318,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 80..."
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,9.2,858,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 80..."
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,9.0,58559,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 28..."
3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,9.0,130219,"[{\id\"": 28, \""name\"": \""Action\""}, {\""id\"": 8..."
4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,9.0,1221,"[{\id\"": 18, \""name\"": \""Drama\""}, {\""id\"": 80..."


In [78]:
import json

def format_genres(genres):
    try:
        valid_json = "\"" + genres
        valid_json = valid_json.replace("\\", "\\\"", 1)
        result = json.loads(json.loads(valid_json))
        return list(map(lambda x: x['name'], result))
    except:
        return []

movie_df['genres'] = movie_df['genres'].apply(format_genres)
movie_df.head(5)

Unnamed: 0,poster,title,year,imdb_rating,movie_id,genres
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,9.3,318,"[Drama, Crime]"
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,9.2,858,"[Drama, Crime]"
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,9.0,58559,"[Drama, Action, Crime, Thriller]"
3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,9.0,130219,"[Action, Crime, Drama, Thriller]"
4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,9.0,1221,"[Drama, Crime]"


In [81]:
movie_df['poster'] = movie_df['poster'].str.replace('_V1_UX67_CR0,0,67,98_AL_', '_V1_UX268_CR0,0,268,392_AL_')
movie_df.head(5)

Unnamed: 0,poster,title,year,imdb_rating,movie_id,genres
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,9.3,318,"[Drama, Crime]"
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,9.2,858,"[Drama, Crime]"
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,9.0,58559,"[Drama, Action, Crime, Thriller]"
3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,9.0,130219,"[Action, Crime, Drama, Thriller]"
4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,9.0,1221,"[Drama, Crime]"


## Базовая модель