## Рекомендательные системы

### Автор: Вадим Кохтев

В этом ноутбуке мы применим алгоритм коллаборативной фильтрации на item-base подходе. Работать мы будем с датасетом MovieLens, который содержит в себе информацию об оценках фильмов пользователями одноименного сайта.

Давайте загрузим необходимые библиотеки.

In [8]:
import pickle as pkl
import zipfile
from collections import defaultdict, Counter
import datetime

from scipy import linalg
import scipy.sparse as sps
import numpy as np
import matplotlib.pyplot as plt

Скачаем данные

In [9]:
!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip

--2020-06-03 04:04:08--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip.2’


2020-06-03 04:04:10 (3.41 MB/s) - ‘ml-1m.zip.2’ saved [5917549/5917549]



Распакуем данные и посмотрим, как они устроены.

In [18]:
with zipfile.ZipFile("ml-1m.zip", "r") as z:
    print("files in archive")
    print(z.namelist())
    print("movies")
    with z.open("ml-1m/movies.dat") as m:
        print(str(m.readline()).split("::"))
    print("users")
    with z.open("ml-1m/users.dat") as m:
        print(str(m.readline()).split("::"))
    print("ratings")
    with z.open("ml-1m/ratings.dat") as m:
        print(str(m.readline()).split("::"))

files in archive
['ml-1m/', 'ml-1m/movies.dat', 'ml-1m/ratings.dat', 'ml-1m/README', 'ml-1m/users.dat']
movies
['b"1', 'Toy Story (1995)', 'Animation|Children\'s|Comedy\\n"']
users
["b'1", 'F', '1', '10', "48067\\n'"]
ratings
["b'1", '1193', '5', "978300760\\n'"]


Мы видим, что в архиве лежит информация о фильмах. Это movieId фильма, название и жанр. О пользователях нам известен userId, пол (F, M), возраст, закодированная информация о трудоуствройстве и zip-code. И информация о рейтинге: userId, movieId, оценка и момент времени, когда оценка была сделана. Давайте прочитаем данные.

In [19]:
# read data
movies = {} # id
users = {} # id
ratings = defaultdict(list) # user-id

with zipfile.ZipFile("ml-1m.zip", "r") as z:
    # parse movies
    with z.open("ml-1m/movies.dat") as m:
        for line in m:
            MovieID, Title, Genres = line.decode('iso-8859-1').strip().split("::")
            MovieID = int(MovieID)
            Genres = Genres.split("|")
            movies[MovieID] = {"Title": Title, "Genres": Genres}
    
    # parse users
    with z.open("ml-1m/users.dat") as m:
        fields = ["UserID", "Gender", "Age", "Occupation", "Zip-code"]
        for line in m:
            row = list(zip(fields, line.decode('iso-8859-1').strip().split("::")))
            data = dict(row[1:])
            data["Occupation"] = int(data["Occupation"])
            users[int(row[0][1])] = data
    
    # parse ratings
    with z.open("ml-1m/ratings.dat") as m:
        for line in m:
            UserID, MovieID, Rating, Timestamp = line.decode('iso-8859-1').strip().split("::")
            UserID = int(UserID)
            MovieID = int(MovieID)
            Rating = int(Rating)
            Timestamp = int(Timestamp)
            ratings[UserID].append((MovieID, Rating, datetime.datetime.fromtimestamp(Timestamp)))

Посмотрим на данные

In [20]:
print(users[3])
print(ratings[3])

{'Gender': 'M', 'Age': '25', 'Occupation': 15, 'Zip-code': '55117'}
[(3421, 4, datetime.datetime(2001, 1, 1, 0, 29, 7)), (1641, 2, datetime.datetime(2001, 1, 1, 0, 33, 50)), (648, 3, datetime.datetime(2001, 1, 1, 0, 24, 27)), (1394, 4, datetime.datetime(2001, 1, 1, 0, 29, 7)), (3534, 3, datetime.datetime(2001, 1, 1, 0, 11, 8)), (104, 4, datetime.datetime(2001, 1, 1, 0, 34, 46)), (2735, 4, datetime.datetime(2001, 1, 1, 0, 24, 27)), (1210, 4, datetime.datetime(2001, 1, 1, 0, 20)), (1431, 3, datetime.datetime(2001, 1, 1, 0, 11, 35)), (3868, 3, datetime.datetime(2001, 1, 1, 0, 34, 46)), (1079, 5, datetime.datetime(2001, 1, 1, 0, 31, 36)), (2997, 3, datetime.datetime(2001, 1, 1, 0, 29, 7)), (1615, 5, datetime.datetime(2001, 1, 1, 0, 21, 50)), (1291, 4, datetime.datetime(2001, 1, 1, 0, 20)), (1259, 5, datetime.datetime(2001, 1, 1, 0, 31, 36)), (653, 4, datetime.datetime(2001, 1, 1, 0, 22, 37)), (2167, 5, datetime.datetime(2001, 1, 1, 0, 20)), (1580, 3, datetime.datetime(2001, 1, 1, 0, 21, 3)

In [21]:
m2m = dict()
nw_movies = []
for movie in movies:
    nw_id = len(m2m)
    m2m[movie] = nw_id
    nw_movies.append(movies[movie])

# u2u = dict()
nw_users = []
nw_ratings = []
for user in users:
    # nw_id = len(u2u)
    # u2u[user] = nw_id
    nw_users.append(users[user])
    nw_ratings.append([(m2m[r[0]], r[1], r[2]) for r in ratings[user]])

old_users, old_movies, old_ratings = users, movies, ratings
users, movies, ratings = nw_users, nw_movies, nw_ratings

In [14]:
import pickle as pkl

In [15]:
with open('data.pkl', 'wb') as f:
    pkl.dump({
        'users': users,
        'movies': movies,
        'ratings': ratings,
    }, f)