# TensorFlow Recommenders: Basics

TFRS와 함께 [MovieLens 100K 데이터 세트](https://grouplens.org/datasets/movielens/100k/)를 사용하여 간단한 행렬 분해(matrix factorization) 모델을 구축합니다. 이 모델을 사용하여 특정 사용자에게 영화를 추천할 수 있습니다.

In [1]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets

[K     |████████████████████████████████| 85 kB 1.3 MB/s 
[K     |████████████████████████████████| 462 kB 31.2 MB/s 
[K     |████████████████████████████████| 4.2 MB 4.0 MB/s 
[?25h

In [2]:
from typing import Dict, Text

import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

import pprint

### Read the data

In [3]:
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
for x in ratings.take(1).as_numpy_iterator():
    pprint.pprint(x)

[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/100000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/movielens/100k-ratings/0.1.0.incomplete0RPTCQ/movielens-train.tfrecord*...…

[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-ratings/0.1.0. Subsequent calls will reuse this data.[0m
{'bucketized_user_age': 45.0,
 'movie_genres': array([7]),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}


In [4]:
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")
for x in movies.take(1).as_numpy_iterator():
    pprint.pprint(x)

[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /root/tensorflow_datasets/movielens/100k-movies/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1682 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/movielens/100k-movies/0.1.0.incomplete07SXWJ/movielens-train.tfrecord*...:…

[1mDataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-movies/0.1.0. Subsequent calls will reuse this data.[0m
{'movie_genres': array([4]),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


In [5]:
# 기본 features 선택
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"]
})

movies = movies.map(lambda x: x["movie_title"])

embedding layer를 위해 `사용자 ID`와 `영화 제목`을 정수 인덱스로 변환하는 vocabulary를 구축합니다.  

레이어에 대한 vocabulary는 구성 시 제공되거나 adapt()를 통해 학습되어야 합니다. adapt() 동안 layer는 데이터 세트를 분석하고 개별 문자열 토큰의 빈도를 결정하고 그로부터 vocabulary를 생성합니다. vocabulary의 크기에 제한이 있는 경우 가장 빈번한 토큰이 어휘를 생성하는 데 사용되고 다른 모든 토큰은 OOV로 처리됩니다.

원시 범주형 기능을 가져와 임베딩으로 전환하는 것은 일반적으로 2단계 프로세스입니다.

먼저 원시 값("스타워즈")을 정수(예: 15)로 매핑하는 "vocabulary"를 구축하여 원시 값을 연속 정수 범위로 변환해야 합니다.  
둘째, 이 정수를 가져와 임베딩으로 변환해야 합니다

In [6]:
user_ids_vocabulary = tf.keras.layers.StringLookup()
user_ids_vocabulary.adapt(ratings.map(lambda x: x["user_id"]))

movie_titles_vocabulary = tf.keras.layers.StringLookup()
movie_titles_vocabulary.adapt(movies)

In [7]:
user_ids_vocabulary.get_vocabulary()[:10]

['[UNK]', '405', '655', '13', '450', '276', '416', '537', '303', '234']

In [8]:
data = tf.constant(['405', '655', '450'])
user_ids_vocabulary(data)

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([1, 2, 4])>

In [9]:
movie_titles_vocabulary.get_vocabulary()[:10]

['[UNK]',
 "Ulee's Gold (1997)",
 'That Darn Cat! (1997)',
 'Substance of Fire, The (1996)',
 'Sliding Doors (1998)',
 'Nightwatch (1997)',
 'Money Talks (1997)',
 'Kull the Conqueror (1997)',
 'Ice Storm, The (1997)',
 'Hurricane Streets (1998)']

In [11]:
movie_titles_vocabulary(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([281, 576])>

### model 정의

`tfrs.Model`을 상속하고 `compute_loss` 메서드를 구현하여 TFRS 모델을 정의할 수 있습니다.

In [16]:
class MovieLensModel(tfrs.Model):

  def __init__(self, user_model: tf.keras.Model,
                             movie_model: tf.keras.Model, 
                             task: tfrs.tasks.Retrieval):
    super().__init__()

    # Set up user and movie representations.
    self.user_model = user_model
    self.movie_model = movie_model
    # Set up a retrieval task.
    self.task = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # loss 계산 정의
    user_embeddings = self.user_model(features["user_id"])
    movie_embeddings = self.movie_model(features["movie_title"])

    return self.task(user_embeddings, movie_embeddings)

두 모델과 retrieval task를 정의합니다.

In [17]:
# user 및 movie model 정의
user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(user_ids_vocabulary.vocab_size(), 64)
])

movie_model = tf.keras.Sequential([
    movie_titles_vocabulary,
    tf.keras.layers.Embedding(movie_titles_vocabulary.vocab_size(), 64)
])

# objective 정의
task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK(
    movies.batch(128).map(movie_model)
  )
)









영화 제목에 대한 임베딩을 직접 가져올 수 있습니다

In [21]:
movie_model.predict(["Star Wars (1977)"])

array([[ 0.02464552,  0.03703124, -0.03513993, -0.00022671,  0.04131972,
         0.02195212,  0.01758487, -0.03748429, -0.02752152, -0.00870178,
         0.03778246,  0.04711354, -0.02785167, -0.02634217, -0.01019273,
         0.04191982,  0.04751679,  0.00382006,  0.04536116, -0.00355872,
         0.01392126,  0.04514055,  0.03960384, -0.00498885,  0.00825249,
        -0.01118356,  0.04231557,  0.00838137, -0.00834968,  0.00044169,
        -0.00828087,  0.01719515, -0.02198948,  0.04215211,  0.00681734,
        -0.00324522,  0.0324559 ,  0.00190448, -0.01686861, -0.00969291,
        -0.04881169,  0.04725707,  0.01275358,  0.03002026,  0.0448077 ,
        -0.03684868,  0.03199245,  0.01768302, -0.03356913, -0.01228727,
         0.02086962,  0.00814749, -0.0367928 , -0.02020143, -0.00380058,
        -0.01869233,  0.03790614, -0.0317286 , -0.00304841, -0.02894868,
         0.04864056, -0.02740761,  0.00759175,  0.03084237]],
      dtype=float32)


### Fit and evaluate it.



In [22]:
# retrieval model 생성
model = MovieLensModel(user_model, movie_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))

# Train for 3 epochs.
model.fit(ratings.batch(4096), epochs=3)

# Use brute-force search to set up retrieval using the trained representations.
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index_from_dataset(
    movies.batch(100).map(lambda title: (title, model.movie_model(title))))

# Get some recommendations.
_, titles = index(np.array(["42"]))
print(f"Top 3 recommendations for user 42: {titles[0, :3]}")

Epoch 1/3
Epoch 2/3
Epoch 3/3
Top 3 recommendations for user 42: [b'Mirage (1995)' b'Rent-a-Kid (1995)'
 b'Far From Home: The Adventures of Yellow Dog (1995)']


In [23]:
titles

<tf.Tensor: shape=(1, 10), dtype=string, numpy=
array([[b'Mirage (1995)', b'Rent-a-Kid (1995)',
        b'Far From Home: The Adventures of Yellow Dog (1995)',
        b'Land Before Time III: The Time of the Great Giving (1995) (V)',
        b'Just Cause (1995)', b'Aristocats, The (1970)',
        b'Winnie the Pooh and the Blustery Day (1968)',
        b'Scarlet Letter, The (1926)', b'Trial by Jury (1994)',
        b'House Arrest (1996)']], dtype=object)>