# Recommender System 

## Understanding Reccomender Systems

### Content Based Filtering
        - Uses item features to provide recommendations (uses the features of a certain item to recommend other items with simular features)
        - A great example is if a user watches a certain video from a certain set of traits, we can reccomend other videos based on those traits
        - A limitation with content-based filtering is that it only leverages item simularities

### Collaborative Filtering
        - Uses simularities between items and users simultaneously to provide reccomendations (gives reccomendations to User A based on simular interests of user B)
        - Explicit Feedback: user giving direct feedback (ratings, comments, etc.)
        - Implicit FeedBack: more suttle, your indirect behavior towards an item (watch-time, click-rate, etc.)
        - An example could be that user 1 has watched movie A, B, and C but user 2 has only watched movie A and C, we can use info based off of user A and reccomend to user B 
          that they watch movie B. 
        - Collaborative Filtering, (too me) is a better route to go

#### Collaborative Filtering in practice
        - We can assign a values between -1 and 1 to users for interest in certain movies, -1 means the most interest 1 means the least, and we can do the same for movies
          -1 means it's more of a certain interest of the user and 1 means it's not in the interest of the user
        - In this example we hand-engineered the one-dimensional embeddings, in practice, these embeddings are much higher in dimensions, but we learn these embeddings automatically. 
          that's the beauty of collborative filtering models. 
        -  U is the user embeddings and V is the product embeddngs, the product of these 2 is A, which is a predictive feedback matrix.
        - Our optimization objectie is to minimize the summation of the squared difference between the feedback labels and the predicted feedback
        - We can solve this using SGD or Weighted Alternative Least Squares (WALS), WALS is specific to this problem
        - The idea of WALS is that for each iteration we alternate between fix U and solve for V, and then fixing V and solving for U. 
        - WALS usually converges much faster than SGD, but SGD is more flexible with other loss functions
        - We have only talked about observed items, but we still have the un-observed ones


##### Un-Observed Items in collaborative filtering using matrix factorization
        - Matrix Factorization on only the observed items minimizes the objective function which is what we don't want
        - We can fix this using weighted-matrix-factorization, we treat unobserved entries as 0, but we also scale the un-observed part of the objective function
          so it is not over-weighted

# Building The Recommnder System

In [2]:
import tensorflow_datasets as tfds
import tensorflow as tf
import numpy as np
import pandas as pd

In [22]:
# Load movielens-ratings dataset
ratings_df = tfds.load("movielens/100k-ratings", split = "train")

In [12]:
for x in ratings_df.take(1).as_numpy_iterator():
    print(x)

{'bucketized_user_age': 45.0, 'movie_genres': array([7]), 'movie_id': b'357', 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)", 'raw_user_age': 46.0, 'timestamp': 879024327, 'user_gender': True, 'user_id': b'138', 'user_occupation_label': 4, 'user_occupation_text': b'doctor', 'user_rating': 4.0, 'user_zip_code': b'53211'}


2022-08-10 21:53:18.289080: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [23]:
# Load in movielens-movies data
movies_df = tfds.load("movielens/100k-movies", split = "train")

for x in movies_df.take(1).as_numpy_iterator():
    print(x)

{'movie_genres': array([4]), 'movie_id': b'1681', 'movie_title': b'You So Crazy (1994)'}


2022-08-10 22:10:09.928006: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [31]:
tf.random.set_seed(42)
shuffled = ratings_df.shuffle(100000, seed = 42, reshuffle_each_iteration=False)

train = shuffled.take(80000)
test = shuffled.skip(80000).take(20000)

movie_titles = movies_df.batch(1000)
user_ids = ratings_df.batch(1000000).map(lambda x: x["user_id"])



In [33]:
unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_movie_titles[:4]

2022-08-10 22:15:52.746030: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


InvalidArgumentError: Cannot batch tensors with different shapes in component 0. First element had shape [1] and element 1 had shape [2]. [Op:IteratorGetNext]

In [27]:
user_ids

<MapDataset element_spec=TensorSpec(shape=(None,), dtype=tf.string, name=None)>

# Resources
        - Tensorflow Youtube Video's (https://www.youtube.com/watch?v=BthUPVwA59s&list=PLQY2H8rRoyvy2MiyUBz5RWZr5MPFkV3qz)
        - How to Design and Build a Recommendation System Pipeline in Python (Jill Cates) (https://www.youtube.com/watch?v=v_mONWiFv0k)
        - Building a Recommendation System in Python (https://www.youtube.com/watch?v=G4MBc40rQ2k)