# Recommender Systems
- GOALS:
- - Movies most likely to be rated 5 stars by user.
- - Products most likely to be purchased.
- - Ads most likely to be clicked on + higher bid.
- - Products generating the largest profit by the company. Cheapily produced etc.
- - Video leading to maximum watch time.
-
-

## 1. Collaborative Filtering

### PER-ITEM FEATURES: When we have the following information:
- i -> Movie(i)
- j -> Person(j)
- x(i) -> Feature vector for Movie(i) -> Romance and Action values of Movie(i)
- w(j), b(j) -> Parameters to be guessed for user(j).
- y(i,j) -> Rating evaluation given by user(j) on Movie(i).

![Alt text](../../figures/ReccSystems.png)

#### COST FUNCTION:
##### Where:
- - m(j) -> Number of movies rated by USer(j).
- - r(i,j) = 1 if User(j) has rated Movie(i), else 0.
- - We want to learn w(j) and b(j).
- - \+ Regularization parameter lambda.

![Alt text2](../../figures/cf.png)

#### If we added a new column "Eve" with all "?".
- Since it has not reviewed any movie yet, the parameters w(j=5) would be [0 0] once they are minimized. Since the user has no effect on the existant ratings.
- Then, the predictions of all movies for this user would be ZERO.

To solve this:
- - Mean Normalization:

![Alt text02](../../figures/mn.png)

- Take all of these ratings and for each movie, compute the average rating that was given.
- This way, we are normalizing the movies ratings. This is a value that we will store per each movie.
- This value, will be later on added in the Cost Function, in order to avoid negative ratings,
- as well as to predict an output with more common sense:
- - We can't say that the user will not like the movie (y = 0).
- - But we can say, the average rating of this movie is y = average(i), and it is probable that the new user falls into the normal distribution.

In [None]:
## IMPLEMENTATION

# - TensorFlow : Allows to use Gradient Descent without manually calculating the derivatives of the Cost Function.
import tensorflow as tf
import keras

In [None]:
def forward_function(X, W, b, Ynorm, R):
    return W * X + b

def loss_function():
    pass


In [None]:
### Let's guess the parameter w(j) of j=5. 

optimizer = keras.optimizers.Adam(learning_rate = 1e-1)

iterations = 200

# AUTOMATIC DIFFERENTATION
for iter in range(iterations):

    # TensorFlow gradient Tape -> Automatic Differentation
    with tf.GradientTape() as tape:

        predicted_rating = forward_function(X, W, b, Ynorm, R)
        # Repeat until convergence:
        # 1 - Get J w.r.t. w
        # 2 - Get J w.r.t. b
        # 3 - Update x = x - alpha * J(w,b)
        loss_value = loss_function(num_users, num_movies, lambda_regularization)
    # Calculate the Gradient of Loss Function to upgrade w value.
    gradients_loss = tape.gradient(loss_value, [X, W, b])  # Get gradient of loss function w.r.t. w.

    # Upgrade w parameter to minimize the loss:
    optimizer.apply_gradients( zip(gradients_loss, [X, W, b]))



In [None]:
## DATA ADQUISITION

### Binary Label: Liked the Movie? Finished the movie?

#### Predict the probability of y(i,j) = 1, given by:
- - g(w(j) · x(i) \+ b(j))

Wehere g(Z):
- - 1 / ( 1 \+ e^-z )

![Alt text3](../b.png)

#### COST FUNCTION:
- - Binary Cross Entropy.

![Alt text1](../bb.png)

From (i, j) to r(i, j) = 1. So the BCEntropy of all the Movies that have a target value (1 or 0).

## 2. Content-based filtering

In collaborative filtering, we had number of users give ratings of different items. In contrast, in content-based filtering, we have features of users and features of items and we want to find a way to find good matches between the users and the items. The way we're going to do so is to compute these vectors, v_u for the users and v_m for the items over the movies, and then take dot products between them to try to find good matches.

#### Deep Learning Approach:
-  Given vector X_u (containing features of Users, such as Age, gender, demographics...), we have to create a new vector V_u:
- - Containing, user preferences regarding the movies. Ex: How much do they like each gender:
- - + Avg Rating Comedy
- - + Avg Rating Action
- - + Avg Rating Romance
-
- At the same time, given vector X_m (Containing year of the Movie, duration, stars etc...) we have to compute a vector V_m of same length as V_u, such that:
- - It describes the movie content:
- - + How much is j a Comedy movie.
- - + How much is j an Action movie.
- - + How much is j a Romance movie.
-
- The combination (dot product) of both, should be a good estimator of the rating that user j gives to movie i.

#### V_u : User network
-
- Input: X_u (Age, gender, Demographics, )
- Layers:
- - Input Layer Units -> len(X_u)
- - Middle Layer Units -> len(prev_layer) / 2 
- - Output Layer Units -> len(V_u) + Sigmoid function if Binary to predict the probability that y_ij = 1.
-
- Output: V_u -> Describes the user.

In [None]:
user_network = tf.keras.models.Sequential(
    [
    tf.keras.layers.Dense(256,
                          activation="relu"
    ),
    tf.keras.layers.Dense(128,
                          activation="relu"
    ),
    tf.keras.layers.Dense(32)
    ]
)

#### V_m : Movie network
-
- Input: X_m (Year, actors, genere...)
- Layers:
- - Input Layer Units -> len(X_m)
- - Middle Layer Units -> len(prev_layer) / 2 
- - Output Layer Units -> len(V_m) + Sigmoid function if Binary to predict the probability that y_ij = 1.
-
- Output: V_m -> Describes the movie.

In [6]:
item_network = tf.keras.models.Sequential(
    [
    tf.keras.layers.Dense(256,
                          activation="relu"
    ),
    tf.keras.layers.Dense(128,
                          activation="relu"
    ),
    tf.keras.layers.Dense(32)
    ]
)

# Last layer has 32 units -> 32 numbers.

#### Predicted_Rating = V_u · V_m

#### Loss function J:
- Assuming that we have some data, of some users at least rating some movies...
-
- Sum over all pairs i and j, where we have labels [ r( i, j) = 1]:
- - (Diff bw prediction - y_i,j )**2
-
- Then use Gradient Descent or another optimization algorithm, to tune the parameters of the network to cause the loss function to be as small as possible.
- To regularize: We can add: NN regularization term to keep values of the parameters small and avoid exploiding gradients.

In [None]:
# 1. Extract the user features from the raw data, and feeds it to the user NN.
input_user = tf.keras.layers.Input(
    shape=(
        num_user_features
    )
)
vector_user = user_network(input_user)

# NORMALIZE THE N2 NORM of the vector:
# To improve the performance of this approach, we can normalize the length of the vector_user to be equal to 1.
vector_user = tf.linalg.l2_normalize(vector_user, axis=1)

# 1. Extract the item features from the raw data, and feeds it to the user NN.
input_item = tf.keras.layers.Input(
    shape=(
        num_item_features
    )
)
vector_item = item_network(input_user)

# To improve the performance of this approach, we can normalize the length of the vector_user to be equal to 1.
vector_item = tf.linalg.l2_normalize(vector_item, axis=1)

# Measure the similarity of the 2 vectors into a prediction:
# Output of the neural network:
prediction = tf.keras.layers.Dot(axes=1) ([vector_user, vector_item])

# Specify the inputs and output of the model
model = Model([input_user, input_item], prediction)

cost_fn = tf.keras.losses.MeanSquaredError()

### How to use this information?
### Find items similar to 1 item:

- V_u_j is a vector of length 32 that describes a user j that have features X_u_j. Similarly,
- V_m_i is a vector of length 32 that describes a movie with these features X_m_i.
-
- Given a specific movie, we want to find movies that are similar to it:
- - V_m_i describes movie i. So, look for other movies k, so that distance(V_m_i, V_m_k) is small. ||V_m_k - V_m_i||**2
-
- This can be precomputed. So that when user browse a movie, we already have searched the top 10 most similar ones.

## Scale up to large data set

- When a new user signs in and watches a couple of movies, you would like to recommend him the next one.
- However, we can not run every single time, 1M of data adding the new users, and compute the dot product to predict, every time.
- It would be high-computationally inneficient. 
- SOLUTION?

### 2 Steps:
-
- 1. **Retrieval**:
- - Generate large list of plausible item candidates. The more items, better performance (since you can recommend from a broader catalog of items) but slower recommendations.
- - + For each of the last 10 movies watched by the user, find 10 most similar ones. [PRECOMPUTED]
- - + For most viewed genres, find the top 10 movies.
- - + Top 20 movies in the country.
- - * Analyze the results:
- - - * Carry out offline experiments to see if retrieving additional items results in more relevant recommendations (i.e. p(y_ij) = 1) of items displayed to user are higher.
- - Combine retrieved items into list, removing duplicates and items already watched/purchased. With around 100s recommended movies.
- - The goal is to ensure broad coverage.
- 
- 2. **Ranking Step**:
- - Take the list from the retrieval and rank it using the learned model.
- - + For each movie-user vectors pairs, compute the predicted rating.
- - + Based on the rating, you can display the ranked items.
- - * To predict this rating, we don't need to run inference again, but we should have stored the V_u and V_m and now just dot product them.