# Recommender system
Recommender system provide recommendations based on user data

## Notations
$n_u$: number of users

$n_m$: number of items

$n$: number of features

$r(i,j)$: if user $j$ has rated $i$th item (0 if not, 1 if rated)

$y(i,j)$: rating given by $j$th user to $i$th item

$\vec w^{(j)}, b^{(j)}$: weights and bias for $j$th user

$\vec x^{(i)}$: a vector contains the features for $i$th item

$X$: a matrix of vectors $\vec x^{(i)}$

$W$: a matrix of vectors $\vec w^{(j)}$

$b$: a vector of bias $b^{(j)}$

$R$: a binary indicator matrix of elements $r(i,j)$

$Y$: a $n_m \times n_u$ matrix that stores users' rating to each item, where each row represents an item and each column represents a user's rating to each item 


$$\mathbf{X} = 
\begin{bmatrix}
--- (\mathbf{x}^{(0)})^T --- \\
--- (\mathbf{x}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{x}^{(n_m-1)})^T --- \\
\end{bmatrix} , \quad
\mathbf{W} = 
\begin{bmatrix}
--- (\mathbf{w}^{(0)})^T --- \\
--- (\mathbf{w}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{w}^{(n_u-1)})^T --- \\
\end{bmatrix},\quad
\mathbf{ b} = 
\begin{bmatrix}
 b^{(0)}  \\
 b^{(1)} \\
\vdots \\
b^{(n_u-1)} \\
\end{bmatrix}\quad
$$ 

* The $i$-th row of $\mathbf{X}$ corresponds to the feature vector, $\vec x^{(i)}$, for the $i$th item
* The $j$th row of $\mathbf{W}$ corresponds to the weight vector, $\vec {w}^{(j)}$, for the $j$th user
* $\vec x^{(i)}$ and $\vec{w}^{(j)}$ are $n$ dimensional vectors, where $n$ is number of features


# Collaborative filtering
Collaborative filtering gives recommendation based on rating of users who gave similar ratings as you

The predicted rating for the $j$th user on the $i$th item is
$$\text{Predicted rating} = \vec w^{(j)} \cdot \vec x^{(i)} + b^{(j)}$$

In collaborative filtering algorithm, we want the algorithm to learn both the features of the items ($\vec x^{(i)}$) and user preferences (depending on $\vec w^{(j)}$ and $b^{(j)}$) at the same time

## Cost function

The collaborative filtering cost function is given by
$$J({\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},b^{(0)},...,\mathbf{w}^{(n_u-1)},b^{(n_u-1)}})= \left[ \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+ \underbrace{\left[
\frac{\lambda}{2}
\sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2
+ \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2
\right]}_{regularization}
$$

$$
= \left[ \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+\text{regularization}
$$

$(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2$: the squared error between predicted rating and actual rating

$\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 $: the sum of squared errors. If the $i$th user gives $j$th item a rating, $r(i,j) = 1$, and the error will count towards the total cost. If $i$th user does not give $j$th item a rating, $r(i,j) = 0$, this term will be ignored

$\sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2$: regularize the weights for all users ($n_u$ users in total), each user have $n$ features

$\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2$: regularize the weights for all items ($n_m$ users in total), each item have $n$ features

Note: $\mathbf{w}^{(j)}$ and $\mathbf{x}^{(i)}$ must have the same number of features

Minimize this cost function will provide the best fit for a recommender system

## Gradient descent
Since the algorithm is learning both the features of the items ($\vec x^{(i)}$) and user preferences (depending on $\vec w^{(j)}$ and $b^{(j)}$) at the same time, each iteration of gradient descent should update $w$, $b$, and $x$ simultaneously

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\;
& w^{(j)}_{i} = w^{(j)}_{i} -  \alpha \frac{\partial J({w},b,x)}{\partial w^{(j)}_{i}} \  \; \newline
&b^{(j)}\ \ = b^{(j)} -  \alpha \frac{\partial J(\vec {w},b,x)}{\partial b^{(j)}}  \newline 
& x^{(i)}_{k} = x^{(i)}_{k} -  \alpha \frac{\partial J({w},b,x)}{\partial x^{(i)}_{k}} \  \; \newline
\rbrace
\end{align*}$$

$w^{(j)}_{i}$: the weight for the $i$th feature for $j$th user

$x^{(i)}_{k}$: the $k$th feature for $i$th item

## Binary application
Binary application classify whether an user like an item or not (0 if not, 1 if yes)

Prediction function:
$$f_{w,b,x}(x) = g(\vec w^{(j)} \cdot \vec x^{(i)} + b^{(j)})$$

$g$: sigmoid function that output a value between 0 and 1

Loss funcion for a single user prediction:
$$Loss = L(f_{w,b,x}(x), y^{(i,j)}) = (-y^{(i,j)}) \log\left(f_{w,b,x}\left(x\right) \right) - \left( 1 - y^{(i,j)}\right) \log \left( 1 - f_{w,b,x}\left(x\right) \right)$$

Cost function:
$$J(w,b,x) = \sum_{(i:j):r(i,j)=1}L(f_{w,b,x}(x), y^{(i,j)})$$

$\sum_{(i:j):r(i,j)=1}$: sub in $i$ and $j$ if a rating is given ($r(i,j) != 0$)


## Limitation
Collaborative filtering is not good at
* Cold start problem: predict rating for a new item with very few rating or predict the preference of a new user who rated very few items
* Can not use side information from items and users: age, gender, location,etc

# Code

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [6]:
# Collaborative filtering cost
def cofi_cost_func(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    # Get number of items, users and features
    nm, nu = Y.shape
    n = X.shape[1]
    
    # Define variables
    J = 0
    regularization = 0
    
    # Calucalate cost 
    for i in range(nm):
        for j in range(nu):
            
            # If the item has a ratiing
            if R[i, j] == 1:
                # Add the squared error
                J += (np.dot(W[j], X[i]) + b[0][j] - Y[i, j])**2
    J /= 2   
    
    # Calculate regularzation for w
    for i in range(nu):
        for j in range(n):
            regularization += (W[i, j])**2
    
    # Calculate regularzation for w
    for i in range(nm):
        for j in range(n):
            regularization += (X[i, j])**2
            
    J += lambda_ * regularization / 2
            
    return J

In [5]:
# Vectorized implementation for cost function
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

In [11]:
# Implmenting collaborative filtering   
num_movies, num_users = 4778, 443 #Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to allow tensorflow to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users, num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1, num_users),   dtype=tf.float64),  name='b')

# Apply Adam optimization
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

# Normalize the Dataset
# Ynorm, Ymean = normalizeRatings(Y, R)

In [13]:
iterations = 200
lambda_ = 1
for iter in range(iterations):
    # Use TensorFlow’s GradientTape to record the operations used to compute the cost 
    with tf.GradientTape() as tape:

        # Compute the cost (forward pass included in cost)
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

    # Use the gradient tape to get the graident with respect to X, W, b and organize it in an array
    grads = tape.gradient(cost_value, [X,W,b])

    # Run one step of gradient descent by updating the value of the variables to minimize the loss.
    optimizer.apply_gradients(zip(grads, [X,W,b]))

NameError: name 'Y' is not defined

# Implementation

## Mean normalization
When an user does not rate many item, we can use initialize a rating for that user based on other users' ratings on that item. This helps the algorithm to make better predictions

To apply mean normalization, for each item, calculate the average rating for each item and subtract individual rating by the average rating of that item. Then, initialize the unknown rating to be the average rating of that item

# Determine related items
Each feature $x^{(i)}$ of item $i$ is hard to interpret. To find similar items, find item $k$ with $x^{(k)}$ close to $x^{(i)}$, which means the distance between two vectors are small

$$Distance = \sum^{n}_{l=1}(x^{(k)}_{l}-x^{(i)}_{l})^2 = ||x^{(k)}-x^{(i)}||$$

The smaller the distance, the more similar the two items are

# Content-based filtering
Content_based filtering recommend items based on features of user and items to find a good match

## Notations
$x^{(j)}_u$: a vector that contains the features about the $j$th user 

$x^{(i)}_m$: a vector that contains the features about the $i$th item

$v^{(j)}_u$: a vector that represents the preference of $j$th user

$v^{(i)}_m$: a vector that represents the features of $i$th item

* $v^{(j)}_u$ and $v^{(i)}_m$ are compuated based on $x^{(j)}_u$ and $x^{(i)}_m$
* $x^{(j)}_u$ and $x^{(i)}_m$ do not need to be the same size, but $v^{(j)}_u$ and $v^{(i)}_m$ must be the same size
* The predicted rating that the $j$th user given on the $i$th item is
$$\text{Predicted rating} = v^{(j)}_u \cdot v^{(i)}_m$$

## Getting $v_u$ and $v_m$

To obtain $v_u$ and $v_m$, we can apply two neural netowrks, one user network and one item network. The user network will have a input layer of $x_u$, and the item network will have a input layer of $x_m$. Two networks can have different architectures but the output layers must have the same size. We can then use $v_u$ and $v_m$ to make a prediction

## Cost function
The cost function for the final prediction
$$J = \sum_{(i:j):r(i,j)=1}(v^{(j)}_u \cdot v^{(i)}_m - y^{(i,j)})^2 + regularization$$

## Retrieval and ranking
Retrieval and ranking allows us to provide recommendations from a large set

* Retrival 
1. Generate a large list of plausible item candidates (closet to user preference/trending items) 
2. Combine items into list and remove duplicates

* Ranking
1. Take the retrieval list and rank based the items cloest to user preference
2. Display items to user based on ranking

# Code

In [1]:
import numpy as np
import numpy.ma as ma
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

In [4]:
# Construct neural network
num_outputs = 32
tf.random.set_seed(1)

# User network 
user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=256, activation='relu'),
    tf.keras.layers.Dense(units=128, activation='relu'),
    tf.keras.layers.Dense(units=num_outputs, activation='linear')

])

# Item network
item_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=256, activation='relu'),
    tf.keras.layers.Dense(units=128, activation='relu'),
    tf.keras.layers.Dense(units=num_outputs, activation='linear')
  
])

# create the user input vector, point it to the user network, and normalize the output
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# create the item input vector, point it to the item network, and normalize the output
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = tf.keras.Model([input_user, input_item], output)

model.summary()

NameError: name 'num_user_features' is not defined

In [None]:
# Compile the model
tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt, loss=cost_fn)

In [None]:
# Fit the model
tf.random.set_seed(1)
model.fit([user_train[:, u_s:], item_train[:, i_s:]], y_train, epochs=30)