Lecture Notes for session conducted on August 10, 2022

https://www.scaler.com/academy/mentee-dashboard/class/33487/session

**Content**

1.   Recap of Matrix Factorization (MF) for Recommender Systems (RecSys).
2.   Non-Negative Matrix Factorization (NMF).
3.   Netflix-Prize Solution.  
4.   Code: RecSys.
5.   MF for feature Engineering:
    - Text.
    - Images.
    - Entities.
7.   Market-Basket Analysis (ARM, Apriori Algorithm).   

### Recap: MF for RecSys


#### Core idea of MF in RecSys:

- Imagine we have a ratings matrix $A_{n*m}$ where $n \to users \ and \ m \to items$. We broke it down into product of 2 metrices $B_{n*d} \ and \ C_{d*m}$.
    <img src='https://drive.google.com/uc?id=1MuyxTUaHJNNpDQH0hpuhgfDgi5MZKGRg'>
- We have $1....n \ users$, and $B_i$ = $U_i^T$ $\rightarrow$ d-dimensional row vector for user $U_i$ and $U_i \in \mathbb{R}^d$. Similarly, we have $1....m \ items$ where $C_j$ $\rightarrow$ d-dimensional column vector for item $I_j$ and $C_j \in \mathbb{R}^d$.

#### Optimization Problem:

- Here, we try to summate over all pairs of $i \ and \ j$ such that $A_{ij} \neq NULL$ (i.e. we have some ratings). We minimize the function by finding $B_i$ and $C_j$ using Stochastic Gradient Descent (SGD) or Coordinate Descent Algorithm.
    <img src='https://drive.google.com/uc?id=1vwF4hYOcYPtOpdSRIGLOKEmD1_Na-vsQ'>
- Note, in the above equation we do not have parameter '$d$'. It is treated as hyperparameter. It id determined using Train/test split on $A_{ij}'s \neq NULL$ or using elbow method discussed in the previous class.

#### PCA and SVD as special case of MF:

- We have $X_{n*d}$ (Standardized data) and its Covariance Matrix Cov(X) = $S_{d*d}$ which is computed as $X_{d*n}^T.X_{n*d}$.
    <img src='https://drive.google.com/uc?id=1bcI2y-mjPp-QyPOSVRKvq0c-S7zsACqI'>
- We compute Eigen values and Eigen vectors of S as $S_{d*d} = W_{d*d}$.$\Lambda_{d*d}$.$W_{d*d}^T$
- Here, we decomposed S into 3 matrices such that:
    - $\Lambda$ is a diagonal matrix of Eigen values.
    - $W$ contains column vectors $V_1, V_2...V_d$ as Eigen Vectors which are Orthogonal to each other.
- PCA works on Square symmetric matrix.

In SVD, we decompose the data $X_{d*d}$ = $U_{n*n}.\Sigma_{n*d}.V_{d*d}^T$ where:
- $\Sigma_{n*d}$ is a diagonal rectangular matrix that has singular values $S_1, S_2...S_d$ and $S_i = \sqrt{\lambda_i}$
- $U_{n*n}$ contains eigen vectors where $U_i = i^{th}$ eigen vector of $X_{n*d}.X_{d*n}^T$.
- $V_{d*d}^T$ contains eigen vectors where $V_i = i^{th}$ eigen vector of $X_{d*n}^T.X_{n*d} = S_{d*d}$.
- SVD works on square and rectangular matrix as well.

#### K-Means clustering as special case of MF:


We have a dataset $X_{n*d}$ and we can write $X$ as approximately equal to product of 2 metrices $Z_{n*k}$ and $C_{k*d}$ where:
- $Z_{n*k}$ is the cluster assignment matrix with some constraints.
- $C_{k*d}$ is the cluster centroid matrix.
<img src='https://drive.google.com/uc?id=12KWi4jwhjtXcOEhodeX4Y2LEtQFDHXgi'>

*Note:* $Z^T.C$ and $Z.C^T$ can be used interchangeably as long as we are representing it in correct manner.

***Question:*** Like we have multiple factors for big natural numbers, can we similarly decompose matrix $A_{ij}$ into 4 metrices or more? Why are we always limiting it to 3?

***Answer***: Yes, we can decompose into more metrices as well. But in most applications related to Data Science / Machine Learning we do not decompose it into 4 metrices or more. This might be the case in some other fields of Applied Sciences.

### Non-negative Matrix Factorization (NMF):


#### Intuition:

- Suppose, we have matrix $A_{n*m}$ and we are trying to decompose it into product of 2 metrices $B_{n*d}$ and $C_{d*m}$.
- What if, we want to keep the components of $B_{ij}$ and $C_{ij}$ as non-negative i.e. $\ge 0$. Then, this constraint enables non-negative Matrix Factorization.
<img src='https://drive.google.com/uc?id=1e-0RbxXgq9XqefwAdcgcv47M1vwEQXjY'>
- In K-means, the cluster assignment matrix has following constraints:
    - $Z_{ij}$ = 1 or 0.
    - $\overset{k}{\underset{j=1}{\Sigma}} Z_{ij} = 1$
- If we have such non-negative constraints on $B_{ij}$ or $C_{ij}$ or both, then we can think of NMF as generalization of Clustering.
- In fact, MF $\supseteq$ NMF $\supseteq$ K-means(Hard/Soft).

#### Why do we need it?


- Often times, we want to vizualize data like images. Images are represented as color image (RGB) or Greyscale image where 0 $\to$ Black and 255 $\to$ White.
- In Machine Learning, we use MinMax Scaling on these images. (We could do Standard Scaling i.e. Mean centering and variance scaling as well).
- If we carry out Matrix Factorization on image data and we want to vizualize the result, then we want it to be non-negative.
- NMF basically says that if we want to decompose matrix $A$ into product of 2 metrices $B$ and $C$ such that $B_{ij} \ge 0 \ \forall_{i,j}$ and $C_{ij} \ge 0 \ \forall_{i,j}$, then such constraints helps us to interpret results better.
- Eigen-Faces is a good examples of NMF.


#### Is NMF always necessary?

- We want Non-negative factors for interpretability. But its not always necessary.
- For e.g. if we want to build a simple RecSys we don't care what the $B_i$ and $C_j$ are as long as $A_{ij}$ are approximated very well.
- But there are some case like images/clustering setup where NMF is helpful.
<img src='https://drive.google.com/uc?id=1OcFfN_QOsvGMfgShBxHGpM7KUj9PeZqP'>


### RecSys Interview Scenario 1:

- Let's say we have rating matrix $A_{n*m}$ where $n \to users$ and $m \to items$ in product base company like Youtube.
<img src='https://drive.google.com/uc?id=1Ea7QW-R14o1E58SdK-0ZSqApiH2iP7ja'>
- We have 'Likes' matrix: $L_{n*m}$, 'Watched' matrix: $W_{n*m}$ and 'Disliked'  matrix: $D_{n*m}$.
- Given this data, how to build a Recommender System, where we want to recommend user $U_i$ $\to$ some videos $V_j$.

**Option 1:**
- Give higher postive weightage to 'Likes' matrix, lower positive weightage to 'Watched' matrix and negative weightage to 'Disliked' matrix and build a new matrix $A_{n*m}$.
- This $A_{n*m}$ will be decomposed into $U_i$ and $V_j$.

**Option 2:**
- Independent factorization, wherein we decompose:
    - $L_{n*m}$ matrix into $U_i^L$ and $V_j^L$.
    - $W_{n*m}$ matrix into $U_i^W$ and $V_j^W$.
    - $D_{n*m}$ matrix into $U_i^D$ and $V_j^D$.
<img src='https://drive.google.com/uc?id=10pUhAc7MZ5txcMW9q74m2MU-oYVuV1UA'>

In both the above options, we can use cosine similarity and find which vidoes have propensity to be liked and prefer suggesting such vidoes and avoid vidoes that have propensity to be disliked.

- Suppose we recommend video $V_{10}$ to user $U_100$ using Matrix completion like strategy. $V_{10}$ will be very similar to other videos liked by user $U_{100}$.
- If $L_{100,10}$ is empty, then we will get this data using $U_{100}^T.V_{10}$.






***Question:*** Can we also make clusters of similar users and then find similar videos which have been liked by another users from the same cluster?

***Answer:*** Yes we can.
- We have $U_i^L \in \mathbb{R}^d$, a d-dimensional representation of users based on the liked vidoes. Similarly, we have $U_i^W \in \mathbb{R}^d$ and $U_i^D \in \mathbb{R}^d$.
- We can build a clustering on top of this and vizualize using t-SNE or UMAP.

***Question:*** Can we call this as probability of every user watching that video?

***Answer:*** No. it would be slightly tricky as $L_{n*m} \to \ 0 \ or \ 1$.
- When we do MF, we try to find values that are closer to 0 or 1. So, they need not be probability.


### Netflix-Prize Solution:

#### About Paper:



- There is a very nice paper that discusses MF for RecSys written around 2008-09 by Yehuda Koren. This paper summarizes important ideas used in the prize winning solution.
<img src='https://drive.google.com/uc?id=1cqfsBwvrEvKHAdEhdwKpjYXOEBABWDAD'>
- $u_i$ represents $user_i$ and $m_j$ represents $movie_j$. Given $(U,M)$ predict the ratings between 0 to 5.
- Metric used is RMSE.
- Internal alogrithms at Netlfix gives certain RMSE and teams need to find solution that gives 10% improvement on existing RMSE.
- Final solution has many models plus some bagging and boosting techniques.
- Though this solution was never productionalize due to its complexity. But a lot of new research came out of this competition.

https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf

*Note:*
- This paper discusses important techniques used in the award winning solution.
- The actual solution is discussed in below paper https://asset-pdf.scinapse.io/prod/54392637/54392637.pdf


#### Notations used in Paper:

- $D_{Tr}$ and $D_{Te}$: Training and Test dataset.
<img src='https://drive.google.com/uc?id=1qoYszZBfMP2_2c9WlJMaEmnbaJ2f2G0f'>
- Assume we have vector $p_u \to$ d-dimensional representation of each user and vector $q_i \to$ d-dimensional representation of $item_i$.
- We want to minimize all $q_i$ and $p_u$ without any constraints by solving the following optimization problem:

$$\underset{q_i, p_u}{\min} \  \underset{u,i}{\Sigma} ( r_{ui} - q_i^T.p_u)^2$$

#### Regularization:

- In any ML models, we want to regularize or else model will overfit.
- So the above equation now becomes:

$$\underset{q_i, p_u}{\min} \  \underset{u,i}{\Sigma} ( r_{ui} - q_i^T.p_u)^2 + \lambda(\underset{i}{\Sigma}||q_i||^2 + \underset{i}{\Sigma}||p_u||^2 )$$
- This $2^{nd}$ component is standard regularization.
- In paper, it is also represented as:
    
    <img src='https://drive.google.com/uc?id=1on4lzjj2cdAa9GVVbBjsWabuvXwscRml'>

***Question:*** Whatever be the value of '$d$', we are actually making $q_i^Tp_u$ very near to $r_{ui}$ and this will be used to construct the null values of $r_{ui}$. So why we need to regularize?

***Answer:*** Let say we have very high '$d$', then  q_i^T.p_u comes very close to $r_{ui}$. But we will overfit on training data and perform poorly on test data. Hence, we need to regularize.

#### Next level of Optimization:

- There is a bias at the user level where certain users tend to give high/low ratings for all the movies they watch (user bias: $b_u$).
- Similarly, popular movies like GodFather or Schindler's list will always have very high ratings (item bias: $b_i$).
- Ratings given in Netflix ecosystem will be different than that given at Amazon Prime or IMDb (Global bias: $\mu$).
<img src='https://drive.google.com/uc?id=18eXJ1PtqwxlRsBZOWSg7XRB0GV6Sysk_'>

So, the optimization problem was changed to:

$$\underset{\underset{\mu, \ b_u, \ b_i}{q_i, p_u}}{\min} \  \underset{u,i}{\Sigma} ( r_{ui} - \mu - b_u - b_i - q_i^T.p_u)^2 + \lambda(\underset{i}{\Sigma}||q_i||^2 + \underset{i}{\Sigma}||p_u||^2 + b_u^2 + b_i^2)$$

***Note:*** Just like we determine $p_u$ and $q_i$ vectors, we will also find these scalars. Practically, these biases would be very close to respective average ratings. But we avoid keeping them as constants and make them as parameters of the optimization function.

If you think from a regularization perspective:
- q_i^T.p_u is a quadratic term.
- $b_u and \ b_i$ are linear term.
- $\mu$ is a constant.

To summarize, $r_{ui}$ can be explained as combination of these biases and the interaction.



***Question:*** Since bias is specific to movies, why should it be removed from all ratings?

***Answer:*** Let's say, movie on an average is rated '2' by the users. Now if we want to find ratings for this movie by a particular user, then we need not adjust $q_i^T.p_u$ as the bias ($b_i$) will take care of it.

***Question:*** Why can't we compute $b_i, n_u \ and \ \mu$ from given data but determine them by solving optimization problem?

***Answer:*** Practically, these biases are close to average values from dataset. But we determine them as parameters due to certain scenarios. Let's say, an user gives an average rating '4' to all movies. But for one particular movie rating '1' is given. So we want $q_i^Tp_u$ to account for this particular user behaviour.


#### Implicit Feedback:

- Recommender systems can infer user preferences using implicit feedback, which indirectly
reflects opinion by observing user behavior including purchase history, browsing history, search patterns or even mouse movements.
- Implicit feedback usually denotes the presence or absence of an event, so it is typically represented by a densely filled matrix.

#### Temporal Dynamics:

- People liking movies or disliking them is a behaviour over time. Simple models do not capture user's behaviour over time.
- Mathematically, we can start treating rating as a function over time: $\hat{r}_{ui}(t) = \mu + b_i(t) - b_u(t) + q_i^T.p_u(t)$ where:
    - $b_i(t)$: Item rating that changes over time. There would be movies that people would be re-discovering and rating them highly.
    - $b_u(t)$: User's taste that changes over time. For e.g. user earlier would be giving rating '3' for movies, but now give rating '4' to most of the movies.
    - Similarly, item's vector representation ($q_i$) and user's vector representation ($p_u$) are also function of time.

- So rather than modelling them as constant values across 10 years of data we have, we model them as time series model which changes over time. The vectors are not longer static functions but they change over time.
- Now, we won't take the entire training data, but chunks of data and analyse how items behaviour is changing over time, how user behaviour is changing over time and how ($q_i$) and ($p_u$) changes over time.
- Eco-system bias ($\mu$) will be constant over time and the other biases $b_i, b_u \ and \ q_i^T.p_u$ will be time varing.
- This yields in very high reduction in RMSE.

***Note:*** Time-series model and forecasting will be discussed in future lectures.


#### MF model's accuracy comparison for above mentioned techniques:

- Temporal dynamics gives much better accuracy in comparison to plain MF, MF with biases and MF with implicit feedback.
    <img src='https://drive.google.com/uc?id=1aDIQdUvN8FhGBvkI3HxAaHfcHtS1s05P'>

*Source:* https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf


### RecSys Interview Scenario 2:

- Imagine your are currently working at Youtube. We have access to historical data of users. Now, how do we build temporal (Time-Series) Recommender System?
<img src='https://drive.google.com/uc?id=1CsIhF8Wnidj21yrlplAc-SeVHw5fb-eT'>

**Option:**
- Update models i.e. $b_u, b_i, q_i, p_u$ periodically.
- Give more weightage to recent data. This can be done by adding $W_{ui}$ to the optimization function, where $W_{ui}$ is based on recency.

$$\underset{\underset{\mu, \ b_u, \ b_i}{q_i, p_u}}{\min} \  \underset{u,i}{\Sigma} \ W_{ui} \ ( r_{ui} - \mu - b_u - b_i - q_i^T.p_u)^2 + \lambda(\underset{i}{\Sigma}||q_i||^2 + \underset{i}{\Sigma}||p_u||^2 + b_u^2 + b_i^2)$$





The above idea works. But there is a much simpler solution.
- Pre-compute $r_{ui}$ on nightly basis using Matrix Factorization where $r_{ui}$ is decomposed into $q_i\to$ d-dimensional representation of item $i$ and $b_i\approx$ average rating of item $i$ (based on popularity).
    <img src='https://drive.google.com/uc?id=1CeSVZJffQ_YcHFC95KOAxaVtvVRVfbUS'>

- So, when a video $V_{100}$ is recommended to a user then next few video recommendations are based on:
    - Pure recency based similarity. A very fast KNN in d-dimensional space is built for this purpose.
    - Historical taste of users.
    - Top videos that are good to recommend.
    <img src='https://drive.google.com/uc?id=1-oKNvC-b6yTyhyk9biREr3I9ShisIktq'>

***Note:*** Nowadays, we have State-of-the-art RecSys built using Neural Collaborative Filtering (NCF) and advanced Deep Learning (DL) models.

### Code for Recommender Systems:

Documentation for surprise library:

https://surprise.readthedocs.io/en/stable/

In [None]:
!pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 4.5 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633997 sha256=f579ad01257443daa4ebe862c9cc036dd5da94498c1c329a74a4f8c34cbc6a13
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [None]:
#Source: https://surprise.readthedocs.io/en/stable/getting_started.html
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import accuracy

In [None]:
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [None]:
# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.9476


0.9475870426780431

In [None]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid)
pred

Prediction(uid='196', iid='302', r_ui=None, est=4.309109605145767, details={'was_impossible': False})

***Question:*** Complexity increases with some variables. How to handle computational costs?

***Answer:***

- Movie-lens like datasets run quickly in Google-colab. Problem occurs when we have 10's of millions of parameters. This is handled in Deep Learning very well.
- SGD works even for few million parameters on colab or local machines.
- Matrix factorization is carried out on distributed systems like Spark for faster results.


### Matrix Factorization (MF) for Feature Engineering:

#### Entities:

- In Matrix Factorization, we cared the most about $q_i$ and $p_u$.
- From simple raw dataset $r_{ui}$, we solve the optimization problem and get the d-dimensional representations of items ($q_i$) and users ($p_u$).
    <img src='https://drive.google.com/uc?id=1fn8BlX-5ZFYZRQL8M2yBG1p0y1NEific'>
- So, in a way MF does feature engineering as it provides item vector and user vectors.
- These vectors are sensible as similar users will have similar vectors and similar items will have similar vectors.
- These vector representations can be used in Classification/ Regression/ Vizualizations/ Clustering.
- So MF is a feature engineering technique that uses a model.


***Example:*** E-commerce Setup

- Let say, a Recommender System was built using all historical data and it has generated user vectors $p_u$ and items vectors $q_i$.
- These vector features are stored in Feature-Stores as key-value pairs. This feature store could be queryed on:
    - user-id $\to$ to get d-dimensional representation of an user.
    - item-id $\to$ to get d-dimensional representation of an item.
- Suppose a developer is working in search team of the same E-commerce setup.
- Now, the end-customer would search for a key term in the shopping browser. Let say "Bluetooth".
- Developer can use the feature store to improve the search result by taking all the item vectors that are similar to this search term.

***Note:*** This technique to built feature stores using MF was used by small startups and big firms. Nowadays, NCF or Entity embeddings are used.



***Question:*** Are this features interpretable?

***Answer:*** No, these features have no interpretability.
- In manual feature engineering, we collected data like Country, Pincode... for the user. So it was interpretable.
- But in MF base feature engineering, we get a d-dimensional representation of users. These algorithmically generated features could be vizualized using t-SNE/UMAP.
<img src='https://drive.google.com/uc?id=14aL6HLVogV8W4cUW9sWUDIqzzqDms4SW'>


























#### Text:


- We can get d-dimensional representations of words. Suppose we have a large Corpus of text e.g. Wikipedia.
- These d-dimensional representations should be sensible such that if we project these vectors in d-dimensional space, then similar words will appear together.
<img src='https://drive.google.com/uc?id=1iyalE-qkzdWdfVBywt6wMLfYVhRQ-bZn'>
- So how do we build them?

##### Option 1:

- To begin with, we can arrange the information into a matrix $A_{n*m}$ as below, where $n\to$ documents and $m\to$ words and $A_{ij}$ = Frequency/Count of occurance.
<img src='https://drive.google.com/uc?id=1EgRJCHUSuK2S9E7ORUZHNIoaGozGpkne'>
- This matrix could be decomposed into 2 metrices $B$ and $C$ where:
    - $B_i^T$ = $Doc_i^T$: A dense representation of the document.
    - $C_j$ = $W_j$: A dense representation of the word.
<img src='https://drive.google.com/uc?id=1-dveLiA1bq8D9rIxOeLaQdbleCXbHdvK'>

- This is the simplest form of Matrix Factorization.
- Here we get d-dimensional representations of documents as well as the words.

##### Option 2:

- Let's construct a word-word matrix $X_{m*m}$. Here, $X_{ij} = 1$ if $w_i$ and $w_j$ occur in vicinity of a context.
<img src='https://drive.google.com/uc?id=1xvY9dO0d9KmZ4uZ5guuBVNpUov-1WFEO'>
- Here we capture if 2 words occur in the same context. For e.g. if 2 words are within a particular word distance. (Co-occurance Matrix).
- We can do Matrix Factorization and decompose $X_{m*m}$ into $B_{n*d}$ and $C_{d*n}$ where:
    - $B_{i}^T \to$ representation of words.
    - $C_{j} \to$ representation of words.


#### Images:


- 2-d image can be represented as single 1-d vector (Row major form).
<img src='https://drive.google.com/uc?id=1NXAD5v6ZNbqCLxJQpKUB38MC1FUewW9P'>
- Given these images, we can construct a data matrix: $X_{n*d}$.
<img src='https://drive.google.com/uc?id=1WfDMyGG5QWeat3WuCj6SFEIaErvBf9RO'>
- Here, $n\to$ number of images, $d = r*c$ where $r\to$ rows and $c\to$ columns and ith row in this matrix is $image_i$.
- We can decompose $X_{n*d}$, into $B_{n*m}$ and $C_{m*d}$.
<img src='https://drive.google.com/uc?id=1cAnahVwKygP-monnC1Za8cYdggKQ2X2g'>
- If we have 1000*1000 pixel image, then $d=10^6$, but if $m=20$ then we have a 20-dimensional dense representation for each image.


### Summary:

Beauty of Matrix Factorization and its variations be it PCA, SVD, NMF, Clustering is that it helps to construct algorithmically generated features that are rich and dense.

### *Note*

Following topics will be covered in next lecture:

7.   Market-Basket Analysis (ARM, Apriori Algorithm).