# Recommendation systems: using Singular Value Decomposition (SVD)
This is brief introduction to recommender systems based on collaborative filtering using SVD and its implementation in web page recommendation.
(This is just a code snippet to give you an idea.)

## Data
The web pages of the website of a financial institution are grouped according to their characteristics (e.g. tags) in certain categories, and for each of them and each user a measure is estimated that summarizes the Average Session on Page (average amount of time spent on a given page), the total number of sessions and other SEO metrics that measure the customer engagement.
They are all normalized in [0,10].

- Inv = Website pages on investments, KPI in [0,10]
- Ins = Website pages on insurance products, KPI in [0,10]
- ReE = Website pages on mortgages, loans, residential financial, KPI in [0,10]
- Pay = Website pages on payments, credit cards, debt cards, KPI in [0,10]
- MeM = Website pages with generic information, KPI in [0,10]

## Goal
The website is dynamic: a dynamic website is a website that displays different types of content every time a user views it, and this display changes depending on rules. 
The idea is: use the information at your disposal for creating a more tailored experience on the site. Based on a user’s previous visit, or similar users' behaviors, a dynamic website can offer similar or related recommendations in terms of web pages, contents, etc. This is a job for a recommendation system.

Let's import data and libraries, and let's have a look at our data.

In [1]:
import numpy as np
import pandas as pd

webpages = pd.read_excel('C:/Users/andre\Desktop\Magistrale\FINTECH\ZENTI\Progetto 2\WebPagesKPI.xlsx') # Only reads col1, col2, col3. col0 will be ignored


SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-1-624b1493b36e>, line 4)

In [85]:
print(webpages)

               User ID  Inv  Ins  ReE  Pay  MeM
0     OARQ249814647825    5    1    0    0    0
1     VIZP702882221151    0    0    5    0    2
2     FQEO606412621089    6    0    0    0    0
3     TOPH566154866098    0    0    0    8    0
4     TOPH566154869076    6    2    0    0    0
...                ...  ...  ...  ...  ...  ...
2078  UFDM391092504710    7    0    0    0    0
2079  UFDM391092504710    1    0    0    0    5
2080  UXUK848361229470    0    0    9    0    0
2081  ATQU555050383451    2    0    0    1    0
2082  DDFU541993148635    1    0    1    0    2

[2083 rows x 6 columns]


In [86]:
webpages.head()

Unnamed: 0,User ID,Inv,Ins,ReE,Pay,MeM
0,OARQ249814647825,5,1,0,0,0
1,VIZP702882221151,0,0,5,0,2
2,FQEO606412621089,6,0,0,0,0
3,TOPH566154866098,0,0,0,8,0
4,TOPH566154869076,6,2,0,0,0


In [87]:
webpages.dtypes

User ID    object
Inv         int64
Ins         int64
ReE         int64
Pay         int64
MeM         int64
dtype: object

## Collaborative Filtering
There are many techniques used for finding and recommending many suitable items (item = web page, in this case): **collaborative filtering through SVD is just one of those techniques**. It's very popular.

The assumption of collaborative filtering is that people who have liked an item (a product, a web page, whatever) in the past will also like the same in future. Thus, this approach builds a model based on the past behaviour of users, finding an association between the users and the items. The model is then used to predict the item in which the user may be interested.

We can use SVD to discover relationship between items, and a recommender system can be build easily from this.
Let's see how.
Just like a number, say 30, can be decomposed as factors 30 = 2x5x3, a matrix can also be expressed as multiplication of some other matrices. But because matrices are arrays of numbers, they have their own rules of multiplication: SVD is a linear algebra technique to break down a matrix into the product of a few smaller matrices - see https://en.wikipedia.org/wiki/Singular_value_decomposition, and https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html.

Briefly, SVD assumes a matrix of dimension users x items (which is called Utility Matrix) is decomposed as follows:
                                   matrix = U*Sigma*V'

Basically, we are looking for the **latent variables, or latent factors**, that hide under the surface of the phenomenon we are analyzing (in this case, the use of web pages):
- U represents the relationship between users and latent factors
- S describes the strength of each latent factor
- V describes the similarity between items and latent factors.

The latent factors here are **estimate** of the characteristics of the items: the SVD decreases the dimension of the utility matrix by extracting its latent factors. It maps each user and each item into a r-dimensional latent space. This mapping facilitates a clear representation of relationships between users and items. Then we will use cosine similarity to find the closest web pages and make a recommendation - see https://en.wikipedia.org/wiki/Cosine_similarity.

In [88]:
# drop the User ID and create the Utility Matrix
webpages.drop(columns=['User ID'],inplace=True)

from numpy.linalg import svd
matrix = webpages.values
u, s, vh = svd(matrix, full_matrices=False)

# little inspection
print(u.shape)
print(s.shape)
print(vh.shape)

(2083, 5)
(5,)
(5, 5)


NOTE 1: By default, the svd() returns a full SVD, but I used a reduced version so we have smaller matrices and we save memory. 
NOTE 2: If we normalize the scores, by subtracting their average rating, we turn low scores into negative numbers and high scores into positive numbers. This is a strong signal - maybe too strong in this case (it's OK with books and movies, but here? these are webpages...).
HINT: try and see what happen.

The columns of vh correspond to the web pages. We can based on vector space model to find which book are most similar to the one we are looking at, using cosine similarity.
And in this example, I try to find the web page that is best match to to first column, but - HINT - you can get an ordered list.

In [93]:
def cosine_similarity(v,u):
    return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u))

#Pick a column = web page type
chosen_col = 1


highest_similarity = -np.inf
highest_sim_col = -1
for col in range(1,vh.shape[1]):
    similarity = cosine_similarity(vh[:,chosen_col], vh[:,col])
    if similarity > highest_similarity:
        highest_similarity = similarity
        highest_sim_col = col
 
print("Column %d (webpage id %s) is most similar to chosen column (webpage id %s)" %
        (highest_sim_col, webpages.columns[col], webpages.columns[chosen_col])
)

Column 1 (webpage id MeM) is most similar to chosen column (webpage id Ins)
