# Data Mining / Prospecção de Dados

## André Falcão - 2022-2023

## Data Mining TP09 - A recommender system tutorial

## Summary

1. Item based recommendation
    1. Term Frequency - Inverse Document Frequency (TF-IDF)
2. Collaborative Filtering 
    1. User-User
    2. Item-Item
3. Comparison of methods in test cases


## 1. Item based recommendations

### 1.1 TF-IDF

**A Review** The idea of TF IDF is very commmon in Information retrieval and is based on two main ideas
* Some terms/words are more relevant than others
* Similarity between documents can be defined by the more relevant terms they share

**Term Frequency**

$$\text{tf}(t,d)=\frac{f_{t,d}}{\sum_{t'\in d}{f_{t',d}}}$$

**Inverse Document Frequency**

$$\text{idf}(t,D)=\log{\frac{|D|}{| d \in D: t \in d |}}$$

**TF-IDF**

$$\text{tf-idf}(t,d,D)=\text{tf}(t,d) . \text{idf}(t,D)$$


Instead of using our custom made approach (see Lab class 01), Scikit-learn has a nice [TF-IDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) already built in that allows extensive configuration. Among the important  factors to use are: 
* analyzer - defines what type of data we are analyzing. if ´words´ then a text analyzer is used for further processing
* stop_words - allows the definition of stop words for string processing. has the default "english" value for using the common English defaults
* strip_accents - whether or not to consider accents in words
* ngram_range (A,B) - ngrams are sequences of N words. By allowing a range, all possible ngrams from A to B are considered

#### Ngram range example:

An ngram range of (1,3) applied to the word string "the quick brown fox jumps" will generate all the 1,2 and 3 word ngrams: thus:
* 1-grams: the, quick, brown, fox, jumps
* 2-grams: (the, quick), (quick, brown), (brown, fox), (fox, jumps)
* 3-grams: (the, quick, brown), (quick, brown, fox), (brown, fox, jumps)


The data set in this first example and several ideas came from [this article from Venkat Raman](https://towardsdatascience.com/recommender-engine-under-the-hood-7869d5eab072)

First let's import the required libraries for this first exercise

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

Now let's open Venkat Raman's dataset of 50 book titles on Machine Learning and store it in a list

In [2]:
books=open("books.txt", "rt").readlines()
books=[L.strip() for L in books]

books[:10]

['Probabilistic Graphical Models',
 'Bayesian Data Analysis',
 'Doing data science',
 'Pattern Recognition and Machine Learning',
 'The Elements of Statistical Learning',
 'An introduction to Statistical Learning',
 'Python Machine Learning',
 'Natural Langauage Processing with Python',
 'Statistical Distributions',
 'Monte Carlo Statistical Methods']

We are going first to vectorize all the words in the book titles, We will consider each element as composed of English words, and, to make a the analysis richer, we will consider all the word ngrams from 1-3.

This will generate a matrix of 50 rows (as many as books) and 404 columns (the number of unique ngrams)

We can see each book individual scores. Let's see for book #39

In [3]:
tfidf_vec = TfidfVectorizer(analyzer='word', ngram_range=(1, 4), min_df=0, stop_words='english')
tfidf_matrix = tfidf_vec.fit_transform(books)

book=39
elems_0 = tfidf_matrix.todense()[book]
elems_0=np.array(elems_0).ravel()
idxs=elems_0.argsort()[::-1]
print(books[book])
for i,idx in enumerate(idxs):
    if elems_0[idx]>0:
        print("\t Element: %4d - TFIDF-Score: %7.4f"% (idx,elems_0[idx] ))


The Art of Data Science
	 Element:   22 - TFIDF-Score:  0.5096
	 Element:   23 - TFIDF-Score:  0.5096
	 Element:   21 - TFIDF-Score:  0.4608
	 Element:  101 - TFIDF-Score:  0.3288
	 Element:  385 - TFIDF-Score:  0.3046
	 Element:   83 - TFIDF-Score:  0.2596


Now that each element has been vectorized we can compute the similarities using the cosine distance. Scikit learn has a cosine distance function, however it is not necessary to use it as the TF-IDF vectorizer has already normalized the distances, being sufficient (and faster!) to use the [linear kernel](https://scikit-learn.org/stable/modules/metrics.html#linear-kernel) (simple vector multiplication)

This will produce a $N.N$ matrix with similarities between items

In [11]:

sims = linear_kernel(tfidf_matrix, tfidf_matrix)
#This is equivalent to 
#sims = cosine_similarity(tfidf_matrix, tfidf_matrix)
sims.shape

(50, 50)

In [13]:
cosine_similarity(tfidf_matrix, tfidf_matrix).shape

(50, 50)

In [14]:
sims[0]

array([1.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.09341262,
       0.        , 0.        , 0.        , 0.        , 0.        ])

In [15]:
cosine_similarity(tfidf_matrix, tfidf_matrix)[0]

array([1.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.09341262,
       0.        , 0.        , 0.        , 0.        , 0.        ])

Now that the similarities between Items have been defined we can pre-compute the first K similar items

In [36]:
def precompute_similars(DB, sims, K=5):
    similars=[]
    for idi, item in enumerate(DB):
        sim_item_idxs = sims[idi].argsort()[:-(K+2):-1]
        sim_items = [(sims[idi][i], i) for i in sim_item_idxs]
        similars.append(sim_items[1:])
    return similars

similars=precompute_similars(books, sims)

In [37]:
similars[0]

[(0.09341262453131684, 44), (0.0, 11), (0.0, 21), (0.0, 20), (0.0, 19)]

In [51]:
def precompute_similars(DB, sims, K=5):
    similars=[]
    for idi, item in enumerate(DB):
        sim_item_idxs = sims[idi].argsort()[-(K+1):-1]
        sim_items = [(sims[idi][i], i) for i in sim_item_idxs]
        similars.append(sim_items[::-1])
    return similars

similars=precompute_similars(books, sims)
similars[0]

[(0.09341262453131684, 44), (0.0, 11), (0.0, 21), (0.0, 20), (0.0, 19)]


Now here follows our recommender system for one simple item. We have everything we might need. We have a database of items and a similarity matrix that quantifies the similarity between items, we can then construct a recommend function that given one item that was purchased by a user is able to find others is the database that are similar to it



In [42]:
def recommend(db, similars, book_id, num=2):
    print("Recommending " + str(num) + " item(s) similar to " + db[book_id])
    for rec in similars[book_id][:num]:
        print("\t (score: %7.4f) - %s" % (rec[0], db[rec[1]]))

        
recommend(books, similars, 4,5)

Recommending 5 item(s) similar to The Elements of Statistical Learning
	 (score:  0.3650) - An introduction to Statistical Learning
	 (score:  0.1453) - Statistical Distributions
	 (score:  0.1121) - Learning SQL
	 (score:  0.1077) - Deep Learning
	 (score:  0.0996) - Statistical Power Analysis


## 1.2. Building User profiles

suppose that an user has purchased 2 books: 'Pattern Recognition and Machine Learning' and'The Elements of Statistical Learning'. Which are its most similar items we can suggest?

A very simple solution is just compute the user profile with the average and check for similar items. By doing that, we are projecting the user preferences into Item space!


In [43]:
def recommend_by_purchases(items, DB, mat, K=5):
    user_profile = np.mean(mat[items,:], axis=0)
    sims_v=linear_kernel(mat, np.asarray(user_profile))

    sims=[ (v[0],i) for i,v in enumerate(sims_v)]
    sims.sort()
    sims.reverse()
    
    print("User has bought: ")
    for item in items:
        print("\t", DB[item])
    print("It is then recommended:")
    for i in range(K):
        if i not in items:
            s, item = sims[i]
            print("\t (score: %7.4f) - %s" % (s, DB[item]))
    
        
recommend_by_purchases((3,4), books, tfidf_matrix, 5)


User has bought: 
	 Pattern Recognition and Machine Learning
	 The Elements of Statistical Learning
It is then recommended:
	 (score:  0.5294) - The Elements of Statistical Learning
	 (score:  0.5294) - Pattern Recognition and Machine Learning
	 (score:  0.2125) - An introduction to Statistical Learning


#### Exercises

Work with the This subset from the [movies database](https://www.kaggle.com/datasets/sankha1998/tmdb-top-10000-popular-movies-dataset) that contain descriptions of over 2700 movies in English and make a recommender system out of it
* Use only the descriptions for recommendation
* Check the results for several known movies that you know and verify if it actually work
* test the differences between words or character based analyzer. Would you change the ngrams?


In [44]:
movies=open("movies_metadata.txt", "rt", encoding="utf-8").readlines()

movies =[lin.strip().split("\t") for lin in movies]
titles=[ title for title, desc in movies]
descs= [ desc for title, desc in movies]
for i in range(3):
    print(titles[i], "-->", descs[i], "\n")

Minions --> "Minions Stuart, Kevin and Bob are recruited by Scarlet Overkill, a super-villain who, alongside her inventor husband Herb, hatches a plot to take over the world." 

Wonder Woman --> An Amazon princess comes to the world of Man to become the greatest of the female superheroes. 

Beauty and the Beast --> A live-action adaptation of Disney's version of the classic 'Beauty and the Beast' tale of a cursed prince and a beautiful young woman who helps him break the spell. 



In [46]:
## TODO on your own
desc_tfidf = TfidfVectorizer(ngram_range=(1,5), min_df=0, stop_words="english").fit_transform(descs)

In [47]:
desc_tfidf

<2768x283762 sparse matrix of type '<class 'numpy.float64'>'
	with 345888 stored elements in Compressed Sparse Row format>

In [49]:
desc_sims = linear_kernel(desc_tfidf)

In [60]:
def precompute_similar(DB, sims, K=10):
    similars=[]
    for idi, item in enumerate(DB):
        sim_item_idxs = sims[idi].argsort()[-(K+1):-1]
        
        sim_items = [(sims[idi][i], i) for i in sim_item_idxs]
        similars.append(sim_items[::-1])
    return similars

def recommend(db, similars, movie_id, num=2):
    print("Recommending " + str(num) + " item(s) similar to " + db[movie_id])
    for rec in similars[movie_id][:num]:
        print("\t (score: %7.4f) - %s" % (rec[0], db[rec[1]]))
        
        
def recommend_by_purchases(items, DB, mat, K=5):
    user_profile = np.mean(mat[items,:], axis=0)
    sims_v=linear_kernel(mat, np.asarray(user_profile))

    sims=[ (v[0],i) for i,v in enumerate(sims_v)]
    sims.sort()
    sims.reverse()
    
    print("User has bought: ")
    for item in items:
        print("\t", DB[item])
    print("It is then recommended:")
    for i in range(K):
        if i not in items:
            s, item = sims[i]
            print("\t (score: %7.4f) - %s" % (s, DB[item]))

In [65]:
list(filter(lambda x: x[1]=="Predestination", enumerate(titles)))

[(2425, 'Predestination')]

In [57]:
most_similars = precompute_similar(titles, desc_sims)

In [66]:
recommend(titles, most_similars, 2425, 5)

Recommending 5 item(s) similar to Predestination
	 (score:  0.0260) - Atomic Blonde
	 (score:  0.0256) - White Oleander
	 (score:  0.0225) - Unstoppable
	 (score:  0.0204) - Survivor
	 (score:  0.0204) - The Interpreter


## 2. Introduction to Collaborative Filtering

Collaborative Filtering (CF) is the process of making recommendations based on the behaviour of other similar users

There are two types of Collaborative Filtering
* User-User CF - We use the users ratings to infer the best items for each user
* Item-Item CF 



### 2.1Introduction to Collaborative Filtering

The key issue in Collaborative Filtering is to identify similarities between individuals or items (rows or columns) with many missing data.

For that purpose the general procedure is as follows:

1. Center the data set according to the row average
2. compute the similarities using any metric - generally the Cosine Similarity

Let's start with the example shown in class with the ratings of 4 users to 7 movies. Note that if a user did not rate a movie, it does not mean that it is zero, but rather that it is unknown. The purpose of collaborative Filtering is to discover those unknowns!

In [67]:
D={"HP1": np.array([4,5,np.nan,np.nan]),
   "HP2": np.array([np.nan,5,np.nan,3]),
   "HP3": np.array([np.nan,4,np.nan,np.nan]),
   "TW":  np.array([5, np.nan,2,np.nan]),
   "SW1": np.array([1, np.nan,4,np.nan]),
   "SW2": np.array([np.nan,np.nan,5,np.nan]),
   "SW3": np.array([np.nan,np.nan,np.nan, 2])}
df=pd.DataFrame(D)
df.index=["A", "B", "C", "D"]
df

Unnamed: 0,HP1,HP2,HP3,TW,SW1,SW2,SW3
A,4.0,,,5.0,1.0,,
B,5.0,5.0,4.0,,,,
C,,,,2.0,4.0,5.0,
D,,3.0,,,,,2.0


In [79]:
np.nanmean(df,axis=1)

array([3.33333333, 4.66666667, 3.66666667, 2.5       ])

array([3.33333333, 4.66666667, 3.66666667, 2.5       ])

We can intrinsically perceive that Person A is probably more similar to B than to C, as the only movie they commonly watched (HP1) they gave it similar rankings. On the other hand person A and C did had divergent opinions on TW and SW1

Rather interestingly, Persons B and C do not share a single movie watched, and thus should be perhaps neutral to each other

#### Step 1. Centering the Matrix

The first step is to center the is matrix by subtracting the mean of each user, and as such the previous table will have positive and negative values 

This way we will assume that if a given movie has not been rated by a person it will have the average of that person ratings

In [88]:
np.nanmean(df, axis=1)

array([3.33333333, 4.66666667, 3.66666667, 2.5       ])

In [95]:
df

Unnamed: 0,HP1,HP2,HP3,TW,SW1,SW2,SW3
A,4.0,,,5.0,1.0,,
B,5.0,5.0,4.0,,,,
C,,,,2.0,4.0,5.0,
D,,3.0,,,,,2.0


In [99]:
def RowCenterMatrix(M):
    mat= np.nanmean(M, axis=1)
    MC = M.T - mat
    MC[np.isnan(MC)]=0
    return MC.T

In [100]:


VC=RowCenterMatrix(df.values)

dfc=pd.DataFrame(VC)
dfc.columns=df.columns
dfc.index=df.index
dfc

Unnamed: 0,HP1,HP2,HP3,TW,SW1,SW2,SW3
A,0.666667,0.0,0.0,1.666667,-2.333333,0.0,0.0
B,0.333333,0.333333,-0.666667,0.0,0.0,0.0,0.0
C,0.0,0.0,0.0,-1.666667,0.333333,1.333333,0.0
D,0.0,0.5,0.0,0.0,0.0,0.0,-0.5


#### Step 2. Compute the Similarity Matrix

Now that we have the ratings-matrix centered we can in fact apply the Cosine Similarity to identify the most similar persons

In [101]:
#cos sim
def CosSim(A, B):
    Norm_A = np.sqrt(np.sum(A*A))
    Norm_B = np.sqrt(np.sum(B*B))
    return np.sum(A*B)/(Norm_A*Norm_B)

With this similarity function we can now verify the similarities between people, and verify our previous intuition


In [102]:
#Similarity users:
print("Sim User A and B: %7.4f" % CosSim(VC[0], VC[1]))
print("Sim User A and C: %7.4f" % CosSim(VC[0], VC[2]))
print("Sim User B and C: %7.4f" % CosSim(VC[1], VC[2]))
#VC[0]

Sim User A and B:  0.0925
Sim User A and C: -0.5591
Sim User B and C:  0.0000


The above procedure is adequate for comparing 2 rows of the same matrix. We can generalize it easily for making a Person x Person matrix using the centered data as above

In [118]:
def CosSim_Matrix(M):
    norms=np.sqrt(np.sum(M*M, axis=1))
    norms[norms<0.001]=0.001  #this will solve rows or cols without variance

    norms_M = np.outer(norms, norms)
    VC=M.copy()
    return np.dot(VC, VC.T)/norms_M

In [119]:
sim_mat = CosSim_Matrix(VC)
sims=pd.DataFrame(sim_mat)
sims.columns=df.index
sims.index=df.index
sims

Unnamed: 0,A,B,C,D
A,1.0,0.09245,-0.559085,0.0
B,0.09245,1.0,0.0,0.288675
C,-0.559085,0.0,1.0,0.0
D,0.0,0.288675,0.0,1.0


#### Exercises 
Analize the previous matrix similarity matrix. 
* What are the most dissimilar users?
* what are the most similar Users?


#### Step 3. Making predictions

For making predictions, we will use the most similar users found. There are two common approaches, with many, more sophisticated, tweaks

* Compute the average of the most similar users
* Do a weighted sum of the most similar users, using the user similarity

This last approach is essentially the same procedure we will use for Item-Item collaborative filtering and will be described next



### 2.2 Item - Item Collaborative filtering

We will use the above procedures to preprocess a slightly larger database (also shown in class) and will start with with a Item-Item approach for making recommendations


#### 2.2.1 Read the data set acknowledging for the missing data

the `movies.csv` dataset has ratings of 12 users for 6 movies

In [120]:
df=pd.read_csv("movies.csv", index_col=0)
df

Unnamed: 0,U1,U2,U3,U4,U5,U6,U7,U8,U9,U10,U11,U12
M1,1.0,,3.0,,,5.0,,,5.0,,4,
M2,,,5.0,4.0,,,4.0,,,2.0,1,3.0
M3,2.0,4.0,,1.0,2.0,,3.0,,4.0,3.0,5,
M4,,2.0,4.0,,5.0,,,4.0,,,2,
M5,,,4.0,3.0,4.0,2.0,,,,,2,5.0
M6,1.0,,3.0,,3.0,,,2.0,,,4,


#### 2.2.2. Compute the similarites bewtween rows

First center the data. We can use the previously defined functions

In [121]:
VC=RowCenterMatrix(df.values)
pd.DataFrame(VC)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,-2.6,0.0,-0.6,0.0,0.0,1.4,0.0,0.0,1.4,0.0,0.4,0.0
1,0.0,0.0,1.833333,0.833333,0.0,0.0,0.833333,0.0,0.0,-1.166667,-2.166667,-0.166667
2,-1.0,1.0,0.0,-2.0,-1.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0
3,0.0,-1.4,0.6,0.0,1.6,0.0,0.0,0.6,0.0,0.0,-1.4,0.0
4,0.0,0.0,0.666667,-0.333333,0.666667,-1.333333,0.0,0.0,0.0,0.0,-1.333333,1.666667
5,-1.6,0.0,0.4,0.0,0.4,0.0,0.0,-0.6,0.0,0.0,1.4,0.0


Now compute the similarities between Rows (Movies)

In [158]:
sim_mat = CosSim_Matrix(VC)
sim_movies=pd.DataFrame(sim_mat, columns=df.index, index=df.index)
sim_movies


Unnamed: 0,M1,M2,M3,M4,M5,M6
M1,1.0,-0.178542,0.414039,-0.10245,-0.308957,0.58704
M2,-0.178542,1.0,-0.526235,0.468008,0.398911,-0.30644
M3,0.414039,-0.526235,1.0,-0.623981,-0.284268,0.50637
M4,-0.10245,0.468008,-0.623981,1.0,0.458735,-0.235339
M5,-0.308957,0.398911,-0.284268,0.458735,1.0,-0.215917
M6,0.58704,-0.30644,0.50637,-0.235339,-0.215917,1.0


#### 2.2.3  Global Baseline approach 

For any Item, the Global BaseLine approach is
$$
    GBA_{u,i} = M + D_i + D_u
$$

where $M$ is the average Item rating for all items and users, $D_i$ is the difference between the global average and the average of that item, and $D_u$ is the difference of the ratios of user $i$ to the global average



In [159]:
def make_GBAMatrix(df):
    Mat=df.values

    M = np.nanmean(Mat)
    col_means = np.nanmean(Mat, axis=0)
    D_cols    = np.ones(Mat.shape)*col_means-M
    pd.DataFrame(D_cols)

    TMat=Mat.T
    row_means = np.nanmean(TMat, axis=0)
    D_rows    = np.ones(TMat.shape)*row_means-M
    D_rows    = D_rows.T

    return M+D_cols+D_rows

gba=make_GBAMatrix(df)
pd.DataFrame(gba)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.761905,3.428571,4.228571,3.095238,3.928571,3.928571,3.928571,3.428571,4.928571,2.928571,3.428571,4.428571
1,1.328571,2.995238,3.795238,2.661905,3.495238,3.495238,3.495238,2.995238,4.495238,2.495238,2.995238,3.995238
2,1.161905,2.828571,3.628571,2.495238,3.328571,3.328571,3.328571,2.828571,4.328571,2.328571,2.828571,3.828571
3,1.561905,3.228571,4.028571,2.895238,3.728571,3.728571,3.728571,3.228571,4.728571,2.728571,3.228571,4.228571
4,1.495238,3.161905,3.961905,2.828571,3.661905,3.661905,3.661905,3.161905,4.661905,2.661905,3.161905,4.161905
5,0.761905,2.428571,3.228571,2.095238,2.928571,2.928571,2.928571,2.428571,3.928571,1.928571,2.428571,3.428571


In [160]:
R=np.nanmean(df, axis=0)

In [161]:
C=np.nanmean(df, axis=1)

In [162]:
M = np.nanmean(df)

#### Exercise
* Discuss - Could there exist scores below 0 or above 5?
    - Yes --> M = 2.5; $User_{bad} = -1.5$; $Movie_{bad} = - 1.5$





#### 2.2.4 Make recommendations - With collaborative Filtering

Now that we have the similarity scores we just have to compute an estimated score.

We just need to select a number of neighbours (the closest ones) that have actual scores and do a weighted average
with the actual similarities

This is the actual procedure we are going to use:

1. For the specified row, sort descendingly the other elements according to the actual similarity matrix
2. Filter the items with actual measured scores
3. Select the top Nearest Neighbours elements
4. Compute the Weighted Average according to the similarity and actual scores
5. If there are actually no similars we are going to output the Global Baseline Average



In [163]:
def estimate_score(df, SM, nn, r,c, verbose=False):
    vals=df.values
    N,M=vals.shape
    sims=list(zip(SM[r], range(N)))
    sims.sort()
    sims.reverse()
    cnt=0
    S=0
    Ssims=0
    if verbose: print("row: %d - col: %d" % (r,c))
    for  sim, idx in sims[1:]:
        if not np.isnan(vals[idx, c]) and sim>0:
            cnt   += 1
            S     += sim*vals[idx, c]
            Ssims += sim
            if verbose: print("\tItem:%4d    Score:%4.1f (Sim: %6.3f)" %(idx, vals[idx, c], sim))
        if cnt>= nn: break
        
    #this is the situation where no nighbours were found
    if Ssims<=0:
        #this is very inneficient!!! <- better have it precomputed 
        if verbose: print("\tNo similars: outputing the Global Baseline Average")
        M = np.nanmean(df.values)
        rA= R[r]
        cA= C[c]
        r = min(max(rA+cA-M,0),5)
        if verbose: print("\tScore: %7.4f" % r)
        return r
    
    if verbose: print("\tScore: %7.4f" % (S/Ssims))
    return S/Ssims


Let's estimate the score `U5` would give to `M1`

In [164]:
estimate_score(df, sim_mat, 2, df.index.get_loc("M1"),df.columns.get_loc("U5"), True)


row: 0 - col: 4
	Item:   5    Score: 3.0 (Sim:  0.587)
	Item:   2    Score: 2.0 (Sim:  0.414)
	Score:  2.5864


2.5864068669348175

we can use this procedure for identifying any missing values in our original table
For instance, let's compute for 
* (M2, U2)
* (M4, U1)
* (M3, U8)
* (M4, U12)

In [165]:
estimate_score(df, sim_mat, 2, df.index.get_loc("M1"),df.columns.get_loc("U5"), True)
estimate_score(df, sim_mat, 3, df.index.get_loc("M2"),df.columns.get_loc("U2"), True)
estimate_score(df, sim_mat, 3, df.index.get_loc("M3"),df.columns.get_loc("U8"), True)
estimate_score(df, sim_mat, 3, df.index.get_loc("M4"),df.columns.get_loc("U12"), True)
estimate_score(df, sim_mat, 3, df.index.get_loc("M4"),df.columns.get_loc("U1" ), True)
a=1

row: 0 - col: 4
	Item:   5    Score: 3.0 (Sim:  0.587)
	Item:   2    Score: 2.0 (Sim:  0.414)
	Score:  2.5864
row: 1 - col: 1
	Item:   3    Score: 2.0 (Sim:  0.468)
	Score:  2.0000
row: 2 - col: 7
	Item:   5    Score: 2.0 (Sim:  0.506)
	Score:  2.0000
row: 3 - col: 11
	Item:   1    Score: 3.0 (Sim:  0.468)
	Item:   4    Score: 5.0 (Sim:  0.459)
	Score:  3.9900
row: 3 - col: 0
	No similars: outputing the Global Baseline Average
	Score:  3.0952


Note: There are other approaches that use the following function for estimating the rating of item $i$ by user $x$ ($r_{xi}$)

$$
r_{xi}=GBA_{xi}+\frac{\sum_{j \in N(i;x)} {s_{ij}.(r_{ij}-GBA_{xj})}}{\sum_{j \in N(i;x).(s_{ij}}}
$$

where $N(i,x)$ are the nearest neighbour items of $i$ rated by user $x$, $r_{xj}$ is the rating of item $j$ by user $x$ and  $GBA_{xi}$ the global baseline average for item and user

It is left as an **exercise the construction of a estimation function with this modification**


### 2.3. User - User  Collaborative filtering

User User collaborative Filtering is similar to the procedure for Item-Item and the same matrix may be used . We just need to transpose it!


In [166]:
u_df=df.transpose()
u_df

Unnamed: 0,M1,M2,M3,M4,M5,M6
U1,1.0,,2.0,,,1.0
U2,,,4.0,2.0,,
U3,3.0,5.0,,4.0,4.0,3.0
U4,,4.0,1.0,,3.0,
U5,,,2.0,5.0,4.0,3.0
U6,5.0,,,,2.0,
U7,,4.0,3.0,,,
U8,,,,4.0,,2.0
U9,5.0,,4.0,,,
U10,,2.0,3.0,,,


#### 2.3.1. Center and compute similarities

We can now center the matrix and compute the user similarities as before

In [167]:
user_centered=RowCenterMatrix(u_df.values)
usim_mat = CosSim_Matrix(user_centered)
sim_users=pd.DataFrame(usim_mat, columns=u_df.index, index=u_df.index)
sim_users

Unnamed: 0,U1,U2,U3,U4,U5,U6,U7,U8,U9,U10,U11,U12
U1,1.0,0.57735,0.39036,-0.629941,-0.456435,-0.288675,-0.57735,0.288675,-0.866025,0.57735,0.235702,0.0
U2,0.57735,1.0,-0.084515,-0.545545,-0.948683,0.0,-0.5,-0.5,-0.5,0.5,0.612372,0.0
U3,0.39036,-0.084515,1.0,0.461069,0.213809,-0.422577,0.507093,0.422577,-0.338062,-0.507093,-0.759072,-0.422577
U4,-0.629941,-0.545545,0.461069,1.0,0.552052,-0.109109,0.981981,0.0,0.545545,-0.981981,-0.846327,-0.327327
U5,-0.456435,-0.948683,0.213809,0.552052,1.0,-0.158114,0.474342,0.632456,0.474342,-0.474342,-0.710047,0.158114
U6,-0.288675,0.0,-0.422577,-0.109109,-0.158114,1.0,0.0,0.0,0.5,0.0,0.408248,-0.5
U7,-0.57735,-0.5,0.507093,0.981981,0.474342,0.0,1.0,0.0,0.5,-1.0,-0.816497,-0.5
U8,0.288675,-0.5,0.422577,0.0,0.632456,0.0,0.0,1.0,0.0,0.0,-0.408248,0.0
U9,-0.866025,-0.5,-0.338062,0.545545,0.474342,0.5,0.5,0.0,1.0,-0.5,-0.204124,0.0
U10,0.57735,0.5,-0.507093,-0.981981,-0.474342,0.0,-1.0,0.0,-0.5,1.0,0.816497,0.5


#### 2.3.2 Make recommendations based on User Similarities

Computing similarities. The procedure is essentially the same, but using the new matrices with the similarities and the scores, but we are going to use a loop to compute all movie user pairs

It is interesting to verify that for this database it is actually easier to find similars for the specified pairs

In [168]:
estimations=[("M1", "U5"), ("M2", "U2"), ("M3", "U8"), ("M4", "U12"), ("M4", "U1")]
for c,r in estimations:
    estimate_score(u_df, usim_mat, 3, u_df.index.get_loc(r),u_df.columns.get_loc(c), True)


row: 4 - col: 0
	Item:   8    Score: 5.0 (Sim:  0.474)
	Item:   2    Score: 3.0 (Sim:  0.214)
	Score:  4.3786
row: 1 - col: 1
	Item:  10    Score: 1.0 (Sim:  0.612)
	Item:   9    Score: 2.0 (Sim:  0.500)
	Score:  1.4495
row: 7 - col: 2
	Item:   4    Score: 2.0 (Sim:  0.632)
	Item:   0    Score: 2.0 (Sim:  0.289)
	Score:  2.0000
row: 11 - col: 3
	Item:  10    Score: 2.0 (Sim:  0.204)
	Item:   4    Score: 5.0 (Sim:  0.158)
	Score:  3.3095
row: 0 - col: 3
	Item:   1    Score: 2.0 (Sim:  0.577)
	Item:   2    Score: 4.0 (Sim:  0.390)
	Item:   7    Score: 4.0 (Sim:  0.289)
	Score:  3.0809


In [174]:
for r,c in estimations:
    estimate_score(df, sim_mat, 3, df.index.get_loc(r),df.columns.get_loc(c), True)

row: 0 - col: 4
	Item:   5    Score: 3.0 (Sim:  0.587)
	Item:   2    Score: 2.0 (Sim:  0.414)
	Score:  2.5864
row: 1 - col: 1
	Item:   3    Score: 2.0 (Sim:  0.468)
	Score:  2.0000
row: 2 - col: 7
	Item:   5    Score: 2.0 (Sim:  0.506)
	Score:  2.0000
row: 3 - col: 11
	Item:   1    Score: 3.0 (Sim:  0.468)
	Item:   4    Score: 5.0 (Sim:  0.459)
	Score:  3.9900
row: 3 - col: 0
	No similars: outputing the Global Baseline Average
	Score:  3.0952


#### Exercises
1. make a direct comparison of the scores computed by User-User and Item-Item CF for all five scores inferred. Discuss your findings in assessing how more confident you would be for both cases
2. [to do at home] make a Recommend_Item function that given a set of parammeters (matrices) and a set of item ratings by a new user, would be capable of suggesting new items to the user 

## 3. A bigger exercise for Collaborative Filter

On Kaggle there is a [Movie Recommendation Data set](https://www.kaggle.com/datasets/rohan4050/movie-recommendation-data) with 610 users and 9723 movies ratings from 1 to 5. The ratings table is available directly for today's class

and we are going to use it for suggesting recommendations to users

In [179]:
rawdata=pd.read_csv("ratings.csv")
rawdata

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


First create a dataframe with the data just read

In [180]:
def create_rankingMatrix(rowLabel, colLabel, df):
    rows = list(set(rawdata[rowLabel])) 
    cols = list(set(rawdata[colLabel])) 
    n_rows = len(rows)
    n_cols = len(cols)

    rows = dict(zip(rows, np.arange(n_rows)))
    cols = dict(zip(cols, np.arange(n_cols)))
    mat = np.zeros((n_rows, n_cols))
    mat[mat==0]=np.nan
    for rw in df.values:
        mat[rows[rw[0]], cols[rw[1]]]=rw[2]
    return mat

mat=create_rankingMatrix("userId", "movieId", rawdata)
mat=mat.T
pd.DataFrame(mat)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,600,601,602,603,604,605,606,607,608,609
0,4.0,,,,4.0,,4.5,,,,...,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
1,,,,,,4.0,,4.0,,,...,,4.0,,5.0,3.5,,,2.0,,
2,4.0,,,,,5.0,,,,,...,,,,,,,,2.0,,
3,,,,,,3.0,,,,,...,,,,,,,,,,
4,,,,,,5.0,,,,,...,,,,3.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,,,,,,,,,,,...,,,,,,,,,,
9720,,,,,,,,,,,...,,,,,,,,,,
9721,,,,,,,,,,,...,,,,,,,,,,
9722,,,,,,,,,,,...,,,,,,,,,,4.0


We can do the same as before and center and compute the similarities

In [181]:
mat_centered=RowCenterMatrix(mat)
sim_mat = CosSim_Matrix(mat_centered)
pd.DataFrame(sim_mat)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
0,1.000000,0.139649,0.113850,0.032658,0.076230,0.037291,0.064268,0.085986,0.041389,-0.008184,...,-0.057898,-0.010636,0.074217,0.0,0.0,0.0,-0.115796,0.0,0.154751,0.0
1,0.139649,1.000000,0.187303,-0.016897,0.154411,0.067458,0.067552,0.031078,0.046682,0.007021,...,0.043645,0.034338,0.192373,0.0,0.0,0.0,0.033170,0.0,0.314891,0.0
2,0.113850,0.187303,1.000000,0.071153,0.220868,0.185618,0.195055,0.188311,0.160743,-0.010144,...,0.000000,0.016012,0.013015,0.0,0.0,0.0,0.024370,0.0,0.000000,0.0
3,0.032658,-0.016897,0.071153,1.000000,0.128339,0.004044,0.022439,0.012922,0.000000,0.076458,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
4,0.076230,0.154411,0.220868,0.128339,1.000000,0.080362,0.217156,0.224129,0.070375,0.047833,...,0.000000,0.031511,0.047210,0.0,0.0,0.0,0.008036,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
9720,-0.115796,0.033170,0.024370,0.000000,0.008036,0.000000,0.000000,0.000000,0.000000,-0.036217,...,0.000000,0.328526,0.118678,0.0,0.0,0.0,1.000000,0.0,0.000000,0.0
9721,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0
9722,0.154751,0.314891,0.000000,0.000000,0.000000,0.033833,0.000000,0.000000,0.000000,0.064858,...,-0.013634,-0.002023,0.461131,0.0,0.0,0.0,0.000000,0.0,1.000000,0.0


let's make an estimation for `Movie = 3` and `User = 1`

In [206]:
r = estimate_score(pd.DataFrame(mat), sim_mat, 5, 1, 3, True)

row: 1 - col: 3
	Item:3415    Score: 3.0 (Sim:  0.237)
	Item:2889    Score: 4.0 (Sim:  0.207)
	Item:2038    Score: 3.0 (Sim:  0.203)
	Item: 514    Score: 1.0 (Sim:  0.199)
	Item:2427    Score: 5.0 (Sim:  0.196)
	Score:  3.1932


#### Exercise

* for user 1, of the films he did not watch which are the ones we can recommend to him?
* given any user list of rankings, can you make a recomendation for that user?

In [190]:
## your move
df = pd.DataFrame(mat)

RangeIndex(start=0, stop=610, step=1)

In [211]:
#go for it - try a graphical approach
estimate_score(df, sim_mat, 5, 1, 125, True)

row: 1 - col: 125
	Item: 352    Score: 5.0 (Sim:  0.258)
	Item: 477    Score: 4.0 (Sim:  0.242)
	Item: 415    Score: 3.0 (Sim:  0.217)
	Item: 142    Score: 4.0 (Sim:  0.203)
	Item: 365    Score: 5.0 (Sim:  0.181)
	Score:  4.2025


4.202451490686916

In [213]:
df.shape

(9724, 610)

In [219]:
estimates = [ estimate_score(df, sim_mat, 5, 1, i, False) for i in range(df.shape[1]) if np.isnan(df.loc[1, i]) ]

In [220]:
len(estimates)

500

In [221]:
np.argsort(estimates)[-5:]

array([210, 197,  44, 448, 425], dtype=int64)

In [222]:
estimates[210]

4.911070485768047