# Matrix Factorization

The Matrix Factorization will require basically:

* An **Index Map** to map an item_id into an index (e.g. 1, 2, 7, 45, etc.)
* A **Matrix** with the predictions for items not yet visited

Usually, in order to create the matrix for this recommender we would need the user_id or some equivalent information, so the matrix could be `users x items`. However, as you may have noticed from the dataset iteractions, the user_id is not available, but we do have an array of user features.

To address this, we decided to use *clustering*. We create clusters from the users' features and use those for the matrix, making it `clusters x items`. This will add another element we'll need for recommending:

* A **Clustering Algorithm** to map new user features to clusters

In this notebook we will setup these elements. However, the actual recommendation happens in `matrix_fact.py` that will answer to the BentoML api when requested.

### Importing Libraries

In [1]:
import random
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

from preprocessing import preprocess, read_sample
from matrix_fact import ClusteredMatrixFactRecommender



### Acquire preprocessed Data

In [2]:
#df = preprocess("Sample")
df = read_sample("/media/backup/datasets/yahoo/yahoo_dataset_clicked.csv", p=1)
df.head()

Unnamed: 0.1,Unnamed: 0,Timestamp,Clicked_Article,Click,User_Features,Article_List
0,7,1317513293,563938,1,[1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1...,[552077 555224 555528 559744 559855 560290 560...
1,13,1317513293,564335,1,[1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1...,[552077 555224 555528 559744 559855 560290 560...
2,39,1317513295,564335,1,[1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1...,[552077 555224 555528 559744 559855 560290 560...
3,144,1317513299,565747,1,[1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1...,[552077 555224 555528 559744 559855 560290 560...
4,176,1317513300,563115,1,[1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1...,[552077 555224 555528 559744 559855 560290 560...


In [4]:
import re
import ast
def literal_eval(element):
    if isinstance(element, str):
        return ast.literal_eval(re.sub('\s+',',',element))
    return element

df['User_Features'] = df['User_Features'].apply(literal_eval)
df['Article_List'] = df['Article_List'].apply(literal_eval)

## Clustering

For the cluster, we will need the users' features

In [5]:
users = np.asarray(df.loc[:,'User_Features']) # acquire only the features
users = np.stack(users, axis=0) # stack them to make an array (iteractions, features)
users.shape

(1027832, 136)

Now we can intialize the clustering algorithm, decide how many clusters we want and compute

In [9]:
1+1

2

In [10]:
kmeans = KMeans(n_clusters=10, n_jobs=-1)
kmeans.fit(users)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=10, n_jobs=-1, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

We can get some random samples and take a look into the clustering process

In [11]:
samples = df.sample(5).loc[:,'User_Features']
samples

176152    [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, ...
262303    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
197729    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...
284846    [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, ...
923152    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...
Name: User_Features, dtype: object

Stack the features again to make an array `(samples, features)`

In [12]:
sample_features = np.stack(samples,axis=0)
sample_features.shape

(5, 136)

Predict their clusters

In [13]:
kmeans.predict(sample_features)

array([7, 6, 9, 7, 4], dtype=int32)

If you wish to check wether the predicted clusters are the same as the previously assigned clusters just run:

In [14]:
kmeans.labels_[samples.index]

array([7, 6, 9, 7, 4], dtype=int32)

Now, we can look at the features to see what similarities and differences they share

In [15]:
sample_features

array([[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
        1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,

## Index Map

First, we get all articles in a list

In [22]:
articles = df['Clicked_Article'].unique()
articles.shape

(652,)

(1027832, 7)

Then, we iterate over them creating a dictionary for the index map.

In [17]:
index_map = {}
idx = 1 # idx starts at 1 so that 0 is used for when the article is not found in the index map
for art in articles:
    index_map[art] = idx
    idx+=1
# index_map

## Matrix

Since our matrix will use indexes instead of the item_id we can replace them in the dataset

In [18]:
df['Clicked_Article'].replace(index_map, inplace=True)
df.head(5)

Unnamed: 0.1,Unnamed: 0,Timestamp,Clicked_Article,Click,User_Features,Article_List
0,7,1317513293,1,1,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...","[552077, 555224, 555528, 559744, 559855, 56029..."
1,13,1317513293,2,1,"[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029..."
2,39,1317513295,2,1,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029..."
3,144,1317513299,3,1,"[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029..."
4,176,1317513300,4,1,"[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029..."


Also, since our clusters will be the other dimension in the matrix, we'll add them to the dataset. 
All this will make the matrix creation process more straightforward

In [19]:
df['Cluster'] = kmeans.labels_
df.head(5)

Unnamed: 0.1,Unnamed: 0,Timestamp,Clicked_Article,Click,User_Features,Article_List,Cluster
0,7,1317513293,1,1,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...","[552077, 555224, 555528, 559744, 559855, 56029...",6
1,13,1317513293,2,1,"[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029...",5
2,39,1317513295,2,1,"[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029...",3
3,144,1317513299,3,1,"[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029...",3
4,176,1317513300,4,1,"[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, ...","[552077, 555224, 555528, 559744, 559855, 56029...",5


In [20]:
pivot_table = df.pivot_table(index='Cluster', columns='Clicked_Article', values='Click', aggfunc=np.sum, fill_value=0)
pivot_table.head(5)

Clicked_Article,1,2,3,4,5,6,7,8,9,10,...,643,644,645,646,647,648,649,650,651,652
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2,29,95,107,124,0,119,86,246,79,...,114,22,52,31,15,89,45,48,16,7
1,151,340,1070,1015,1047,18,811,890,1737,527,...,152,27,75,58,16,123,60,66,20,13
2,0,31,81,128,108,0,107,115,184,56,...,122,15,42,28,13,101,55,54,16,6
3,1,27,96,102,77,1,113,97,138,65,...,127,14,53,23,5,112,58,59,18,4
4,0,31,164,188,215,1,125,180,263,116,...,212,22,78,39,26,199,87,105,26,13


In [21]:
pivot_table.shape

(10, 652)

Converting the matrix into a numpy array

In [24]:
pivot_matrix = np.asarray(pivot_table.values,dtype='float')
pivot_matrix[:5]

array([[2.00e+00, 2.90e+01, 9.50e+01, ..., 4.80e+01, 1.60e+01, 7.00e+00],
       [1.51e+02, 3.40e+02, 1.07e+03, ..., 6.60e+01, 2.00e+01, 1.30e+01],
       [0.00e+00, 3.10e+01, 8.10e+01, ..., 5.40e+01, 1.60e+01, 6.00e+00],
       [1.00e+00, 2.70e+01, 9.60e+01, ..., 5.90e+01, 1.80e+01, 4.00e+00],
       [0.00e+00, 3.10e+01, 1.64e+02, ..., 1.05e+02, 2.60e+01, 1.30e+01]])

Each array inside this one is, therefore, the values for a cluster

In [25]:
clusters = list(pivot_table.index)
clusters[:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Here we make the matrix sparse with `scipy.sparse.csr_matrix()` in order to input for factorization

In [26]:
sparse_matrix = csr_matrix(pivot_matrix)
sparse_matrix

<10x652 sparse matrix of type '<class 'numpy.float64'>'
	with 6454 stored elements in Compressed Sparse Row format>

With `from scipy.sparse.linalg.svds()` we compute the factorization

In [28]:
FACTORS_MF = 5

U, sigma, Vt = svds(sparse_matrix, k = FACTORS_MF)

After this process, we can convert the output back into a dataframe, and then, a matrix as a numpy array

In [29]:
U.shape

(10, 5)

In [30]:
Vt.shape

(5, 652)

In [31]:
sigma = np.diag(sigma)
sigma.shape

(5, 5)

In [32]:
all_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
all_predicted_ratings.shape

(10, 652)

In [33]:
all_predicted_norm = (all_predicted_ratings - all_predicted_ratings.min()) / (all_predicted_ratings.max() - all_predicted_ratings.min())

In [34]:
cf_preds_df = pd.DataFrame(all_predicted_norm, columns = pivot_table.columns, index=clusters).transpose()
cf_preds_df.head(10)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9
Clicked_Article,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.002003,0.066163,0.001946,0.001929,0.001964,0.003924,0.001978,0.00157,0.002707,0.002658
2,0.017211,0.147088,0.011882,0.013097,0.016491,0.014215,0.008212,0.016921,0.01647,0.011369
3,0.040015,0.459106,0.044083,0.045017,0.068209,0.037849,0.024035,0.044642,0.024806,0.052835
4,0.052783,0.43562,0.057812,0.044613,0.085262,0.04974,0.033646,0.050446,0.042994,0.076264
5,0.048532,0.449419,0.054517,0.041418,0.084962,0.037688,0.02624,0.062338,0.045521,0.051674
6,0.001732,0.009334,0.001785,0.001791,0.001917,0.001824,0.00166,0.001907,0.001799,0.001646
7,0.053266,0.348325,0.043346,0.048334,0.059754,0.051503,0.03211,0.037294,0.0339,0.069241
8,0.036784,0.382116,0.05036,0.044092,0.079138,0.034513,0.025232,0.044821,0.020496,0.061447
9,0.09957,0.744129,0.082128,0.067881,0.118039,0.082485,0.050657,0.094982,0.09739,0.092899
10,0.033642,0.227032,0.032265,0.033032,0.048471,0.027457,0.018918,0.034934,0.023023,0.03784


In [35]:
matrix = np.asarray(cf_preds_df.values,dtype='float')
matrix.shape # shape (items, clusters)

(652, 10)

### Saving Artifacts

In order to pass our basic elements (matrix, index_map, clustering algorithm) to the model, we use BentoML. Thus, our recommender will load those in order to make its recommendations.

The `pack()` function takes care of saving what we need.

In [36]:
model = ClusteredMatrixFactRecommender()

In [37]:
model.pack("index_map", index_map)

<matrix_fact.ClusteredMatrixFactRecommender at 0x7f0a92d6a208>

In [38]:
model.pack("cluster_path", kmeans)

<matrix_fact.ClusteredMatrixFactRecommender at 0x7f0a92d6a208>

In [39]:
model.pack("matrix", matrix)

<matrix_fact.ClusteredMatrixFactRecommender at 0x7f0a92d6a208>

After packing what our recommender will need, we can test it with a small sample

In [40]:
test_articles = [565648, 563115, 552077, 564335, 565589, 563938, 560290, 563643, 560620, 565822, 563787, 555528, 565364, 559855, 560518]

In [49]:
sample_features[0]

array([1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])

In this test, we will take the first set of features sampled randomly for the clustering tests `sample_features[0]`

In [41]:
model.rank({'Timestamp': 123456789, 'Clicked_Article': 565822, 'Click': 1, 'User_Features': sample_features[0], 'Article_List': np.asarray(test_articles)})

[565822,
 563643,
 563115,
 565589,
 565648,
 559855,
 560290,
 555528,
 564335,
 560518,
 565364,
 552077,
 563787,
 563938,
 560620]

In order to check wether the recommendation is correct, we can do it ourselves

First, we get the cluster for our features

In [42]:
test_cluster = kmeans.predict([sample_features[0]])[0]
test_cluster

7

Then acquiring the indexes for the item list

In [43]:
indexes = [index_map[art] for art in test_articles]
indexes

[10, 4, 25, 2, 8, 1, 14, 9, 24, 19, 6, 21, 13, 23, 11]

With the indexes and the cluster, we can get the scores for each item.

Here, we subtract 1 from `idx` because the index 0 is is only used for items not found in the map; thus the matrix index 0 corresponds to the mapped index 1.

In [44]:
scores = [matrix[idx-1, test_cluster] for idx in indexes] 
scores

[0.03493439180374096,
 0.05044619287345377,
 0.002669978503223864,
 0.01692149248459763,
 0.04482099128216789,
 0.0015698378424824479,
 0.02973536382094232,
 0.09498165372155451,
 0.0012883653835528376,
 0.09883936018090379,
 0.0019067136452041627,
 0.02822560464822843,
 0.003096438434984018,
 0.034430103978424206,
 0.016880185280996884]

Finally we can sort the items by their scores

In [45]:
sorted(zip(scores, test_articles),reverse=True)

[(0.09883936018090379, 565822),
 (0.09498165372155451, 563643),
 (0.05044619287345377, 563115),
 (0.04482099128216789, 565589),
 (0.03493439180374096, 565648),
 (0.034430103978424206, 559855),
 (0.02973536382094232, 560290),
 (0.02822560464822843, 555528),
 (0.01692149248459763, 564335),
 (0.016880185280996884, 560518),
 (0.003096438434984018, 565364),
 (0.002669978503223864, 552077),
 (0.0019067136452041627, 563787),
 (0.0015698378424824479, 563938),
 (0.0012883653835528376, 560620)]

In [46]:
model.save()

[2020-07-06 15:03:09,684] INFO - BentoService bundle 'ClusteredMatrixFactRecommender:1.0.20200706150251_E32098' saved to: /home/marlesson/bentoml/repository/ClusteredMatrixFactRecommender/1.0.20200706150251_E32098


'/home/marlesson/bentoml/repository/ClusteredMatrixFactRecommender/1.0.20200706150251_E32098'