# Matrix Factorization

The Matrix Factorization will require basically:

* An **Index Map** to map an item_id into an index (e.g. 1, 2, 7, 45, etc.)
* A **Matrix** with the predictions for items not yet visited

Usually, in order to create the matrix for this recommender we would need the user_id or some equivalent information, so the matrix could be `users x items`. However, as you may have noticed from the dataset iteractions, the user_id is not available, but we do have an array of user features.

To address this, we decided to use *clustering*. We create clusters from the users' features and use those for the matrix, making it `clusters x items`. This will add another element we'll need for recommending:

* A **Clustering Algorithm** to map new user features to clusters

In this notebook we will setup these elements. However, the actual recommendation happens in `matrix_fact.py` that will answer to the BentoML api when requested.

### Importing Libraries

In [1]:
import random
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

from preprocessing import preprocess
from matrix_fact import ClusteredMatrixFactRecommender

### Acquire preprocessed Data

In [2]:
df = preprocess("Sample")

In [3]:
df.head(6)

Unnamed: 0,Timestamp,Clicked_Article,Click,User_Features,Article_List
0,1317513291,560620,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
1,1317513291,565648,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
2,1317513291,563115,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
3,1317513292,552077,0,"[True, False, False, False, False, False, True...","[552077, 555224, 555528, 559744, 559855, 56029..."
4,1317513292,564335,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
5,1317513292,565589,0,"[True, False, False, False, True, False, False...","[552077, 555224, 555528, 559744, 559855, 56029..."


## Clustering

For the cluster, we will need the users' features

In [4]:
users = np.asarray(df.loc[:,'User_Features']) # acquire only the features
users = np.stack(users, axis=0) # stack them to make an array (iteractions, features)
users.shape

(10447, 136)

Now we can intialize the clustering algorithm, decide how many clusters we want and compute

In [5]:
kmeans = KMeans(n_clusters=20)
kmeans.fit(users)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

We can get some random samples and take a look into the clustering process

In [6]:
samples = df.sample(5).loc[:,'User_Features']
samples

9424    [True, False, False, False, False, False, Fals...
9875    [True, False, False, False, False, False, Fals...
5896    [True, False, False, False, False, False, Fals...
9918    [True, False, False, False, False, False, Fals...
62      [True, False, False, False, False, False, Fals...
Name: User_Features, dtype: object

Stack the features again to make an array `(samples, features)`

In [7]:
sample_features = np.stack(samples,axis=0)
sample_features.shape

(5, 136)

Predict their clusters

In [8]:
kmeans.predict(sample_features)

array([1, 1, 1, 1, 6], dtype=int32)

If you wish to check wether the predicted clusters are the same as the previously assigned clusters just run:

In [9]:
kmeans.labels_[samples.index]

array([1, 1, 1, 1, 6], dtype=int32)

Now, we can look at the features to see what similarities and differences they share

In [10]:
sample_features

array([[ True, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
      

## Index Map

First, we get all articles in a list

In [11]:
articles = df['Clicked_Article'].unique()

Then, we iterate over them creating a dictionary for the index map.

In [12]:
index_map = {}
idx = 1 # idx starts at 1 so that 0 is used for when the article is not found in the index map
for art in articles:
    index_map[art] = idx
    idx+=1
# index_map

## Matrix

Since our matrix will use indexes instead of the item_id we can replace them in the dataset

In [13]:
df['Clicked_Article'].replace(index_map, inplace=True)
df.head(5)

Unnamed: 0,Timestamp,Clicked_Article,Click,User_Features,Article_List
0,1317513291,1,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
1,1317513291,2,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
2,1317513291,3,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
3,1317513292,4,0,"[True, False, False, False, False, False, True...","[552077, 555224, 555528, 559744, 559855, 56029..."
4,1317513292,5,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."


Also, since our clusters will be the other dimension in the matrix, we'll add them to the dataset. 
All this will make the matrix creation process more straightforward

In [14]:
df['Cluster'] = kmeans.labels_
df.head(5)

Unnamed: 0,Timestamp,Clicked_Article,Click,User_Features,Article_List,Cluster
0,1317513291,1,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029...",13
1,1317513291,2,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029...",8
2,1317513291,3,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029...",17
3,1317513292,4,0,"[True, False, False, False, False, False, True...","[552077, 555224, 555528, 559744, 559855, 56029...",16
4,1317513292,5,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029...",1


In [15]:
pivot_table = df.pivot_table(index='Cluster', columns='Clicked_Article', values='Click', aggfunc=np.sum, fill_value=0)
pivot_table.head(5)

Clicked_Article,1,2,3,4,5,6,7,8,9,10,...,17,18,19,20,21,22,23,24,25,26
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,0,0,0,0,0,1,0,1,2,...,0,0,0,1,0,0,0,0,0,0
1,2,2,10,2,3,7,5,5,6,8,...,6,7,9,6,5,7,11,6,9,7
2,0,0,0,0,0,0,0,0,1,0,...,0,0,1,2,0,0,1,0,0,0
3,0,0,1,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,2,0
4,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,2,1


Converting the matrix into a numpy array

In [16]:
pivot_matrix = np.asarray(pivot_table.values,dtype='float')
pivot_matrix[:5]

array([[ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  2.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 2.,  2., 10.,  2.,  3.,  7.,  5.,  5.,  6.,  8.,  3.,  5.,  3.,
         9.,  2.,  2.,  6.,  7.,  9.,  6.,  5.,  7., 11.,  6.,  9.,  7.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
         0.,  1.,  0.,  0.,  0.,  1.,  2.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,
         0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  1.]])

Each array inside this one is, therefore, the values for a cluster

In [17]:
clusters = list(pivot_table.index)
clusters[:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Here we make the matrix sparse with `scipy.sparse.csr_matrix()` in order to input for factorization

In [18]:
sparse_matrix = csr_matrix(pivot_matrix)
sparse_matrix

<20x26 sparse matrix of type '<class 'numpy.float64'>'
	with 195 stored elements in Compressed Sparse Row format>

With `from scipy.sparse.linalg.svds()` we compute the factorization

In [19]:
FACTORS_MF = 15

U, sigma, Vt = svds(sparse_matrix, k = FACTORS_MF)

After this process, we can convert the output back into a dataframe, and then, a matrix as a numpy array

In [20]:
U.shape

(20, 15)

In [21]:
Vt.shape

(15, 26)

In [22]:
sigma = np.diag(sigma)
sigma.shape

(15, 15)

In [23]:
all_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
all_predicted_ratings.shape

(20, 26)

In [24]:
all_predicted_norm = (all_predicted_ratings - all_predicted_ratings.min()) / (all_predicted_ratings.max() - all_predicted_ratings.min())

In [25]:
cf_preds_df = pd.DataFrame(all_predicted_norm, columns = pivot_table.columns, index=clusters).transpose()
cf_preds_df.head(10)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Clicked_Article,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,0.095983,0.209958,0.039053,0.017526,0.025699,0.037962,0.062812,0.119395,0.039,0.035501,0.029552,0.031793,0.038981,0.035038,0.032129,0.124572,0.048208,0.03938,0.02541,0.030283
2,0.048449,0.209556,0.025393,0.04044,0.032689,0.03665,0.109413,0.034217,0.03342,0.034022,0.20807,0.03602,0.031068,0.032748,0.213938,0.031176,0.03982,0.119744,0.126012,0.038657
3,0.035703,0.913686,0.028571,0.117289,0.107867,0.128458,0.120776,0.115275,0.029593,0.035761,0.035054,0.21039,0.031435,0.036304,0.205119,0.120531,0.043071,0.203951,0.130363,0.027171
4,0.033165,0.212812,0.034898,0.027086,0.023253,0.038245,0.033462,0.027911,0.207879,0.036391,0.12486,0.034597,0.119744,0.038054,0.028593,0.031996,0.043327,0.033113,0.125135,0.110239
5,0.041765,0.298721,0.037317,0.035956,0.032558,0.208683,0.024225,0.295985,0.036603,0.035574,0.038128,0.035569,0.119113,0.124344,0.033277,0.030399,0.038901,0.037621,0.031856,0.027747
6,0.038842,0.65191,0.053236,0.036466,0.036791,0.202757,0.02016,0.119477,0.033473,0.037022,0.049542,0.298421,0.119487,0.130263,0.285413,0.29557,0.019995,0.125085,0.034128,0.01388
7,0.126278,0.470151,0.031507,0.038185,0.042102,0.032571,0.1203,0.126862,0.039913,0.03327,0.029931,0.122156,0.12289,0.03119,0.042625,0.032902,0.039143,0.039535,0.027215,0.043809
8,0.032613,0.471875,0.034533,0.049671,0.017769,0.03603,0.039339,0.124477,0.034529,0.0342,0.035531,0.038213,0.118078,0.031749,0.120946,0.036007,0.031726,0.031688,0.033752,0.040138
9,0.114326,0.557183,0.11527,0.032025,0.039393,0.298959,0.048075,0.126085,0.037005,0.120252,0.112934,0.120576,0.126243,0.029449,0.217598,0.300274,0.0389,0.123062,0.030838,0.04832
10,0.20789,0.737392,0.043765,0.118828,0.128674,0.118342,0.207013,0.033267,0.032171,0.210631,0.128544,0.121115,0.123552,0.038528,0.02819,0.211201,0.110684,0.122155,0.03659,0.112981


In [26]:
matrix = np.asarray(cf_preds_df.values,dtype='float')
matrix.shape # shape (items, clusters)

(26, 20)

### Saving Artifacts

In order to pass our basic elements (matrix, index_map, clustering algorithm) to the model, we use BentoML. Thus, our recommender will load those in order to make its recommendations.

The `pack()` function takes care of saving what we need.

In [27]:
model = ClusteredMatrixFactRecommender()

In [28]:
model.pack("index_map", index_map)

<matrix_fact.ClusteredMatrixFactRecommender at 0x7f119aae1240>

In [29]:
model.pack("cluster_path", kmeans)

<matrix_fact.ClusteredMatrixFactRecommender at 0x7f119aae1240>

In [30]:
model.pack("matrix", matrix)

<matrix_fact.ClusteredMatrixFactRecommender at 0x7f119aae1240>

After packing what our recommender will need, we can test it with a small sample

In [31]:
test_articles = [565648, 563115, 552077, 564335, 565589, 563938, 560290, 563643, 560620, 565822, 563787, 555528, 565364, 559855, 560518]

In this test, we will take the first set of features sampled randomly for the clustering tests `sample_features[0]`

In [32]:
model.rank({'Timestamp': 123456789, 'Clicked_Article': 565822, 'Click': 1, 'User_Features': sample_features[0], 'Article_List': np.asarray(test_articles)})

[563115,
 559855,
 565822,
 565589,
 563643,
 555528,
 560290,
 563938,
 564335,
 565364,
 563787,
 552077,
 560620,
 565648,
 560518]

In order to check wether the recommendation is correct, we can do it ourselves

First, we get the cluster for our features

In [33]:
test_cluster = kmeans.predict([sample_features[0]])[0]
test_cluster

1

Then acquiring the indexes for the item list

In [34]:
indexes = [index_map[art] for art in test_articles]
indexes

[2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 11, 12, 13, 14, 15]

With the indexes and the cluster, we can get the scores for each item.

Here, we subtract 1 from `idx` because the index 0 is is only used for items not found in the map; thus the matrix index 0 corresponds to the mapped index 1.

In [35]:
scores = [matrix[idx-1, test_cluster] for idx in indexes] 
scores

[0.20955558290254883,
 0.9136864273262162,
 0.2128117256038242,
 0.29872088718393625,
 0.6519104658348439,
 0.47015068742263794,
 0.47187500815441596,
 0.5571828219121022,
 0.20995764839408318,
 0.7373923134031024,
 0.289645532185459,
 0.4766124834800963,
 0.2975338359800944,
 0.8211201174380656,
 0.20651483131551182]

Finally we can sort the items by their scores

In [36]:
sorted(zip(scores, test_articles),reverse=True)

[(0.9136864273262162, 563115),
 (0.8211201174380656, 559855),
 (0.7373923134031024, 565822),
 (0.6519104658348439, 565589),
 (0.5571828219121022, 563643),
 (0.4766124834800963, 555528),
 (0.47187500815441596, 560290),
 (0.47015068742263794, 563938),
 (0.29872088718393625, 564335),
 (0.2975338359800944, 565364),
 (0.289645532185459, 563787),
 (0.2128117256038242, 552077),
 (0.20995764839408318, 560620),
 (0.20955558290254883, 565648),
 (0.20651483131551182, 560518)]