# Matrix Factorization for a small subset

In this notebook, we're going to build our first recommender system, which follows a **collaborative filtering approach** and only takes into account all the readers and all the articles in a small subset of our data. The goal with this **matrix factorization technique** is to 'learn' two embedding matrices with the respective size of the numbers of readers/articles and an arbitrarily chosen (and thus tunable) size of latent factors. 

Thus, if we had 10 readers, 5 articles and were to assume we needed 3 latent factors (which could represent implicit, but substantive differences in our reader/article-base), our method will calculate two matrices (a 10 by 3 for the readers and a 3 by 5 for the articles) whose scalar products yield a new matrix the size of our original one (10 x 5), which *approximates* the original matrix best. This optimization problem is typically solved by stochastic gradient descent (although there are, of course, other possibilities) and from a once extremely sparse matrix (obviously, ervery single reader only reads/clicks a tiny fraction of the articles available to us), we get a densely populated table which now contains information on wether some reader might be more or less inclined to read certain articles. 

The approach might sound a bit dry and mathematic at first, but with the embeddings we actually learn some lower dimensional representations of our readers/articles and can hereby determine *resemblances in preferences*. If you ever wondered how amazon or google knew what you were interested in before you even searched for it: here you go!

## Python Imports

In [61]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from scipy.sparse.linalg import svds

from sklearn.preprocessing import binarize
from sklearn.preprocessing import normalize

## Data Import

In [2]:
behaviors = pd.read_csv("../../data/mind_small_train/behaviors_processed.csv")
news = pd.read_csv("../../data/mind_small_train/news_processed.csv")

In [3]:
behaviors.drop_duplicates(subset="user_id", inplace=True)
behaviors.head(3)

Unnamed: 0,impression_id,user_id,time,history,labels
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...


In [4]:
news.head(3)

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."


## Data preparation for the model

### Restrict data size and create user-article table

In order to reduce computing time, we want to reduce our dataset to the first 10,000 impressions for this task:

In [5]:
behav_part_1 = behaviors.iloc[:10000, :]

In [6]:
behav_part_1.shape

(10000, 5)

Create a dictonary that maps impression IDs to corresponding user IDs for later use in evaluation.

In [7]:
id_dict = pd.Series(behav_part_1.user_id.values,
                    index=behav_part_1.impression_id
                   ).to_dict()

Create table which lists all the user-article pairs and labels them as read.

In [8]:
x = behav_part_1.set_index('user_id').history.str.split(' ', expand =True)
x = x.stack().reset_index(1, drop=True).reset_index(name='article')
behaviors_part_1_set = x

In [9]:
behaviors_part_1_set['read'] = 1

In [10]:
behaviors_part_1_set.head()

Unnamed: 0,user_id,article,read
0,U13740,N55189,1
1,U13740,N42782,1
2,U13740,N34694,1
3,U13740,N45794,1
4,U13740,N18445,1


### Train Test Split

Next we will perform the train-test-split on the user-article table. Then we want to make sure we have a good overlap of the same users and articles in the two splits. This is important for the evaluation of the model later on, as we can only give recommendations for users the model already saw in training. 

In [11]:
train, test = train_test_split(behaviors_part_1_set, 
                               test_size=0.5, 
                               random_state=420)

In [12]:
user_intersection = set(train.user_id) & set(test.user_id)
article_intersection = set(train.article) & set(test.article)
print("User ID overlap in train and test split:    ",
      f"{len(user_intersection)} / {behaviors_part_1_set.user_id.nunique()}",
      "\n"
      "Article ID overlap in train and test split: ",
      f"{len(article_intersection)} / {behaviors_part_1_set.article.nunique()}")   

User ID overlap in train and test split:     9519 / 10000 
Article ID overlap in train and test split:  11490 / 21798


As we can see from the numbers above we have a sufficient amount of the same users and articles in both of the splits.

### Create Pivot Table

Now we create the user-article matrix from our train set, which we then approximate by singular value decomposition aka matrix factorization.

In [31]:
original_matrix = train.pivot_table(index='user_id', 
                                    columns='article',
                                    values='read',
                                    fill_value=0,
                                 #   aggfunc=np.sum
                                   )

In [32]:
original_matrix = original_matrix.astype(np.float64)
original_matrix.head()

article,N1001,N10016,N10021,N10024,N10025,N10034,N10040,N10041,N10047,N10048,...,N9955,N9958,N996,N9969,N997,N9973,N9974,N9977,N9978,N9992
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U10022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10062,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Model Fitting

In [33]:
original_matrix_np = original_matrix.to_numpy(copy=True)
original_matrix_np_mean = np.mean(b1, axis=1)
original_matrix_np -= original_matrix_np_mean.reshape(-1,1)

In [34]:
U, sigma, Vt = svds(original_matrix_np, k=5)

In [35]:
Sigma = np.diag(sigma)

In [36]:
Sigma.shape

(5, 5)

In [38]:
approx_matrix = np.dot(np.dot(U, Sigma), Vt) + original_matrix_np_mean.reshape(-1, 1)
approx_matrix_df = pd.DataFrame(recommendations)
approx_matrix_df.columns = original_matrix.columns
approx_matrix_df['user_ids'] = original_matrix.index
approx_matrix_df.set_index('user_ids', inplace=True)

In [63]:
#norm1 = approx_matrix / np.linalg.norm(approx_matrix)
norm2 = normalize(approx_matrix, axis=0)
#print np.all(norm1 == norm2)

In [74]:
maxi = np.max(approx_matrix)
mini = np.min(approx_matrix)
norm = (approx_matrix - mini) / (maxi-mini)

In [71]:
maxi, mini

(1.4849397745290152, -0.7104377895120427)

In [76]:
np.max(norm), np.min(norm)

(1.0, 0.0)

In [80]:
norm_df = pd.DataFrame(norm)
norm_df.iloc[:, :10].describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0
mean,0.323631,0.32403,0.323643,0.323717,0.323668,0.323739,0.323874,0.323685,0.323704,0.323645
std,0.000141,0.000765,0.000229,0.000309,0.000225,0.000279,0.000563,0.000171,0.000142,9.9e-05
min,0.322901,0.321191,0.322134,0.322243,0.322595,0.322258,0.321269,0.322966,0.32339,0.323204
25%,0.323596,0.323656,0.323568,0.323604,0.323586,0.323605,0.323612,0.323608,0.323623,0.323605
50%,0.323617,0.323832,0.323602,0.323645,0.32362,0.323645,0.323725,0.323641,0.32366,0.323621
75%,0.323664,0.324265,0.323634,0.32377,0.323701,0.323777,0.32402,0.323726,0.32374,0.323667
max,0.325498,0.33074,0.325136,0.326922,0.325377,0.326225,0.331564,0.326157,0.32511,0.324535


In [83]:
approx_matrix_df.iloc[:, :10].describe()

article,N1001,N10016,N10021,N10024,N10025,N10034,N10040,N10041,N10047,N10048
count,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0,9748.0
mean,0.000121,0.000871,0.000156,0.000276,0.000212,0.000324,0.000648,0.000234,0.000274,0.000154
std,0.000669,0.001579,0.000774,0.000894,0.000762,0.00082,0.001348,0.000685,0.00065,0.000628
min,-0.003658,-0.004789,-0.004043,-0.004315,-0.004879,-0.003061,-0.004035,-0.003274,-0.002337,-0.002864
25%,-0.000138,0.000133,-0.00016,-5.3e-05,-9.4e-05,-4.3e-05,7.3e-05,-4.7e-05,-1.5e-05,-0.000124
50%,8e-05,0.000485,7.6e-05,0.000154,0.00011,0.000144,0.00036,0.00013,0.000143,7.4e-05
75%,0.000328,0.001269,0.000354,0.000508,0.000404,0.000492,0.000984,0.00041,0.000429,0.000325
max,0.010376,0.021052,0.010573,0.013038,0.010254,0.012224,0.022178,0.011677,0.012276,0.011757


In [39]:
approx_matrix_df.head(3)

article,N1001,N10016,N10021,N10024,N10025,N10034,N10040,N10041,N10047,N10048,...,N9955,N9958,N996,N9969,N997,N9973,N9974,N9977,N9978,N9992
user_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U10022,-0.000528,0.001132,-0.00047,5.1e-05,-0.000225,-0.000446,0.000941,-0.000224,-0.000154,-0.000528,...,0.00627,-0.00057,0.01048,-0.000392,-0.00052,-0.000445,-0.000496,-0.000288,-0.000511,-0.000417
U10043,0.000769,0.00136,0.000829,0.000783,0.000697,0.00098,0.00074,0.000785,0.000854,0.00092,...,0.000894,0.000873,0.002221,0.000871,0.000767,0.000805,0.000693,0.00093,0.001,0.00079
U10045,0.000951,0.001591,0.001087,0.001037,0.000935,0.001228,0.000932,0.000986,0.00108,0.001121,...,0.001501,0.001077,0.002711,0.001116,0.000971,0.001073,0.000952,0.001175,0.001179,0.000993


In [40]:
np.min(approx_matrix), np.max(approx_matrix)

(-0.7104377895120427, 1.4849397745290152)

In [45]:
test.sort_values(by="user_id")

Unnamed: 0,user_id,article,read
101887,U10022,N27448,1
101884,U10022,N879,1
101890,U10022,N16233,1
101897,U10022,N40716,1
101904,U10022,N56240,1
...,...,...,...
130454,U999,N42937,1
130459,U999,N20028,1
130455,U999,N42768,1
182303,U9991,N47472,1


In [52]:
try:
    approx = approx_matrix_df.loc['U10022', 'N27448']
except KeyError as e:
    

SyntaxError: unexpected EOF while parsing (<ipython-input-52-dc3ae5dfb754>, line 4)

In [54]:
approx = approx_matrix_df.loc['U10022', 'N27448']

In [55]:
t = 0
if approx > t:
    approx = 1
else:
    approx = 0

cost = 1 - approx


0.041518773083372366