## Matrix Factorization for a small subset
In this notebook, we're going to build our first recommender system, which follows a **collaborative filtering approach** and only takes into account all the readers and all the articles in a small subset of our data. The goal with this **matrix factorization technique** is to 'learn' two embedding matrices with the repsective size of the numbers of readers/articles and an arbitrarily chosen (and thus tunable) size of latent factors. 

Thus, if we had 10 readers, 5 articles and were to assume we needed 3 latent factors (which could represent implicit, but substantive differences in our reader/article-base), our method will calculate two matrices (a 10 by 3 for the readers and a 3 by 5 for the articles) whose scalar products yield a new matrix the size of our original one (10 x 5), which *approximates* the original matrix best. This optimization problem is typically solved by stochastic gradient descent (although there are, of course, other possibilities) and from a once extremely sparse matrix (obviously, ervery single reader only reads/clicks a tiny fraction of the articles available to us), we get a densely populated table which now contains information on wether some reader might be more or less inclined to read certain articles. 

The approach might sound a bit dry and mathematic at first, but with the embeddings we actually learn some lower dimensional representations of our readers/articles and can hereby determine *resemblances in preferences*. If you ever wondered how amazon or google knew what you were interested in before you even searched for it: here you go!

In [1]:
import pandas as pd
import numpy as np

In [2]:
behaviors = pd.read_csv('../../data/mind_small_train/behaviors.tsv', sep="\t", header=None)
news= pd.read_csv('../../data/mind_small_train/news.tsv', sep="\t", header = None)

At first, we will only need to work with the behaviors dataset, which looks like this:

In [3]:
behaviors.head()

Unnamed: 0,0,1,2,3,4
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0
4,5,U8125,11/12/2019 4:11:21 PM,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...


and needs some column-relabelling:

In [4]:
behaviors= behaviors.rename(columns={3:'history'})
behaviors = behaviors.rename(columns={0:'impression_id'})
behaviors = behaviors.rename(columns= {1 : 'user_id'})
behaviors = behaviors.rename(columns= {2 : 'time'})
behaviors = behaviors.rename(columns= {4 : 'labels'})

Now we want to check if there are readers with multiple sessions:

In [5]:
behaviors.user_id.value_counts()

U32146    62
U15740    44
U20833    41
U51286    40
U44201    40
U79449    37
U30304    37
U57047    36
U47521    36
U56120    35
U79210    35
U85878    34
U63482    34
U27166    34
U72280    33
U43884    33
U21954    33
U68925    33
U67455    32
U83337    32
U19040    32
U38387    32
U44210    32
U77427    32
U58715    31
U48826    30
U1296     30
U52496    30
U17204    30
U39770    29
          ..
U32770     1
U73735     1
U58657     1
U62242     1
U26579     1
U39552     1
U75135     1
U83402     1
U45318     1
U61370     1
U89513     1
U90825     1
U7167      1
U14321     1
U84778     1
U18213     1
U37761     1
U76269     1
U38711     1
U37640     1
U83505     1
U6411      1
U27473     1
U91794     1
U56520     1
U83994     1
U40636     1
U55639     1
U79895     1
U42763     1
Name: user_id, Length: 50000, dtype: int64

In [6]:
len(behaviors.user_id.unique()), len(behaviors.user_id)

(50000, 156965)

Apparently, there are! For matrix factorization, we only want to work with the click history, so let's see how this is looking:

In [7]:
user_U32594 = behaviors[behaviors.user_id == 'U32594']

Unnamed: 0,impression_id,user_id,time,history,labels
615,616,U32594,11/10/2019 4:38:09 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N54595-0 N23757-0 N23820-0 N18572-0 N41220-0 N...
2202,2203,U32594,11/14/2019 2:27:10 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N41612-0 N16148-0 N3031-0 N51954-0 N2021-0 N33...
4511,4512,U32594,11/14/2019 3:47:55 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N16419-0 N3167-0 N30071-0 N47721-0 N16148-0 N8...
5095,5096,U32594,11/9/2019 12:36:17 PM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N58051-0 N56396-0 N31372-0 N24272-0 N59852-0 N...
5747,5748,U32594,11/12/2019 3:05:21 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N31978-0 N49157-0 N21741-0 N50675-0 N14184-0 N...
8648,8649,U32594,11/11/2019 1:09:11 PM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N30998-0 N41172-0 N19542-0 N55204-0 N33964-0 N...
19975,19976,U32594,11/11/2019 3:29:00 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N35729-0 N48759-1 N31273-0 N49685-0 N62729-0 N...
56570,56571,U32594,11/14/2019 7:09:36 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N64174-0 N16148-0 N45509-0 N46821-0 N23446-0 N...
67071,67072,U32594,11/14/2019 4:01:54 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N16844-0 N19391-0 N28767-0 N60550-0 N40559-0 N...
76277,76278,U32594,11/13/2019 10:31:37 AM,N54359 N54359 N5227 N16695 N63188 N6253 N60844...,N13907-0 N51048-0 N34876-0 N64094-1 N39010-0 N...


In [9]:
user_U32594.iloc[1, 3] == user_U32594.iloc[2, 3] == user_U32594.iloc[8, 3]

True

In [10]:
user_U32594.iloc[1, 4] == user_U32594.iloc[2, 4] == user_U32594.iloc[8, 4]

False

In [11]:
user_U67455 = behaviors[behaviors.user_id == 'U67455']

In [12]:
user_U67455.iloc[1, 3] == user_U67455.iloc[2, 3] == user_U67455.iloc[8, 3]

True

In [13]:
user_U67455.iloc[1, 4] == user_U67455.iloc[2, 4] == user_U67455.iloc[8, 4]

False

It looks like the history for the users is always the same. Luckily, the recommendations and clicks are not.

In [14]:
x = user_U67455.history.iloc[1].split(' ')
len(x), len(set(x))


(278, 275)

It also looks like there are readers who clicked the same articles multiple times. We treat these instances as redundancies here, which -- together with the repeating histories in general -- don't pose a problem for constructing our **original reader-article-matrix**, what we will do in the following:

First of all, we want to reduce our dataset to the first 10,000 impressions for this task:

In [15]:
behav_part_1 = behaviors.iloc[:10000, :]

In [16]:
behav_part_1 = behav_part_1.dropna()
behav_part_1.shape

(9796, 5)

In [17]:
id_dict = pd.Series(behav_part_1.user_id.values,index=behav_part_1.impression_id).to_dict()
id_dict[9999]

'U5787'

In [18]:
behaviors_part_1_set = behav_part_1.set_index('user_id').history.str.split(' ', expand =True).stack().reset_index(1, drop=True).reset_index(name='article')



In [19]:
behaviors_part_1_set

Unnamed: 0,user_id,article
0,U13740,N55189
1,U13740,N42782
2,U13740,N34694
3,U13740,N45794
4,U13740,N18445
5,U13740,N63302
6,U13740,N10414
7,U13740,N19347
8,U13740,N31801
9,U91836,N31739


In [20]:
behaviors_part_1_set['zus'] = 1

In [21]:
behaviors_part_1_pivot = behaviors_part_1_set.pivot_table(index='user_id', columns='article', values='zus').fillna(0)

In [22]:
behaviors_part_1_pivot.shape, len(behav_part_1.user_id.unique())

((8502, 20688), 8502)

In [23]:
behaviors_part_1_pivot.head()

article,N100,N1000,N10001,N10003,N10009,N1001,N10014,N10016,N10021,N10024,...,N9967,N9969,N997,N9973,N9974,N9977,N9978,N9984,N9992,N9993
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U10022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U10062,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
import scipy as sp
from scipy.sparse.linalg import svds

In [25]:
b1 = behaviors_part_1_pivot.to_numpy(copy=True)
b1_mean = np.mean(b1, axis=1)
b1 -= b1_mean.reshape(-1,1)

In [26]:
U, sigma, Vt = svds(b1, k=20)

In [27]:
sigma = np.diag(sigma)


In [28]:
sigma.shape

(20, 20)

In [29]:
recommendations_df = pd.DataFrame(np.dot(np.dot(U, sigma), Vt) + b1_mean.reshape(-1, 1))
recommendations_df.columns = behaviors_part_1_pivot.columns
recommendations_df['user_ids'] = behaviors_part_1_pivot.index
recommendations_df = recommendations_df.set_index('user_ids')

In [30]:
recommendations_df.head()

article,N100,N1000,N10001,N10003,N10009,N1001,N10014,N10016,N10021,N10024,...,N9967,N9969,N997,N9973,N9974,N9977,N9978,N9984,N9992,N9993
user_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U10022,-0.00028,-0.000243,0.004196,0.000124,-0.000161,-0.000625,0.001704,0.007723,-0.000838,-0.002165,...,0.001413,-0.002237,-0.001649,0.00323,-0.002287,0.004048,0.003238,0.001986,-0.000648,0.000227
U10043,0.000978,0.001051,0.000647,0.001053,0.001231,0.001009,0.001018,0.0003,0.00103,0.001044,...,0.001074,0.000777,0.001134,0.000905,0.001014,0.001096,0.001055,0.001161,0.000896,0.001211
U10045,0.001679,0.001302,0.004273,0.001828,0.001645,0.001876,0.003305,-0.001754,0.001566,0.000592,...,0.003646,-0.001291,-0.000977,0.001018,0.000648,0.002572,0.000337,0.001198,0.001577,0.002121
U10059,0.000582,-0.000819,0.001246,-0.000453,-0.001358,-0.000569,-0.000668,0.002088,-0.001098,-0.000746,...,-0.000597,0.000823,-0.001739,0.001315,-0.000515,0.000465,0.000518,0.001193,0.001138,0.000827
U10062,-0.000813,-0.000512,0.001235,-0.002485,-0.002926,-0.003568,0.004105,-0.004009,-0.002954,0.001731,...,-0.001081,7.3e-05,-0.00682,-0.003728,-0.006758,-0.005362,-0.00301,0.004399,-0.00281,-0.000508


In [31]:
comparison = list(zip(recommendations_df.loc['U10022'], behaviors_part_1_pivot.loc['U10022']))

In [32]:
zeros = []
ones = []
for x in comparison:
    if x[1] == 0:
        zeros.append(x[0])
    if x[1] == 1:
        ones.append(x[0])
        

In [33]:
len(zeros), np.mean(zeros), len(ones),np.mean(ones)

(20651, 0.0014974926627323597, 37, 0.1641967303220017)

In [34]:
min(ones), max(zeros)

(0.005595187962075218, 0.2645727751826295)

In [35]:
news.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [35]:
titles_dict = pd.Series(news[3].values,index=news[0]).to_dict()

In [36]:
def give_recommendations(user, n = 5):
    g=[]
    cols = recommendations_df.columns.to_list()
    
    for value in recommendations_df.loc[user]:
        g.append(value)

    inter = list(zip(cols,g))
    
    recos = []
    for x, y in inter:
        if behaviors_part_1_pivot.loc[user][x] == 0:
            recos.append(tuple((x,titles_dict[x],y)))
            
    
    recos = sorted(recos, key=lambda tup: tup[1])    
    recos = recos[-n:]
    
    
    return recos
       
       
    

In [37]:
give_recommendations('U91836')

[('N28296',
  '\u200b20 Funny Things People in the 1970s Were Totally Guilty of Doing',
  0.007997864721030094),
 ('N10575',
  '\u200b20 Funny Things People in the 1980s Were Totally Guilty of Doing',
  0.005278211766946482),
 ('N53209',
  '\u200bAre the Astros stealing signs to gain an edge in ALCS? Hopefully.',
  -0.0035951987737860734),
 ('N33318', '\u200bDevan Makail Bradish: Obituary', 0.00022632689518664812),
 ('N19829',
  '\ufeff\ufeffPrincess Charlene of Monaco Has the Most Daring Style of All the Royals',
  0.002462624527777522)]

In [38]:
give_recommendations('U10059')

[('N28296',
  '\u200b20 Funny Things People in the 1970s Were Totally Guilty of Doing',
  9.050816445769117e-05),
 ('N10575',
  '\u200b20 Funny Things People in the 1980s Were Totally Guilty of Doing',
  -0.0010660536117711192),
 ('N53209',
  '\u200bAre the Astros stealing signs to gain an edge in ALCS? Hopefully.',
  0.00023019408168195826),
 ('N33318', '\u200bDevan Makail Bradish: Obituary', -0.0005080142407107755),
 ('N19829',
  '\ufeff\ufeffPrincess Charlene of Monaco Has the Most Daring Style of All the Royals',
  -0.005494227347512183)]

In [112]:
news[news[0]=='N28296'] news[news[0]=='N19829']

(           0          1                    2  \
 4368  N28296  lifestyle  lifestyledidyouknow   
 
                                                       3  \
 4368  ​20 Funny Things People in the 1970s Were Tota...   
 
                                                       4  \
 4368  Here are 25 things that we all thought were pe...   
 
                                                   5   6  \
 4368  https://assets.msn.com/labs/mind/AAHPd6b.html  []   
 
                                                       7  
 4368  [{"Label": "1970s in music", "Type": "C", "Wik...  ,
             0          1                2  \
 31387  N19829  lifestyle  lifestyleroyals   
 
                                                        3  \
 31387  ﻿﻿Princess Charlene of Monaco Has the Most Dar...   
 
                                           4  \
 31387  She loves to show off her shoulders.   
 
                                                    5  \
 31387  https://assets.msn.com/labs/mind/B

In [114]:
news[news[0]=='N19829'][3]

31387    ﻿﻿Princess Charlene of Monaco Has the Most Dar...
Name: 3, dtype: object

In [119]:
titles_dict['N19829'].encode('ascii', 'ignore').decode()

'Princess Charlene of Monaco Has the Most Daring Style of All the Royals'

In [125]:
news[3] = news[3].str.encode('ascii', 'ignore').str.decode('ascii')

In [126]:
news.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [133]:
give_recommendations('U91836', n=10)

[('N61307',
  'Zooming in: A roundup of construction permits issued last week in Minneapolis',
  -0.00012707838361855194),
 ('N32298',
  'Zozo Championship to finish Monday because of torrential rain',
  -0.0006325997097498927),
 ('N38486',
  "Zozo Championship's 140-yard par 4 10th hole had 16 eagles",
  -0.00026311696023084783),
 ('N31306', 'Zucchini and Feta Fritters', -0.000344880499342292),
 ('N46668',
  'Zuckerberg San Francisco General asks for help identifying patient',
  6.191563215537411e-05),
 ('N21585',
  "Zuckerberg defends Facebook's currency plans before Congress",
  0.0025899345940205067),
 ('N9070',
  'eBay Find: 1964 Chevrolet Bel Air Garage Find',
  0.011761044472512628),
 ('N32753',
  'eBay Find: Actual "I Am Legend" 2007 Mustang Shelby GT500 Movie Car',
  0.00019226813438700217),
 ('N16891',
  "iOS 13's Dark Mode proven to significantly boost battery life on OLED iPhones",
  -0.00015010788545072751),
 ('N56727', 'southern_california_erupts_in_fire', 0.0005481993107

In [5]:
behaviors_dev = pd.read_csv('../../data/mind_small_dev/behaviors.tsv', sep="\t", header=None)

In [11]:
len(set(behaviors_dev[1]) & set(behaviors[1])), len(set(behaviors_dev[1]))

(5943, 50000)