### Conventional Approaches to Recommendation

In this notebook, we will build our first recommender systems, based on a subset of our data. We start by constructing a **content based filtering model using the articles' titles and abstract with a tf-idf algorithm**. We will then undertake a **matrix factorization, using reader-artcile interactions**. After that, we will also build a **hybrid filtering model**.

Let's first import some libraries and modules and also load the data:

In [1]:
import pandas as pd
import numpy as np
import heapq

from lightfm import LightFM
from lightfm.data import Dataset
import lightfm as lm
from lightfm import cross_validation 

import warnings
warnings.filterwarnings('ignore')

from lightfmHelper import evaluate

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel



In [3]:
behaviors = pd.read_csv("../../data/mind_small_train/behaviors_processed_small.csv")
news = pd.read_csv("../../data/mind_small_train/news_processed_small.csv")

Since we will only be working with the reading histories in this notebook (which are the same for all the sessions), let's make them to a list and also **drop mutliple user sessions**:

In [36]:
behaviors.history = behaviors.history.str.split(' ')

In [None]:
behaviors.drop_duplicates(subset="user_id", inplace=True)

### Content based filtering with tf-idf

At first, we want to give recommendations based on content. We do have some information on articles' categories, subcategories etc., but here we want to build **a system that utilizes textual information**. Let s first combine all the text available for each article:

In [4]:
news['news_text'] = news['title'] + ' ' + news['abstract']

so that we can build a representation for every individual article based on the words that appear in it. We can use scikit-learn's tf-idf Vectorizer for this task, which doesn't only **let us build a matrix containing all the articles and some kind of one hot encoding for all of the words appearing in our text corpus** (except for some pre-defined english stopwords), but also **weighs the words according to their appearance in the specific article against their appearance in the whole corpus**. The assumption for this procedure is that on the one side, the more frequent an expression in a *specific* document is, the more important this expression should be for the character of the document, on the other side, the more frequent the expression is in the *whole* corpus, the less valuable it should be to single out the individual nature of the document.

In [5]:
tfidf = TfidfVectorizer(stop_words='english')

In [6]:
text_matrix= tfidf.fit_transform(news['news_text'].apply(lambda x: np.str_(x)))

In [10]:
text_matrix.shape, news.shape[0]

((50434, 54324), 50434)

We now have a matrix that contains all our articles and represents them in a space with as many dimensions as there are individual linguistic expressions (except for stop words again) in titles and abstracts. With this, we can now construct a **similarity matrix** with the help of scikit-learns linear kernel, which basically calculates the dot product for every pair of articles, so that we can say **how close each article linguistically is to each other**:

In [11]:
similarity_matrix = linear_kernel(text_matrix,text_matrix)
similarity_matrix.shape

(50434, 50434)

Let's now make a mapping that enables us to **get the index in our similarity matrix from an article ID**:

In [12]:
mapping_id = pd.Series(news.index,index = news['article_id'])
mapping_id['N16909']

50429

In [19]:
article_check = news[news.article_id == 'N16909']
article_check

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities,news_text
50429,N16909,weather,weathertopstories,"Adapting, Learning And Soul Searching: Reflect...",Woolsey Fire Anniversary: A community is forev...,https://assets.msn.com/labs/mind/BBWzQJK.html,"[{""Label"": ""Woolsey Fire"", ""Type"": ""N"", ""Wikid...","[{""Label"": ""Woolsey Fire"", ""Type"": ""N"", ""Wikid...","Adapting, Learning And Soul Searching: Reflect..."


And also one that gets the **article's title from it's ID**:

In [13]:
news_for_title = news.set_index('article_id')
mapping_title = pd.Series(news_for_title.title)
mapping_title['N53526']

"I Was An NBA Wife. Here's How It Affected My Mental Health."

In [20]:
article_check_2 = news[news.article_id == 'N53526']
article_check_2

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities,news_text
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ...",I Was An NBA Wife. Here's How It Affected My M...


Now we want to write a function that gives us the **k linguistically closest articles**:

In [21]:
def recommended_articles(news_id, k=10):
    news_index = mapping_id[news_id]
    similarity_score = list(enumerate(similarity_matrix[news_index]))
    similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
    similarity_score = similarity_score[1:k+1]
    news_indices = [i[0] for i in similarity_score]
    return (news['article_id'].iloc[news_indices].values)


In [22]:
recommended_articles('N24217')

array(['N48828', 'N13856', 'N16308', 'N33131', 'N23206', 'N389', 'N1634',
       'N34192', 'N29952', 'N41317'], dtype=object)

Let's now create a **user-lookup-table**, so that we can write a **function that gives recommendations based on actual user behavior**:

In [58]:
user_lookup = behaviors.set_index('user_id').copy()

In [65]:
def get_recs_tfidf_first(user_id):
    used_article = user_lookup.loc[user_id].history[0]
    recs_for_article = recommended_articles(used_article)
    article_titles = []
    
    for article in recs_for_article:
        article_titles.append(mapping_title[article])
    
    
    hits = 0
    for article in recs_for_article:
        if article in user_lookup.loc[user_id].history:
            hits += 1
    
    print(f'The first read article of user {user_id} was: \n {mapping_title[used_article]}')
    print('_______________________________________________________')
    print(f'The suggested articles are:\n')
    for article in article_titles:
        print(article) 
    print(hits)

In [66]:
behaviors.head(2)

Unnamed: 0,impression_id,user_id,time,history,impressions,length_history
0,1,U13740,11/11/2019 9:05:58 AM,"[N55189, N42782, N34694, N45794, N18445, N6330...",N55689-1 N35729-0,9
1,2,U91836,11/12/2019 6:11:30 PM,"[N31739, N6072, N63045, N23979, N35656, N43353...",N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...,82


In [67]:
get_recs_tfidf_first('U13740')

The first read article of user U13740 was: 
 'Wheel Of Fortune' Guest Delivers Hilarious, Off The Rails Introduction
_______________________________________________________
The suggested articles are:

Best Response Ever From a 'Wheel of Fortune' Contestant?
Viral Wheel of Fortune Contestant and His Wife Clarify Hilarious 'Loveless Marriage' Intro
'Wheel Of Fortune' Host Pat Sajak Recovers After Surgery
'Wheel Of Fortune' Host Pat Sajak Undergoes Emergency Surgery; Vanna White Hosts Temporarily
Wheel Of Fortune's Pat Sajak Undergoes 'Successful Emergency Surgery'
Wheel Of Fortune's Pat Sajak Says 'Worst Has Passed' After Emergency Surgery Last Week
Pat Sajak recovering from emergency surgery
'Wheel of Fortune' fans can't believe all three contestants missed puzzle
Wheel of Fortune's Pat Sajak Says the 'Worst Has Passed' Following Emergency Intestine Surgery
ICYMI: The week in TV news for Oct. 13-19, 2019
0


Our function only used the first article in the users reading history and as you can see, it **heavily emphasizes (and most certainly overestimates) the users interest in Wheel of Fortune topics**. Maybe we could do better in capturing our readers' interests by giving recommendations based on mutliple articles. Let's try it with three:

In [111]:
def get_recs_tfidf_first_three(user_id):
    used_articles = []
    for i, article_id in enumerate(user_lookup.loc[user_id].history[0:3]):
        used_articles.append(user_lookup.loc[user_id].history[i])
    
    
    recs_for_articles = []
    for article in used_articles:
        recs_for_articles.append(recommended_articles(article, k=3))
    
    recs_for_articles_list = []
    for array in recs_for_articles:
        for article in array:
            recs_for_articles_list.append(article)
    
    #return(recs_for_articles_list)
   
    article_titles = []
    for article in recs_for_articles_list:
        article_titles.append(mapping_title[article])
    
    
    hits = 0
    for article in recs_for_articles_list:
        if article in user_lookup.loc[user_id].history:
            hits += 1
    
    print(f'The first read articles of user {user_id} were: \n {mapping_title[used_articles]}')
    print('_______________________________________________________')
    print(f'The suggested articles are:\n')
    for article in article_titles:
        print(article) 
    print(hits)

In [112]:
get_recs_tfidf_first_three('U13740')

The first read articles of user U13740 were: 
 article_id
N55189    'Wheel Of Fortune' Guest Delivers Hilarious, O...
N42782    Three takeaways from Yankees' ALCS Game 5 vict...
N34694    Rosie O'Donnell: Barbara Walters Isn't 'Up to ...
Name: title, dtype: object
_______________________________________________________
The suggested articles are:

Best Response Ever From a 'Wheel of Fortune' Contestant?
Viral Wheel of Fortune Contestant and His Wife Clarify Hilarious 'Loveless Marriage' Intro
'Wheel Of Fortune' Host Pat Sajak Recovers After Surgery
Yankees stay alive with 4-1 win against Astros
Three takeaways from Astros' Game 3 ALCS win over Yankees
ALCS Game 6 Thread: Yankees at Astros
Lawrence O'Donnell answers your impeachment questions
3 Dead After Multi-Vehicle Crash Sparks 2-Acre Brush Fire in Santa Barbara
Barbara Nicklaus honored with PGA's Distinguished Service Award
0


Now with the second article in user U13740's reading history being Yankess and baseball related, it ***could* be the case that we captured some long term interests of this person**. But this should rather be regarded as a coincidence: maybe ther person isn't even a yankees or a baseball fan and would hate us if we treated her like one by always suggesting Yankees or baseball related content!

As we can see, **this approach is relatively twitchy with respect to relying on single (even if multiple single) articles**. We don't need to expect this method to be able to reasonably track user interests. The worse problem with news consumption is the fact that most people neither need nor want to read the same *news* again and again. Also, given that we have to do with news, it would be very important to **filter articles based on time and day of publication**. Unfortunately, this information is *not* provided in the MIcrosoft News Dataset, so we have to work out different strategies. 

In the following, we will pursue a matrix factorization approach, where we can make use of *all* the articles a user read together with all the articles all of the other users read.


### Collaborative Filtering with Matrix Factorization

The following **collaborative filtering approach** takes into account all the readers and all the articles together. If we were to construct a matrix out of all the interactions and given that every user only read a tiny fraction of all  articles, we would get a very sparse matrix. The goal with a **matrix factorization technique** now is to 'learn' two embedding matrices with the repsective size of the numbers of readers/articles and an arbitrarily chosen (and thus tunable) size of latent factors. 

Thus, if we had 10 readers, 5 articles and were to assume we needed 3 latent factors (which could represent implicit, but substantive differences in our reader/article-base), **our method will calculate two matrices (a 10 by 3 for the readers and a 3 by 5 for the articles) whose scalar products yield a new matrix the size of our original one (10 x 5), which *approximates* the original matrix best**. This optimization problem is typically solved by stochastic gradient descent (although there are, of course, other possibilities) and from a once extremely sparse matrix (obviously, ervery single reader only reads/clicks a tiny fraction of the articles available to us), we get a densely populated table which now contains information on wether some reader might be more or less inclined to read certain articles. 

The approach might sound a bit dry and mathematic at first, but **with the embeddings we actually learn some lower dimensional representations of our readers/articles** and can hereby determine *resemblances in preferences*. If you ever wondered how amazon or google knew what you were interested in before you even searched for it: here you go!

We could do the matrix factorization manually using scikit-learns Truncated SVD for instance, but here, we use the **LightFM library**, which's main purpose actually is hybrid filtering, but **reduces to a matrix factorization when only supplied with user/article intrteractions**.

In [115]:
uai = pd.read_csv('../../data/mind_small_train/small_train.csv')

In [138]:
uai.head()

Unnamed: 0,user_id,article_id,user_int_id,article_int_id
0,U13740,N55189,1810,24758
1,U13740,N42782,1810,17976
2,U13740,N34694,1810,13534
3,U13740,N45794,1810,19650
4,U13740,N18445,1810,4608


Here we have all the interactions and integers corresponding to the original IDs, which LightFM needs as Inputs:

In [119]:
dataset_cf = Dataset()
dataset_cf.fit(uai['user_int_id'], uai['article_int_id'])

In [120]:
uai_array = uai.to_numpy()

In [121]:
interactions, weights = dataset_cf.build_interactions(
    (ua[2], ua[3]) for ua in uai_array
)

In [122]:
train, test = cross_validation.random_train_test_split(
    interactions, test_percentage=0.5, 
    random_state=np.random.RandomState(42)
)

With these inputs, we can now build a model:

In [124]:
loss = 'warp'
no_components = 20
epochs = 20

In [None]:
model_cf = LightFM(no_components=no_components, loss=loss)

In [125]:
model_cf = LightFM(no_components=no_components, loss=loss)
model_cf.fit(train, epochs=epochs)

<lightfm.lightfm.LightFM at 0x7fc3334ea978>

In [126]:
result = evaluate(model_cf, train, test)

The AUC Score is in training/validation:                  0.98623747  /  0.906224
The mean precision at k Score in training/validation is:  0.058934964  /  0.03676688
The mean reciprocal rank in training/validation is:       0.19215415  /  0.12714417
_________________________________________________________


Okay, so this doesn't look too bad (the AUC score actually looks super high), although our model seems to be overfitting quite a bit. In order to later compare the matrix factorization approach with other deep learning models, we will **test our ranking scores by ranking a known postive interaction under 99 known negative ones**. We have the data prepared already so we only have to load it and write an evaluation function:

In [127]:
cf_result = result

In [129]:
test_filename = "../../data/mind_small_train/small_test.csv"
test_positives = []

with open(test_filename, "r") as f:
    header = f.readline()
    print(header)
    line = f.readline()
    print(line)
    while line != None and line != "":
        line_list = line.split(",")
        #print(line_list)
        user, article = int(line_list[2]), int(line_list[3])
        #print(user, article)                                            
        test_positives.append([user, article])
        line = f.readline()

user_id,article_id,user_int_id,article_int_id

U13740,N31801,1810,11956



In [131]:
test_neg_filename = "../../data/mind_small_train/small_test_negatives.tsv"
test_negatives = []

with open(test_neg_filename, "r") as f:
    line = f.readline()
    while line != None and line != "":
        line_list = line.split("\t")
        #print(line_list)
        negatives = []
        for neg in line_list[1: ]:
            negatives.append(int(neg))
        test_negatives.append(negatives)
        line = f.readline()

In [139]:
K = 100

In [140]:
def eval_one_rating(idx, model):
    user = test_positives[idx][0]
    pos_item = test_positives[idx][1]
    items = test_negatives[idx]
    items.append(pos_item)
    
    # Get prediction score
    map_item_score = {}
    user_array = np.full(len(items), user, dtype = 'int32')
    predictions = model.predict(user_array, np.array(items))
    for i in range(len(items)):
        item = items[i]
        map_item_score[item] = predictions[i]
    
    items.pop()
    
    # Evaluate top rank list
    ranklist = heapq.nlargest(K, map_item_score, key=map_item_score.get)
    
    if pos_item in ranklist:
        hr = 1
        i = ranklist.index(pos_item)
        ndcg = np.log(2) / np.log(i+2)
        rr = 1/(i+1)
    else:
        hr = 0
        ndcg = 0
        rr = 0
   
    return (hr, ndcg, rr)

In [141]:
hits, ndcgs, rrs = [], [], []
for idx in range(len(test_positives)):
    hr, ndcg, rr = eval_one_rating(idx, model_cf)
    hits.append(hr)
    ndcgs.append(ndcg)
    rrs.append(rr)

In [142]:
hr = np.array(hits).mean()
mrr = np.array(rrs).mean()
ndcg = np.array(ndcgs).mean()

In [143]:
hr, mrr, ndcg

(1.0, 0.057148300041675176, 0.21417640314079628)

Okay, so this will be a good baseline for our upcoming deep learning techniques. Now let's also use LightFM's hybrid capabilities.

### A Hybrid Model

We can include some **additional information concerning users and/or articles** when using a factorization machine like LighFM. Actually, that's what these algortihms are made for since we can already see that up until now, **we can't really make any recommendations for new users or articles**. Obviously, this **cold start problem** is a big issue in *news* recommendation as well! We won't tackle this problem at this point though, because we only want to demonstrate how to useLightFM as a hybrid. Also, the hybridization won't affect our customized testing drill, which only evaluates rankings for unlearned interactions of already learned users and articles. Later on, we will also show how the cold start problem could be tackled using recurrent neural networks.

Right now, let's just imagine we wanted to exploit the available information on news categories (potentially, we could also make up some differentiating features for users, e.g. taking the extent of their reading history into account).

In [145]:
news = pd.read_csv("../../data/mind_small_train/news_processed.csv")

In [146]:
news_categories = news.category.unique().tolist()

In [147]:
news.head(2)

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."


In [148]:
news_categories

['lifestyle',
 'health',
 'news',
 'sports',
 'weather',
 'entertainment',
 'autos',
 'travel',
 'foodanddrink',
 'tv',
 'finance',
 'movies',
 'video',
 'music',
 'kids',
 'middleeast',
 'northamerica']

In [149]:
article_cat_dict = {}
for row in news.values:
    art, cat = row[0], row[1]
    article_cat_dict[art] = cat
    
for art in ['N2325787', 'N117002']:
    article_cat_dict[art] = "none"

In [150]:
news_categories.append("none")

In [152]:
article_categories = [article_cat_dict[art] for art in uai.article_id]

In [155]:
dataset_hybrid = Dataset()
dataset_hybrid.fit(uai['user_int_id'], 
                   uai['article_int_id'],
                   item_features=news_categories)

In [157]:
item_features = dataset_hybrid.build_item_features(
    (art_id, [art_category]) for art_id, art_category 
    in zip(uai.article_int_id, article_categories))

In [158]:
interactions_hybrid, weights_hybrid = dataset_hybrid.build_interactions(
    (ua[2], ua[3]) for ua in uai_array)

In [159]:
train_hybrid, test_hybrid = cross_validation.random_train_test_split(
    interactions_hybrid, test_percentage=0.5,
    random_state=np.random.RandomState(42))

In [160]:
model_hybrid = LightFM(no_components=no_components, 
                       loss=loss,
                       item_alpha=0.0001)

model_hybrid.fit(train_hybrid, 
                 item_features=item_features,
                 epochs=epochs)

<lightfm.lightfm.LightFM at 0x7fc345e82c88>

In [161]:
result_hybrid = evaluate(model_hybrid, train_hybrid, test_hybrid, 
                         hybrid=True, features=item_features)

The AUC Score is in training/validation:                  0.947942  /  0.80881274
The mean precision at k Score in training/validation is:  0.047876682  /  0.022506196
The mean reciprocal rank in training/validation is:       0.19671887  /  0.090290435
_________________________________________________________


Apparently, our hybrid model is overfitting the data even more than the one based solely on interactions. We could **experiment with a more severe regularization via the item alpha** for instance, but right now, we want to go ahead and pursue a neural collaborative filtering approach in the next notebook!