### Intro

Data analysis for the NLP capstone project of the Upgrad Data Science course.

Code committed to: https://github.com/kavurisrikanth/news-recommender-capstone

### The Basics - Loading data

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px

In [2]:
txns = pd.read_csv('../data/consumer_transanctions.csv')
cnt = pd.read_csv('../data/platform_content.csv')

  txns = pd.read_csv('../data/consumer_transanctions.csv')


In [3]:
txns.head()

Unnamed: 0,event_timestamp,interaction_type,item_id,consumer_id,consumer_session_id,consumer_device_info,consumer_location,country
0,1465413032,content_watched,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,content_watched,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,content_watched,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,content_followed,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,content_watched,-7820640624231356730,-445337111692715325,561148 1178424124714,,,


In [4]:
cnt.head()

Unnamed: 0,event_timestamp,interaction_type,item_id,producer_id,producer_session_id,producer_device_info,producer_location,producer_country,item_type,item_url,title,text_description,language
0,1459192779,content_pulled_out,-6451309518266745024,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,content_present,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,content_present,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,content_present,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,content_present,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en


### Data preparation

In [5]:
cnt.item_type.value_counts()

HTML     3101
VIDEO      11
RICH       10
Name: item_type, dtype: int64

Some articles are not strictly text-based. Check text descriptions.

In [6]:
cnt.text_description.isna().sum()

0

All articles have text descriptions. So, content-based predictions would work.

In [7]:
not_text = cnt[(cnt.item_type != 'HTML') & (cnt.language == 'en')]

In [8]:
not_text.head()

Unnamed: 0,event_timestamp,interaction_type,item_id,producer_id,producer_session_id,producer_device_info,producer_location,producer_country,item_type,item_url,title,text_description,language
118,1459423815,content_present,-254088699629065171,4340306774493623681,3903381901308718595,,,,RICH,https://soundcloud.com/epicenterbitcoin/eb-124,EB124 - Rune Christensen: Maker Dao Ethereum's...,"Support the show, consider donating: 1GW6t1vzH...",en
319,1460379355,content_present,7707640607626518697,-4243635261966794110,1881702425778279387,,,,VIDEO,https://www.ted.com/talks/linus_torvalds_the_m...,Linus Torvalds: The mind behind Linux,Linus Torvalds transformed technology twice --...,en
357,1460484544,content_present,5688279681867464747,3375381077362025672,4718359416970444168,,,,VIDEO,https://www.ted.com/talks/margaret_gould_stewa...,Margaret Gould Stewart: How giant websites des...,"Facebook's ""like"" and ""share"" buttons are seen...",en
451,1460854706,content_present,5379671084978512851,-8020832670974472349,1759315806103391579,,,,VIDEO,http://www.ted.com/talks/linus_torvalds_the_mi...,Linus Torvalds: The mind behind Linux,Linus Torvalds transformed technology twice --...,en
496,1461159850,content_present,-5315378314308323942,490109768671667408,-1480333772626639660,,,,RICH,https://itunes.apple.com/br/course/developing-...,Developing iOS 9 Apps with Swift - Curso gráti...,Updated for iOS 9 and Swift. Tools and APIs re...,en


In [9]:
not_text.shape

(12, 13)

In [10]:
cnt.shape

(3122, 13)

A very small number of articles are not HTML type.

In [11]:
txns[txns['item_id'].isin(not_text.item_id)]

Unnamed: 0,event_timestamp,interaction_type,item_id,consumer_id,consumer_session_id,consumer_device_info,consumer_location,country
80,1460568722,content_watched,5688279681867464747,-4585796377251906117,-781311598216662665,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4...,SP,BR
116,1460568041,content_watched,5688279681867464747,-108842214936804958,-9137723263631808218,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKi...,SP,BR
123,1460567995,content_watched,5688279681867464747,-108842214936804958,-9137723263631808218,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKi...,SP,BR
124,1460568099,content_commented_on,5688279681867464747,-108842214936804958,-9137723263631808218,,,
1153,1465862100,content_watched,-3134743773662773628,-1443636648652872475,-3237684801374470717,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,SP,BR
...,...,...,...,...,...,...,...,...
64220,1480964845,content_watched,-78667914647336721,301435144665447655,9023879016298890164,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1...,MG,BR
64715,1480528692,content_watched,-78667914647336721,1262852631026172055,-7650127696526090326,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,SP,BR
66485,1484325357,content_watched,-78667914647336721,-4998109382710136565,-4461239032136160843,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,SP,BR
68648,1485533343,content_watched,-78667914647336721,-4998109382710136565,8985952401743307581,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,SP,BR


In [12]:
txns.shape

(72312, 8)

Number of transactions on such articles is also miniscule.

#### Drop unnecessary columns

In [13]:
# Drop country, consumer_location, consumer_device_info, consumer_session_id from txns
txns.drop(columns=['country', 'consumer_location', 'consumer_device_info', 'consumer_session_id'], inplace=True)

In [14]:
# Drop producer_id, producer_session_id, producer_device_info, producer_location, producer_country from cnt
cnt.drop(columns=['producer_id', 'producer_session_id', 'producer_device_info', 'producer_location', 'producer_country'], inplace=True)

In [15]:
content = cnt

In [16]:
content.head()

Unnamed: 0,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language
0,1459192779,content_pulled_out,-6451309518266745024,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,content_present,-7292285110016212249,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,content_present,-6151852268067518688,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,content_present,2448026894306402386,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en


#### Remove all docs that are not in English

In [17]:
content.language.value_counts()

en    2264
pt     850
la       4
es       2
ja       2
Name: language, dtype: int64

In [18]:
content.shape

(3122, 8)

In [19]:
content = content[content['language'] == 'en']

In [20]:
content.shape

(2264, 8)

#### Handle articles with duplicated entries

In [21]:
no_dups = content.sort_values('event_timestamp').drop_duplicates(subset=['title', 'text_description'], keep='last')

In [22]:
no_dups.head()

Unnamed: 0,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language
1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,content_present,-7292285110016212249,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,content_present,-6151852268067518688,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,content_present,2448026894306402386,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en
5,1459194522,content_present,-2826566343807132236,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en


In [23]:
no_dups.reset_index(inplace=True)

In [24]:
no_dups.interaction_type.value_counts()

content_present       2153
content_pulled_out      38
Name: interaction_type, dtype: int64

In [25]:
no_dups[no_dups['title'] == "Ethereum, a Virtual Currency, Enables Transactions That Rival Bitcoin's"]

Unnamed: 0,index,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language
0,1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en


In [26]:
content[content['title'] == "Ethereum, a Virtual Currency, Enables Transactions That Rival Bitcoin's"]

Unnamed: 0,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language
0,1459192779,content_pulled_out,-6451309518266745024,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en


The entry in the no duplicates DataFrame is the one with the older timestamp. Makes sense.

In [27]:
cnt = no_dups

#### Introduce keywords

In [28]:
# %pip install gensim

In [29]:
from gensim.utils import simple_preprocess

In [30]:
cnt['text_description_preprocessed'] = cnt['text_description'].apply(lambda x: simple_preprocess(x, deacc=True))

In [31]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language,text_description_preprocessed
0,1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[all, of, this, work, is, still, very, early, ..."
1,2,1459194146,content_present,-7292285110016212249,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[the, alarm, clock, wakes, me, at, with, strea..."
2,3,1459194474,content_present,-6151852268067518688,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[we, re, excited, to, share, the, google, data..."
3,4,1459194497,content_present,2448026894306402386,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[the, aite, group, projects, the, blockchain, ..."
4,5,1459194522,content_present,-2826566343807132236,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, of, the, largest, and, oldest, organizat..."


In [32]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords_en = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ksrs9\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
cnt['text_description_no_stopwords'] = cnt['text_description_preprocessed'].apply(lambda x: [word for word in x if word not in stopwords_en])

In [34]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language,text_description_preprocessed,text_description_no_stopwords
0,1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[all, of, this, work, is, still, very, early, ...","[work, still, early, first, full, public, vers..."
1,2,1459194146,content_present,-7292285110016212249,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[the, alarm, clock, wakes, me, at, with, strea...","[alarm, clock, wakes, stream, advert, free, br..."
2,3,1459194474,content_present,-6151852268067518688,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[we, re, excited, to, share, the, google, data...","[excited, share, google, data, center, tour, y..."
3,4,1459194497,content_present,2448026894306402386,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[the, aite, group, projects, the, blockchain, ...","[aite, group, projects, blockchain, market, co..."
4,5,1459194522,content_present,-2826566343807132236,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, of, the, largest, and, oldest, organizat...","[one, largest, oldest, organizations, computin..."


In [35]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [36]:
cnt['text_description_lemmatized'] = cnt['text_description_no_stopwords'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [37]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language,text_description_preprocessed,text_description_no_stopwords,text_description_lemmatized
0,1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[all, of, this, work, is, still, very, early, ...","[work, still, early, first, full, public, vers...","[work, still, early, first, full, public, vers..."
1,2,1459194146,content_present,-7292285110016212249,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[the, alarm, clock, wakes, me, at, with, strea...","[alarm, clock, wakes, stream, advert, free, br...","[alarm, clock, wake, stream, advert, free, bro..."
2,3,1459194474,content_present,-6151852268067518688,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[we, re, excited, to, share, the, google, data...","[excited, share, google, data, center, tour, y...","[excited, share, google, data, center, tour, y..."
3,4,1459194497,content_present,2448026894306402386,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[the, aite, group, projects, the, blockchain, ...","[aite, group, projects, blockchain, market, co...","[aite, group, project, blockchain, market, cou..."
4,5,1459194522,content_present,-2826566343807132236,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, of, the, largest, and, oldest, organizat...","[one, largest, oldest, organizations, computin...","[one, largest, oldest, organization, computing..."


In [38]:
# Drop the columns we don't need anymore
cnt.drop(['text_description_preprocessed', 'text_description_no_stopwords'], axis=1, inplace=True)

#### Introduce a ratings column

In [39]:
def to_rating(val):
    if val == 'content_followed':
        return 5
    if val == 'content_commented_on':
        return 4
    if val == 'content_saved':
        return 3
    if val == 'content_liked':
        return 2
    return 1

In [40]:
txns.interaction_type.value_counts()

content_watched         61086
content_liked            5745
content_saved            2463
content_commented_on     1611
content_followed         1407
Name: interaction_type, dtype: int64

In [41]:
txns['rating'] = txns.interaction_type.apply(lambda x: to_rating(x))

In [42]:
txns.head()

Unnamed: 0,event_timestamp,interaction_type,item_id,consumer_id,rating
0,1465413032,content_watched,-3499919498720038879,-8845298781299428018,1
1,1465412560,content_watched,8890720798209849691,-1032019229384696495,1
2,1465416190,content_watched,310515487419366995,-1130272294246983140,1
3,1465413895,content_followed,310515487419366995,344280948527967603,5
4,1465412290,content_watched,-7820640624231356730,-445337111692715325,1


#### Adjust IDs

The user and document IDs in the data make no sense. So create new IDs that start from 1.

In [43]:
class IdHelper:
    _map = {}
    _id = 1
    ids = []

    def translate(self, id):
        # If a mapping exists for id, then return the mapping
        # Otherwise, create a new mapping, store it, and return it
        if id in self._map:
            return self._map[id]
        new_id = self.__new_id__()
        self._map[id] = new_id
        return new_id

    def __new_id__(self):
        num = self._id
        self._id += 1
        self.ids.append(num)
        return num

    def is_known_id(self, id):
        return id in self.ids

In [44]:
consumer_helper = IdHelper()
item_helper = IdHelper()

In [45]:
txns['consumer_id_adj'] = txns['consumer_id'].map(lambda x: consumer_helper.translate(x))

In [46]:
txns.head()

Unnamed: 0,event_timestamp,interaction_type,item_id,consumer_id,rating,consumer_id_adj
0,1465413032,content_watched,-3499919498720038879,-8845298781299428018,1,1
1,1465412560,content_watched,8890720798209849691,-1032019229384696495,1,2
2,1465416190,content_watched,310515487419366995,-1130272294246983140,1,3
3,1465413895,content_followed,310515487419366995,344280948527967603,5,4
4,1465412290,content_watched,-7820640624231356730,-445337111692715325,1,5


In [47]:
txns['item_id_adj'] = txns['item_id'].map(lambda x: item_helper.translate(x))

In [48]:
# Drop item_id and consumer_id from txns
txns.drop(columns=['item_id', 'consumer_id'], inplace=True)

In [49]:
txns.head()

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj
0,1465413032,content_watched,1,1,1
1,1465412560,content_watched,1,2,2
2,1465416190,content_watched,1,3,3
3,1465413895,content_followed,5,4,3
4,1465412290,content_watched,1,5,4


Same for content.

In [50]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_id,item_type,item_url,title,text_description,language,text_description_lemmatized
0,1,1459193988,content_present,-4110354420726924665,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers..."
1,2,1459194146,content_present,-7292285110016212249,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[alarm, clock, wake, stream, advert, free, bro..."
2,3,1459194474,content_present,-6151852268067518688,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[excited, share, google, data, center, tour, y..."
3,4,1459194497,content_present,2448026894306402386,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[aite, group, project, blockchain, market, cou..."
4,5,1459194522,content_present,-2826566343807132236,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, largest, oldest, organization, computing..."


In [51]:
cnt['item_id_adj'] = cnt['item_id'].map(lambda x: item_helper.translate(x))

In [52]:
# Drop item_id from cnt
cnt.drop(columns=['item_id'], inplace=True)

In [53]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj
0,1,1459193988,content_present,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers...",1190
1,2,1459194146,content_present,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[alarm, clock, wake, stream, advert, free, bro...",811
2,3,1459194474,content_present,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[excited, share, google, data, center, tour, y...",559
3,4,1459194497,content_present,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[aite, group, project, blockchain, market, cou...",2988
4,5,1459194522,content_present,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, largest, oldest, organization, computing...",1191


### EDA

#### Checking for missing values

In [54]:
txns.isna().sum()

event_timestamp     0
interaction_type    0
rating              0
consumer_id_adj     0
item_id_adj         0
dtype: int64

In [55]:
txns.shape

(72312, 5)

In [56]:
cnt.isna().sum()

index                          0
event_timestamp                0
interaction_type               0
item_type                      0
item_url                       0
title                          0
text_description               0
language                       0
text_description_lemmatized    0
item_id_adj                    0
dtype: int64

In [57]:
cnt.shape

(2191, 10)

#### Checking for duplicated ratings

In [58]:
txns.head()

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj
0,1465413032,content_watched,1,1,1
1,1465412560,content_watched,1,2,2
2,1465416190,content_watched,1,3,3
3,1465413895,content_followed,5,4,3
4,1465412290,content_watched,1,5,4


In [59]:
txns_2 = txns[['consumer_id_adj', 'item_id_adj', 'rating']]

In [60]:
txns_2.head()

Unnamed: 0,consumer_id_adj,item_id_adj,rating
0,1,1,1
1,2,2,1
2,3,3,1
3,4,3,5
4,5,4,1


In [61]:
duplicates = txns[txns.duplicated(subset=['consumer_id_adj', 'item_id_adj'], keep=False)]

In [62]:
duplicates.sort_values(by=['consumer_id_adj', 'item_id_adj', 'rating'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  duplicates.sort_values(by=['consumer_id_adj', 'item_id_adj', 'rating'], inplace=True)


In [63]:
duplicates.head()

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj
0,1465413032,content_watched,1,1,1
34,1465413046,content_watched,1,1,1
1647,1465481798,content_liked,2,1,2
1651,1465481662,content_saved,3,1,2
25568,1460648226,content_watched,1,1,28


There are duplicated entries i.e., the same user has interacted with the same article multiple times.

Since multiple interactions could mean that a user liked an article, the duplicates must be considered in the analysis.

#### For "duplicated" transactions, calculate the average rating of the user for that article

In [64]:
grp = duplicates.groupby(by=['consumer_id_adj', 'item_id_adj'])['rating'].mean()

In [65]:
grp.head()

consumer_id_adj  item_id_adj
1                1              1.000000
                 2              2.500000
                 28             1.250000
                 42             2.000000
                 68             2.142857
Name: rating, dtype: float64

In [66]:
grp_df = pd.DataFrame(grp)

In [67]:
grp_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating
consumer_id_adj,item_id_adj,Unnamed: 2_level_1
1,1,1.0
1,2,2.5
1,28,1.25
1,42,2.0
1,68,2.142857


Renaming the rating column to avoid any potential clash when merged with the original

In [68]:
grp_df.columns = ['rating_sum']

In [69]:
grp_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rating_sum
consumer_id_adj,item_id_adj,Unnamed: 2_level_1
1,1,1.0
1,2,2.5
1,28,1.25
1,42,2.0
1,68,2.142857


In [70]:
grp_df.reset_index(inplace=True)

In [71]:
grp_df.head()

Unnamed: 0,consumer_id_adj,item_id_adj,rating_sum
0,1,1,1.0
1,1,2,2.5
2,1,28,1.25
3,1,42,2.0
4,1,68,2.142857


Check distributions of ratings

In [72]:
grp_df.describe(percentiles=[0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99, 1.0])

Unnamed: 0,consumer_id_adj,item_id_adj,rating_sum
count,16640.0,16640.0,16640.0
mean,368.352043,1264.77512,1.344271
std,395.042294,886.443286,0.506015
min,1.0,1.0,1.0
5%,7.0,72.0,1.0
10%,21.0,145.9,1.0
15%,32.85,216.0,1.0
20%,53.0,316.8,1.0
25%,69.0,415.0,1.0
30%,85.0,533.0,1.0


In [73]:
fig = px.box(grp_df, y='rating_sum')
fig.show()

A majority of the articles are rated 2 or lower. Only a very small number of transactions have a high rating. However, these are not outliers. This is expected, as users would only like a small percentage of the articles in the system.

#### Add the adjusted rating back to the original transactions DataFrame

In [74]:
no_dups = txns.drop_duplicates(subset=['consumer_id_adj', 'item_id_adj'])

In [75]:
no_dups.head()

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj
0,1465413032,content_watched,1,1,1
1,1465412560,content_watched,1,2,2
2,1465416190,content_watched,1,3,3
3,1465413895,content_followed,5,4,3
4,1465412290,content_watched,1,5,4


In [76]:
no_dups.sort_values(by=['consumer_id_adj', 'item_id_adj', 'rating'], inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [77]:
no_dups.head()

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj
0,1465413032,content_watched,1,1,1
1647,1465481798,content_liked,2,1,2
13844,1462296634,content_watched,1,1,8
25568,1460648226,content_watched,1,1,28
41857,1470773847,content_watched,1,1,38


In [78]:
duplicates.shape

(48242, 5)

In [79]:
no_dups.shape

(40710, 5)

In [80]:
txns.shape

(72312, 5)

Merge the two DataFrames

In [81]:
txns_merged = pd.merge(left=no_dups, right=grp_df, left_on=['consumer_id_adj', 'item_id_adj'], right_on=['consumer_id_adj', 'item_id_adj'], how='left')

In [82]:
txns_merged.sort_values(by=['consumer_id_adj', 'item_id_adj', 'rating'], inplace=True)

In [83]:
txns_merged.head(25)

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj,rating_sum
0,1465413032,content_watched,1,1,1,1.0
1,1465481798,content_liked,2,1,2,2.5
2,1462296634,content_watched,1,1,8,
3,1460648226,content_watched,1,1,28,1.25
4,1470773847,content_watched,1,1,38,
5,1460648169,content_watched,1,1,42,2.0
6,1461867235,content_watched,1,1,52,
7,1461867305,content_watched,1,1,54,
8,1466614562,content_watched,1,1,68,2.142857
9,1464190235,content_watched,1,1,87,


Rows that have rating_sum as NaN were not duplicated in the original. So, the summed rating would just be the rating for these rows.

In [84]:
txns_merged['ratings_merged'] = txns_merged.rating_sum.fillna(txns_merged.rating)

In [85]:
txns_merged.head(25)

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj,rating_sum,ratings_merged
0,1465413032,content_watched,1,1,1,1.0,1.0
1,1465481798,content_liked,2,1,2,2.5,2.5
2,1462296634,content_watched,1,1,8,,1.0
3,1460648226,content_watched,1,1,28,1.25,1.25
4,1470773847,content_watched,1,1,38,,1.0
5,1460648169,content_watched,1,1,42,2.0,2.0
6,1461867235,content_watched,1,1,52,,1.0
7,1461867305,content_watched,1,1,54,,1.0
8,1466614562,content_watched,1,1,68,2.142857,2.142857
9,1464190235,content_watched,1,1,87,,1.0


In [86]:
txns_merged.ratings_merged.describe()

count    40710.000000
mean         1.157177
std          0.397953
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          4.500000
Name: ratings_merged, dtype: float64

The rating is between 1 and 5, so that is good enough.

In [87]:
txns_merged.drop(columns=['rating_sum'], inplace=True)

In [88]:
txns_merged.head()

Unnamed: 0,event_timestamp,interaction_type,rating,consumer_id_adj,item_id_adj,ratings_merged
0,1465413032,content_watched,1,1,1,1.0
1,1465481798,content_liked,2,1,2,2.5
2,1462296634,content_watched,1,1,8,1.0
3,1460648226,content_watched,1,1,28,1.25
4,1470773847,content_watched,1,1,38,1.0


In [89]:
txns_merged.rename(columns={'rating': 'rating_original'}, inplace=True)

In [90]:
txns_merged.head()

Unnamed: 0,event_timestamp,interaction_type,rating_original,consumer_id_adj,item_id_adj,ratings_merged
0,1465413032,content_watched,1,1,1,1.0
1,1465481798,content_liked,2,1,2,2.5
2,1462296634,content_watched,1,1,8,1.0
3,1460648226,content_watched,1,1,28,1.25
4,1470773847,content_watched,1,1,38,1.0


In [91]:
# txns_merged.rename(columns={'ratings_scaled': 'rating'}, inplace=True)
txns_merged.rename(columns={'ratings_merged': 'rating'}, inplace=True)

In [92]:
txns_merged.head()

Unnamed: 0,event_timestamp,interaction_type,rating_original,consumer_id_adj,item_id_adj,rating
0,1465413032,content_watched,1,1,1,1.0
1,1465481798,content_liked,2,1,2,2.5
2,1462296634,content_watched,1,1,8,1.0
3,1460648226,content_watched,1,1,28,1.25
4,1470773847,content_watched,1,1,38,1.0


In [93]:
txns = txns_merged

In [94]:
txns.head()

Unnamed: 0,event_timestamp,interaction_type,rating_original,consumer_id_adj,item_id_adj,rating
0,1465413032,content_watched,1,1,1,1.0
1,1465481798,content_liked,2,1,2,2.5
2,1462296634,content_watched,1,1,8,1.0
3,1460648226,content_watched,1,1,28,1.25
4,1470773847,content_watched,1,1,38,1.0


In [95]:
txns.drop(columns=['interaction_type'], inplace=True)

In [96]:
txns.head()

Unnamed: 0,event_timestamp,rating_original,consumer_id_adj,item_id_adj,rating
0,1465413032,1,1,1,1.0
1,1465481798,2,1,2,2.5
2,1462296634,1,1,8,1.0
3,1460648226,1,1,28,1.25
4,1470773847,1,1,38,1.0


In [97]:
txns.describe()

Unnamed: 0,event_timestamp,rating_original,consumer_id_adj,item_id_adj,rating
count,40710.0,40710.0,40710.0,40710.0,40710.0
mean,1470525000.0,1.146917,430.560624,1493.917416,1.157177
std,7535306.0,0.53958,450.92468,913.586706,0.397953
min,1457965000.0,1.0,1.0,1.0,1.0
25%,1464379000.0,1.0,81.0,584.0,1.0
50%,1469471000.0,1.0,254.0,1603.5,1.0
75%,1475262000.0,1.0,648.0,2277.0,1.0
max,1488310000.0,5.0,1895.0,2987.0,4.5


Consolidated Ratings are between 1 and 4.5, which is expected.

#### Plotting

In [98]:
px.histogram(txns, x='rating')

In [99]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj
0,1,1459193988,content_present,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers...",1190
1,2,1459194146,content_present,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[alarm, clock, wake, stream, advert, free, bro...",811
2,3,1459194474,content_present,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[excited, share, google, data, center, tour, y...",559
3,4,1459194497,content_present,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[aite, group, project, blockchain, market, cou...",2988
4,5,1459194522,content_present,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, largest, oldest, organization, computing...",1191


In [100]:
px.histogram(cnt, x='language')

### Topic Modelling

Try to create some basic topics under which each article may be categorized

In [101]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

#### Feature extraction

In [102]:
vec = TfidfVectorizer(stop_words='english')
X = vec.fit_transform(cnt['text_description'])

In [103]:
test_df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.



In [104]:
test_df.head()

Unnamed: 0,00,000,0000,000000,000000000001,0000000000400848,000001,000001000001,0001,000707,...,収穫,和食,将来の夢は,干杯,懐石料理,教える,楔形文字,頭に来る,食べ物,건배
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.057607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.036106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### NMF Decomposition

In [105]:
num_topics = 10
nmf = NMF(n_components=num_topics, random_state=42)
doc_topic = nmf.fit_transform(X)
topic_term = nmf.components_


The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).



In [106]:
# Getting the top 10 words for each topic

words = np.array(vec.get_feature_names())
topic_words = pd.DataFrame(
    np.zeros((num_topics, 10)),
    index=['topic_{}'.format(i + 1) for i in range(num_topics)],
    columns=['word_{}'.format(i + 1) for i in range(10)]
).astype(str)


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.



In [107]:
topic_words

Unnamed: 0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9,word_10
topic_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
topic_10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Populating topic_words

In [108]:
for i in range(num_topics):
    idx = topic_term[i].argsort()[::-1][:10]
    topic_words.iloc[i] = words[idx]

In [109]:
topic_words

Unnamed: 0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9,word_10
topic_1,digital,customer,business,companies,product,customers,company,new,people,marketing
topic_2,drupal,module,acquia,modules,commerce,ecommerce,content,site,core,api
topic_3,cloud,google,data,platform,gcp,aws,engine,storage,services,api
topic_4,learning,machine,ai,data,intelligence,deep,algorithms,neural,artificial,tensorflow
topic_5,bitcoin,blockchain,ethereum,financial,banks,technology,currency,bank,ledger,banking
topic_6,google,android,app,apps,vr,mobile,search,new,chrome,users
topic_7,apple,iphone,jobs,siri,mac,ios,steve,event,cook,watch
topic_8,bot,bots,facebook,slack,messenger,chatbots,app,users,apps,chat
topic_9,docker,container,containers,kubernetes,windows,run,linux,image,command,swarm
topic_10,code,use,time,data,like,just,test,java,don,ll


In [110]:
# Create a topic mapping for topic_words
# The topics in order are: 'Digital Marketing', 'E-Commerce', 'Cloud Computing', 'Data Science & Machine Learning', 'Cryptocurrency', 'Google', 'Apple', 'Facebook', 'Operating Systems & Runtimes', 'Computer Programming'
topic_mapping = {
    'topic_1': 'Digital Marketing',
    'topic_2': 'E-Commerce',
    'topic_3': 'Cloud Computing',
    'topic_4': 'Data Science & Machine Learning',
    'topic_5': 'Cryptocurrency',
    'topic_6': 'Google',
    'topic_7': 'Apple',
    'topic_8': 'Facebook',
    'topic_9': 'Operating Systems & Runtimes',
    'topic_10': 'Computer Programming'
}

In [111]:
doc_topic_df = pd.DataFrame(doc_topic, columns=['topic_{}'.format(i + 1) for i in range(num_topics)])

In [112]:
# Get the 5 topics with the highest probabilities for each document
doc_topic_df['top_topics'] = doc_topic_df.apply(lambda x: x.sort_values(ascending=False).index[:5].tolist(), axis=1)

In [113]:
doc_topic_df.head()

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,top_topics
0,0.0,0.0,0.001054,0.004535,0.231242,0.0,0.001212,0.0,0.0,0.010984,"[topic_5, topic_10, topic_4, topic_7, topic_3]"
1,0.005539,0.0,0.005143,0.011259,0.031041,0.0,0.003597,0.006076,0.0,0.018289,"[topic_5, topic_10, topic_4, topic_8, topic_1]"
2,0.002278,0.0,0.086706,0.01138,0.0,0.103184,0.0,0.0,0.0,0.001038,"[topic_6, topic_3, topic_4, topic_1, topic_10]"
3,0.0,0.0,0.0,0.007874,0.217134,0.0,0.003375,0.0,0.002483,0.021666,"[topic_5, topic_10, topic_4, topic_7, topic_9]"
4,0.0,0.0,0.070105,0.000261,0.147466,0.0,0.005656,0.0,0.0,0.0,"[topic_5, topic_3, topic_7, topic_4, topic_1]"


In [114]:
# Get the mapping for doc_topic_df.top_topics from topic_mapping and create a new column
doc_topic_df['top_topics_mapped'] = doc_topic_df.top_topics.apply(lambda x: [topic_mapping[i] for i in x])

In [115]:
doc_topic_df.head()

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,top_topics,top_topics_mapped
0,0.0,0.0,0.001054,0.004535,0.231242,0.0,0.001212,0.0,0.0,0.010984,"[topic_5, topic_10, topic_4, topic_7, topic_3]","[Cryptocurrency, Computer Programming, Data Sc..."
1,0.005539,0.0,0.005143,0.011259,0.031041,0.0,0.003597,0.006076,0.0,0.018289,"[topic_5, topic_10, topic_4, topic_8, topic_1]","[Cryptocurrency, Computer Programming, Data Sc..."
2,0.002278,0.0,0.086706,0.01138,0.0,0.103184,0.0,0.0,0.0,0.001038,"[topic_6, topic_3, topic_4, topic_1, topic_10]","[Google, Cloud Computing, Data Science & Machi..."
3,0.0,0.0,0.0,0.007874,0.217134,0.0,0.003375,0.0,0.002483,0.021666,"[topic_5, topic_10, topic_4, topic_7, topic_9]","[Cryptocurrency, Computer Programming, Data Sc..."
4,0.0,0.0,0.070105,0.000261,0.147466,0.0,0.005656,0.0,0.0,0.0,"[topic_5, topic_3, topic_7, topic_4, topic_1]","[Cryptocurrency, Cloud Computing, Apple, Data ..."


In [116]:
doc_topic_df.shape

(2191, 12)

In [117]:
# Add doc_topic_df.top_topics_mapped to cnt
cnt = pd.concat([cnt, doc_topic_df.top_topics_mapped], axis=1)

In [118]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj,top_topics_mapped
0,1,1459193988,content_present,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers...",1190,"[Cryptocurrency, Computer Programming, Data Sc..."
1,2,1459194146,content_present,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[alarm, clock, wake, stream, advert, free, bro...",811,"[Cryptocurrency, Computer Programming, Data Sc..."
2,3,1459194474,content_present,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[excited, share, google, data, center, tour, y...",559,"[Google, Cloud Computing, Data Science & Machi..."
3,4,1459194497,content_present,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[aite, group, project, blockchain, market, cou...",2988,"[Cryptocurrency, Computer Programming, Data Sc..."
4,5,1459194522,content_present,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, largest, oldest, organization, computing...",1191,"[Cryptocurrency, Cloud Computing, Apple, Data ..."


In [119]:
# Rename cnt.top_topics_mapped to cnt.topics
cnt.rename(columns={'top_topics_mapped': 'topics'}, inplace=True)

In [120]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj,topics
0,1,1459193988,content_present,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers...",1190,"[Cryptocurrency, Computer Programming, Data Sc..."
1,2,1459194146,content_present,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[alarm, clock, wake, stream, advert, free, bro...",811,"[Cryptocurrency, Computer Programming, Data Sc..."
2,3,1459194474,content_present,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[excited, share, google, data, center, tour, y...",559,"[Google, Cloud Computing, Data Science & Machi..."
3,4,1459194497,content_present,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[aite, group, project, blockchain, market, cou...",2988,"[Cryptocurrency, Computer Programming, Data Sc..."
4,5,1459194522,content_present,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, largest, oldest, organization, computing...",1191,"[Cryptocurrency, Cloud Computing, Apple, Data ..."


With this, we have some idea of what topics each article is talking about.

## Getting articles for a User

Consider user-based collaborative filtering, and ALS. Whichever gives the best result would be the model to use.

### User-based collaborative filtering

In [121]:
n_users = txns.consumer_id_adj.nunique()

In [122]:
n_articles = txns.item_id_adj.nunique()

In [123]:
# txns.consumer_id.values

In [124]:
print(f'Num users: {n_users}, Num articles: {n_articles}')

Num users: 1895, Num articles: 2987


### Train test split

In [125]:
import sklearn
train, test = sklearn.model_selection.train_test_split(txns, test_size=0.3, random_state=42)

In [126]:
train.shape

(28497, 5)

In [127]:
test.shape

(12213, 5)

In [128]:
train.describe(percentiles=[0.25, 0.5, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0])

Unnamed: 0,event_timestamp,rating_original,consumer_id_adj,item_id_adj,rating
count,28497.0,28497.0,28497.0,28497.0,28497.0
mean,1470555000.0,1.147489,430.525038,1499.067516,1.156979
std,7530892.0,0.541093,450.984328,912.874316,0.400637
min,1457965000.0,1.0,1.0,1.0,1.0
25%,1464608000.0,1.0,81.0,591.0,1.0
50%,1469539000.0,1.0,254.0,1614.0,1.0
75%,1475262000.0,1.0,643.0,2277.0,1.0
80%,1476981000.0,1.0,761.0,2425.8,1.0
85%,1478886000.0,1.0,928.0,2547.0,1.5
90%,1481900000.0,1.0,1122.0,2714.0,1.5


### User-Article matrix

Since this is collaborative filtering, we will consider the transactions matrix. From this, we construct a matrix of the ratings given by users for each product.

Populate the training matrix

In [129]:
def create_and_populate_user_article_matrix(data):
    data_matrix = np.zeros((n_users, n_articles))

    for line in data.itertuples():
        # print(line)
        # print(type(line))
        # print(f'UserId: {line.consumer_id_adj}, ArticleId: {line.item_id_adj}, Rating: {line.rating}')
        # break
        user_id = line.consumer_id_adj
        article_id = line.item_id_adj
        rating = line.rating

        data_matrix[user_id - 1, article_id - 1] = rating
    
    return data_matrix

Fill the training matrix with rating values

In [130]:
data_matrix = create_and_populate_user_article_matrix(train)

In [131]:
data_matrix

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.83333333, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [132]:
data_matrix.shape

(1895, 2987)

Dimensions match the number of unique users & articles

Populate the testing matrix

In [133]:
data_matrix_test = create_and_populate_user_article_matrix(test)

In [134]:
data_matrix_test

array([[0. , 2.5, 0. , ..., 0. , 0. , 0. ],
       [0. , 2. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 1.8, ..., 0. , 0. , 0. ],
       ...,
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ]])

In [135]:
data_matrix_test.shape

(1895, 2987)

### Pairwise Distance

In [136]:
from sklearn.metrics.pairwise import pairwise_distances

In [137]:
user_similarity = 1 - pairwise_distances(data_matrix, metric='cosine')

In [138]:
user_similarity

array([[1.        , 0.15926467, 0.04880552, ..., 0.        , 0.        ,
        0.        ],
       [0.15926467, 1.        , 0.08667542, ..., 0.        , 0.        ,
        0.        ],
       [0.04880552, 0.08667542, 1.        , ..., 0.        , 0.04386611,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.04386611, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [139]:
user_similarity.shape

(1895, 1895)

Take the transpose of the data matrix in order to calculate the article similarity. Will be used later.

In [140]:
# data_matrix.shape

In [141]:
# data_matrix.T.shape

In [142]:
article_similarity = 1 - pairwise_distances(data_matrix.T, metric='cosine')

In [143]:
article_similarity

array([[1.        , 0.24889563, 0.11910292, ..., 0.        , 0.        ,
        0.        ],
       [0.24889563, 1.        , 0.02202395, ..., 0.        , 0.        ,
        0.        ],
       [0.11910292, 0.02202395, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [144]:
article_similarity.shape

(2987, 2987)

### Get dot product of data matrix with similarity matrix

In [145]:
user_similarity.shape

(1895, 1895)

In [146]:
data_matrix_test.shape

(1895, 2987)

In [147]:
article_prediction = np.dot(user_similarity, data_matrix_test)

In [148]:
article_prediction.shape

(1895, 2987)

In [149]:
article_pred_df = pd.DataFrame(article_prediction)

In [150]:
article_pred_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2977,2978,2979,2980,2981,2982,2983,2984,2985,2986
0,0.143673,3.254799,0.245405,0.044782,0.022328,0.53297,0.570999,0.048806,0.202181,0.256083,...,0.0,0.004591,0.056674,0.0,0.0,0.006644,0.021842,0.0,0.0,0.0
1,0.103099,3.744892,0.784183,0.166364,0.03861,0.863883,0.854128,0.086675,0.524641,1.027532,...,0.0,0.214345,0.203155,0.0,0.007302,0.047056,0.005316,0.0,0.0,0.0
2,0.021726,0.838452,2.226006,0.065983,0.04978,0.3423,0.186129,1.0,0.376136,0.794697,...,0.0,0.118548,0.071983,0.0,0.0,0.005184,0.0,0.0,0.0,0.0
3,0.0,0.365032,1.949186,0.0,0.0,0.148945,0.023038,0.0,0.016471,0.454616,...,0.0,0.091701,0.0,0.0,0.0,0.036026,0.0,0.0,0.0,0.0
4,0.019866,0.821098,0.426121,1.089345,0.03945,0.300026,0.190112,0.065983,0.283924,0.436737,...,0.0,0.238282,0.079827,0.0,0.053516,0.025284,0.023377,0.0,0.0,0.0


In [151]:
txns.consumer_id_adj.value_counts()

7       961
21      669
2       648
27      585
114     437
       ... 
456       1
1526      1
1518      1
1513      1
1895      1
Name: consumer_id_adj, Length: 1895, dtype: int64

### Test for one user

In [152]:
test.head()

Unnamed: 0,event_timestamp,rating_original,consumer_id_adj,item_id_adj,rating
5712,1469717171,2,39,1750,1.5
6946,1474569261,1,52,2273,1.0
29727,1468410949,1,607,1572,1.0
21712,1464954966,1,288,379,1.666667
33830,1471957935,1,855,1993,1.0


In [153]:
test_user_id = 962
test_user_idx = test_user_id - 1

In [154]:
test_user_id in test.consumer_id_adj.values

True

In [155]:
article_pred_df.iloc[test_user_idx]

0       0.000000
1       0.229254
2       0.000000
3       0.000000
4       0.000000
          ...   
2982    0.000000
2983    0.000000
2984    0.000000
2985    0.000000
2986    0.000000
Name: 961, Length: 2987, dtype: float64

In [156]:
article_recommendation = pd.DataFrame(article_pred_df.iloc[test_user_idx].sort_values(ascending=False))

In [157]:
article_recommendation

Unnamed: 0,961
1854,1.501111
884,1.457415
1569,1.316463
2456,1.305109
576,1.287530
...,...
1072,0.000000
1073,0.000000
1075,0.000000
1076,0.000000


In [158]:
article_recommendation.reset_index(inplace=True)

In [159]:
article_recommendation.head()

Unnamed: 0,index,961
0,1854,1.501111
1,884,1.457415
2,1569,1.316463
3,2456,1.305109
4,576,1.28753


Since the matrix is zero-based, the article ID index that we get is also zero-based. However, our IDs are one-based. So, convert the article ID to one-based by adding 1.

In [160]:
article_recommendation['index'] = article_recommendation['index'] + 1

In [161]:
article_recommendation.head()

Unnamed: 0,index,961
0,1855,1.501111
1,885,1.457415
2,1570,1.316463
3,2457,1.305109
4,577,1.28753


In [162]:
article_recommendation.rename(columns={'index': 'article_id', test_user_idx: 'score'}, inplace=True)

In [163]:
article_recommendation.head()

Unnamed: 0,article_id,score
0,1855,1.501111
1,885,1.457415
2,1570,1.316463
3,2457,1.305109
4,577,1.28753


Merging with the content dataframe to get the article title.

In [164]:
merged = pd.merge(article_recommendation, cnt, left_on='article_id', right_on='item_id_adj', how='left')

In [165]:
merged.columns

Index(['article_id', 'score', 'index', 'event_timestamp', 'interaction_type',
       'item_type', 'item_url', 'title', 'text_description', 'language',
       'text_description_lemmatized', 'item_id_adj', 'topics'],
      dtype='object')

In [166]:
keep = ['article_id', 'score', 'title', 'interaction_type']

In [167]:
merged = merged.drop(columns=[col for col in merged if col not in keep])

In [168]:
merged.head(10)

Unnamed: 0,article_id,score,interaction_type,title
0,1855,1.501111,content_present,The Broken Window Theory
1,885,1.457415,content_present,Program your way to your next grocery delivery
2,1570,1.316463,content_present,Visual Thinking and Learning 3.0 working toget...
3,2457,1.305109,,
4,577,1.28753,,
5,1551,1.207948,,
6,2211,1.187492,content_present,SpotHero is ready for the future of autonomous...
7,171,1.040397,,
8,52,1.0,content_present,Chrome OS now has Material Design for the desktop
9,1568,1.0,,


In [169]:
cnt[cnt['item_id_adj'] == 203]

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj,topics


Some articles have title as NaN. This is because they do not exist in the content DataFrame, meaning they were pulled out of the system, or that data was somehow lost.

These entries can be used for analysis. However, they must not be included in any results.

In [170]:
merged.shape

(2988, 4)

In [171]:
merged = merged[~(merged['title'].isna())]

In [172]:
merged.shape

(2130, 4)

Of the remaining suggestions, some might have been pulled out of the system. Filter those out.

In [173]:
merged[merged['interaction_type'] == 'content_pulled_out']

Unnamed: 0,article_id,score,interaction_type,title
399,2023,0.0,content_pulled_out,How Netflix does A/B testing - InVision Blog
725,1683,0.0,content_pulled_out,Certeza que devemos marcar uma reunião?
735,1657,0.0,content_pulled_out,"So, You Want A Table, Huh?"
765,1921,0.0,content_pulled_out,Approaching (Almost) Any Machine Learning Problem
793,1932,0.0,content_pulled_out,Learn Swift Programming Syntax | Udacity
922,2730,0.0,content_pulled_out,"Which countries study which languages, and wha..."
1438,2536,0.0,content_pulled_out,IBM now uses more Macs than any other company ...
1495,2574,0.0,content_pulled_out,Introducing the Workspace Preview System | Acq...
1516,2407,0.0,content_pulled_out,Today in Apple history: Steve Jobs passes away
1549,2417,0.0,content_pulled_out,Real World Swift Performance


In [174]:
merged = merged[merged['interaction_type'] != 'content_pulled_out']

In [175]:
merged.shape

(2094, 4)

In [176]:
merged.head()

Unnamed: 0,article_id,score,interaction_type,title
0,1855,1.501111,content_present,The Broken Window Theory
1,885,1.457415,content_present,Program your way to your next grocery delivery
2,1570,1.316463,content_present,Visual Thinking and Learning 3.0 working toget...
6,2211,1.187492,content_present,SpotHero is ready for the future of autonomous...
8,52,1.0,content_present,Chrome OS now has Material Design for the desktop


### Evaluate the predictions of the Collaborative User-based model

In [177]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from math import sqrt

In [178]:
data_matrix

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.83333333, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [179]:
data_matrix_test

array([[0. , 2.5, 0. , ..., 0. , 0. , 0. ],
       [0. , 2. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 1.8, ..., 0. , 0. , 0. ],
       ...,
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ]])

In [180]:
article_prediction

array([[0.14367345, 3.25479919, 0.24540477, ..., 0.        , 0.        ,
        0.        ],
       [0.10309915, 3.74489189, 0.78418255, ..., 0.        , 0.        ,
        0.        ],
       [0.0217256 , 0.8384518 , 2.22600589, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.08443689, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.10645397, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.19704777, 0.0614806 , ..., 0.        , 0.        ,
        0.        ]])

In [181]:
data_matrix_test_nz = data_matrix_test.nonzero()

In [182]:
prediction = article_prediction[data_matrix_test_nz]

In [183]:
ground_truth = data_matrix_test[data_matrix_test_nz]

#### Mean Absolute Error

In [184]:
mean_absolute_error(prediction, ground_truth)

0.5569011778965313

#### Root Mean Square Error

In [185]:
sqrt(mean_squared_error(prediction, ground_truth))

0.8216710492578613

#### Precision

Out of the recommended items, how many did the user like?

In [186]:
num_pred = 10

In [187]:
predicted_article_ids_for_user = merged['article_id'].values[:num_pred]

In [188]:
predicted_article_ids_for_user

array([1855,  885, 1570, 2211,   52, 1484, 1856, 1816, 1807, 1628],
      dtype=int64)

In [189]:
def get_articles_that_user_liked(user_id):
    # For this, we get all the articles that user has given a rating of more than the average rating
    # Get the average rating for the user
    avg = txns[txns['consumer_id_adj'] == user_id].rating.mean()

    user_interactions = txns[(txns['consumer_id_adj'] == user_id) & (txns['rating'] > avg)].sort_values(by='rating', ascending=False)

    if (len(user_interactions) == 0):
        user_interactions = txns[(txns['consumer_id_adj'] == user_id)].sort_values(by='rating', ascending=False)

    return user_interactions[['item_id_adj', 'rating']]

Since in the txns DataFrame, all IDs are 1-indexed, we can use the test user ID as it is.

In [190]:
user_interactions = get_articles_that_user_liked(test_user_id)

In [191]:
user_interactions.head()

Unnamed: 0,item_id_adj,rating
35003,52,1.0
35004,299,1.0
35005,308,1.0
35006,1484,1.0
35007,1568,1.0


In [192]:
actual_article_ids_for_user = user_interactions['item_id_adj'].values

In [193]:
set(predicted_article_ids_for_user)

{52, 885, 1484, 1570, 1628, 1807, 1816, 1855, 1856, 2211}

In [194]:
set(actual_article_ids_for_user)

{52, 299, 308, 1484, 1568}

Get intersection of predictions and user interactions

In [195]:
set(predicted_article_ids_for_user) & set(actual_article_ids_for_user)

{52, 1484}

In [196]:
correctly_predicted_article_ids = set(predicted_article_ids_for_user) & set(actual_article_ids_for_user)

Some of the articles that user liked are identified

Precision = #Correct predictions / #Predictions

In [197]:
precision = len(correctly_predicted_article_ids) / len(predicted_article_ids_for_user)

In [198]:
precision

0.2

#### Recall

Recall is the ratio of liked articles that the system is able to identify correctly

Recall = #Correct Predictions / #Liked Articles

In [199]:
recall = len(correctly_predicted_article_ids) / len(actual_article_ids_for_user)

In [200]:
recall

0.4

In order to evaluate the filtering method over the entire test data, get the metrics as defined above, and take the average

In [201]:
# Helper methods
def evaluate_user_based_filtering(test):
    # For each unique consumer_id_adj in the test DataFrame, we will evaluate the precision and recall
    # of the user-based filtering algorithm
    total_precision = 0
    total_recall = 0

    test_user_ids = test.consumer_id_adj.unique()
    num_users = len(test_user_ids)
    for test_user_id in test_user_ids:
        # Get the articles that the user has liked
        user_interactions = get_articles_that_user_liked(test_user_id)
        actual_article_ids_for_user = user_interactions['item_id_adj'].values

        if (len(actual_article_ids_for_user) == 0):
            # If the user has not liked any articles, we will skip this user
            # Print the user id so that we can keep track of the progress
            print('Skipping user: ', test_user_id)
            num_users -= 1
            continue

        # Get the articles that the user-based filtering algorithm has recommended
        test_user_idx = test_user_id - 1
        article_recommendation = pd.DataFrame(article_pred_df.iloc[test_user_idx].sort_values(ascending=False))
        article_recommendation.reset_index(inplace=True)
        article_recommendation['index'] = article_recommendation['index'] + 1
        article_recommendation.rename(columns={'index': 'article_id', test_user_idx: 'score'}, inplace=True)
        merged = pd.merge(article_recommendation, cnt, left_on='article_id', right_on='item_id_adj', how='left')
        keep = ['article_id', 'score', 'title', 'interaction_type']
        merged = merged.drop(columns=[col for col in merged if col not in keep])
        merged = merged[~(merged['title'].isna())]
        merged = merged[merged['interaction_type'] != 'content_pulled_out']
        predicted_article_ids_for_user = merged['article_id'].values[:num_pred]

        # Calculate precision and recall
        correctly_predicted_article_ids = set(predicted_article_ids_for_user) & set(actual_article_ids_for_user)
        precision = len(correctly_predicted_article_ids) / len(predicted_article_ids_for_user)
        recall = len(correctly_predicted_article_ids) / len(actual_article_ids_for_user)
        
        total_precision += precision
        total_recall += recall
    
    # Return the average precision and recall as a tuple
    return (total_precision / num_users, total_recall / num_users)

In [202]:
# Evaluate the user-based filtering algorithm and store the results in 2 variables
avg_precision, avg_recall = evaluate_user_based_filtering(test)

In [203]:
# Round the results to 3 decimal places and print them
print('Average precision: ', round(avg_precision, 3))
print('Average recall: ', round(avg_recall, 3))

Average precision:  0.13
Average recall:  0.205


Check if ALS does better.

Expose method to get recommendations for a user

In [204]:
def get_articles_for_user_from_user_based(user_id, n=-1):
    user_idx = user_id - 1

    recommendation = pd.DataFrame(article_pred_df.iloc[user_idx].sort_values(ascending=False))

    recommendation.reset_index(inplace=True)

    recommendation['index'] = recommendation['index'] + 1

    recommendation.rename(columns={'index': 'article_id', user_idx: 'score'}, inplace=True)

    merged = pd.merge(recommendation, cnt, left_on='article_id', right_on='item_id_adj', how='left')

    keep = ['article_id', 'title', 'score', 'topics', 'interaction_type']

    merged = merged.drop(columns=[col for col in merged if col not in keep])

    merged = merged[merged['interaction_type'] != 'content_pulled_out']

    # Drop rows with NaN values
    merged.dropna(inplace=True)

    # Reset the index
    merged.reset_index(inplace=True, drop=True)

    # Drop interaction_type
    merged = merged.drop(columns=['interaction_type'])

    # Sort by score
    merged = merged.sort_values(by='score', ascending=False)

    # Return the top n articles if n is specified
    if (n > 0):
        return merged[:n]

    return merged

In [205]:
get_articles_for_user_from_user_based(test_user_id)

Unnamed: 0,article_id,score,title,topics
0,1855,1.501111,The Broken Window Theory,"[Computer Programming, Digital Marketing, Appl..."
1,885,1.457415,Program your way to your next grocery delivery,"[Facebook, Computer Programming, Digital Marke..."
2,1570,1.316463,Visual Thinking and Learning 3.0 working toget...,"[Data Science & Machine Learning, Computer Pro..."
3,2211,1.187492,SpotHero is ready for the future of autonomous...,"[Digital Marketing, Facebook, Data Science & M..."
4,52,1.000000,Chrome OS now has Material Design for the desktop,"[Google, Computer Programming, Apple, Digital ..."
...,...,...,...,...
848,2317,0.000000,The Best Advice From Quora on 'How to Learn Ma...,"[Data Science & Machine Learning, Digital Mark..."
847,2316,0.000000,Innovation is in all the wrong places,"[Digital Marketing, Facebook, Computer Program..."
846,2315,0.000000,Blog | Niantic,"[Cloud Computing, Digital Marketing, Google, E..."
845,2314,0.000000,Largest botnet attack in history peaks at over...,"[Computer Programming, Digital Marketing, Clou..."


In [206]:
get_articles_for_user_from_user_based(test_user_id, 10)

Unnamed: 0,article_id,score,title,topics
0,1855,1.501111,The Broken Window Theory,"[Computer Programming, Digital Marketing, Appl..."
1,885,1.457415,Program your way to your next grocery delivery,"[Facebook, Computer Programming, Digital Marke..."
2,1570,1.316463,Visual Thinking and Learning 3.0 working toget...,"[Data Science & Machine Learning, Computer Pro..."
3,2211,1.187492,SpotHero is ready for the future of autonomous...,"[Digital Marketing, Facebook, Data Science & M..."
4,52,1.0,Chrome OS now has Material Design for the desktop,"[Google, Computer Programming, Apple, Digital ..."
5,1484,1.0,Accenture Launches Content Studio,"[Digital Marketing, E-Commerce, Cloud Computin..."
6,1856,0.806182,Why Walmart wants to buy Jet.com and what you ...,"[Digital Marketing, E-Commerce, Cloud Computin..."
7,1816,0.791662,Three Lessons for Design-Driven Success,"[Digital Marketing, Apple, Computer Programmin..."
8,1807,0.763826,10 Modern Software Over-Engineering Mistakes,"[Computer Programming, Digital Marketing, Data..."
9,1628,0.511688,How This Former Google Engineer Is Bringing Bl...,"[Cryptocurrency, Data Science & Machine Learni..."


## Alternating Least Squares method

#### Create sparse User-Article matrix

In [207]:
from scipy.sparse import csr_matrix

Random values in CSR matrix will be filled with alpha value

In [208]:
txns.head()

Unnamed: 0,event_timestamp,rating_original,consumer_id_adj,item_id_adj,rating
0,1465413032,1,1,1,1.0
1,1465481798,2,1,2,2.5
2,1462296634,1,1,8,1.0
3,1460648226,1,1,28,1.25
4,1470773847,1,1,38,1.0


In [209]:
keep = ['consumer_id_adj', 'item_id_adj', 'rating']

In [210]:
txns_mod = txns.drop(columns=[col for col in txns.columns if col not in keep])

In [211]:
txns_mod.head()

Unnamed: 0,consumer_id_adj,item_id_adj,rating
0,1,1,1.0
1,1,2,2.5
2,1,8,1.0
3,1,28,1.25
4,1,38,1.0


In [212]:
txns_mod.describe()

Unnamed: 0,consumer_id_adj,item_id_adj,rating
count,40710.0,40710.0,40710.0
mean,430.560624,1493.917416,1.157177
std,450.92468,913.586706,0.397953
min,1.0,1.0,1.0
25%,81.0,584.0,1.0
50%,254.0,1603.5,1.0
75%,648.0,2277.0,1.0
max,1895.0,2987.0,4.5


In [213]:
alpha = 40

In [214]:
txns_mod.shape

(40710, 3)

In [215]:
txns_mod.shape[0]

40710

In [216]:
x = [alpha] * txns_mod.shape[0]

In [217]:
len(x)

40710

In [218]:
sparse_user_article = csr_matrix( ([alpha]*txns_mod.shape[0], (txns_mod['consumer_id_adj'], txns_mod['item_id_adj']) ))

In [219]:
sparse_user_article

<1896x2988 sparse matrix of type '<class 'numpy.intc'>'
	with 40710 stored elements in Compressed Sparse Row format>

In [220]:
n_users

1895

In [221]:
n_articles

2987

Matrix dimensions match with the number of users & articles, accounting for the extra row at index 0

Convert to array

In [222]:
csr_user_array = sparse_user_article.toarray()

In [223]:
csr_user_array

array([[ 0,  0,  0, ...,  0,  0,  0],
       [ 0, 40, 40, ...,  0,  0,  0],
       [ 0,  0, 40, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0]], dtype=int32)

In [224]:
n_users

1895

In [225]:
len(csr_user_array), len(csr_user_array[0])

(1896, 2988)

Dimensions match with the matrix

In [226]:
max(csr_user_array[1])

40

Create article-user sparse matrix

In [227]:
sparse_article_user = sparse_user_article.T.tocsr()

In [228]:
sparse_article_user

<2988x1896 sparse matrix of type '<class 'numpy.intc'>'
	with 40710 stored elements in Compressed Sparse Row format>

Shape matches

In [229]:
csr_article_array = sparse_article_user.toarray()

#### Create train & test data

In [230]:
%pip install implicit

Note: you may need to restart the kernel to use updated packages.


In [231]:
from implicit.evaluation import train_test_split

In [232]:
sparse_article_user

<2988x1896 sparse matrix of type '<class 'numpy.intc'>'
	with 40710 stored elements in Compressed Sparse Row format>

In [233]:
train, test = train_test_split(sparse_user_article, train_percentage=0.8)

In [234]:
train

<1896x2988 sparse matrix of type '<class 'numpy.intc'>'
	with 32580 stored elements in Compressed Sparse Row format>

In [235]:
test

<1896x2988 sparse matrix of type '<class 'numpy.intc'>'
	with 8130 stored elements in Compressed Sparse Row format>

#### Building the ALS Model

In [236]:
from implicit.als import AlternatingLeastSquares

In [237]:
model = AlternatingLeastSquares(factors=60, regularization=0.1, iterations=60, calculate_training_loss=False)


Intel MKL BLAS detected. Its highly recommend to set the environment variable 'export MKL_NUM_THREADS=1' to disable its internal multithreading



In [238]:
# model

Training

In [239]:
model.fit(train)

  0%|          | 0/60 [00:00<?, ?it/s]

In [240]:
# test

In [241]:
# test_user_id = 114

In [242]:
user_interactions = get_articles_that_user_liked(test_user_id)

New Implicit API expects (user, item) sparse matrix as input

In [243]:
model.recommend(test_user_id, sparse_user_article[test_user_id], N=20, filter_already_liked_items=False)

(array([1568,  308,  299, 1484,  577, 1577,  163, 1774, 1570,  885,  435,
        2581, 1766, 1804, 1551, 2457, 2724, 1547, 1603, 1776]),
 array([0.8724034 , 0.80005395, 0.7738688 , 0.69204986, 0.5867467 ,
        0.5714423 , 0.49449414, 0.47777113, 0.47732183, 0.47504848,
        0.46001446, 0.4151766 , 0.41483334, 0.41109627, 0.40823567,
        0.40461266, 0.39848226, 0.3758666 , 0.37432632, 0.3522518 ],
       dtype=float32))

In [244]:
ids, scores = model.recommend(test_user_id, sparse_user_article[test_user_id], N=20, filter_already_liked_items=False)

In [245]:
out = pd.DataFrame({'article_id': ids, 'als_score': scores})

In [246]:
# out

In [247]:
out.head(num_pred)

Unnamed: 0,article_id,als_score
0,1568,0.872403
1,308,0.800054
2,299,0.773869
3,1484,0.69205
4,577,0.586747
5,1577,0.571442
6,163,0.494494
7,1774,0.477771
8,1570,0.477322
9,885,0.475048


In [248]:
out.shape

(20, 2)

In [249]:
user_interactions.head(10)

Unnamed: 0,item_id_adj,rating
35003,52,1.0
35004,299,1.0
35005,308,1.0
35006,1484,1.0
35007,1568,1.0


In [250]:
user_interactions.shape

(5, 2)

In [251]:
actual_article_ids_for_user = set(user_interactions['item_id_adj'].values)

In [252]:
predicted_article_ids_for_user = set(out['article_id'].values)

In [253]:
correctly_predicted_article_ids = actual_article_ids_for_user & predicted_article_ids_for_user

In [254]:
precision = len(correctly_predicted_article_ids) / len(predicted_article_ids_for_user)

In [255]:
recall = len(correctly_predicted_article_ids) / len(actual_article_ids_for_user)

In [256]:
# Print the precision and recall
print('Precision: ', precision)
print('Recall: ', recall)

Precision:  0.2
Recall:  0.8


Similar to user-based collaborative filtering, evaluate ALS

implicit.evaluation already contains a mean_average_precision_at_k method

In [257]:
from implicit.evaluation import precision_at_k

In [258]:
p_at_k = precision_at_k(model, train, test, K=10)

  0%|          | 0/1309 [00:00<?, ?it/s]

In [259]:
# Round the results to 3 decimal places and print them
print('Precision at k: ', round(p_at_k, 3))

Precision at k:  0.133


Check if better precision@k is possible with hyperparameter tuning

In [260]:
import itertools

In [261]:
if False:
    factors = [60, 80, 85, 87, 90, 92, 95, 100]
    regularization = [0.1, 0.11, 0.115, 0.12, 0.125]
    iterations = [30, 35, 40, 45, 50, 60]

    # Create a DataFrame to store the results
    results = pd.DataFrame(columns=['factors', 'regularization', 'iterations', 'precision_at_k'])
    for (f, r, i) in itertools.product(factors, regularization, iterations):
        model = AlternatingLeastSquares(factors=f, regularization=r, iterations=i, calculate_training_loss=False)
        model.fit(train, show_progress=False)
        p_at_k = precision_at_k(model, train, test, K=10, show_progress=False)

        # Append the results to the DataFrame
        # Create a temp DataFrame to store the results
        temp_results = pd.DataFrame([[f, r, i, p_at_k]], columns=['factors', 'regularization', 'iterations', 'precision_at_k'])
        
        # Concatenate the temp DataFrame to the results DataFrame
        results = pd.concat([results, temp_results], ignore_index=True)

In [262]:
if False:
    # Sort the results by precision_at_k and print the top 5
    results.sort_values(by='precision_at_k', ascending=False, inplace=True)
    results.head()

Got best params from tuning

precision@k = 0.144

In [263]:
best_user_based_f = 92
best_user_based_r = 0.115
best_user_based_i = 40

In [264]:
best_user_based_als = AlternatingLeastSquares(
    factors=best_user_based_f, 
    regularization=best_user_based_r, 
    iterations=best_user_based_i, 
    calculate_training_loss=False
)
best_user_based_als.fit(train)

  0%|          | 0/40 [00:00<?, ?it/s]

In [265]:
ids, scores = best_user_based_als.recommend(test_user_id, sparse_user_article[test_user_id], N=20, filter_already_liked_items=True)

precision@k is higher than that of User-based collaborative filtering, so ALS can be used for getting articles for a user.

Expose method

In [266]:
def get_articles_for_user_from_als(user_id, n=20):
    global best_user_based_als
    if not best_user_based_als:
        best_user_based_als = AlternatingLeastSquares(
            factors=best_user_based_f, 
            regularization=best_user_based_r, 
            iterations=best_user_based_i, 
            calculate_training_loss=False
        )
        best_user_based_als.fit(train)
    id, scores = best_user_based_als.recommend(user_id, sparse_user_article[user_id], N=50, filter_already_liked_items=True)

    out = pd.DataFrame({'item_id_adj': id, 'score': scores})

    # Merge out with cnt on item_id_adj
    merged = out.merge(cnt, how='left', on='item_id_adj')

    # Keep only item_id_adj, title, score, and topics
    merged = merged[['item_id_adj', 'title', 'score', 'topics']]

    # Drop rows with NaN values
    merged.dropna(inplace=True)

    # Reset index
    merged.reset_index(drop=True, inplace=True)

    # Round score to 3 decimal places
    merged['score'] = merged['score'].apply(lambda x: round(x, 3))

    # Sort by score
    merged.sort_values(by='score', ascending=False, inplace=True)

    return merged[:n]

In [267]:
get_articles_for_user_from_als(test_user_id, n=10)

Unnamed: 0,item_id_adj,title,score,topics
0,885,Program your way to your next grocery delivery,0.417,"[Facebook, Computer Programming, Digital Marke..."
1,163,"Forget The Internet Of Things, There Is A Digi...",0.396,"[Digital Marketing, Data Science & Machine Lea..."
2,1628,How This Former Google Engineer Is Bringing Bl...,0.374,"[Cryptocurrency, Data Science & Machine Learni..."
3,1570,Visual Thinking and Learning 3.0 working toget...,0.356,"[Data Science & Machine Learning, Computer Pro..."
4,1378,Google Ranking Factors: The Complete List,0.349,"[Google, Computer Programming, E-Commerce, Dat..."
5,1808,You don't talk about refactoring club,0.339,"[Computer Programming, Digital Marketing, E-Co..."
6,297,22 Mobile Stats Everyone Should Know - DZone M...,0.324,"[Digital Marketing, Google, Computer Programmi..."
7,2035,Building Flipkart Lite: A Progressive Web App,0.304,"[Computer Programming, Google, Facebook, Cloud..."
8,1559,[Retro] Celebration Grids - Management 3.0,0.288,"[Computer Programming, Data Science & Machine ..."
9,1518,2 terrific #MarTech talks on the rise of AI in...,0.28,"[Data Science & Machine Learning, Digital Mark..."


## Getting articles matching another article

Consider item-based collaborative filtering and content-based filtering

### Item-based collaborative filtering

Use article_similarity matrix constructed earlier

In [268]:
article_similarity.shape

(2987, 2987)

In [269]:
n_articles

2987

In [270]:
n_articles

2987

In [271]:
data_matrix_test.shape

(1895, 2987)

In [272]:
data_matrix_test.T.shape

(2987, 1895)

In [273]:
other_article_prediction = np.dot(article_similarity, data_matrix_test.T)

In [274]:
other_article_prediction.shape

(2987, 1895)

In [275]:
other_article_pred_df = pd.DataFrame(other_article_prediction)

In [276]:
other_article_pred_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894
0,6.271056,41.288538,2.186761,0.294807,4.455233,0.816352,18.228467,1.604078,1.610983,0.680827,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,6.653296,31.719906,2.643238,0.181775,3.983794,1.447462,18.754569,3.507377,2.933203,0.807899,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.641996,23.057265,3.732405,2.062648,3.644192,2.536398,21.389102,2.567386,3.44371,0.779039,...,0.0,0.0,0.0,0.0,0.0,0.158502,0.0,0.0,0.0,0.0
3,4.661226,7.71371,2.151167,0.386253,4.935273,0.368157,18.373473,1.466528,2.704127,0.265914,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.663325,9.506578,1.074286,0.104765,1.838503,0.958249,1.973059,2.380764,1.499875,0.911702,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Test for one article

In [277]:
test_article_id = 1190

In [278]:
test_article_idx = test_article_id - 1

In [279]:
article_similarity[test_article_idx]

array([0., 0., 0., ..., 0., 0., 0.])

In [280]:
df = pd.DataFrame(article_similarity[test_article_idx], columns=['score'])

In [281]:
df.head()

Unnamed: 0,score
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


In [282]:
df.reset_index(inplace=True)

In [283]:
df.head()

Unnamed: 0,index,score
0,0,0.0
1,1,0.0
2,2,0.0
3,3,0.0
4,4,0.0


In [284]:
df['index'] = df['index'] + 1

In [285]:
df.head()

Unnamed: 0,index,score
0,1,0.0
1,2,0.0
2,3,0.0
3,4,0.0
4,5,0.0


In [286]:
df.rename(columns={'index': 'item_id_adj'}, inplace=True)

In [287]:
df.sort_values(by='score', ascending=False, inplace=True)

In [288]:
df.head()

Unnamed: 0,item_id_adj,score
1189,1190,1.0
917,918,1.0
1299,1300,0.707107
675,676,0.5
481,482,0.5


In [289]:
cnt[(cnt['item_id_adj'] == 1190) | (cnt['item_id_adj'] == 918)][['item_id_adj', 'title', 'text_description', 'topics']]

Unnamed: 0,item_id_adj,title,text_description,topics
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,"[Cryptocurrency, Computer Programming, Data Sc..."
191,918,"Proof of Individuality, the New-Age Security o...",Proof of Individuality protocol is designed to...,"[Cryptocurrency, Computer Programming, Data Sc..."


Expose method

In [290]:
def get_articles_matching_article_from_item_based(article_id, n=-1, all=False):
    article_idx = article_id - 1

    out = pd.DataFrame(article_similarity[article_idx], columns=['score'])

    out.reset_index(inplace=True)

    out['index'] = out['index'] + 1

    out.rename(columns={'index': 'item_id_adj'}, inplace=True)

    out.sort_values(by='score', ascending=False, inplace=True)

    # Merge out with cnt on item_id_adj
    merged = out.merge(cnt, how='left', on='item_id_adj')

    # Keep only item_id_adj, title, score, and topics
    merged = merged[['item_id_adj', 'title', 'score', 'topics']]

    # Drop rows with NaN values
    merged.dropna(inplace=True)

    # Reset index
    merged.reset_index(drop=True, inplace=True)

    # Round score to 3 decimal places
    merged['score'] = merged['score'].apply(lambda x: round(x, 3))

    # Sort by score
    merged.sort_values(by='score', ascending=False, inplace=True)

    if n == -1 or all:
        return merged

    return merged[:n]

In [291]:
get_articles_matching_article_from_item_based(test_article_id, n=10)

Unnamed: 0,item_id_adj,title,score,topics
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc..."
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc..."
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce..."
3,676,Microsoft Continues to Embrace Ethereum & Bitc...,0.5,"[Cryptocurrency, Operating Systems & Runtimes,..."
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro..."
5,916,Gold Backed Digix Raises Millions in Hours on ...,0.5,"[Cryptocurrency, Computer Programming, Digital..."
6,1005,Google Cloud Platform for AWS Professionals,0.408,"[Cloud Computing, Google, Computer Programming..."
7,549,"Google Failure, Ethereum Leaps, ECB Giveout in...",0.408,"[Cryptocurrency, Google, Digital Marketing, Co..."
9,158,What Apple's differential privacy means for yo...,0.258,"[Apple, Data Science & Machine Learning, Compu..."
8,353,Getting Started with Activity & Fragment Trans...,0.258,"[Computer Programming, Google, E-Commerce, Ope..."


### ALS for Articles

Use sparse_article_user created earlier

In [292]:
item_train, item_test = train_test_split(sparse_article_user, train_percentage=0.8, random_state=42)

In [293]:
model = AlternatingLeastSquares(factors=60, regularization=0.1, iterations=60, calculate_training_loss=False)

In [294]:
model.fit(item_train)

  0%|          | 0/60 [00:00<?, ?it/s]

In [295]:
precision_at_k(model, item_train, item_test, K=10)

  0%|          | 0/2262 [00:00<?, ?it/s]

0.18956451395442903

Precision@k value is 0.193. Check for a better value with Hyperparameter tuning.

In [296]:
def item_based_hyperparameter_tuning():
    factors = [10, 20, 30, 35, 40, 45, 50, 55, 60, 65, 70]
    regularization = [0.7, 0.8, 0.9, 0.95, 1, 1.1, 1.2, 1.5]
    iterations = [80, 90, 100, 110, 120, 130, 140, 150]

    # Create a DataFrame to store the results
    results = pd.DataFrame(columns=['factors', 'regularization', 'iterations', 'precision_at_k'])
    for (f, r, i) in itertools.product(factors, regularization, iterations):
        model = AlternatingLeastSquares(factors=f, regularization=r, iterations=i, calculate_training_loss=False)
        model.fit(train, show_progress=False)
        p_at_k = precision_at_k(model, item_train, item_test, K=10, show_progress=False)

        # Append the results to the DataFrame
        # Create a temp DataFrame to store the results
        temp_results = pd.DataFrame([[f, r, i, p_at_k]], columns=['factors', 'regularization', 'iterations', 'precision_at_k'])
        
        # Concatenate the temp DataFrame to the results DataFrame
        results = pd.concat([results, temp_results], ignore_index=True)

    # Sort the results by precision_at_k and print the top 5
    results.sort_values(by='precision_at_k', ascending=False, inplace=True)
    return results

In [297]:
if False:
    results = item_based_hyperparameter_tuning()
    print(results.head())

After hyperparameter tuning

In [298]:
best_article_als_f = 20
best_article_als_r = 1.2
best_article_als_i = 120

In [299]:
best_item_als = AlternatingLeastSquares(
    factors=best_article_als_f, 
    regularization=best_article_als_r, 
    iterations=best_article_als_i, 
    calculate_training_loss=False
)
best_item_als.fit(item_train)

  0%|          | 0/120 [00:00<?, ?it/s]

Test for one article

In [300]:
test_article_id = 1190

In [301]:
ids, scores = best_item_als.recommend(test_article_id, sparse_article_user[test_article_id], N=20, filter_already_liked_items=False)

In [302]:
# Create a DataFrame of the recommended article ids and scores
collab_out = pd.DataFrame({'article_id': ids, 'Score': scores})

In [303]:
collab_out.head()

Unnamed: 0,article_id,Score
0,77,0.559194
1,196,0.39084
2,244,0.323323
3,538,0.322372
4,44,0.302859


In [304]:
# Define a function to get the article title from the article id
def get_article_title(article_id):
    # If the article id is not in the article dataframe, log that it is missing
    if article_id not in cnt['item_id_adj'].values:
        print('Missing article id: ', article_id)
        return None
    return cnt[cnt['item_id_adj'] == article_id]['title'].values[0]

In [305]:
def get_article_topics(article_id):
    # If the article id is not in the article dataframe, log that it is missing
    if article_id not in cnt['item_id_adj'].values:
        print('Missing article id: ', article_id)
        return None
    return cnt[cnt['item_id_adj'] == article_id]['topics'].values[0]

In [306]:
# Get the article title from the article ids
collab_out['title'] = collab_out['article_id'].apply(lambda x: get_article_title(x))

Missing article id:  77
Missing article id:  244
Missing article id:  44
Missing article id:  421


In [307]:
collab_out.head()

Unnamed: 0,article_id,Score,title
0,77,0.559194,
1,196,0.39084,How I got into the top 15 of a Kaggle competit...
2,244,0.323323,
3,538,0.322372,Announcing SyntaxNet: The World's Most Accurat...
4,44,0.302859,


In [308]:
# Get the article topics from the article ids
collab_out['topics'] = collab_out['article_id'].apply(lambda x: get_article_topics(x))

Missing article id:  77
Missing article id:  244
Missing article id:  44
Missing article id:  421


In [309]:
collab_out.head()

Unnamed: 0,article_id,Score,title,topics
0,77,0.559194,,
1,196,0.39084,How I got into the top 15 of a Kaggle competit...,"[Computer Programming, Data Science & Machine ..."
2,244,0.323323,,
3,538,0.322372,Announcing SyntaxNet: The World's Most Accurat...,"[Data Science & Machine Learning, Computer Pro..."
4,44,0.302859,,


In [310]:
# Drop rows with missing article titles
collab_out.dropna(inplace=True)

In [311]:
collab_out

Unnamed: 0,article_id,Score,title,topics
1,196,0.39084,How I got into the top 15 of a Kaggle competit...,"[Computer Programming, Data Science & Machine ..."
3,538,0.322372,Announcing SyntaxNet: The World's Most Accurat...,"[Data Science & Machine Learning, Computer Pro..."
5,224,0.298958,An overview of web service solutions in Drupal 8,"[E-Commerce, Computer Programming, Cloud Compu..."
6,846,0.295025,The insurance tech equation,"[Digital Marketing, Cryptocurrency, Data Scien..."
7,73,0.281402,Hero unveils a new home gadget to help you tra...,"[Computer Programming, Digital Marketing, Appl..."
8,92,0.28042,How We Migrated Our Backend to Spring Boot in ...,"[Computer Programming, Cloud Computing, Operat..."
9,394,0.279102,"The New App Store: Subscription Pricing, Faste...","[Apple, Google, Facebook, Computer Programming..."
10,212,0.276676,Meet Aquifer: A build system for easier Drupal...,"[Computer Programming, E-Commerce, Operating S..."
11,587,0.272253,Enterprise developers look out: this week on G...,"[Cloud Computing, Operating Systems & Runtimes..."
12,2,0.266744,Top 10 Intranet Trends of 2016,"[Computer Programming, Digital Marketing, Goog..."


Expose method

In [312]:
def get_articles_matching_article_from_als(article_id, n=20, all=False):
    ids, scores = best_item_als.similar_items(
        article_id, item_users=sparse_article_user, N=50 if not all else n_articles)

    out = pd.DataFrame({'item_id_adj': ids, 'score': scores})

    merged = pd.merge(out, cnt, how='left', on='item_id_adj')

    keep = ['item_id_adj', 'score', 'title', 'topics']

    merged = merged.drop(columns=[col for col in merged if col not in keep])

    merged.dropna(inplace=True)

    # reset index
    merged.reset_index(drop=True, inplace=True)

    # round score to 3 decimal places
    merged['score'] = merged['score'].apply(lambda x: round(x, 3))

    # sort by score
    merged.sort_values(by='score', ascending=False, inplace=True)

    if all:
        return merged

    return merged[:n]

In [313]:
get_articles_matching_article_from_als(test_article_id, n=10)

Unnamed: 0,item_id_adj,score,title,topics
0,1190,1.0,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc..."
1,567,0.777,How Airbnb uses Machine Learning to Detect Hos...,"[Computer Programming, Data Science & Machine ..."
2,644,0.758,Presenting to the Boss(es) | Pluralsight,"[Digital Marketing, Computer Programming, Goog..."
3,1191,0.693,IEEE to Talk Blockchain at Cloud Computing Oxf...,"[Cryptocurrency, Cloud Computing, Apple, Data ..."
4,253,0.688,[E-learning] Design Thinking for Innovation - ...,"[Digital Marketing, Computer Programming, E-Co..."
5,285,0.68,Chromebase for meetings makes video-conferenci...,"[Google, Digital Marketing, Computer Programmi..."
6,57,0.68,Spotify UI built with HTML / CSS - Freebiesbug,"[Computer Programming, Google, Apple, Facebook..."
7,1079,0.668,TPOT: A Python tool for automating data science,"[Data Science & Machine Learning, Computer Pro..."
8,436,0.643,The #digital upstarts offering app-only #banki...,"[Digital Marketing, Cryptocurrency, Facebook, ..."
9,619,0.641,Running Kubernetes Locally via Docker,"[Operating Systems & Runtimes, Data Science & ..."


### Content-based filtering

#### Derive keywords from the article text

In [314]:
cnt.columns

Index(['index', 'event_timestamp', 'interaction_type', 'item_type', 'item_url',
       'title', 'text_description', 'language', 'text_description_lemmatized',
       'item_id_adj', 'topics'],
      dtype='object')

In [315]:
cnt.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj,topics
0,1,1459193988,content_present,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers...",1190,"[Cryptocurrency, Computer Programming, Data Sc..."
1,2,1459194146,content_present,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[alarm, clock, wake, stream, advert, free, bro...",811,"[Cryptocurrency, Computer Programming, Data Sc..."
2,3,1459194474,content_present,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[excited, share, google, data, center, tour, y...",559,"[Google, Cloud Computing, Data Science & Machi..."
3,4,1459194497,content_present,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[aite, group, project, blockchain, market, cou...",2988,"[Cryptocurrency, Computer Programming, Data Sc..."
4,5,1459194522,content_present,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, largest, oldest, organization, computing...",1191,"[Cryptocurrency, Cloud Computing, Apple, Data ..."


In [316]:
cnt.text_description

0       All of this work is still very early. The firs...
1       The alarm clock wakes me at 8:00 with stream o...
2       We're excited to share the Google Data Center ...
3       The Aite Group projects the blockchain market ...
4       One of the largest and oldest organizations fo...
                              ...                        
2186    For the past year , we've ranked nearly 9,000 ...
2187    Amazon has launched Chime, a video conferencin...
2188    February 9, 2017 - We begin each year with a l...
2189    At JPMorgan Chase & Co., a learning machine is...
2190    The Acquia Partner Awards Program is comprised...
Name: text_description, Length: 2191, dtype: object

In [317]:
# Join cnt.text_description_lemmatized into a single list
words_list = []
for doc in cnt.text_description_lemmatized:
    words_list.append(doc)

In [318]:
len(words_list)

2191

In [319]:
words_list[0][:10]

['work',
 'still',
 'early',
 'first',
 'full',
 'public',
 'version',
 'ethereum',
 'software',
 'recently']

In [320]:
cnt.shape

(2191, 11)

In [321]:
words_list[0]

['work',
 'still',
 'early',
 'first',
 'full',
 'public',
 'version',
 'ethereum',
 'software',
 'recently',
 'released',
 'system',
 'could',
 'face',
 'technical',
 'legal',
 'problem',
 'tarnished',
 'bitcoin',
 'many',
 'bitcoin',
 'advocate',
 'say',
 'ethereum',
 'face',
 'security',
 'problem',
 'bitcoin',
 'greater',
 'complexity',
 'software',
 'thus',
 'far',
 'ethereum',
 'faced',
 'much',
 'le',
 'testing',
 'many',
 'fewer',
 'attack',
 'bitcoin',
 'novel',
 'design',
 'ethereum',
 'may',
 'also',
 'invite',
 'intense',
 'scrutiny',
 'authority',
 'given',
 'potentially',
 'fraudulent',
 'contract',
 'like',
 'ponzi',
 'scheme',
 'written',
 'directly',
 'ethereum',
 'system',
 'sophisticated',
 'capability',
 'system',
 'made',
 'fascinating',
 'executive',
 'corporate',
 'america',
 'ibm',
 'said',
 'last',
 'year',
 'experimenting',
 'ethereum',
 'way',
 'control',
 'real',
 'world',
 'object',
 'called',
 'internet',
 'thing',
 'microsoft',
 'working',
 'several',
 'p

In [322]:
len(words_list), len(words_list[0]), len(words_list[1])

(2191, 599, 203)

#### Create Dictionary, Bag of Words, tfidf model & Similarity matrix

In [323]:
# %pip install gensim

In [324]:
from gensim.corpora.dictionary import Dictionary

In [325]:
# create a dictionary from words list
dictionary = Dictionary(words_list)

In [326]:
dictionary

<gensim.corpora.dictionary.Dictionary at 0x19688d28f10>

In [327]:
len(dictionary)

38249

In [328]:
number_words = 0
for word in words_list:
    number_words = number_words + len(word)

In [329]:
number_words

1216693

In [330]:
dictionary.get(0), dictionary.get(1), dictionary.get(2)

('actual', 'advocate', 'agreed')

##### Generating Bag of Words

In [331]:
bow = dictionary.doc2bow(words_list[0])

In [332]:
len(words_list[0]), len(bow)

(599, 369)

Some words are repeated

##### Generating a corpus

In [333]:
#create corpus where the corpus is a bag of words for each document
corpus = [dictionary.doc2bow(doc) for doc in words_list] 

In [334]:
len(corpus), len(corpus[0]), len(corpus[1])

(2191, 369, 169)

All the articles are in the corpus, and the length of the first matches the count in the Bag of Words above

##### Use the TfIdf model on the corpus

In [335]:
from gensim.models.tfidfmodel import TfidfModel

In [336]:
#create tfidf model of the corpus
tfidf = TfidfModel(corpus) 

In [337]:
tfidf

<gensim.models.tfidfmodel.TfidfModel at 0x19688d25400>

In [338]:
len(tfidf[corpus[0]])

369

In [339]:
len(tfidf[corpus[1]])

169

Again, the lengths are matched

##### Generate Similarity matrix

In [340]:
from gensim.similarities import MatrixSimilarity

# Create the similarity matrix. This is the most important part where we get the similarities between the movies.
sims = MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))

In [341]:
len(dictionary)

38249

In [342]:
# Flatten words_list into a set of unique words
words_set = set([word for doc in words_list for word in doc])

In [343]:
len(set(words_set))

38249

In [344]:
sims

<gensim.similarities.docsim.MatrixSimilarity at 0x19688d254f0>

In [345]:
sims[corpus[0]]

array([0.89554286, 0.0337494 , 0.03001624, ..., 0.07600649, 0.12786059,
       0.02062125], dtype=float32)

In [346]:
len(sims[corpus[0]])

2191

In [347]:
len(sims)

2191

#### Generating recommendations

In [348]:
def article_recommendation(content):
    # get a bag of words from the content
    query_doc_bow = dictionary.doc2bow(content) 

    #convert the regular bag of words model to a tf-idf model
    query_doc_tfidf = tfidf[query_doc_bow] 

    # get similarity values between input movie and all other movies
    similarity_array = sims[query_doc_tfidf] 

    #Convert to a Series
    similarity_series = pd.Series(similarity_array.tolist(), index=cnt['item_id_adj']) 

    #get the most similar movies 
    # similarity_output = similarity_series.sort_values(ascending=False)
    similarity_output = similarity_series
    return similarity_output

In [349]:
test_article_id

1190

In [350]:
cnt[cnt['item_id_adj'] == test_article_id]

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj,topics
0,1,1459193988,content_present,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers...",1190,"[Cryptocurrency, Computer Programming, Data Sc..."


In [351]:
test_desc = cnt[cnt['item_id_adj'] == test_article_id]['text_description_lemmatized'].values[0]

In [352]:
recs = article_recommendation(test_desc)

In [353]:
recs[:10]

item_id_adj
1190    1.000000
811     0.026399
559     0.010764
2988    0.203936
1191    0.098496
2989    0.173190
1259    0.146547
1063    0.152322
1059    0.880597
246     0.010631
dtype: float64

In [354]:
recs_df = pd.DataFrame(recs, columns=['Score'])

In [355]:
recs_df.head()

Unnamed: 0_level_0,Score
item_id_adj,Unnamed: 1_level_1
1190,1.0
811,0.026399
559,0.010764
2988,0.203936
1191,0.098496


In [356]:
recs_df.reset_index(inplace=True)

In [357]:
recs_df.head()

Unnamed: 0,item_id_adj,Score
0,1190,1.0
1,811,0.026399
2,559,0.010764
3,2988,0.203936
4,1191,0.098496


In [358]:
recs_df.isna().sum()

item_id_adj    0
Score          0
dtype: int64

In [359]:
recs_df = cnt.merge(recs_df, on='item_id_adj', how='left')

In [360]:
recs_df.head()

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj,topics,Score
0,1,1459193988,content_present,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en,"[work, still, early, first, full, public, vers...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,2,1459194146,content_present,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en,"[alarm, clock, wake, stream, advert, free, bro...",811,"[Cryptocurrency, Computer Programming, Data Sc...",0.026399
2,3,1459194474,content_present,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en,"[excited, share, google, data, center, tour, y...",559,"[Google, Cloud Computing, Data Science & Machi...",0.010764
3,4,1459194497,content_present,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en,"[aite, group, project, blockchain, market, cou...",2988,"[Cryptocurrency, Computer Programming, Data Sc...",0.203936
4,5,1459194522,content_present,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en,"[one, largest, oldest, organization, computing...",1191,"[Cryptocurrency, Cloud Computing, Apple, Data ...",0.098496


In [361]:
recs_df.sort_values(by='Score', ascending=False, inplace=True)

In [362]:
recs_df.isna().sum()

index                          0
event_timestamp                0
interaction_type               0
item_type                      0
item_url                       0
title                          0
text_description               0
language                       0
text_description_lemmatized    0
item_id_adj                    0
topics                         0
Score                          0
dtype: int64

In [363]:
keep = ['Score', 'title', 'text_description', 'topics', 'item_id_adj']

In [364]:
recs_df.drop(columns=[col for col in recs_df if col not in keep], inplace=True)

In [365]:
recs_df.head()

Unnamed: 0,title,text_description,item_id_adj,topics,Score
0,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
8,The Rise And Growth of Ethereum Gets Mainstrea...,"Ethereum, considered by many to be the most pr...",1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597
155,Ethereum and Bitcoin Are Market Leaders But No...,A lot of people tend to ignore the fact that B...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909
416,"For Blockchain VCs, the Time for Ethereum Inve...",Just a few months after the platform's product...,1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678
71,Microsoft Adds Ethereum to Windows Platform Fo...,Millions of Microsoft developers are now able ...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529


Expose method

In [366]:
def get_articles_matching_article_from_content_based(article_id, n=-1):
    lemmatized_desc = cnt[cnt['item_id_adj'] == article_id]['text_description_lemmatized'].values[0]

    recommendations = article_recommendation(lemmatized_desc)

    recommendations_df = pd.DataFrame(recommendations, columns=['score'])

    recommendations_df.reset_index(inplace=True)

    recommendations_df = cnt.merge(recommendations_df, on='item_id_adj', how='left')

    recommendations_df.sort_values(by='score', ascending=False, inplace=True)

    keep = ['score', 'title', 'topics', 'item_id_adj']

    recommendations_df.drop(columns=[col for col in recommendations_df if col not in keep], inplace=True)

    # Drop rows with NaN
    recommendations_df.dropna(inplace=True)

    # Reset index
    recommendations_df.reset_index(drop=True, inplace=True)

    if n > 0:
        recommendations_df = recommendations_df[:n]

    return recommendations_df

In [367]:
get_articles_matching_article_from_content_based(test_article_id, n=10)

Unnamed: 0,title,item_id_adj,topics,score
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597
2,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909
3,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678
4,Microsoft Adds Ethereum to Windows Platform Fo...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529
5,Solidity Available in Visual Studio - Ethereum...,99,"[Cryptocurrency, Facebook, Cloud Computing, Op...",0.487613
6,Microsoft Continues to Embrace Ethereum & Bitc...,676,"[Cryptocurrency, Operating Systems & Runtimes,...",0.44841
7,Cashila Announces Convenient Buy and Sell Feat...,2992,"[Cryptocurrency, Facebook, Computer Programmin...",0.362484
8,"Eyeing Volume, Asian Exchanges Add Support for...",707,"[Cryptocurrency, Digital Marketing, Operating ...",0.358869
9,Decentralized Options Exchange Etheropt Uses A...,810,"[Cryptocurrency, Digital Marketing, Cloud Comp...",0.333189



#### Comparing item-based and content-based filtering

In [368]:
num_articles = len(collab_out)

In [369]:
num_articles

16

In [370]:
# Assign the first num_articles rows from recs_df to content_out
content_out = recs_df.iloc[:num_articles]

In [371]:
content_out.reset_index(inplace=True)

In [372]:
content_out.head()

Unnamed: 0,index,title,text_description,item_id_adj,topics,Score
0,0,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,8,The Rise And Growth of Ethereum Gets Mainstrea...,"Ethereum, considered by many to be the most pr...",1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597
2,155,Ethereum and Bitcoin Are Market Leaders But No...,A lot of people tend to ignore the fact that B...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909
3,416,"For Blockchain VCs, the Time for Ethereum Inve...",Just a few months after the platform's product...,1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678
4,71,Microsoft Adds Ethereum to Windows Platform Fo...,Millions of Microsoft developers are now able ...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529


In [373]:
cnt[cnt['item_id_adj'] == test_article_id][['title', 'topics']]

Unnamed: 0,title,topics
0,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc..."


In [374]:
# Rename index to article_id
content_out.rename(columns={'item_id_adj': 'article_id'}, inplace=True)
content_out.drop(columns=['index'], inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [375]:
content_out.head()

Unnamed: 0,title,text_description,article_id,topics,Score
0,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,"Ethereum, considered by many to be the most pr...",1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597
2,Ethereum and Bitcoin Are Market Leaders But No...,A lot of people tend to ignore the fact that B...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909
3,"For Blockchain VCs, the Time for Ethereum Inve...",Just a few months after the platform's product...,1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678
4,Microsoft Adds Ethereum to Windows Platform Fo...,Millions of Microsoft developers are now able ...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529


In [376]:
collab_out.head()

Unnamed: 0,article_id,Score,title,topics
1,196,0.39084,How I got into the top 15 of a Kaggle competit...,"[Computer Programming, Data Science & Machine ..."
3,538,0.322372,Announcing SyntaxNet: The World's Most Accurat...,"[Data Science & Machine Learning, Computer Pro..."
5,224,0.298958,An overview of web service solutions in Drupal 8,"[E-Commerce, Computer Programming, Cloud Compu..."
6,846,0.295025,The insurance tech equation,"[Digital Marketing, Cryptocurrency, Data Scien..."
7,73,0.281402,Hero unveils a new home gadget to help you tra...,"[Computer Programming, Digital Marketing, Appl..."


In [377]:
# Left join the content_out and collab_out DataFrames on article_id
out = pd.merge(collab_out, content_out, on='article_id', how='left')

In [378]:
content_out

Unnamed: 0,title,text_description,article_id,topics,Score
0,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,"Ethereum, considered by many to be the most pr...",1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597
2,Ethereum and Bitcoin Are Market Leaders But No...,A lot of people tend to ignore the fact that B...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909
3,"For Blockchain VCs, the Time for Ethereum Inve...",Just a few months after the platform's product...,1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678
4,Microsoft Adds Ethereum to Windows Platform Fo...,Millions of Microsoft developers are now able ...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529
5,Solidity Available in Visual Studio - Ethereum...,Consensys and Microsoft have announced that th...,99,"[Cryptocurrency, Facebook, Cloud Computing, Op...",0.487613
6,Microsoft Continues to Embrace Ethereum & Bitc...,Microsoft Also read: Putin's Advisor: Bitcoin ...,676,"[Cryptocurrency, Operating Systems & Runtimes,...",0.44841
7,Cashila Announces Convenient Buy and Sell Feat...,There seems to be no love lost between central...,2992,"[Cryptocurrency, Facebook, Computer Programmin...",0.362484
8,"Eyeing Volume, Asian Exchanges Add Support for...",Following Ethereum's production-ready software...,707,"[Cryptocurrency, Digital Marketing, Operating ...",0.358869
9,Decentralized Options Exchange Etheropt Uses A...,The price per Ether will be taken from multipl...,810,"[Cryptocurrency, Digital Marketing, Cloud Comp...",0.333189


In [379]:
collab_out

Unnamed: 0,article_id,Score,title,topics
1,196,0.39084,How I got into the top 15 of a Kaggle competit...,"[Computer Programming, Data Science & Machine ..."
3,538,0.322372,Announcing SyntaxNet: The World's Most Accurat...,"[Data Science & Machine Learning, Computer Pro..."
5,224,0.298958,An overview of web service solutions in Drupal 8,"[E-Commerce, Computer Programming, Cloud Compu..."
6,846,0.295025,The insurance tech equation,"[Digital Marketing, Cryptocurrency, Data Scien..."
7,73,0.281402,Hero unveils a new home gadget to help you tra...,"[Computer Programming, Digital Marketing, Appl..."
8,92,0.28042,How We Migrated Our Backend to Spring Boot in ...,"[Computer Programming, Cloud Computing, Operat..."
9,394,0.279102,"The New App Store: Subscription Pricing, Faste...","[Apple, Google, Facebook, Computer Programming..."
10,212,0.276676,Meet Aquifer: A build system for easier Drupal...,"[Computer Programming, E-Commerce, Operating S..."
11,587,0.272253,Enterprise developers look out: this week on G...,"[Cloud Computing, Operating Systems & Runtimes..."
12,2,0.266744,Top 10 Intranet Trends of 2016,"[Computer Programming, Digital Marketing, Goog..."


In [380]:
out

Unnamed: 0,article_id,Score_x,title_x,topics_x,title_y,text_description,topics_y,Score_y
0,196,0.39084,How I got into the top 15 of a Kaggle competit...,"[Computer Programming, Data Science & Machine ...",,,,
1,538,0.322372,Announcing SyntaxNet: The World's Most Accurat...,"[Data Science & Machine Learning, Computer Pro...",,,,
2,224,0.298958,An overview of web service solutions in Drupal 8,"[E-Commerce, Computer Programming, Cloud Compu...",,,,
3,846,0.295025,The insurance tech equation,"[Digital Marketing, Cryptocurrency, Data Scien...",,,,
4,73,0.281402,Hero unveils a new home gadget to help you tra...,"[Computer Programming, Digital Marketing, Appl...",,,,
5,92,0.28042,How We Migrated Our Backend to Spring Boot in ...,"[Computer Programming, Cloud Computing, Operat...",,,,
6,394,0.279102,"The New App Store: Subscription Pricing, Faste...","[Apple, Google, Facebook, Computer Programming...",,,,
7,212,0.276676,Meet Aquifer: A build system for easier Drupal...,"[Computer Programming, E-Commerce, Operating S...",,,,
8,587,0.272253,Enterprise developers look out: this week on G...,"[Cloud Computing, Operating Systems & Runtimes...",,,,
9,2,0.266744,Top 10 Intranet Trends of 2016,"[Computer Programming, Digital Marketing, Goog...",,,,


In [381]:
content_out.shape

(16, 5)

There isn't much overlap between the item-based collaborative, content-based, and ALS results.

Check if combining with ALS improves the results

### Combining item-based filterings

In [382]:
item_als_result = get_articles_matching_article_from_als(test_article_id, n=50, all=True)

In [383]:
item_als_result.shape

(1375, 4)

In [384]:
item_collab_result = get_articles_matching_article_from_item_based(test_article_id)

In [385]:
item_collab_result.shape

(2130, 4)

In [386]:
item_content_result = get_articles_matching_article_from_content_based(test_article_id)

In [387]:
item_content_result.shape

(2193, 4)

#### Normalizing the similarity scores using Min-Max normalization

In [388]:
# Normalize the scores in item_als_result
item_als_result['normalized_score_als'] = (item_als_result['score'] - min(item_als_result['score'])) / (max(item_als_result['score']) - min(item_als_result['score']))

In [389]:
min(item_als_result['score'])

-0.614

In [390]:
item_als_result.head()

Unnamed: 0,item_id_adj,score,title,topics,normalized_score_als
0,1190,1.0,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,567,0.777,How Airbnb uses Machine Learning to Detect Hos...,"[Computer Programming, Data Science & Machine ...",0.861834
2,644,0.758,Presenting to the Boss(es) | Pluralsight,"[Digital Marketing, Computer Programming, Goog...",0.850062
3,1191,0.693,IEEE to Talk Blockchain at Cloud Computing Oxf...,"[Cryptocurrency, Cloud Computing, Apple, Data ...",0.809789
4,253,0.688,[E-learning] Design Thinking for Innovation - ...,"[Digital Marketing, Computer Programming, E-Co...",0.806691


In [391]:
# Normalize the scores in item_collab_result
item_collab_result['normalized_score_collab'] = (item_collab_result['score'] - min(item_collab_result['score'])) / (max(item_collab_result['score']) - min(item_collab_result['score']))

In [392]:
min(item_collab_result['score'])

0.0

In [393]:
item_collab_result.head()

Unnamed: 0,item_id_adj,title,score,topics,normalized_score_collab
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707
3,676,Microsoft Continues to Embrace Ethereum & Bitc...,0.5,"[Cryptocurrency, Operating Systems & Runtimes,...",0.5
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5


In [394]:
# Normalize the scores in item_content_result
item_content_result['normalized_score_content'] = (item_content_result['score'] - min(item_content_result['score'])) / (max(item_content_result['score']) - min(item_content_result['score']))

In [395]:
min(item_content_result['score'])

0.0003919448936358094

In [396]:
item_content_result.head()

Unnamed: 0,title,item_id_adj,topics,score,normalized_score_content
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597,0.880551
2,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909,0.576743
3,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678,0.576512
4,Microsoft Adds Ethereum to Windows Platform Fo...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529,0.49133


#### Item-based & content-based

In [397]:
item_collab_result.head()

Unnamed: 0,item_id_adj,title,score,topics,normalized_score_collab
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707
3,676,Microsoft Continues to Embrace Ethereum & Bitc...,0.5,"[Cryptocurrency, Operating Systems & Runtimes,...",0.5
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5


In [398]:
item_content_result.head()

Unnamed: 0,title,item_id_adj,topics,score,normalized_score_content
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597,0.880551
2,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909,0.576743
3,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678,0.576512
4,Microsoft Adds Ethereum to Windows Platform Fo...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529,0.49133


In [399]:
item_content_hybrid = pd.merge(item_content_result, item_collab_result, on='item_id_adj', how='left')

In [400]:
item_content_hybrid.shape

(2197, 9)

In [401]:
item_content_hybrid.isna().sum()

title_x                      0
item_id_adj                  0
topics_x                     0
score_x                      0
normalized_score_content     0
title_y                     61
score_y                     61
topics_y                    61
normalized_score_collab     61
dtype: int64

In [402]:
item_content_hybrid.dropna(inplace=True)

In [403]:
item_content_hybrid.head()

Unnamed: 0,title_x,item_id_adj,topics_x,score_x,normalized_score_content,title_y,score_y,topics_y,normalized_score_collab
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597,0.880551,The Rise And Growth of Ethereum Gets Mainstrea...,0.0,"[Cryptocurrency, Computer Programming, Cloud C...",0.0
2,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909,0.576743,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707
3,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678,0.576512,"For Blockchain VCs, the Time for Ethereum Inve...",0.0,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.0
4,Microsoft Adds Ethereum to Windows Platform Fo...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529,0.49133,Microsoft Adds Ethereum to Windows Platform Fo...,0.0,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.0


In [404]:
# Drop title_y and topics_y
item_content_hybrid.drop(columns=['title_y', 'topics_y'], inplace=True)

In [405]:
item_content_hybrid.head()

Unnamed: 0,title_x,item_id_adj,topics_x,score_x,normalized_score_content,score_y,normalized_score_collab
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,1.0,1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597,0.880551,0.0,0.0
2,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909,0.576743,0.707,0.707
3,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678,0.576512,0.0,0.0
4,Microsoft Adds Ethereum to Windows Platform Fo...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529,0.49133,0.0,0.0


In [406]:
# Store the average of the normalized scores in a new column
item_content_hybrid['final_score'] = item_content_hybrid[['normalized_score_content', 'normalized_score_collab']].mean(axis=1)

In [407]:
item_content_hybrid.head()

Unnamed: 0,title_x,item_id_adj,topics_x,score_x,normalized_score_content,score_y,normalized_score_collab,final_score
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,1.0,1.0,1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597,0.880551,0.0,0.0,0.440275
2,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909,0.576743,0.707,0.707,0.641871
3,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576678,0.576512,0.0,0.0,0.288256
4,Microsoft Adds Ethereum to Windows Platform Fo...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.491529,0.49133,0.0,0.0,0.245665


In [408]:
# Sort the DataFrame by final_score in descending order
item_content_hybrid.sort_values(by='final_score', ascending=False, inplace=True)

# Reset the index
item_content_hybrid.reset_index(drop=True, inplace=True)

In [409]:
item_content_hybrid.head()

Unnamed: 0,title_x,item_id_adj,topics_x,score_x,normalized_score_content,score_y,normalized_score_collab,final_score
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,1.0,1.0,1.0
1,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909,0.576743,0.707,0.707,0.641871
2,"Proof of Individuality, the New-Age Security o...",918,"[Cryptocurrency, Computer Programming, Data Sc...",0.064632,0.064266,1.0,1.0,0.532133
3,Microsoft Continues to Embrace Ethereum & Bitc...,676,"[Cryptocurrency, Operating Systems & Runtimes,...",0.44841,0.448193,0.5,0.5,0.474097
4,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.880597,0.880551,0.0,0.0,0.440275


In [410]:
# Drop the score_x, score_y, normalized_score_content and normalized_score_collab columns
item_content_hybrid.drop(columns=['score_x', 'score_y', 'normalized_score_content', 'normalized_score_collab'], inplace=True)

# Rename title_x to title and topics_x to topics
item_content_hybrid.rename(columns={'title_x': 'title', 'topics_x': 'topics'}, inplace=True)

In [411]:
item_content_hybrid.head()

Unnamed: 0,title,item_id_adj,topics,final_score
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.641871
2,"Proof of Individuality, the New-Age Security o...",918,"[Cryptocurrency, Computer Programming, Data Sc...",0.532133
3,Microsoft Continues to Embrace Ethereum & Bitc...,676,"[Cryptocurrency, Operating Systems & Runtimes,...",0.474097
4,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.440275


Expose method

In [412]:
def get_articles_matching_article_from_item_content_hybrid(article_id, n=-1, ignore=[]):
    item_collab_result = get_articles_matching_article_from_item_based(article_id)

    # Normalize the scores in item_collab_result
    item_collab_result['normalized_score_collab'] = (item_collab_result['score'] - min(item_collab_result['score'])) / (max(item_collab_result['score']) - min(item_collab_result['score']))

    item_content_result = get_articles_matching_article_from_content_based(article_id)

    # Normalize the scores in item_content_result
    item_content_result['normalized_score_content'] = (item_content_result['score'] - min(item_content_result['score'])) / (max(item_content_result['score']) - min(item_content_result['score']))

    item_content_hybrid = pd.merge(item_content_result, item_collab_result, on='item_id_adj', how='left')

    item_content_hybrid.dropna(inplace=True)

    # Drop title_y and topics_y
    item_content_hybrid.drop(columns=['title_y', 'topics_y'], inplace=True)

    # Store the average of the normalized scores in a new column
    item_content_hybrid['final_score'] = item_content_hybrid[['normalized_score_content', 'normalized_score_collab']].mean(axis=1)

    # Drop the rows that have item_id_adj in ignore if ignore is not empty
    if len(ignore) > 0:
        item_content_hybrid = item_content_hybrid[~item_content_hybrid['item_id_adj'].isin(ignore)]

    # Sort the DataFrame by final_score in descending order
    item_content_hybrid.sort_values(by='final_score', ascending=False, inplace=True)

    # Reset the index
    item_content_hybrid.reset_index(drop=True, inplace=True)

    # Drop the score_x, score_y, normalized_score_content and normalized_score_collab columns
    item_content_hybrid.drop(columns=['score_x', 'score_y', 'normalized_score_content', 'normalized_score_collab'], inplace=True)

    # Rename title_x to title and topics_x to topics
    item_content_hybrid.rename(columns={'title_x': 'title', 'topics_x': 'topics'}, inplace=True)

    if n > 0:
        # Return only the first n articles
        return item_content_hybrid.head(n)

    return item_content_hybrid

In [413]:
get_articles_matching_article_from_item_content_hybrid(test_article_id, n=5)

Unnamed: 0,title,item_id_adj,topics,final_score
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.641871
2,"Proof of Individuality, the New-Age Security o...",918,"[Cryptocurrency, Computer Programming, Data Sc...",0.532133
3,Microsoft Continues to Embrace Ethereum & Bitc...,676,"[Cryptocurrency, Operating Systems & Runtimes,...",0.474097
4,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.440275


#### ALS & Item-based

In [414]:
item_als_result.head()

Unnamed: 0,item_id_adj,score,title,topics,normalized_score_als
0,1190,1.0,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,567,0.777,How Airbnb uses Machine Learning to Detect Hos...,"[Computer Programming, Data Science & Machine ...",0.861834
2,644,0.758,Presenting to the Boss(es) | Pluralsight,"[Digital Marketing, Computer Programming, Goog...",0.850062
3,1191,0.693,IEEE to Talk Blockchain at Cloud Computing Oxf...,"[Cryptocurrency, Cloud Computing, Apple, Data ...",0.809789
4,253,0.688,[E-learning] Design Thinking for Innovation - ...,"[Digital Marketing, Computer Programming, E-Co...",0.806691


In [415]:
item_collab_result.head()

Unnamed: 0,item_id_adj,title,score,topics,normalized_score_collab
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707
3,676,Microsoft Continues to Embrace Ethereum & Bitc...,0.5,"[Cryptocurrency, Operating Systems & Runtimes,...",0.5
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5


In [416]:
item_als_hybrid = pd.merge(item_collab_result, item_als_result, on='item_id_adj', how='left')

In [417]:
item_als_hybrid.shape

(2132, 9)

In [418]:
item_als_hybrid.isna().sum()

item_id_adj                  0
title_x                      0
score_x                      0
topics_x                     0
normalized_score_collab      0
score_y                    755
title_y                    755
topics_y                   755
normalized_score_als       755
dtype: int64

In [419]:
item_als_hybrid.head()

Unnamed: 0,item_id_adj,title_x,score_x,topics_x,normalized_score_collab,score_y,title_y,topics_y,normalized_score_als
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,0.267,"Proof of Individuality, the New-Age Security o...","[Cryptocurrency, Computer Programming, Data Sc...",0.545849
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707,-0.046,Ethereum and Bitcoin Are Market Leaders But No...,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.351921
3,676,Microsoft Continues to Embrace Ethereum & Bitc...,0.5,"[Cryptocurrency, Operating Systems & Runtimes,...",0.5,-0.064,Microsoft Continues to Embrace Ethereum & Bitc...,"[Cryptocurrency, Operating Systems & Runtimes,...",0.340768
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5,0.066,Deep-learning neural network creates its own i...,"[Data Science & Machine Learning, Computer Pro...",0.421314


In [420]:
item_als_hybrid.score_y.value_counts()

 0.000    47
 0.170    11
 0.267     9
-0.148     8
 0.035     8
          ..
-0.018     1
 0.559     1
-0.087     1
-0.191     1
 0.152     1
Name: score_y, Length: 666, dtype: int64

In [421]:
item_als_hybrid.dropna(inplace=True)

In [422]:
item_als_hybrid.head()

Unnamed: 0,item_id_adj,title_x,score_x,topics_x,normalized_score_collab,score_y,title_y,topics_y,normalized_score_als
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,0.267,"Proof of Individuality, the New-Age Security o...","[Cryptocurrency, Computer Programming, Data Sc...",0.545849
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707,-0.046,Ethereum and Bitcoin Are Market Leaders But No...,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.351921
3,676,Microsoft Continues to Embrace Ethereum & Bitc...,0.5,"[Cryptocurrency, Operating Systems & Runtimes,...",0.5,-0.064,Microsoft Continues to Embrace Ethereum & Bitc...,"[Cryptocurrency, Operating Systems & Runtimes,...",0.340768
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5,0.066,Deep-learning neural network creates its own i...,"[Data Science & Machine Learning, Computer Pro...",0.421314


In [423]:
# Drop title_y and topics_y
item_als_hybrid.drop(columns=['title_y', 'topics_y'], inplace=True)

In [424]:
item_als_hybrid.head()

Unnamed: 0,item_id_adj,title_x,score_x,topics_x,normalized_score_collab,score_y,normalized_score_als
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,0.267,0.545849
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707,-0.046,0.351921
3,676,Microsoft Continues to Embrace Ethereum & Bitc...,0.5,"[Cryptocurrency, Operating Systems & Runtimes,...",0.5,-0.064,0.340768
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5,0.066,0.421314


In [425]:
# Calculate final score by multiplying normalized_score_collab by 2/3, and normalized_score_als by 1/3, and then adding them together
item_als_hybrid['final_score'] = (item_als_hybrid['normalized_score_collab'] * 2/3) + (item_als_hybrid['normalized_score_als'] * 1/3)

In [426]:
# Sort the DataFrame by final_score in descending order
item_als_hybrid.sort_values(by='final_score', ascending=False, inplace=True)

In [427]:
item_als_hybrid.head()

Unnamed: 0,item_id_adj,title_x,score_x,topics_x,normalized_score_collab,score_y,normalized_score_als,final_score
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,1.0,1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,0.267,0.545849,0.848616
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707,-0.046,0.351921,0.58864
5,916,Gold Backed Digix Raises Millions in Hours on ...,0.5,"[Cryptocurrency, Computer Programming, Digital...",0.5,0.114,0.451053,0.483684
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5,0.066,0.421314,0.473771


Since ALS results are more diverse, we include them in the final results. However, item-based results are more intuitive, so we give them a higher weightage.

Expose method

In [428]:
def get_articles_matching_article_from_als_item_hybrid(article_id, n=-1):
    item_als_result = get_articles_matching_article_from_als(article_id, all=True)

    # Normalize the scores in item_als_result
    item_als_result['normalized_score_als'] = (item_als_result['score'] - min(item_als_result['score'])) / (max(item_als_result['score']) - min(item_als_result['score']))

    item_collab_result = get_articles_matching_article_from_item_based(article_id)

    # Normalize the scores in item_collab_result
    item_collab_result['normalized_score_collab'] = (item_collab_result['score'] - min(item_collab_result['score'])) / (max(item_collab_result['score']) - min(item_collab_result['score']))

    item_als_hybrid = pd.merge(item_collab_result, item_als_result, on='item_id_adj', how='left')

    item_als_hybrid.dropna(inplace=True)

    # Drop title_y and topics_y
    item_als_hybrid.drop(columns=['title_y', 'topics_y'], inplace=True)

    # Calculate final score by multiplying normalized_score_collab by 2/3, and normalized_score_als by 1/3, and then adding them together
    item_als_hybrid['final_score'] = (item_als_hybrid['normalized_score_collab'] * 2/3) + (item_als_hybrid['normalized_score_als'] * 1/3)

    # Sort the DataFrame by final_score in descending order
    item_als_hybrid.sort_values(by='final_score', ascending=False, inplace=True)

    # Reset the index
    item_als_hybrid.reset_index(drop=True, inplace=True)

    # Drop the score_x, score_y, normalized_score_content and normalized_score_collab columns
    item_als_hybrid.drop(columns=['score_x', 'score_y', 'normalized_score_als', 'normalized_score_collab'], inplace=True)

    # Rename title_x to title and topics_x to topics
    item_als_hybrid.rename(columns={'title_x': 'title', 'topics_x': 'topics'}, inplace=True)

    if n > 0:
        # Return only the first n articles
        return item_als_hybrid.head(n)

    return item_als_hybrid

In [429]:
get_articles_matching_article_from_als_item_hybrid(test_article_id, n=5)

Unnamed: 0,item_id_adj,title,topics,final_score
0,1190,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,918,"Proof of Individuality, the New-Age Security o...","[Cryptocurrency, Computer Programming, Data Sc...",0.848616
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.58864
3,916,Gold Backed Digix Raises Millions in Hours on ...,"[Cryptocurrency, Computer Programming, Digital...",0.483684
4,482,Deep-learning neural network creates its own i...,"[Data Science & Machine Learning, Computer Pro...",0.473771


#### ALS & Content-based

In [430]:
content_als_hybrid = pd.merge(item_als_hybrid, item_content_result, on='item_id_adj', how='left')

In [431]:
content_als_hybrid.shape

(1389, 12)

In [432]:
content_als_hybrid.isna().sum()

item_id_adj                 0
title_x                     0
score_x                     0
topics_x                    0
normalized_score_collab     0
score_y                     0
normalized_score_als        0
final_score                 0
title                       0
topics                      0
score                       0
normalized_score_content    0
dtype: int64

In [433]:
content_als_hybrid.head()

Unnamed: 0,item_id_adj,title_x,score_x,topics_x,normalized_score_collab,score_y,normalized_score_als,final_score,title,topics,score,normalized_score_content
0,1190,"Ethereum, a Virtual Currency, Enables Transact...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0,1.0,1.0,"Ethereum, a Virtual Currency, Enables Transact...","[Cryptocurrency, Computer Programming, Data Sc...",1.0,1.0
1,918,"Proof of Individuality, the New-Age Security o...",1.0,"[Cryptocurrency, Computer Programming, Data Sc...",1.0,0.267,0.545849,0.848616,"Proof of Individuality, the New-Age Security o...","[Cryptocurrency, Computer Programming, Data Sc...",0.064632,0.064266
2,1300,Ethereum and Bitcoin Are Market Leaders But No...,0.707,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.707,-0.046,0.351921,0.58864,Ethereum and Bitcoin Are Market Leaders But No...,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.576909,0.576743
3,916,Gold Backed Digix Raises Millions in Hours on ...,0.5,"[Cryptocurrency, Computer Programming, Digital...",0.5,0.114,0.451053,0.483684,Gold Backed Digix Raises Millions in Hours on ...,"[Cryptocurrency, Computer Programming, Digital...",0.231759,0.231457
4,482,Deep-learning neural network creates its own i...,0.5,"[Data Science & Machine Learning, Computer Pro...",0.5,0.066,0.421314,0.473771,Deep-learning neural network creates its own i...,"[Data Science & Machine Learning, Computer Pro...",0.024417,0.024035


Since the algorithm is similar, defining method directly here

In [434]:
def get_articles_matching_article_from_als_content_hybrid(article_id, n=-1, ignore=[]):
    item_als_result = get_articles_matching_article_from_als(article_id, all=True)

    # Normalize the scores in item_als_result
    item_als_result['normalized_score_als'] = (item_als_result['score'] - min(item_als_result['score'])) / (max(item_als_result['score']) - min(item_als_result['score']))

    item_content_result = get_articles_matching_article_from_content_based(article_id)

    # Normalize the scores in item_content_result
    item_content_result['normalized_score_content'] = (item_content_result['score'] - min(item_content_result['score'])) / (max(item_content_result['score']) - min(item_content_result['score']))

    content_als_hybrid = pd.merge(item_content_result, item_als_result, on='item_id_adj', how='left')

    content_als_hybrid.dropna(inplace=True)

    # Drop title_y and topics_y
    content_als_hybrid.drop(columns=['title_y', 'topics_y'], inplace=True)

    # If ignore is not empty, drop the rows with item_id_adj in ignore
    if len(ignore) > 0:
        content_als_hybrid = content_als_hybrid[~content_als_hybrid['item_id_adj'].isin(ignore)]

    # Calculate final score by multiplying normalized_score_content by 2/3, and normalized_score_als by 1/3, and then adding them together
    content_als_hybrid['final_score'] = (content_als_hybrid['normalized_score_content'] * 2/3) + (content_als_hybrid['normalized_score_als'] * 1/3)

    # Sort the DataFrame by final_score in descending order
    content_als_hybrid.sort_values(by='final_score', ascending=False, inplace=True)

    # Reset the index
    content_als_hybrid.reset_index(drop=True, inplace=True)

    # Drop the score_x, score_y, normalized_score_als and normalized_score_content columns
    content_als_hybrid.drop(columns=['score_x', 'score_y', 'normalized_score_als', 'normalized_score_content'], inplace=True)

    # Rename title_x to title and topics_x to topics
    content_als_hybrid.rename(columns={'title_x': 'title', 'topics_x': 'topics'}, inplace=True)

    if n > 0:
        # Return only the first n articles
        return content_als_hybrid.head(n)

    return content_als_hybrid

In [435]:
get_articles_matching_article_from_als_content_hybrid(test_article_id, n=5)

Unnamed: 0,title,item_id_adj,topics,final_score
0,"Ethereum, a Virtual Currency, Enables Transact...",1190,"[Cryptocurrency, Computer Programming, Data Sc...",1.0
1,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.726852
2,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.516105
3,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.501802
4,Solidity Available in Visual Studio - Ethereum...,99,"[Cryptocurrency, Facebook, Cloud Computing, Op...",0.458977


Results are as expected.

### Final API

#### The Cold Start problem

For a new user, we get articles based on the topics that user has marked of interest. If user has not marked any topics, we get 10 random articles.

In [436]:
def get_common_topics(user_topics, article_topics):
    return list(set(user_topics) & set(article_topics))
    
def get_n_articles_with_topic(topic, n=10, all=False):
    if n <= 0:
        n = 10
    df = cnt.loc[cnt.apply(lambda x: topic in x['topics'], axis=1)]
    if all:
        return df
    return df.sample(n)

def get_10_articles_for_new_user(user_id, topics=[]):
    keep = ['item_id_adj', 'title', 'topics']
    if len(topics) == 0:
        return cnt.sample(10)[keep]

    # Create an empty DataFrame
    df = pd.DataFrame()

    # For each topic, get 10 articles and append them to df
    for topic in topics:
        df = pd.concat([df, get_n_articles_with_topic(topic, n=10)])

    return df.sample(10)[keep]

Check if non-HTML articles are present and correct.

In [437]:
crypto = get_n_articles_with_topic('Cryptocurrency', all=True)

In [438]:
crypto.item_type.value_counts()

HTML     709
VIDEO      4
RICH       2
Name: item_type, dtype: int64

In [439]:
crypto[crypto['item_type'] != 'HTML']

Unnamed: 0,index,event_timestamp,interaction_type,item_type,item_url,title,text_description,language,text_description_lemmatized,item_id_adj,topics
98,118,1459423815,content_present,RICH,https://soundcloud.com/epicenterbitcoin/eb-124,EB124 - Rune Christensen: Maker Dao Ethereum's...,"Support the show, consider donating: 1GW6t1vzH...",en,"[support, show, consider, donating, gw, vzhkn,...",814,"[Cryptocurrency, Computer Programming, Operati..."
356,451,1460854706,content_present,VIDEO,http://www.ted.com/talks/linus_torvalds_the_mi...,Linus Torvalds: The mind behind Linux,Linus Torvalds transformed technology twice --...,en,"[linus, torvalds, transformed, technology, twi...",960,"[Computer Programming, Operating Systems & Run..."
1022,1371,1465857905,content_present,RICH,https://www.scribd.com/doc/315571329/LinkedIn-...,Deck describing how MSFT plans to use Linkedin,This presentation contains certain forward-loo...,en,"[presentation, contains, certain, forward, loo...",384,"[Digital Marketing, Cloud Computing, Cryptocur..."
1078,1441,1466275303,content_present,VIDEO,http://www.ted.com/talks/bill_gross_the_single...,[Videos] Bill Gross: The single biggest reason...,You have JavaScript disabled Bill Gross has fo...,en,"[javascript, disabled, bill, gross, founded, l...",1168,"[Computer Programming, Digital Marketing, Data..."
1189,1607,1467323050,content_present,VIDEO,http://www.ted.com/talks/julia_galef_why_you_t...,Julia Galef: Why you think you're right -- eve...,You have JavaScript disabled Perspective is ev...,en,"[javascript, disabled, perspective, everything...",1456,"[Computer Programming, Digital Marketing, Appl..."
1602,2244,1472746714,content_present,VIDEO,http://www.ted.com/talks/don_tapscott_how_the_...,Don Tapscott: How the blockchain is changing m...,"What is the blockchain? If you don't know, you...",en,"[blockchain, know, chance, still, need, clarif...",2117,"[Cryptocurrency, Digital Marketing, Computer P..."


The results are as expected. So, the modeling is working properly.

In [440]:
get_10_articles_for_new_user(10000)

Unnamed: 0,item_id_adj,title,topics
1611,2125,It's Official: 68 Million Dropbox Account Deta...,"[Computer Programming, Facebook, Cloud Computi..."
1165,1430,Research: Why Best Practices Don't Translate A...,"[Digital Marketing, Computer Programming, E-Co..."
605,3020,Hidden (Caché) (2005),"[Computer Programming, Apple, Digital Marketin..."
1658,2871,Cookies vs Tokens: The Definitive Guide,"[Computer Programming, Facebook, Google, Cloud..."
1379,1798,Elasticsearch: CSV exporter for Kibana Discover,"[Computer Programming, Google, Data Science & ..."
1235,1474,Organizing for digital acceleration: Making a ...,"[Digital Marketing, Operating Systems & Runtim..."
9,246,Setting Up HTTP(S) Load Balancing,"[Computer Programming, Cloud Computing, Operat..."
849,262,A step-by-step guide to agile growth experiments,"[Digital Marketing, Computer Programming, Data..."
515,3011,Rams (2015),"[Computer Programming, Digital Marketing, Appl..."
637,1544,Creative partnerships: Machine learning and th...,"[Data Science & Machine Learning, Digital Mark..."


Checking when topics given.

In [441]:
get_10_articles_for_new_user(10000, topics=['Google', 'Cryptocurrency', 'Computer Programming'])

Unnamed: 0,item_id_adj,title,topics
130,741,Million-dollar babies,"[Data Science & Machine Learning, Digital Mark..."
1119,1444,Google Fiber agrees to acquire Webpass,"[Google, Cloud Computing, Digital Marketing, E..."
1708,2284,Keynotes from the O'Reilly Velocity Conference...,"[Digital Marketing, Cloud Computing, Computer ..."
868,264,8 Insanely Simple Productivity Hacks,"[Computer Programming, Digital Marketing, Face..."
301,24,Blockchain won't kill banks: Bitcoin pioneer,"[Cryptocurrency, Digital Marketing, E-Commerce..."
984,3032,New smart toothbrush from Philips Sonicare is ...,"[Computer Programming, Google, Apple, Digital ..."
1436,1883,Soylent's new drink will replace your breakfas...,"[Digital Marketing, Computer Programming, Appl..."
1376,1784,Android - The dark side of Jack and Jill,"[Computer Programming, Google, Digital Marketi..."
330,1659,Voronoi Diagrams on the GPU,"[Computer Programming, Data Science & Machine ..."
2132,2859,Spring Boot 1.5.1 released,"[Computer Programming, Cloud Computing, E-Comm..."


The pool of topics seems diverse enough, so we can use this method.

This problem would not be as severe for new articles, as we are using a combination of item-based collaborative filtering, and content-based filtering. So, the new article would be picked as long as it is similar to existing articles.

#### Get top 10 articles for a user at the start of the day

In [442]:
def get_top_10_articles_for_user(user_id):
    if not consumer_helper.is_known_id(user_id):
        return get_10_articles_for_new_user(user_id)
    return get_articles_for_user_from_als(user_id, n=10)

In [443]:
get_top_10_articles_for_user(test_user_id)

Unnamed: 0,item_id_adj,title,score,topics
0,885,Program your way to your next grocery delivery,0.417,"[Facebook, Computer Programming, Digital Marke..."
1,163,"Forget The Internet Of Things, There Is A Digi...",0.396,"[Digital Marketing, Data Science & Machine Lea..."
2,1628,How This Former Google Engineer Is Bringing Bl...,0.374,"[Cryptocurrency, Data Science & Machine Learni..."
3,1570,Visual Thinking and Learning 3.0 working toget...,0.356,"[Data Science & Machine Learning, Computer Pro..."
4,1378,Google Ranking Factors: The Complete List,0.349,"[Google, Computer Programming, E-Commerce, Dat..."
5,1808,You don't talk about refactoring club,0.339,"[Computer Programming, Digital Marketing, E-Co..."
6,297,22 Mobile Stats Everyone Should Know - DZone M...,0.324,"[Digital Marketing, Google, Computer Programmi..."
7,2035,Building Flipkart Lite: A Progressive Web App,0.304,"[Computer Programming, Google, Facebook, Cloud..."
8,1559,[Retro] Celebration Grids - Management 3.0,0.288,"[Computer Programming, Data Science & Machine ..."
9,1518,2 terrific #MarTech talks on the rise of AI in...,0.28,"[Data Science & Machine Learning, Digital Mark..."


#### Get more articles for a user when they read an article

We will be using a hybrid of item-based collaborative filtering and ALS, since the results are intuitive and diverse.

In [444]:
def get_articles_read_by_user(user_id):
    return list(txns[txns['consumer_id_adj'] == user_id]['item_id_adj'].values)

def get_more_articles_for_user(article_id, user_id):
    to_filter = get_articles_read_by_user(user_id)

    # Append the article_id to to_filter
    to_filter.append(article_id)

    return get_articles_matching_article_from_als_content_hybrid(article_id, n=10, ignore=to_filter)

In [445]:
get_more_articles_for_user(test_article_id, test_user_id)

Unnamed: 0,title,item_id_adj,topics,final_score
0,The Rise And Growth of Ethereum Gets Mainstrea...,1059,"[Cryptocurrency, Computer Programming, Cloud C...",0.726852
1,"For Blockchain VCs, the Time for Ethereum Inve...",1115,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.516105
2,Ethereum and Bitcoin Are Market Leaders But No...,1300,"[Cryptocurrency, Digital Marketing, E-Commerce...",0.501802
3,Solidity Available in Visual Studio - Ethereum...,99,"[Cryptocurrency, Facebook, Cloud Computing, Op...",0.458977
4,Microsoft Continues to Embrace Ethereum & Bitc...,676,"[Cryptocurrency, Operating Systems & Runtimes,...",0.412385
5,Microsoft Adds Ethereum to Windows Platform Fo...,1019,"[Cryptocurrency, Cloud Computing, Facebook, Co...",0.385381
6,Five Bitcoin and Ethereum Based Projects to Wa...,723,"[Cryptocurrency, Computer Programming, Faceboo...",0.364199
7,"Eyeing Volume, Asian Exchanges Add Support for...",707,"[Cryptocurrency, Digital Marketing, Operating ...",0.361135
8,Growing Global Electricity Consumption Is Not ...,772,"[Cryptocurrency, Computer Programming, Data Sc...",0.33786
9,"Google Failure, Ethereum Leaps, ECB Giveout in...",549,"[Cryptocurrency, Google, Digital Marketing, Co...",0.337455


### Online evaluation for Item recommendations

#### Evaluation method

To check whether the recommendations are good, we will be using the following method:

If a user has scrolled through at least 75% of an article, we will consider it as a positive interaction. To measure this accurately, we should also keep track of the amount of time in which the user scrolls through the article. When the user clicks on an article and that page opens, we start a timer. The timer is stopped when the user either leaves the page, or has scrolled through 75% of the article.

#### Further improvements

To evaluate the results of getting articles similar to another article, we can use the article's topics. If the topics are similar, then the articles are similar.

In order to further personalize the recommendations, we can use the user's interests. This can also be broken down into a list of topics. The recommended articles should generally have the topics in which the user is interested. When serving articles to the user, we can also keep track of the topics in which the user has read the most articles.

That being said, diversifying the results is important in order to keep the user engaged. To do this, we could track topics that are similar to a user's favorite topics. If the user has not read many articles in such a similar topic, we can recommend articles from that topic.