### How to Build a Recommendation Engine That Isn’t Movielens

Recommendation engines are pretty simple. Or at least, they are made to seem simple by an uncountable number of online tutorials. The only problem: **it’s hard to find a tutorial that doesn’t use** the ready-made and pre-baked **MovieLens** dataset. Fine. But, perhaps you’ve followed one of these tutorials and have struggled to imagine how to, or otherwise implement your own recommendation engine on your own data. In this workshop, I’ll show you how to use industry-leading open source tools to **build your own engine** and how to **structure your own data** so that it might be “recommendation-compatible”.

In [1]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

### MovieLens

![](images/movielens.png)

### Quickstart

```sh
pip install lightfm
```

In [2]:
from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k

data = fetch_movielens(min_rating=5.0)
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)

precision_at_k(model, data['test'], k=5).mean()

0.05310436

In [3]:
data

{'train': <943x1682 sparse matrix of type '<class 'numpy.float32'>'
 	with 19048 stored elements in COOrdinate format>,
 'test': <943x1682 sparse matrix of type '<class 'numpy.int32'>'
 	with 2153 stored elements in COOrdinate format>,
 'item_features': <1682x1682 sparse matrix of type '<class 'numpy.float32'>'
 	with 1682 stored elements in Compressed Sparse Row format>,
 'item_feature_labels': array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
        'Sliding Doors (1998)', 'You So Crazy (1994)',
        'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object),
 'item_labels': array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
        'Sliding Doors (1998)', 'You So Crazy (1994)',
        'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object)}

In [4]:
data['train']

<943x1682 sparse matrix of type '<class 'numpy.float32'>'
	with 19048 stored elements in COOrdinate format>

### Data

![](images/halloween.png)

![](images/candy.jpg)

![](images/influenster.png)

This was working on 2019-10-19:

```python
import time
import json
import random
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from gazpacho import Soup
from tqdm import tqdm
import pandas as pd

options = Options()
options.headless = True
browser = Firefox(options=options)

def make_soup(url):
    browser.get(url)
    html = browser.page_source
    soup = Soup(html)
    return soup

def build_review_url(product, page):
    base = 'https://www.influenster.com'
    url = f'{base}/{product}?review_sort=most+recent&review_page={page}'
    return url

def parse_review(product, review):
    stars = int(review.find('div', {'class': 'avg-stars'}).attrs['data-stars'])
    user = review.find('div', {'class': 'content-item-author-info'}).find('a').attrs['href'][1:]
    return {'product': product, 'user': user, 'stars': stars}

def scrape_product_page(product, page):
    url = build_review_url(product, page)
    soup = make_soup(url)
    page_reviews = soup.find('div', {'class': 'content-item review-item'})
    return [parse_review(product, r) for r in page_reviews]

def scrape_product(product):
    url = build_review_url(product, 1)
    soup = make_soup(url)
    pages = int(
        soup.find('div', {'class': 'product-highlights-results'})
        .text
        .replace(',', '')
        .split(' ')[0]
    ) // 10 + 1
    pages = min(pages, 100)
    pages = list(range(1, pages+1))
    random.shuffle(pages)
    reviews = []
    for page in pages:
        print(f'scraping page: {page}')
        page_reviews = scrape_product_page(product, page)
        reviews.extend(page_reviews)
        time.sleep(random.randint(1, 10) / 10)
    return reviews

def scrape_index(category='sweets-candy-gum'):
    product_index = []
    for page in tqdm(range(1, 10+1)):
        url = f'https://www.influenster.com/reviews/{category}?page={page}'
        soup = make_soup(url)
        products = soup.find('a', {'class': 'category-product'}, strict=True)
        products = [p.attrs['href'] for p in products]
        candy.extend(products)
        time.sleep(random.randint(1, 10) / 10)
    return product_index

if __name__ == '__main__':

    product_index = scrape_index(category='sweets-candy-gum')

    product_reviews = []
    for product in tqdm(product_index):
        print(f'scraping: {product}')
        try:
            reviews = scrape_product(product)
            product_reviews.extend(reviews)
        except:
            pass
        time.sleep(random.randint(1, 10) / 10)

    df = pd.DataFrame(product_reviews)
    df.to_csv('data/candy.csv', index=False)
```

![](images/cat_and_mouse.jpg)

### Scrape 

![](images/influenster_index.png)

[Source](https://www.influenster.com/reviews/sweets-candy-gum)

In [5]:
with open('influenster/index.html', 'r') as f:
    html = f.read()

![](images/gazpacho.png)

```
pip install gazpacho
```

In [6]:
from gazpacho import Soup

In [7]:
soup = Soup(html)

In [8]:
products = soup.find('a', {'class': 'category-product'})

In [9]:
products[0]

<a class="category-product" href="https://www.influenster.com/reviews/reeses-peanut-butter-cups-miniatures-76"><div class="category-product-image-container"><div class="category-product-image" style="background-image: url("index_files/51107862.jpg");"><img itemprop="image" data-img="div" src="index_files/51107862.jpg" alt="Reese's Peanut Butter Cups Miniatures" width="0" height="0"></div></div><div class="category-product-detail"><div class="category-product-title" data-truncate-lines="3" style="overflow-wrap: break-word;">Reese's Peanut Butter Cups Miniatures</div><div class="category-product-brand">
By Reese's
</div><div class="category-product-stars"><div class="avg-stars small " data-stars="4.82880054868237"><div class="star"><i class="star-icon"></i><div class="progress" data-star="1" data-progress="100"></div></div><div class="star"><i class="star-icon"></i><div class="progress" data-star="2" data-progress="100"></div></div><div class="star"><i class="star-icon"></i><div class="p

In [10]:
products = [p.attrs['href'] for p in products]

In [11]:
products[:5]

['https://www.influenster.com/reviews/reeses-peanut-butter-cups-miniatures-76',
 'https://www.influenster.com/reviews/ferrero-rocher-chocolate',
 'https://www.influenster.com/reviews/kit-kat-crisp-wafers-in-milk-chocolate',
 'https://www.influenster.com/reviews/lindt-lindor-milk-chocolate-truffles',
 'https://www.influenster.com/reviews/hersheys-cookies-n-creme-candy-bar']

![](images/influenster_skittles.png)

In [12]:
with open('influenster/skittles.html', 'r') as f:
    html = f.read()

In [13]:
soup = Soup(html)

In [14]:
reviews = (soup
    .find('div', {'class': 'layoutComponents__Block-l2otzz-0 efHRYv'}, strict=True)
    .find('div', {'class': 'item wrappers__Wrapper-sc-1mex847-0 jEYnle'})
)

In [15]:
def parse_review(review):
    stars = len(review.find('div', {'class': 'productComponents__SingleStar-sc-1ffpes9-3 kzXpnS'}))
    user = (review.find('div',
        {'class': 'layoutComponents__Row-l2otzz-2 MSbai layoutComponents__Block-l2otzz-0 ixyxcj'}
        ).find('a').attrs['href'])[1:]
    return {'user': user, 'stars': stars}

In [16]:
review = reviews[0]
parse_review(review)

{'user': 'anahce0f', 'stars': 5}

In [17]:
candy = [parse_review(r) for r in reviews]
candy

[{'user': 'anahce0f', 'stars': 5},
 {'user': 'rileyc2ef0', 'stars': 5},
 {'user': 'danielledowsett', 'stars': 5},
 {'user': 'candygirl7585', 'stars': 5},
 {'user': 'marnishaw', 'stars': 5},
 {'user': 'megfields', 'stars': 4},
 {'user': 'member-a58a7cd88', 'stars': 4},
 {'user': 'darcywood', 'stars': 5},
 {'user': 'amandaj64', 'stars': 5},
 {'user': 'member-930e4ca64', 'stars': 4}]

In [18]:
import pandas as pd

df = pd.DataFrame(candy)
df['item'] = 'skittles'
df

Unnamed: 0,stars,user,item
0,5,anahce0f,skittles
1,5,rileyc2ef0,skittles
2,5,danielledowsett,skittles
3,5,candygirl7585,skittles
4,5,marnishaw,skittles
5,4,megfields,skittles
6,4,member-a58a7cd88,skittles
7,5,darcywood,skittles
8,5,amandaj64,skittles
9,4,member-930e4ca64,skittles


### EDA

In [19]:
import pandas as pd

df = pd.read_csv('data/candy.csv')

df.sample(5)

Unnamed: 0,item,user,review
16441,Jolly Rancher Hard Candy Original Flavors Asso...,evanhawkins,5
3404,5 Gum,jesse58,5
5580,Hershey's Kisses Milk Chocolates with Almonds,taylordylan,4
7411,Hershey's Milk Chocolate Bar,wallderrick,4
350,Airheads White Mystery,laura05,5


In [20]:
df['item'].value_counts()[:5]

Twix                                       340
Snickers Chocolate Bar                     330
Werther's Original Caramel Hard Candies    322
M&Ms Peanut Chocolate Candy                310
M&Ms Milk Chocolate Candy                  273
Name: item, dtype: int64

In [21]:
df['item'].unique().shape

(142,)

In [22]:
df['user'].unique().shape

(2531,)

In [23]:
import chart # pip install chart

chart.histogram(df['review'], bins=5, height=20, mark='x')

        x
        x
        x
        x
        x
        x
        x
        x
        x
        x
        x
        x
        x
        x
        x
        x
      x x
      x x
      x x
x x x x x



In [24]:
df['review'].value_counts()

5    12977
4     2554
3      967
2      372
1      364
Name: review, dtype: int64

In [25]:
df.groupby('user')['item'].count().mean()

6.809166337416041

### Sparsity

In [48]:
ex = pd.DataFrame([
    [0, 1, 1, 0, 0, 0],
    [0, 1, 1, 1, 0, 0],
    [1, 0, 0, 1, 0, 0],
    [0, 1, 1, 0, 0, 1],
    [0, 0, 0, 1, 1, 1]], 
    columns=['item_1', 'item_2', 'item_3', 'item_4', 'item_5', 'item_6'])

ex

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
0,0,1,1,0,0,0
1,0,1,1,1,0,0
2,1,0,0,1,0,0
3,0,1,1,0,0,1
4,0,0,0,1,1,1


In [49]:
r, c = ex.shape
ex.sum().sum() / (r * c)

0.43333333333333335

In [50]:
import sys

sys.getsizeof(ex)

344

In [51]:
ex.values

array([[0, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 1],
       [0, 0, 0, 1, 1, 1]])

In [53]:
from scipy.sparse import csc_matrix

sx = csc_matrix(ex.values)

In [55]:
sys.getsizeof(sx)

56

### Sparse Candy

In [60]:
df.sample(5)

Unnamed: 0,item,user,review
11147,The Original Lemonhead,mclay,5
5927,Butterfinger Minis,michael08,5
11404,Hershey's Nuggets Chocolate Assortment,josephmartin,5
1640,Ferrero Collection Fine Assorted Confections,johnsonlaura,5
10640,Jolly Rancher Hard Candy Original Flavors Asso...,horr,5


In [61]:
import numpy as np

In [62]:
ratings='review'
users='user'
items='item'

ratings = np.array(df[ratings])
users = np.array(df[users])
items = np.array(df[items])

In [64]:
from scipy.sparse import csr_matrix

help(csr_matrix)

Help on class csr_matrix in module scipy.sparse.csr:

class csr_matrix(scipy.sparse.compressed._cs_matrix, scipy.sparse.sputils.IndexMixin)
 |  csr_matrix(arg1, shape=None, dtype=None, copy=False)
 |  
 |  Compressed Sparse Row matrix
 |  
 |  This can be instantiated in several ways:
 |      csr_matrix(D)
 |          with a dense matrix or rank-2 ndarray D
 |  
 |      csr_matrix(S)
 |          with another sparse matrix S (equivalent to S.tocsr())
 |  
 |      csr_matrix((M, N), [dtype])
 |          to construct an empty matrix with shape (M, N)
 |          dtype is optional, defaulting to dtype='d'.
 |  
 |      csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
 |          where ``data``, ``row_ind`` and ``col_ind`` satisfy the
 |          relationship ``a[row_ind[k], col_ind[k]] = data[k]``.
 |  
 |      csr_matrix((data, indices, indptr), [shape=(M, N)])
 |          is the standard CSR representation where the column indices for
 |          row i are stored in ``indices[indpt

In [65]:
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])

csr_matrix((data, (row, col)), shape=(3, 3)).toarray()

array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]], dtype=int64)

In [67]:
from sklearn.preprocessing import LabelEncoder

# heavy lifting encoders
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

# preparation for the csr matrix
u = user_encoder.fit_transform(users)
i = item_encoder.fit_transform(items)
lu = len(np.unique(u))
li = len(np.unique(i))

In [68]:
interactions = csr_matrix((ratings, (u, i)), shape=(lu, li))

### Basic LightFM 

In [69]:
model = LightFM()

In [70]:
model.fit(interactions)

<lightfm.lightfm.LightFM at 0x11ed53ef0>

In [74]:
model.predict(0, [1, 2, 3])

array([0.57377923, 0.97406924, 0.6821956 ])