## Building Recommendation Engines in Python

### MovieLens

![](images/movielens.png)

### Quickstart

![](images/light_fm.png)

[Source](https://github.com/lyst/lightfm)

```sh
pip install lightfm
```

In [1]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k

data = fetch_movielens(min_rating=5.0)
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)

precision_at_k(model, data['test'], k=5).mean()

0.050462354

In [3]:
data

{'train': <943x1682 sparse matrix of type '<class 'numpy.float32'>'
 	with 19048 stored elements in COOrdinate format>,
 'test': <943x1682 sparse matrix of type '<class 'numpy.int32'>'
 	with 2153 stored elements in COOrdinate format>,
 'item_features': <1682x1682 sparse matrix of type '<class 'numpy.float32'>'
 	with 1682 stored elements in Compressed Sparse Row format>,
 'item_feature_labels': array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
        'Sliding Doors (1998)', 'You So Crazy (1994)',
        'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object),
 'item_labels': array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
        'Sliding Doors (1998)', 'You So Crazy (1994)',
        'Scream of Stone (Schrei aus Stein) (1991)'], dtype=object)}

In [4]:
data['train']

<943x1682 sparse matrix of type '<class 'numpy.float32'>'
	with 19048 stored elements in COOrdinate format>

### Data

![](images/candy.jpg)

![](images/influenster.png)

### Structure

In [5]:
import pandas as pd

df = pd.read_csv('data/candy.csv')

df.sample(5)

Unnamed: 0,item,user,review
15611,Reese's White Peanut Butter Eggs,victorkelly,5
2191,Skittles Sour Candy,kim99,5
7350,Airheads White Mystery,jeffrey15,5
9915,Life Savers Holiday Wint-O-Green Candy Mints,julieconway,5
12048,Nestle Toll House Semi Sweet Chocolate Morsels,travis02,5


In [6]:
df[df['user'] == 'zjohnson']

Unnamed: 0,item,user,review
2186,Skittles Sour Candy,zjohnson,5
6022,Haribo Sour Gold Bears Gummi Candy,zjohnson,5
7919,Starburst Original Fruit Chews,zjohnson,5
8382,Sour Patch Watermelon,zjohnson,5
12304,Sour Patch Kids Candy,zjohnson,4


In [7]:
df['item'].value_counts()[:5]

Twix                                       340
Snickers Chocolate Bar                     330
Werther's Original Caramel Hard Candies    322
M&Ms Peanut Chocolate Candy                310
M&Ms Milk Chocolate Candy                  273
Name: item, dtype: int64

In [8]:
df['item'].unique().shape

(142,)

In [9]:
df['user'].unique().shape

(2531,)

In [12]:
df['review'].value_counts()

5    12977
4     2554
3      967
2      372
1      364
Name: review, dtype: int64

In [13]:
df.groupby('user')['item'].count().mean()

6.809166337416041

### Sparsity

In [14]:
ex = pd.DataFrame([
    [0, 1, 1, 0, 0, 0],
    [0, 1, 1, 1, 0, 0],
    [1, 0, 0, 1, 0, 0],
    [0, 1, 1, 0, 0, 1],
    [0, 0, 0, 1, 1, 1]], 
    columns=['twix', 'mars', 'reeses', 'skittles', 'snickers', 'lindt'])

ex

Unnamed: 0,twix,mars,reeses,skittles,snickers,lindt
0,0,1,1,0,0,0
1,0,1,1,1,0,0
2,1,0,0,1,0,0
3,0,1,1,0,0,1
4,0,0,0,1,1,1


In [15]:
r, c = ex.shape
ex.sum().sum() / (r * c)

0.43333333333333335

In [16]:
import sys

sys.getsizeof(ex)

400

In [17]:
ex.values

array([[0, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 0],
       [0, 1, 1, 0, 0, 1],
       [0, 0, 0, 1, 1, 1]])

In [18]:
from scipy.sparse import csc_matrix

sx = csc_matrix(ex.values)

In [19]:
sys.getsizeof(sx)

64

### Sparse Candy

In [None]:
df.sample(5)

In [None]:
import numpy as np

In [None]:
ratings = 'review'
users = 'user'
items = 'item'

ratings = np.array(df[ratings])
users = np.array(df[users])
items = np.array(df[items])

In [None]:
from scipy.sparse import csr_matrix

help(csr_matrix)

In [None]:
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])

csr_matrix((data, (row, col)), shape=(3, 3)).toarray()

In [None]:
from sklearn.preprocessing import LabelEncoder

# heavy lifting encoders
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

# preparation for the csr matrix
u = user_encoder.fit_transform(users)
i = item_encoder.fit_transform(items)
lu = len(np.unique(u))
li = len(np.unique(i))

In [None]:
interactions = csr_matrix((ratings, (u, i)), shape=(lu, li))

In [None]:
interactions

### Interaction Machine

In [None]:
class InteractionMachine:
    def __init__(self):
        self.user_encoder = LabelEncoder()
        self.item_encoder = LabelEncoder()

    def __repr__(self):
        return 'InteractionMachine()'

    def build(self, users, items, ratings):
        u = self.user_encoder.fit_transform(users)
        i = self.item_encoder.fit_transform(items)
        self.n_users = len(np.unique(u))
        self.n_items = len(np.unique(i))
        self.interactions = csr_matrix((ratings, (u, i)), shape=(self.n_users, self.n_items))
        return self

In [None]:
im = InteractionMachine()

im.build(df['user'], df['item'], df['review'])

interactions = im.interactions

In [None]:
interactions.toarray()

### Basic LightFM 

In [None]:
model = LightFM()

In [None]:
model.fit(interactions)

In [None]:
model.predict(interactions) # not exactly sklearn...

In [None]:
model.predict(0, [1, 2, 3])

### Evaluation

In [None]:
from lightfm.evaluation import auc_score, precision_at_k

> AUC measures the quality of the overall ranking. In the binary case, it can be interpreted as the probability that a randomly chosen positive item is ranked higher than a randomly chosen negative item. Consequently, an AUC close to 1.0 will suggest that, by and large, your ordering is correct: and this can be true even if none of the first K items are positives. This metric may be more appropriate if you do not exert full control on which results will be presented to the user; it may be that the first K recommended items are not available any more (say, they are out of stock), and you need to move further down the ranking. A high AUC score will then give you confidence that your ranking is of high quality throughout.

[Source](https://stackoverflow.com/questions/45451161/evaluating-the-lightfm-recommendation-model)

> Precision@K measures the proportion of positive items among the K highest-ranked items. As such, it's very focused on the ranking quality at the top of the list: it doesn't matter how good or bad the rest of your ranking is as long as the first K items are mostly positive. This would be an appropriate metric if you are only ever going to be showing your users the very top of the list.

[Source](https://stackoverflow.com/questions/45451161/evaluating-the-lightfm-recommendation-model)

In [None]:
auc_score(model, interactions).mean()

In [None]:
precision_at_k(model, interactions, k=10).mean()

### train-test-split

In [None]:
# don't do this...
from sklearn.model_selection import train_test_split

In [None]:
train_test_split(interactions)
# because...

### Traditional 

![](images/tts_traditional.png)

### Recommendation

![](images/tts_reco.png)

In [None]:
# do this
from lightfm.cross_validation import random_train_test_split

In [None]:
train, test = random_train_test_split(interactions, test_percentage=0.2)

In [None]:
train

In [None]:
test

### Training Cycle

In [None]:
model = LightFM()
model.fit(train, epochs=500)

In [None]:
auc_score(model, test).mean()

In [None]:
model = LightFM()

scores = []
for e in range(100):
    model.fit_partial(train, epochs=1)
    auc_train = auc_score(model, train).mean()
    auc_test = auc_score(model, test).mean()
    scores.append((auc_train, auc_test))
    
scores = np.array(scores)

In [None]:
from matplotlib import pyplot as plt

%matplotlib inline

plt.plot(scores[:, 0], label='train')
plt.plot(scores[:, 1], label='test')
plt.legend()

### Loss

> WARP: Weighted Approximate-Rank Pairwise loss. Maximises
  the rank of positive examples by repeatedly sampling negative
  examples until rank violating one is found. Useful when only
  positive interactions are present and optimising the top of
  the recommendation list (precision@k) is desired.

In [None]:
model = LightFM(loss='warp')

scores = []
for e in range(100):
    model.fit_partial(train, epochs=1)
    auc_train = auc_score(model, train).mean()
    auc_test = auc_score(model, test).mean()
    scores.append((auc_train, auc_test))
    
scores = np.array(scores)

In [None]:
from matplotlib import pyplot as plt

plt.plot(scores[:, 0], label='train')
plt.plot(scores[:, 1], label='test')
plt.legend()

### Activity

Take 5 minutes to explore different epoch and loss combinations

### Early Stopping

In [None]:
from copy import deepcopy

model = LightFM(loss='warp')

count = 0
best = 0
scores = []
for e in range(100):
    if count > 5: # patience
        break
    model.fit_partial(train, epochs=1)
    auc_train = auc_score(model, train).mean()
    auc_test = auc_score(model, test).mean()
    print(f'Epoch: {e}, Train AUC={auc_train:.3f}, Test AUC={auc_test:.3f}')
    scores.append((auc_train, auc_test))
    if auc_test > best:
        best_model = deepcopy(model)
        best = auc_test
    else:
        count += 1

model = deepcopy(best_model)

### New Predictions

In [None]:
user = 'aaron67'
df[df['user'] == user]

In [None]:
im.user_encoder.transform([user])[0]

In [None]:
user_id = im.user_encoder.transform([user])[0]

In [None]:
preds = model.predict(user_id, list(range(im.n_items)))
preds = pd.DataFrame(zip(preds, im.item_encoder.classes_), columns=['pred', 'item'])
preds = preds.sort_values('pred', ascending=False)
preds.head()

In [None]:
tried = df[df['user'] == user]['item'].values
list(preds[~preds['item'].isin(tried)]['item'].values[:5])

### New Users

![](images/willy.jpg)

### Unless...

In [None]:
ex = pd.DataFrame([
    [0, 1, 1, 0, 0, 0], 
    [0, 1, 1, 1, 0, 0],
    [1, 0, 0, 1, 0, 0],
    [0, 1, 1, 0, 0, 1],
    [0, 0, 0, 1, 1, 1]
])

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances(ex)

In [None]:
df = pd.read_csv("data/candy.csv")
df = df[df['review'] >= 4]

In [None]:
df.sample(5)

In [None]:
df = df.groupby(["user"])["item"].apply(lambda x: ",".join(x))
df = pd.DataFrame(df)
df.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(tokenizer=lambda x: x.split(","), max_features=250)
X = cv.fit_transform(df['item'])

In [None]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5)
nn.fit(X)

In [None]:
neighbors = nn.kneighbors(X, return_distance=False)
neighbors

In [None]:
neighbors[0]

In [None]:
candy = []
for n in neighbors[0]:
    c = df.iloc[int(n)].values[0].split(",")
    candy.extend(c)
    
list(set(candy))

### Putting a bow on it

In [None]:
df = pd.read_csv("data/candy.csv")
df = df[df['review'] >= 4]
df = df.groupby(["user"])["item"].apply(lambda x: ",".join(x))
df = pd.DataFrame(df)
df.head()

In [None]:
class NNRecommender:
    def __init__(
        self, n_neighbors=5, max_features=250, tokenizer=lambda x: x.split(",")):
        self.cv = CountVectorizer(tokenizer=tokenizer, max_features=max_features)
        self.nn = NearestNeighbors(n_neighbors=n_neighbors)

    def fit(self, X):
        self.X = X
        X = self.cv.fit_transform(X)
        self.nn.fit(X)
        return self

    def predict(self, X):
        Xp = []
        for Xi in X:
            Xt = self.cv.transform([Xi])
            neighbors = self.nn.kneighbors(Xt, return_distance=False)
            repos = []
            for n in neighbors[0]:
                r = self.X.iloc[int(n)].split(",")
                repos.extend(r)
            repos = list(set(repos))
            repos = [r for r in repos if r not in Xi.split(",")]
            Xp.append(repos)
        return Xp

In [None]:
n_neighbors = 5
max_features = 250
model = NNRecommender(n_neighbors, max_features)
model.fit(df["item"])

In [None]:
df.sample(1)['item'].values

In [None]:
sweet = ["Airheads Xtremes Sweetly Sour Candy Rainbow Berry,Life Savers Five Flavor Gummies,Twizzlers Pull-N-Peel Candy Cherry"]

In [None]:
peanut = ["Reese's Peanut Butter Cups Miniatures,M&Ms Peanut Chocolate Candy,Reese's Peanut Butter Big Cup"]

In [None]:
im.item_encoder.classes_

In [None]:
model.predict(sweet)

In [None]:
model.predict(peanut)

![](images/the_end.jpg)

### Links

**Max Humber** 

- [Twitter](https://twitter.com/maxhumber)
- [LinkedIn](https://www.linkedin.com/in/maxhumber/)
- [GitHub](https://github.com/maxhumber)

**Open Source**

- [marc](https://github.com/maxhumber/marc) - (**mar**kov **c**hain) is a small, but flexible Markov chain generator.
- [gazpacho](https://github.com/maxhumber/gazpacho) - is a web scraping library. You should use it!
- [mummify](https://github.com/maxhumber/mummify) - makes model prototyping faster. 
- [chart](https://github.com/maxhumber/chart) - a zero-dependency python package that prints basic charts to a Jupyter output

- [recommend](https://github.com/maxhumber/recommend) - basically this presentation (super beta right now)

#### Upcoming

### Appendix

For when your data looks like this...

In [None]:
df = pd.read_csv('data/candy.csv')
df = df[df['user'].isin(df['user'].sample(10))]
df = df.pivot(index='item', columns='user', values='review')
df = df.reset_index()
df.head(5)

Do this...

In [None]:
df = df.melt(id_vars='item', var_name='user', value_name='review')
df = df.dropna().reset_index(drop=True)

df.head(5)

### Parting Thoughts

![](images/savage.png)

[Source](https://news.ycombinator.com/item?id=20495047)