# Product Recommender using Collaborative Filtering and LanceDB

We are going to use **LanceDB** and **Collaborative Filtering** to recommend products based on a user's past buying history. We used the <a href="https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset">**Instacart dataset**</a> as our data for this example.



## Get dataset
Download and unzip the dataset from LanceDB s3 bucket.

In [None]:
wget http://vectordb-recipes.s3.us-west-2.amazonaws.com/product-recommender.zip
unzip product-recommender.zip

Install dependencies:

In [None]:
!pip install numpy pandas scipy kaggle implicit torch lancedb

First, let's import all the required modules for this example.

In [None]:
import zipfile
import numpy as np
import pandas as pd
import scipy.sparse
import torch
import implicit
from implicit import evaluation
import pydantic
import lancedb
from lancedb.pydantic import pydantic_to_schema, vector

We must now extract the zip files.

In [None]:
files = [
    'instacart-market-basket-analysis.zip',
    'order_products__train.csv.zip',
    'order_products__prior.csv.zip',
    'products.csv.zip',
    'orders.csv.zip'
]

for filename in files:
    with zipfile.ZipFile(filename, 'r') as zip_ref:
        zip_ref.extractall('./')

Now we can move on to loading the dataset. We'll first read the csv files and create dataframes.

In [None]:
products = pd.read_csv('products.csv')
orders = pd.read_csv('orders.csv')
order_products = pd.concat([pd.read_csv('order_products__train.csv'), pd.read_csv('order_products__prior.csv')])

Since there isn't a user rating attribute, we'll gather "confidence" data by looking at the frequency of each item purchased by a user, and store this in the `data` dataframe.

In [None]:
customer_order_products = pd.merge(orders, order_products, how='inner',on='order_id')

# create confidence table
data = customer_order_products.groupby(['user_id', 'product_id'])[['order_id']].count().reset_index()
data.columns=["user_id", "product_id", "total_orders"]
data.product_id = data.product_id.astype('int64')

Let's create a couple of test users to examine the recommendations later:
- 1st test user: buys 50 sodas: **Zero Calorie Cola**
- 2nd test user: buys organic produce: **Organic Whole Milk** and **Organic Blackberries**

In [None]:
data_new = pd.DataFrame([[data.user_id.max() + 1, 46149, 50],
                         [data.user_id.max() + 2, 27845, 49],
                         [data.user_id.max() + 2, 26604, 32]
                        ], columns=['user_id', 'product_id', 'total_orders'])
data = pd.concat([data, data_new]).reset_index(drop = True)
data.tail()

Unnamed: 0,user_id,product_id,total_orders
13863744,206209,48697,1
13863745,206209,48742,2
13863746,206210,46149,50
13863747,206211,27845,49
13863748,206211,26604,32


In the next step, we will extract user and product unique ids, in order to create a CSR (Compressed Sparse Row) matrix. This will allow us to perform collaborative filtering.


In [None]:
# extract unique user and product ids
unique_users = list(np.sort(data.user_id.unique()))
unique_products = list(np.sort(products.product_id.unique()))
purchases = list(data.total_orders)

# create zero-based index position <-> user/item ID mappings
index_to_user = pd.Series(unique_users)

# create reverse mappings from user/item ID to index positions
user_to_index = pd.Series(data=index_to_user.index + 1, index=index_to_user.values)

# create row and column for user and product ids
users_rows = data.user_id.astype(int)
products_cols = data.product_id.astype(int)

# create CSR matrix
matrix = scipy.sparse.csr_matrix((purchases, (users_rows, products_cols)), shape=(len(unique_users) + 1, len(unique_products) + 1))
matrix.data = np.nan_to_num(matrix.data, copy=False)

Let's now create a recommender model using the **implicit** library. The recommendation model is based off the algorithms described in the paper [Collaborative Filtering for Implicit Feedback Datasets](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets) with performance optimizations described in [Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.379.6473&rep=rep1&type=pdf).


In [None]:
#split data into train and test splits
train, test = evaluation.train_test_split(matrix, train_percentage=0.9)

# initialize the recommender model
model = implicit.als.AlternatingLeastSquares(factors=128,
                                             regularization=0.05,
                                             iterations=50,
                                             num_threads=1)

alpha = 15
train = (train * alpha).astype('double')

# train the model on CSR matrix
model.fit(train, show_progress = True)

## Let's now evaluate the model.

In [None]:
test = (test * alpha).astype('double')
evaluation.ranking_metrics_at_k(model, train, test, K=100,
                         show_progress=True, num_threads=1)

From the model, we'll be able to retrieve item and user factors, which we can use later on to store in LanceDB as vector embeddings.

In [None]:
model.item_factors[1:3]

Matrix([[-4.78145870e-04  1.20844017e-03 -1.05093475e-02  4.69897687e-03
  -3.42543889e-03 -2.47619092e-03 -1.37619404e-02  7.47181184e-04
   1.07308161e-02 -2.85757496e-03 -3.43888951e-03  2.04937998e-02
   1.51449145e-04 -2.15489650e-03  4.52879071e-03  2.36469251e-03
  -4.35322057e-03  1.33156916e-02  1.48590095e-02  4.69916826e-03
   6.51248451e-03  4.22086829e-04 -8.89686961e-03 -1.12102665e-02
  -1.30706644e-02  2.08967202e-03  1.39534501e-02 -5.01955580e-03
   9.95233562e-03  1.66954547e-02 -2.05567423e-02  2.21748278e-03
  -1.06390044e-02  1.60855800e-02 -5.87939285e-03  2.46607186e-03
  -4.01218655e-03 -6.49328623e-03  6.99202390e-03  1.05327908e-02
   6.51289755e-03 -9.16731264e-03 -4.96822828e-03  8.40877462e-03
  -2.60996539e-03  5.20697143e-03 -4.72197018e-04 -6.58531254e-03
  -1.40383225e-02 -3.83673515e-03 -1.17233172e-02  8.79578851e-03
   3.15940916e-03 -1.96590065e-03  3.96021921e-03  7.77690002e-05
  -1.72196236e-03 -9.86298453e-03 -1.31952651e-02 -3.54522630e-03
  -

In [None]:
model.user_factors[1:3]

Matrix([[ 1.76306021e+00  5.84278941e-01 -1.31811535e+00  2.58672982e-02
  -1.47269890e-01 -1.99104631e+00  2.27232277e-01 -1.44048560e+00
   6.22047544e-01 -6.87879205e-01 -4.23879363e-02 -1.50391304e+00
  -1.04289226e-01 -9.02593374e-01  6.43032670e-01 -3.58335793e-01
  -6.54706135e-02  8.50856245e-01  7.79175341e-01 -4.51985866e-01
   8.55366886e-01  1.36438921e-01 -5.49016356e-01  4.83980298e-01
  -1.40851259e-01 -5.33492684e-01  5.68639338e-01  1.45152867e-01
   1.76261580e+00  5.22969246e-01 -2.21816874e+00  1.65968144e+00
  -8.83751035e-01  7.76956260e-01  1.25151992e+00 -3.25308472e-01
  -1.49347281e+00 -1.01729310e+00 -4.59959418e-01  7.20718205e-01
  -7.15589583e-01  6.45604208e-02 -8.51610005e-01  3.01664054e-01
  -4.82483774e-01 -1.79249153e-01 -1.40011147e-01 -4.23951089e-01
  -1.13460481e+00  1.13597369e+00 -7.75141537e-01  1.12328935e+00
   8.09678361e-02  3.74672678e-03 -5.49308121e-01  3.00086081e-01
  -3.05186462e+00  8.97184253e-01  3.36005628e-01 -9.53863025e-01
  -

## Let's save the data and create a empty LanceDB Table using a Pydantic model.
A Table is designed to store large numbers of columns and huge quantities of data! For those interested, a LanceDB is columnar-based, and uses Lance, an open data format to store data.

In [None]:
# you can simply replace the following line from the code copied from the dashboard
# or substitute  with your project/api key
db = lancedb.connect("db://your-project-name", api_key="sk_...", region="us-east-1")


In [None]:
class ProductModel(pydantic.BaseModel):
    product_id: int
    product_name: str
    vector: vector(128)
schema = pydantic_to_schema(ProductModel)
table_name = 'product_recommender'
try:
    tbl = db.create_table(table_name, data=data)
except:
    tbl = db.open_table(table_name)

Let's now store our item factors into the table via the vector column of `product_entries`.

In [None]:
# Transform items into factors
items_factors = model.item_factors
product_entries = products[['product_id', 'product_name']].drop_duplicates()
product_entries['product_id'] = product_entries.product_id.astype('int64')
device = "cuda" if torch.cuda.is_available() else "cpu"
item_embeddings = items_factors[1:].to_numpy().tolist() if device == "cuda" else items_factors[1:].tolist()
product_entries['vector'] = item_embeddings

tbl.add(product_entries)

## Let's create an ANN index in order to speed up retrieval. This might take a while.

In [None]:
tbl.create_index(vector_column_name="vector")

This is a helper method for analysing recommendations later.
This method returns top N products that someone bought in the past (based on product quantity).

In [None]:
def products_bought_by_user_in_the_past(user_id: int, top: int = 10):

    selected = data[data.user_id == user_id].sort_values(by=['total_orders'], ascending=False)

    selected['product_name'] = selected['product_id'].map(product_entries.set_index('product_id')['product_name'])
    selected = selected[['product_id', 'product_name', 'total_orders']].reset_index(drop=True)
    if selected.shape[0] < top:
        return selected

    return selected[:top]

Let's retrieve our test users so we can query for recommendations.

In [None]:
test_user_ids = [206210, 206211]
test_user_factors = model.user_factors[user_to_index[test_user_ids]]

## Let's now query LanceDB to retrieve recommendations.

In [None]:
# Query by user factors
test_user_embeddings = test_user_factors.to_numpy().tolist() if device == "cuda" else test_user_factors.tolist()
for embedding, id in zip(test_user_embeddings, test_user_ids):
    results = tbl.search(embedding).limit(10).to_pandas()
    display(results)
    display(products_bought_by_user_in_the_past(id, top=15))

Unnamed: 0,product_id,product_name,vector,_distance
0,46149,Zero Calorie Cola,"[0.0022252752, 0.006192103, -0.030976068, -0.0...",42.362167
1,196,Soda,"[-0.026970126, -0.018141357, -0.058909897, -0....",42.566917
2,40939,Drinking Water,"[-0.0022220984, 0.0038040192, -0.014334261, -0...",42.823524
3,41400,Crunchy Oats 'n Honey Granola Bars,"[0.00020495932, 0.01123721, -0.017992454, 0.00...",42.825871
4,37710,Trail Mix,"[-0.00409454, 0.013049758, -0.015458386, 0.002...",42.887043
5,46061,Popcorn,"[-0.0024577982, 0.0062640505, 0.0007137757, -0...",42.894321
6,38928,0% Greek Strained Yogurt,"[0.0039202925, -0.0039743707, -0.012298337, 0....",42.912727
7,31651,Extra Fancy Unsalted Mixed Nuts,"[0.0035735483, -0.006829423, -0.009457169, 0.0...",42.922546
8,22802,Mineral Water,"[0.02247884, 0.003889028, -0.020661984, -0.031...",42.935951
9,39657,Milk Chocolate Almonds,"[0.00749532, 0.00577313, -0.016842585, -0.0015...",42.946739


Unnamed: 0,product_id,product_name,total_orders
0,46149,Zero Calorie Cola,50


Unnamed: 0,product_id,product_name,vector,_distance
0,26604,Organic Blackberries,"[-0.006724186, 0.025339324, 0.026328607, -0.00...",19.517328
1,27966,Organic Raspberries,"[-0.008532436, 0.012350272, 0.00061730895, 0.0...",19.688234
2,9076,Blueberries,"[-0.02710966, 0.04093987, 0.051150266, -0.0465...",19.795025
3,43352,Raspberries,"[-0.00842552, 0.01970873, 0.043075223, -0.0089...",19.805353
4,39275,Organic Blueberries,"[-0.01799259, 0.0049827565, 0.0029076852, 0.02...",19.961214
5,27845,Organic Whole Milk,"[0.0005443055, -0.013880691, 0.008969757, -0.0...",19.976284
6,21288,Blackberries,"[-0.007392233, -0.01224536, 0.03930769, 0.0020...",19.990463
7,11777,Red Raspberries,"[-0.011827968, 0.02923465, 0.006089752, -0.033...",20.038776
8,21137,Organic Strawberries,"[-0.018719932, 0.004096488, -0.016034253, 0.02...",20.056273
9,47209,Organic Hass Avocado,"[0.016230278, 0.0025620027, -0.0056362785, 0.0...",20.078579


Unnamed: 0,product_id,product_name,total_orders
0,27845,Organic Whole Milk,49
1,26604,Organic Blackberries,32
