# Product Recommender using Collaborative Filtering and LanceDB

We are going to use **LanceDB** and **Collaborative Filtering** to recommend products based on a user's past buying history. We used the <a href="https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset">**Instacart dataset**</a> as our data for this example.



## Credentials

Copy and paste the project name and the api key from your project page.
These will be used later to [connect to LanceDB Cloud](#scroll-to=5q8m6GMD7sGu)

In [2]:
project_slug = "your-project-slug"  # @param {type:"string"}

In [3]:
api_key = "sk_..."  # @param {type:"string"}

You can also set the LANCEDB_API_KEY as an environment variable with one of the options below

In [3]:
!export LANCEDB_API_KEY="sk_..."

In [None]:
import os
import getpass

os.environ["LANCEDB_API_KEY"] = getpass.getpass("Enter Your LANCEDB API Key:")

## Get dataset
Download and unzip the dataset from LanceDB s3 bucket.

In [4]:
!wget http://vectordb-recipes.s3.us-west-2.amazonaws.com/product-recommender.zip
!unzip product-recommender.zip
!cp product-recommender/*.zip .
!rm -fr product-recommender

--2024-01-23 03:30:37--  http://vectordb-recipes.s3.us-west-2.amazonaws.com/product-recommender.zip
Resolving vectordb-recipes.s3.us-west-2.amazonaws.com (vectordb-recipes.s3.us-west-2.amazonaws.com)... 3.5.84.12, 3.5.84.155, 3.5.84.131, ...
Connecting to vectordb-recipes.s3.us-west-2.amazonaws.com (vectordb-recipes.s3.us-west-2.amazonaws.com)|3.5.84.12|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 411510857 (392M) [application/zip]
Saving to: ‘product-recommender.zip’


2024-01-23 03:30:56 (20.8 MB/s) - ‘product-recommender.zip’ saved [411510857/411510857]

Archive:  product-recommender.zip
   creating: product-recommender/
  inflating: __MACOSX/._product-recommender  
  inflating: product-recommender/order_products__prior.csv.zip  
  inflating: __MACOSX/product-recommender/._order_products__prior.csv.zip  
  inflating: product-recommender/order_products__train.csv.zip  
  inflating: __MACOSX/product-recommender/._order_products__train.csv.zip  
  inflating:

Install dependencies:

In [5]:
!pip install numpy pandas scipy kaggle implicit torch lancedb

Collecting implicit
  Downloading implicit-0.7.2-cp310-cp310-manylinux2014_x86_64.whl (8.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting lancedb
  Downloading lancedb-0.5.0-py3-none-any.whl (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting deprecation (from lancedb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting pylance==0.9.6 (from lancedb)
  Downloading pylance-0.9.6-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.6/18.6 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ratelimiter~=1.0 (from lancedb)
  Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)
Collecting retry>=0.9.2 (from lancedb)
  Downloading retry-0.9.2-py2.py3-none-any.whl (8.0 kB)
Collecting semver>=3.0 (from l

First, let's import all the required modules for this example.

In [6]:
import zipfile
import numpy as np
import pandas as pd
import scipy.sparse
import torch
import implicit
from implicit import evaluation
import pydantic
import lancedb
from lancedb.pydantic import pydantic_to_schema, vector

We must now extract the zip files.

In [8]:
files = [
    "instacart-market-basket-analysis.zip",
    "order_products__train.csv.zip",
    "order_products__prior.csv.zip",
    "products.csv.zip",
    "orders.csv.zip",
]

for filename in files:
    with zipfile.ZipFile(filename, "r") as zip_ref:
        zip_ref.extractall("./")

Now we can move on to loading the dataset. We'll first read the csv files and create dataframes.

In [9]:
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
order_products = pd.concat(
    [pd.read_csv("order_products__train.csv"), pd.read_csv("order_products__prior.csv")]
)

Since there isn't a user rating attribute, we'll gather "confidence" data by looking at the frequency of each item purchased by a user, and store this in the `data` dataframe.

In [10]:
customer_order_products = pd.merge(orders, order_products, how="inner", on="order_id")

# create confidence table
data = (
    customer_order_products.groupby(["user_id", "product_id"])[["order_id"]]
    .count()
    .reset_index()
)
data.columns = ["user_id", "product_id", "total_orders"]
data.product_id = data.product_id.astype("int64")

Let's create a couple of test users to examine the recommendations later:
- 1st test user: buys 50 sodas: **Zero Calorie Cola**
- 2nd test user: buys organic produce: **Organic Whole Milk** and **Organic Blackberries**

In [11]:
data_new = pd.DataFrame(
    [
        [data.user_id.max() + 1, 46149, 50],
        [data.user_id.max() + 2, 27845, 49],
        [data.user_id.max() + 2, 26604, 32],
    ],
    columns=["user_id", "product_id", "total_orders"],
)
data = pd.concat([data, data_new]).reset_index(drop=True)
data.tail()

13863749


Unnamed: 0,user_id,product_id,total_orders
13863744,206209,48697,1
13863745,206209,48742,2
13863746,206210,46149,50
13863747,206211,27845,49
13863748,206211,26604,32


In the next step, we will extract user and product unique ids, in order to create a CSR (Compressed Sparse Row) matrix. This will allow us to perform collaborative filtering.


In [12]:
# extract unique user and product ids
unique_users = list(np.sort(data.user_id.unique()))
unique_products = list(np.sort(products.product_id.unique()))
purchases = list(data.total_orders)

# create zero-based index position <-> user/item ID mappings
index_to_user = pd.Series(unique_users)

# create reverse mappings from user/item ID to index positions
user_to_index = pd.Series(data=index_to_user.index + 1, index=index_to_user.values)

# create row and column for user and product ids
users_rows = data.user_id.astype(int)
products_cols = data.product_id.astype(int)

# create CSR matrix
matrix = scipy.sparse.csr_matrix(
    (purchases, (users_rows, products_cols)),
    shape=(len(unique_users) + 1, len(unique_products) + 1),
)
matrix.data = np.nan_to_num(matrix.data, copy=False)

Let's now create a recommender model using the **implicit** library. The recommendation model is based off the algorithms described in the paper [Collaborative Filtering for Implicit Feedback Datasets](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets) with performance optimizations described in [Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.379.6473&rep=rep1&type=pdf).

Note: this step will take about 17 minutes with the current parameter setup.

In [13]:
# split data into train and test splits
train, test = evaluation.train_test_split(matrix, train_percentage=0.9)

# initialize the recommender model
model = implicit.als.AlternatingLeastSquares(
    factors=128, regularization=0.05, iterations=50, num_threads=1
)

alpha = 15
train = (train * alpha).astype("double")

# train the model on CSR matrix
model.fit(train, show_progress=True)

  check_blas_config()


  0%|          | 0/50 [00:00<?, ?it/s]

## Let's now evaluate the model.

In [15]:
test = (test * alpha).astype("double")
evaluation.ranking_metrics_at_k(
    model, train, test, K=100, show_progress=True, num_threads=1
)

  0%|          | 0/192802 [00:00<?, ?it/s]

{'precision': 0.2742377453615933,
 'map': 0.04506404325620732,
 'ndcg': 0.1449554399501384,
 'auc': 0.6549935260418878}

From the model, we'll be able to retrieve item and user factors, which we can use later on to store in LanceDB as vector embeddings.

In [17]:
model.item_factors[1:3]

array([[ 4.18832153e-03,  3.25558195e-03, -1.20758591e-02,
         1.40742492e-03, -9.09519568e-03,  3.18243494e-03,
         2.07483694e-02, -3.95777356e-03, -7.84489443e-04,
         1.28329173e-03,  4.66100639e-03,  1.26599418e-02,
         1.69202778e-02, -3.54033429e-03, -1.87805621e-04,
        -8.05972423e-03,  4.04613744e-03,  7.47162709e-03,
         4.05248860e-03,  1.68309249e-02, -1.78848747e-02,
        -9.86590981e-03,  8.46584328e-03, -1.20693864e-02,
         7.22488947e-03,  3.90211469e-03,  6.32435898e-04,
         3.13967327e-03,  9.04218480e-03,  2.50183023e-03,
         1.39820874e-02,  7.54051283e-03,  1.57470535e-02,
         4.96101473e-03,  1.74571313e-02,  4.82573919e-03,
         1.31175248e-02,  2.78141089e-02,  2.54594497e-02,
         1.70677726e-04,  6.35464117e-03, -3.27711529e-03,
         8.61203857e-03,  1.61729436e-02, -7.27234699e-04,
         7.29484204e-03, -6.27670763e-03,  2.42914446e-02,
         9.70306620e-03,  9.60955396e-03,  1.76130934e-0

In [18]:
model.user_factors[1:3]

array([[-0.48312342, -0.16332878, -0.27058715, -0.68734646,  0.55745304,
        -0.76024646,  1.3025886 , -1.1410682 ,  0.19876784,  0.322232  ,
         1.418613  , -0.35110232, -0.20965634,  0.06050462, -1.2792661 ,
        -1.0213155 ,  0.4870829 ,  0.1747867 , -0.56089026,  1.9309798 ,
        -1.1751343 , -1.7791682 , -1.1694795 ,  0.05588444,  1.1789317 ,
         0.46748516, -1.4641706 , -0.34146857,  0.38970897,  0.8604016 ,
         0.3465701 ,  1.1880745 ,  0.06135967, -1.3244237 ,  0.3275966 ,
        -1.1865908 , -0.01917509,  2.7532892 ,  2.7307365 ,  0.44283357,
         0.5644037 , -0.697197  , -1.8847649 ,  0.10031813,  0.3599322 ,
        -0.83181113, -1.9561976 ,  0.8480924 ,  0.910125  , -0.35006854,
         0.45438412,  1.1324192 ,  0.02506897,  0.7978778 , -1.0787288 ,
         0.41879764, -1.0015563 , -0.11314881, -1.512127  , -0.37960863,
        -0.5743517 , -1.0606588 ,  0.9415234 ,  0.1189226 , -0.10419434,
         1.4429063 , -0.35251117,  0.59351844,  0.5

## Let's save the data and create a empty LanceDB Table using a Pydantic model.
A Table is designed to store large numbers of columns and huge quantities of data! For those interested, a LanceDB is columnar-based, and uses Lance, an open data format to store data.

In [20]:
# connect to LanceDB Cloud with previously set credentials
uri = "db://" + project_slug
db = lancedb.connect(uri, api_key=api_key, region="us-east-1")

In [21]:
data.head()

Unnamed: 0,user_id,product_id,total_orders
0,1,196,11
1,1,10258,10
2,1,10326,1
3,1,12427,10
4,1,13032,4


In [22]:
class ProductModel(pydantic.BaseModel):
    product_id: int
    product_name: str
    vector: vector(128)


schema = pydantic_to_schema(ProductModel)
table_name = "product_recommender"
db.drop_table(table_name)
try:
    tbl = db.create_table(table_name, schema=schema)
except:
    tbl = db.open_table(table_name)

Let's now store our item factors into the table via the vector column of `product_entries`.

In [23]:
# Transform items into factors
items_factors = model.item_factors
product_entries = products[["product_id", "product_name"]].drop_duplicates()
product_entries["product_id"] = product_entries.product_id.astype("int64")
device = "cuda" if torch.cuda.is_available() else "cpu"
item_embeddings = items_factors[1:].tolist()
product_entries["vector"] = item_embeddings

tbl.add(product_entries)

## Let's create an ANN index in order to speed up retrieval. This might take a while.

In [24]:
tbl.create_index(vector_column_name="vector")

{}

This is a helper method for analysing recommendations later.
This method returns top N products that someone bought in the past (based on product quantity).

In [25]:
def products_bought_by_user_in_the_past(user_id: int, top: int = 10):
    selected = data[data.user_id == user_id].sort_values(
        by=["total_orders"], ascending=False
    )

    selected["product_name"] = selected["product_id"].map(
        product_entries.set_index("product_id")["product_name"]
    )
    selected = selected[["product_id", "product_name", "total_orders"]].reset_index(
        drop=True
    )
    if selected.shape[0] < top:
        return selected

    return selected[:top]

Let's retrieve our test users so we can query for recommendations.

In [26]:
test_user_ids = [206210, 206211]
test_user_factors = model.user_factors[user_to_index[test_user_ids]]

## Let's now query LanceDB to retrieve recommendations.

In [28]:
# Query by user factors
test_user_embeddings = test_user_factors.tolist()
for embedding, id in zip(test_user_embeddings, test_user_ids):
    results = tbl.search(embedding).limit(10).to_pandas()
    display(results)
    display(products_bought_by_user_in_the_past(id, top=15))

Unnamed: 0,product_id,product_name,vector,_distance
0,196,Soda,"[-0.0030924827, -0.0042996905, -0.01350651, -0...",35.096085
1,46149,Zero Calorie Cola,"[0.0015008126, -0.014029495, -0.015295635, 0.0...",35.392975
2,40939,Drinking Water,"[0.0018837166, -0.018152414, -0.015649604, 0.0...",35.864483
3,37710,Trail Mix,"[-0.0011668581, -0.0025222106, -0.016717039, -...",35.896873
4,22802,Mineral Water,"[-0.010115783, -0.017115017, -0.011403508, 0.0...",36.035912
5,41400,Crunchy Oats 'n Honey Granola Bars,"[0.0040870784, -0.0009994006, -0.018302424, -0...",36.042686
6,46061,Popcorn,"[0.0036969625, -0.013887798, -0.002804261, -0....",36.043732
7,31651,Extra Fancy Unsalted Mixed Nuts,"[0.014438897, -0.005578243, -0.0055169673, -0....",36.117802
8,5258,Sparkling Water,"[-0.022658644, -0.026015628, -0.0083606485, -0...",36.131721
9,38928,0% Greek Strained Yogurt,"[0.0018425643, -0.011489441, -0.0052835834, 0....",36.13987


Unnamed: 0,product_id,product_name,total_orders
0,46149,Zero Calorie Cola,50


Unnamed: 0,product_id,product_name,vector,_distance
0,26604,Organic Blackberries,"[-0.017585486, 0.019628799, 0.0399348, 0.01422...",17.404045
1,27845,Organic Whole Milk,"[-0.050286394, 0.026924692, 0.030701049, -0.02...",17.404305
2,27966,Organic Raspberries,"[-0.006732653, 0.015266006, 0.018316658, -0.00...",17.867121
3,43352,Raspberries,"[0.0037516877, 0.013682851, 0.057814274, 0.031...",18.030893
4,9076,Blueberries,"[0.0029817792, 0.030459687, 0.04528497, 0.0113...",18.135754
5,21288,Blackberries,"[-0.011553102, -0.010046569, 0.037375, 0.02368...",18.141661
6,39275,Organic Blueberries,"[0.010543987, 0.006028164, 0.011502461, 0.0004...",18.24152
7,39928,Organic Kiwi,"[-0.044292357, -0.031322725, -0.00174381, -0.0...",18.414057
8,11777,Red Raspberries,"[-0.0067819585, -0.023531102, 0.010277328, -0....",18.468819
9,21137,Organic Strawberries,"[0.007023127, 0.0037457773, -0.0061378656, -0....",18.476973


Unnamed: 0,product_id,product_name,total_orders
0,27845,Organic Whole Milk,49
1,26604,Organic Blackberries,32
