# Product Recommender using Collaborative Filtering and LanceDB

We are going to use **LanceDB** and **Collaborative Filtering** to recommend products based on a user's past buying history. We used the <a href="https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset">**Instacart dataset**</a> as our data for this example.

![picture](https://daxg39y63pxwu.cloudfront.net/images/blog/product-recommendation-system-projects/Product_Recommendation_System_Project_Ideas_and_Examples.png)

To run this example, you must first create a Kaggle account. Then, go to the 'Account' tab of your user profile and select 'Create New Token'. This will trigger the download of kaggle.json, a file containing your API credentials.

Add Kaggle credentials to `~/.kaggle/kaggle.json` on Linux, OSX, and other UNIX-based operating systems or `C:\Users\<Windows-username>\.kaggle\kaggle.json` for Window's users.

In Google Colab, run the snippet below.

In [10]:
import json
import os

# Set the file path
kaggle_json_path = "/content/kaggle.json"

# Write Kaggle API key to the file
with open(kaggle_json_path, "w") as fp:
    json.dump({"username": "", "key": ""}, fp)

# Move the file to the correct location
os.system("mkdir -p ~/.kaggle")
os.system(f"mv {kaggle_json_path} ~/.kaggle/kaggle.json")

# Set permissions
os.system("chmod 600 ~/.kaggle/kaggle.json")

print("Kaggle API key file created and moved successfully.")

Kaggle API key file created and moved successfully.


### Install dependencies

In [5]:
!pip install numpy pandas scipy kaggle implicit torch lancedb

Collecting implicit
  Downloading implicit-0.7.2-cp310-cp310-manylinux2014_x86_64.whl (8.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting lancedb
  Downloading lancedb-0.5.0-py3-none-any.whl (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting deprecation (from lancedb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting pylance==0.9.6 (from lancedb)
  Downloading pylance-0.9.6-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.6/18.6 MB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ratelimiter~=1.0 (from lancedb)
  Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)
Collecting retry>=0.9.2 (from lancedb)
  Downloading retry-0.9.2-py2.py3-none-any.whl (8.0 kB)
Collecting semver>=3.0 (from 

### Importing libraries

In [6]:
import zipfile
import numpy as np
import pandas as pd
import scipy.sparse
import torch
import implicit
from implicit import evaluation
import pydantic
import lancedb
from lancedb.pydantic import pydantic_to_schema, vector

### Load the dataset
Now we can download the dataset. You will need to accept the rules of the `instacart-market-basket-analysis` competition, which you can do so [here](https://www.kaggle.com/competitions/instacart-market-basket-analysis/rules).

In [11]:
!kaggle competitions download -c instacart-market-basket-analysis

Downloading instacart-market-basket-analysis.zip to /content
 92% 181M/196M [00:01<00:00, 81.3MB/s]
100% 196M/196M [00:01<00:00, 105MB/s] 


We must now extract the zip files.

In [12]:
files = [
    "instacart-market-basket-analysis.zip",
    "order_products__train.csv.zip",
    "order_products__prior.csv.zip",
    "products.csv.zip",
    "orders.csv.zip",
]

for filename in files:
    with zipfile.ZipFile(filename, "r") as zip_ref:
        zip_ref.extractall("./")

Now we can move on to loading the dataset. We'll first read the csv files and create dataframes.

In [13]:
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
order_products = pd.concat(
    [pd.read_csv("order_products__train.csv"), pd.read_csv("order_products__prior.csv")]
)

Since there isn't a user rating attribute, we'll gather "confidence" data by looking at the frequency of each item purchased by a user, and store this in the `data` dataframe.

### Data Manipulation

In [14]:
customer_order_products = pd.merge(orders, order_products, how="inner", on="order_id")

# create confidence table
data = (
    customer_order_products.groupby(["user_id", "product_id"])[["order_id"]]
    .count()
    .reset_index()
)
data.columns = ["user_id", "product_id", "total_orders"]
data.product_id = data.product_id.astype("int64")

Let's create a couple of test users to examine the recommendations later:
- 1st test user: buys 50 sodas: **Zero Calorie Cola**
- 2nd test user: buys organic produce: **Organic Whole Milk** and **Organic Blackberries**

In [15]:
data_new = pd.DataFrame(
    [
        [data.user_id.max() + 1, 46149, 50],
        [data.user_id.max() + 2, 27845, 49],
        [data.user_id.max() + 2, 26604, 32],
    ],
    columns=["user_id", "product_id", "total_orders"],
)
data = pd.concat([data, data_new]).reset_index(drop=True)
data.tail()

Unnamed: 0,user_id,product_id,total_orders
13863744,206209,48697,1
13863745,206209,48742,2
13863746,206210,46149,50
13863747,206211,27845,49
13863748,206211,26604,32


In the next step, we will extract user and product unique ids, in order to create a `CSR (Compressed Sparse Row)` matrix. This will allow us to perform collaborative filtering.


In [16]:
# extract unique user and product ids
unique_users = list(np.sort(data.user_id.unique()))
unique_products = list(np.sort(products.product_id.unique()))
purchases = list(data.total_orders)

# create zero-based index position <-> user/item ID mappings
index_to_user = pd.Series(unique_users)

# create reverse mappings from user/item ID to index positions
user_to_index = pd.Series(data=index_to_user.index + 1, index=index_to_user.values)

# create row and column for user and product ids
users_rows = data.user_id.astype(int)
products_cols = data.product_id.astype(int)

# create CSR matrix
matrix = scipy.sparse.csr_matrix(
    (purchases, (users_rows, products_cols)),
    shape=(len(unique_users) + 1, len(unique_products) + 1),
)
matrix.data = np.nan_to_num(matrix.data, copy=False)

Let's now create a recommender model using the **implicit** library. The recommendation model is based off the algorithms described in the paper [Collaborative Filtering for Implicit Feedback Datasets](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets) with performance optimizations described in [Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.379.6473&rep=rep1&type=pdf).


# Difference between colloborative and content filtering

![picture](https://miro.medium.com/v2/resize:fit:1400/0*R8qw_CXxCc4600bQ.png)

In [17]:
import os

os.environ["OPENBLAS_NUM_THREADS"] = "1"
# split data into train and test splits
train, test = evaluation.train_test_split(matrix, train_percentage=0.9)

# initialize the recommender model
model = implicit.als.AlternatingLeastSquares(
    factors=128, regularization=0.05, iterations=50, num_threads=1
)

alpha = 15
train = (train * alpha).astype("double")

# train the model on CSR matrix
model.fit(train, show_progress=True)

  check_blas_config()


  0%|          | 0/50 [00:00<?, ?it/s]

## Let's now evaluate the model.

In [18]:
test = (test * alpha).astype("double")
evaluation.ranking_metrics_at_k(
    model, train, test, K=100, show_progress=True, num_threads=1
)

  0%|          | 0/192941 [00:00<?, ?it/s]

{'precision': 0.27412284342591836,
 'map': 0.04493413696144052,
 'ndcg': 0.14451615505158932,
 'auc': 0.6545342486842805}

From the model, we'll be able to retrieve item and user factors, which we can use later on to store in LanceDB as vector embeddings.

In [19]:
model.item_factors[1:3]

array([[-0.00393582, -0.01685037, -0.02514135, -0.00218876, -0.0010562 ,
        -0.00079798,  0.01819069, -0.00519188, -0.01228626,  0.00401638,
        -0.00781125,  0.00959024,  0.01423726,  0.00442174,  0.01037473,
         0.02808587,  0.00244657,  0.0018454 ,  0.02538132,  0.01683291,
         0.01188253,  0.00087587,  0.0025703 , -0.00047981,  0.01450326,
         0.01200323,  0.00787515, -0.00017644, -0.00753563,  0.01381539,
         0.00135173,  0.01005786,  0.01090438,  0.00116869,  0.00096769,
         0.00977502, -0.0167746 ,  0.00648016, -0.00428325,  0.00548768,
         0.00768948, -0.0004173 , -0.00244178,  0.01658725,  0.01461017,
         0.00099183,  0.00801511, -0.00094962, -0.00111636,  0.01834919,
         0.01020439,  0.01684855,  0.00937538,  0.00352314,  0.00628611,
         0.01727425, -0.00045354, -0.0043545 ,  0.00622296,  0.02763929,
         0.0175414 ,  0.0025494 ,  0.00278871,  0.00882237,  0.01894817,
         0.004546  ,  0.00443751,  0.00223829,  0.0

In [20]:
model.user_factors[1:3]

array([[-1.156621  , -0.56613535, -2.132921  ,  0.21048984, -2.4275026 ,
         0.65278965,  0.29068047, -0.86535686, -1.1061512 ,  0.56259805,
         0.19742274, -1.2165526 , -0.62973964, -0.01139626,  0.89300275,
         2.2871504 ,  1.4771796 , -1.4062662 ,  1.0189441 ,  0.5945485 ,
        -0.18952619,  0.70189404, -1.3442475 , -0.02677805,  0.84128475,
         2.0733142 , -1.7199677 ,  0.5854054 , -0.4431385 , -0.42398763,
         0.02329228, -0.21817428,  0.11456848, -0.60438013,  1.8845385 ,
         0.48805752,  0.4914834 ,  0.7036006 , -0.20515339,  0.26406226,
        -1.0394758 ,  0.10504863,  0.15933166,  0.8230506 , -1.4198968 ,
         1.5953054 , -0.17673688, -0.8304307 , -0.6108456 ,  0.9837131 ,
        -0.7765777 , -0.17818405, -0.5966103 ,  0.04043822, -0.5247469 ,
         0.82219905, -1.2847204 , -0.15080781,  0.39815912,  0.38488662,
         0.64036644, -0.41876483, -0.82841444,  0.14284681,  1.6959293 ,
         0.32721832,  0.37919757, -0.12497136, -0.8

## Let's save the data and create a empty LanceDB Table using a Pydantic model.
A Table is designed to store large numbers of columns and huge quantities of data! For those interested, a LanceDB is columnar-based, and uses Lance, an open data format to store data.

In [21]:
db = lancedb.connect("data/lancedb")

In [22]:
class ProductModel(pydantic.BaseModel):
    product_id: int
    product_name: str
    vector: vector(128)


schema = pydantic_to_schema(ProductModel)
table_name = "product_recommender"
tbl = db.create_table(table_name, schema=schema, mode="overwrite")

Let's now store our item factors into the table via the vector column of `product_entries`.

In [25]:
# Transform items into factors
items_factors = model.item_factors
product_entries = products[["product_id", "product_name"]].drop_duplicates()
product_entries["product_id"] = product_entries.product_id.astype("int64")
item_embeddings = items_factors[1:].tolist()
product_entries["vector"] = item_embeddings

tbl.add(product_entries)

## Let's create an ANN index in order to speed up retrieval. This might take a while.

In [26]:
tbl.create_index(num_partitions=256, num_sub_vectors=16)

This is a helper method for analysing recommendations later.
This method returns top N products that someone bought in the past (based on product quantity).

In [27]:
def products_bought_by_user_in_the_past(user_id: int, top: int = 10):
    selected = data[data.user_id == user_id].sort_values(
        by=["total_orders"], ascending=False
    )

    selected["product_name"] = selected["product_id"].map(
        product_entries.set_index("product_id")["product_name"]
    )
    selected = selected[["product_id", "product_name", "total_orders"]].reset_index(
        drop=True
    )
    if selected.shape[0] < top:
        return selected

    return selected[:top]

Let's retrieve our test users so we can query for recommendations.

In [28]:
test_user_ids = [206210, 206211]
test_user_factors = model.user_factors[user_to_index[test_user_ids]]

## Let's now query LanceDB to retrieve recommendations.

In [31]:
# Query by user factors
test_user_embeddings = test_user_factors.tolist()
for embedding, id in zip(test_user_embeddings, test_user_ids):
    results = tbl.search(embedding).limit(10).to_pandas()
    display(results)
    display(products_bought_by_user_in_the_past(id, top=15))

Unnamed: 0,product_id,product_name,vector,_distance
0,46149,Zero Calorie Cola,"[-0.014371638, -0.016776536, -0.026950998, -0....",36.209068
1,196,Soda,"[-0.031917833, -0.050772455, 0.013827451, -0.0...",36.464764
2,40939,Drinking Water,"[-0.013426425, 0.0053616967, -0.01992105, -0.0...",36.504112
3,22802,Mineral Water,"[-0.0062663523, -0.00076926383, -0.013624842, ...",36.615498
4,37710,Trail Mix,"[-0.01988333, -0.014069387, -0.021995109, -0.0...",36.650448
5,42500,Orange & Lemon Flavor Variety Pack Sparkling F...,"[-0.009584657, -0.023491196, -0.033104196, -0....",36.696648
6,11759,Organic Simply Naked Pita Chips,"[-0.009341286, -0.014609524, -0.0064758006, -0...",36.705814
7,41400,Crunchy Oats 'n Honey Granola Bars,"[-0.013461881, -0.021371827, -0.02064814, -0.0...",36.709579
8,46061,Popcorn,"[0.0019679032, 0.00719048, -0.01262015, -0.005...",36.714954
9,26348,Mixed Fruit Fruit Snacks,"[-0.0017672281, 0.0020188452, 0.012172974, -0....",36.716858


Unnamed: 0,product_id,product_name,total_orders
0,46149,Zero Calorie Cola,50


Unnamed: 0,product_id,product_name,vector,_distance
0,26604,Organic Blackberries,"[0.045252558, 0.04258531, 0.011869884, -0.0111...",17.445852
1,43352,Raspberries,"[0.059606433, 0.014409931, 0.008712215, -0.007...",17.617174
2,27845,Organic Whole Milk,"[-0.03977351, 0.012210161, 0.024828656, 0.0155...",17.692816
3,21288,Blackberries,"[0.030181486, 0.049021076, 0.003293778, -0.038...",17.696075
4,27966,Organic Raspberries,"[0.020116415, 0.045062356, 0.00675044, 0.01640...",17.872534
5,9076,Blueberries,"[0.0482006, 0.06329333, -0.015093377, 0.000180...",17.879623
6,11777,Red Raspberries,"[0.05492493, 0.008120705, 0.020613482, 0.00779...",17.931437
7,39275,Organic Blueberries,"[0.005109854, 0.032895964, -0.013481544, 0.010...",17.970798
8,21137,Organic Strawberries,"[0.0017651353, 0.033547334, -0.005775958, 0.02...",17.98657
9,13176,Bag of Organic Bananas,"[0.004607136, 0.02749164, -0.006206838, 0.0187...",18.092993


Unnamed: 0,product_id,product_name,total_orders
0,27845,Organic Whole Milk,49
1,26604,Organic Blackberries,32
