# Product Recommender using Collaborative Filtering and LanceDB

We are going to use **LanceDB** and **Collaborative Filtering** to recommend products based on a user's past buying history. We used the <a href="https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset">**Instacart dataset**</a> as our data for this example.

![picture](https://daxg39y63pxwu.cloudfront.net/images/blog/product-recommendation-system-projects/Product_Recommendation_System_Project_Ideas_and_Examples.png)

To downloading dataset in this example, you must have a Kaggle account.

To get the Kaggle API credentials,

Go to the Your Profile -> Settings -> Create Token

This will download `kaggle.json`, a file containing your API credentials.

Upload Kaggle credentials `kaggle.json` in Google Colab, run the snippet below.

In [1]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json



### Install dependencies

In [2]:
!pip install numpy pandas scipy implicit torch lancedb

Collecting implicit
  Downloading implicit-0.7.2-cp310-cp310-manylinux2014_x86_64.whl (8.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Collecting lancedb
  Downloading lancedb-0.6.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
Collecting deprecation (from lancedb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting pylance==0.10.1 (from lancedb)
  Downloading pylance-0.10.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.5/21.5 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ratelimiter~=1.0 (from lancedb)
  Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)
Collecting retry>=0.9.2 (from lancedb)
  Downloading retry-0.9.2-py2.py3-none-an

### Importing libraries

In [3]:
import zipfile
import numpy as np
import pandas as pd
import scipy.sparse
import torch
import implicit
from implicit import evaluation
import pydantic
import lancedb
from lancedb.pydantic import pydantic_to_schema, vector

### Load the dataset
Now to download datasets, You need to get into competition of the `instacart-market-basket-analysis` competition, which you can do so [here](https://www.kaggle.com/competitions/instacart-market-basket-analysis/data).

In [4]:
!kaggle competitions download -c instacart-market-basket-analysis

Downloading instacart-market-basket-analysis.zip to /content
 93% 183M/196M [00:01<00:00, 115MB/s]
100% 196M/196M [00:01<00:00, 118MB/s]


We must now extract the zip files.

In [5]:
files = [
    "instacart-market-basket-analysis.zip",
    "order_products__train.csv.zip",
    "order_products__prior.csv.zip",
    "products.csv.zip",
    "orders.csv.zip",
]

for filename in files:
    with zipfile.ZipFile(filename, "r") as zip_ref:
        zip_ref.extractall("./")

Now we can move on to loading the dataset. We'll first read the csv files and create dataframes.

In [6]:
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
order_products = pd.concat(
    [pd.read_csv("order_products__train.csv"), pd.read_csv("order_products__prior.csv")]
)

Since there isn't a user rating attribute, we'll gather "confidence" data by looking at the frequency of each item purchased by a user, and store this in the `data` dataframe.

### Data Manipulation

In [7]:
customer_order_products = pd.merge(orders, order_products, how="inner", on="order_id")

# create confidence table
data = (
    customer_order_products.groupby(["user_id", "product_id"])[["order_id"]]
    .count()
    .reset_index()
)
data.columns = ["user_id", "product_id", "total_orders"]
data.product_id = data.product_id.astype("int64")

Let's create a couple of test users to examine the recommendations later:
- 1st test user: buys 50 sodas: **Zero Calorie Cola**
- 2nd test user: buys organic produce: **Organic Whole Milk** and **Organic Blackberries**

In [8]:
data_new = pd.DataFrame(
    [
        [data.user_id.max() + 1, 46149, 50],
        [data.user_id.max() + 2, 27845, 49],
        [data.user_id.max() + 2, 26604, 32],
    ],
    columns=["user_id", "product_id", "total_orders"],
)
data = pd.concat([data, data_new]).reset_index(drop=True)
data.tail()

Unnamed: 0,user_id,product_id,total_orders
13863744,206209,48697,1
13863745,206209,48742,2
13863746,206210,46149,50
13863747,206211,27845,49
13863748,206211,26604,32


In the next step, we will extract user and product unique ids, in order to create a `CSR (Compressed Sparse Row)` matrix. This will allow us to perform collaborative filtering.


In [9]:
# extract unique user and product ids
unique_users = list(np.sort(data.user_id.unique()))
unique_products = list(np.sort(products.product_id.unique()))
purchases = list(data.total_orders)

# create zero-based index position <-> user/item ID mappings
index_to_user = pd.Series(unique_users)

# create reverse mappings from user/item ID to index positions
user_to_index = pd.Series(data=index_to_user.index + 1, index=index_to_user.values)

# create row and column for user and product ids
users_rows = data.user_id.astype(int)
products_cols = data.product_id.astype(int)

# create CSR matrix
matrix = scipy.sparse.csr_matrix(
    (purchases, (users_rows, products_cols)),
    shape=(len(unique_users) + 1, len(unique_products) + 1),
)
matrix.data = np.nan_to_num(matrix.data, copy=False)

Let's now create a recommender model using the **implicit** library. The recommendation model is based off the algorithms described in the paper [Collaborative Filtering for Implicit Feedback Datasets](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets) with performance optimizations described in [Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.379.6473&rep=rep1&type=pdf).


# Difference between colloborative and content filtering

![picture](https://miro.medium.com/v2/resize:fit:1400/0*R8qw_CXxCc4600bQ.png)

In [10]:
import os

os.environ["OPENBLAS_NUM_THREADS"] = "1"
# split data into train and test splits
train, test = evaluation.train_test_split(matrix, train_percentage=0.9)

# initialize the recommender model
model = implicit.als.AlternatingLeastSquares(
    factors=128, regularization=0.05, iterations=50, num_threads=1
)

alpha = 15
train = (train * alpha).astype("double")

# train the model on CSR matrix
model.fit(train, show_progress=True)

  check_blas_config()


  0%|          | 0/50 [00:00<?, ?it/s]

## Let's now evaluate the model.

In [11]:
test = (test * alpha).astype("double")
evaluation.ranking_metrics_at_k(
    model, train, test, K=100, show_progress=True, num_threads=1
)

  0%|          | 0/192999 [00:00<?, ?it/s]

{'precision': 0.27477883977578244,
 'map': 0.04505803167409894,
 'ndcg': 0.14491547666623716,
 'auc': 0.6550619166364096}

From the model, we'll be able to retrieve item and user factors, which we can use later on to store in LanceDB as vector embeddings.

In [12]:
model.item_factors[1:3]

array([[-0.01073535,  0.01225309,  0.00282226, -0.00914562,  0.01481111,
         0.00767373, -0.00427731,  0.0056481 ,  0.00795351,  0.00424179,
        -0.00455681, -0.00175643, -0.00220297, -0.0138361 , -0.00829704,
        -0.00559029, -0.01200527,  0.00596893,  0.00808288, -0.01018421,
         0.01595827,  0.00867552,  0.02999683,  0.00679287,  0.00992141,
         0.01169722,  0.00303244,  0.00791476,  0.01493086, -0.00200432,
         0.00475327,  0.01365075, -0.00702923,  0.00941817,  0.00221444,
         0.00278489,  0.01576312,  0.00883053,  0.00070464,  0.00061513,
        -0.00012623,  0.00052815,  0.01637699,  0.00285431,  0.01877954,
         0.01524585, -0.00794455,  0.01723802,  0.00804117,  0.00352978,
         0.01410676, -0.00625158, -0.00453345,  0.02724608,  0.01960974,
        -0.01250265,  0.01295316, -0.00220814,  0.01525659,  0.02175995,
        -0.00712163,  0.02181616,  0.00632107,  0.01416669,  0.00973109,
         0.00702811, -0.00343407, -0.01017761,  0.0

In [13]:
model.user_factors[1:3]

array([[ 2.35114765e+00, -9.82077837e-01,  9.20681953e-02,
        -1.55748022e+00,  2.61008650e-01,  1.38084328e+00,
        -1.04197145e+00,  2.08925948e-01,  1.45271456e+00,
        -4.09525931e-01, -2.79641271e-01, -1.06512582e+00,
        -2.45185947e+00, -8.88424039e-01, -9.62235093e-01,
        -3.62847820e-02, -9.97323275e-01,  3.57037872e-01,
         1.39508307e-01, -7.77906895e-01, -3.02864462e-01,
        -2.49430239e-01,  2.07240963e+00, -1.16224551e+00,
         7.26323247e-01,  1.34066701e-01, -1.00640464e+00,
         6.03325069e-02,  1.24448466e+00,  3.97046000e-01,
        -1.01987794e-01, -2.13813528e-01, -5.79491258e-02,
        -3.17022443e-01,  7.47085869e-01,  1.62657106e+00,
         9.75901306e-01,  1.17893267e+00, -6.45162404e-01,
        -1.40145004e+00, -6.50845766e-01,  4.65424120e-01,
         1.01861715e+00,  1.16076279e+00,  7.42953658e-01,
        -5.01821935e-01,  4.48503673e-01,  3.03975850e-01,
        -8.14426184e-01, -5.65647744e-02,  5.86561143e-0

## Let's save the data and create a empty LanceDB Table using a Pydantic model.
A Table is designed to store large numbers of columns and huge quantities of data! For those interested, a LanceDB is columnar-based, and uses Lance, an open data format to store data.

In [14]:
db = lancedb.connect("data/lancedb")

In [15]:
class ProductModel(pydantic.BaseModel):
    product_id: int
    product_name: str
    vector: vector(128)


schema = pydantic_to_schema(ProductModel)
table_name = "product_recommender"
tbl = db.create_table(table_name, schema=schema, mode="overwrite")

Let's now store our item factors into the table via the vector column of `product_entries`.

In [16]:
# Transform items into factors
items_factors = model.item_factors
product_entries = products[["product_id", "product_name"]].drop_duplicates()
product_entries["product_id"] = product_entries.product_id.astype("int64")
item_embeddings = items_factors[1:].tolist()
product_entries["vector"] = item_embeddings

tbl.add(product_entries)

## Let's create an ANN index in order to speed up retrieval. This might take a while.

In [17]:
tbl.create_index(num_partitions=256, num_sub_vectors=16)

This is a helper method for analysing recommendations later.
This method returns top N products that someone bought in the past (based on product quantity).

In [18]:
def products_bought_by_user_in_the_past(user_id: int, top: int = 10):
    selected = data[data.user_id == user_id].sort_values(
        by=["total_orders"], ascending=False
    )

    selected["product_name"] = selected["product_id"].map(
        product_entries.set_index("product_id")["product_name"]
    )
    selected = selected[["product_id", "product_name", "total_orders"]].reset_index(
        drop=True
    )
    if selected.shape[0] < top:
        return selected

    return selected[:top]

Let's retrieve our test users so we can query for recommendations.

In [19]:
test_user_ids = [206210, 206211]
test_user_factors = model.user_factors[user_to_index[test_user_ids]]

## Let's now query LanceDB to retrieve recommendations.

In [20]:
# Query by user factors
test_user_embeddings = test_user_factors.tolist()
for embedding, id in zip(test_user_embeddings, test_user_ids):
    results = tbl.search(embedding).limit(10).to_pandas()
    display(results)
    display(products_bought_by_user_in_the_past(id, top=15))

Unnamed: 0,product_id,product_name,vector,_distance
0,46149,Zero Calorie Cola,"[0.037515923, -0.030325921, 0.004221245, -0.00...",38.190578
1,196,Soda,"[0.04531822, -0.04450815, -0.0022076364, -0.02...",38.34008
2,22802,Mineral Water,"[0.030236538, -0.0041136313, 0.015683502, -0.0...",38.593525
3,40939,Drinking Water,"[0.03287196, -0.017454194, 0.009911481, -0.004...",38.606468
4,31651,Extra Fancy Unsalted Mixed Nuts,"[0.037796307, -0.009871203, -0.0020715303, -0....",38.642967
5,37710,Trail Mix,"[0.05062829, -0.017916694, 0.0027849572, 0.001...",38.668938
6,41400,Crunchy Oats 'n Honey Granola Bars,"[0.028622035, -0.013106515, -0.0072577046, -0....",38.703171
7,26348,Mixed Fruit Fruit Snacks,"[0.011525251, -0.032522, -0.021976499, 0.01198...",38.709934
8,46061,Popcorn,"[0.039293304, -0.016017294, -0.0010792917, 0.0...",38.713402
9,39657,Milk Chocolate Almonds,"[0.030015469, -0.00927157, 0.0061932686, 0.000...",38.748997


Unnamed: 0,product_id,product_name,total_orders
0,46149,Zero Calorie Cola,50


Unnamed: 0,product_id,product_name,vector,_distance
0,26604,Organic Blackberries,"[0.019478824, 0.007443799, 0.004226536, 0.0283...",16.314867
1,27845,Organic Whole Milk,"[-0.03417227, -0.053161107, 0.03893201, 0.0150...",16.432335
2,27966,Organic Raspberries,"[0.024305355, -0.0063351737, 0.029324768, 0.02...",16.577738
3,43352,Raspberries,"[0.020642506, 0.025494106, 0.0050161625, 0.003...",16.588812
4,21288,Blackberries,"[-0.00844225, 0.01996236, -0.0148576135, 0.012...",16.672234
5,39275,Organic Blueberries,"[0.035410225, -0.0029810749, 0.014112177, 0.00...",16.684757
6,11777,Red Raspberries,"[0.020807281, -0.015660688, 0.010914551, 0.028...",16.746056
7,9076,Blueberries,"[0.033343736, 0.0068411743, 0.0028535812, 0.00...",16.765997
8,21137,Organic Strawberries,"[0.018478896, -0.0014569649, 0.01558258, 0.009...",16.883642
9,11422,Plain Greek Yogurt,"[0.003926732, -0.02004065, 0.059874147, 0.0318...",17.008499


Unnamed: 0,product_id,product_name,total_orders
0,27845,Organic Whole Milk,49
1,26604,Organic Blackberries,32
