### Product Recommendation System

Dataset:
https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset

Overview:\
Machine Learning Program that suggests groceries to customers based on their purchase history. 

Utilizes the K-Means Clustering algorithm from Scikit-Learn to perform collaborative filtering.

**SCROLL TO BOTTOM FOR RESULTS**

---

Importing necessary libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
from datetime import datetime
import sklearn
import torch

Next, let's read the data from the necessary csv files, and manipulate/join them as needed.

In [2]:
ord_prods = pd.read_csv('order_products__train.csv')
ord_prods

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1
...,...,...,...,...
1384612,3421063,14233,3,1
1384613,3421063,35548,4,1
1384614,3421070,35951,1,1
1384615,3421070,16953,2,1


In [3]:
products = pd.read_csv('products.csv')
products

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13
...,...,...,...,...
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5
49684,49685,En Croute Roast Hazelnut Cranberry,42,1
49685,49686,Artisan Baguette,112,3
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8


Since both dataframes have a product_id column, let's merge them

In [4]:
merged = pd.merge(ord_prods, products, on='product_id', how='inner')
merged

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,1,49302,1,1,Bulgarian Yogurt,120,16
1,1,11109,2,1,Organic 4% Milk Fat Whole Milk Cottage Cheese,108,16
2,1,10246,3,0,Organic Celery Hearts,83,4
3,1,49683,4,0,Cucumber Kirby,83,4
4,1,43633,5,1,Lightly Smoked Sardines in Olive Oil,95,15
...,...,...,...,...,...,...,...
1384612,3421063,14233,3,1,Natural Artesian Water,115,7
1384613,3421063,35548,4,1,Twice Baked Potatoes,13,20
1384614,3421070,35951,1,1,Organic Unsweetened Almond Milk,91,16
1384615,3421070,16953,2,1,Creamy Peanut Butter,88,13


In [5]:
# Double check that the merge worked
merged.shape[0] == ord_prods.shape[0]

True

In [6]:
orders = pd.read_csv('orders.csv')
orders

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
...,...,...,...,...,...,...,...
3421078,2266710,206209,prior,10,5,18,29.0
3421079,1854736,206209,prior,11,4,10,30.0
3421080,626363,206209,prior,12,1,12,18.0
3421081,2977660,206209,prior,13,1,12,7.0


Since orders has shared column "order_id" with merged, I join it as well

In [7]:
merged = pd.merge(merged, orders, on='order_id', how='inner')
merged

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1,49302,1,1,Bulgarian Yogurt,120,16,112108,train,4,4,10,9.0
1,1,11109,2,1,Organic 4% Milk Fat Whole Milk Cottage Cheese,108,16,112108,train,4,4,10,9.0
2,1,10246,3,0,Organic Celery Hearts,83,4,112108,train,4,4,10,9.0
3,1,49683,4,0,Cucumber Kirby,83,4,112108,train,4,4,10,9.0
4,1,43633,5,1,Lightly Smoked Sardines in Olive Oil,95,15,112108,train,4,4,10,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1384612,3421063,14233,3,1,Natural Artesian Water,115,7,169679,train,30,0,10,4.0
1384613,3421063,35548,4,1,Twice Baked Potatoes,13,20,169679,train,30,0,10,4.0
1384614,3421070,35951,1,1,Organic Unsweetened Almond Milk,91,16,139822,train,15,6,10,8.0
1384615,3421070,16953,2,1,Creamy Peanut Butter,88,13,139822,train,15,6,10,8.0


Now that we have one comprehensive df and no more csv files to merge, let's preprocess the df. In the next cell I'll do the following:

- Slightly adjust column names to make them more readable
- Drop unnecessary columns
- Cast 'days_since_prev_order' to int (instead of float)

In [8]:
cleaned_merge = merged.rename(columns={'add_to_cart_order': 'cart_position', 'days_since_prior_order': 'days_since_prev_order'})
cleaned_merge = cleaned_merge.drop(['eval_set', 'product_name'], axis=1)
cleaned_merge['days_since_prev_order'] = cleaned_merge['days_since_prev_order'].astype(int)
cleaned_merge

Unnamed: 0,order_id,product_id,cart_position,reordered,aisle_id,department_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prev_order
0,1,49302,1,1,120,16,112108,4,4,10,9
1,1,11109,2,1,108,16,112108,4,4,10,9
2,1,10246,3,0,83,4,112108,4,4,10,9
3,1,49683,4,0,83,4,112108,4,4,10,9
4,1,43633,5,1,95,15,112108,4,4,10,9
...,...,...,...,...,...,...,...,...,...,...,...
1384612,3421063,14233,3,1,115,7,169679,30,0,10,4
1384613,3421063,35548,4,1,13,20,169679,30,0,10,4
1384614,3421070,35951,1,1,91,16,139822,15,6,10,8
1384615,3421070,16953,2,1,88,13,139822,15,6,10,8


Next, I create the matrix of Users-Products to perform K-means clustering. 

Users are plotted on one axis, and products on the other axis. This is used to determine the frequency that each user bought an item, and to recommend new items based on that information.

In [9]:
unique_users = cleaned_merge['user_id'].unique()
unique_items = cleaned_merge['product_id'].unique()

user2idx = {user_id: idx for idx, user_id in enumerate(unique_users)}
item2idx = {item_id: idx for idx, item_id in enumerate(unique_items)}

temp_u = cleaned_merge['user_id'].map(user2idx)
temp_i = cleaned_merge['product_id'].map(item2idx)

num_users = len(user2idx)
num_items = len(item2idx)

In [10]:
indices = torch.LongTensor([
    np.array(temp_u.values),
    np.array(temp_i.values)
])

values = torch.FloatTensor(cleaned_merge['reordered'].values)
shape = torch.Size([num_users, num_items])
user_item_sparse = torch.sparse_coo_tensor(indices, values, shape)

  indices = torch.LongTensor([


Running the K-Means Clustering algorithm on the user-item matrix

In [11]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10)
knn.fit(user_item_sparse.to_dense())

In [12]:
# Convert the item sparse matrix to a dense format for easier manipulation
user_item = user_item_sparse.to_dense()
idx2item = { idx: raw_item_id
             for raw_item_id, idx in item2idx.items() }

Last, I need to define a callable method that can perform K-means predictions on individual users. 

In [13]:
def recommend_for_user(raw_user_id, top_k_purchased=3, top_k_recs=3):
    u_idx = user2idx[raw_user_id]
    u_vec = user_item[u_idx]            
    purchased_indices = np.where(u_vec > 0)[0]

    top_purchased = purchased_indices[np.argsort(-u_vec[purchased_indices])][:top_k_purchased]
    top_purchased_raw = [idx2item[i] for i in top_purchased]
    
    dists, nbrs = knn.kneighbors(u_vec.reshape(1, -1), n_neighbors=knn.n_neighbors+1)
    neighbor_idxs = [n for n in nbrs[0] if n != u_idx][:knn.n_neighbors]
    
    neighbor_matrix = user_item[neighbor_idxs]    
    agg_scores = neighbor_matrix.sum(axis=0)      
    
    agg_scores[u_vec > 0] = 0
    
    rec_idxs = np.argsort(-agg_scores)[:top_k_recs]
    rec_raw_ids = [idx2item[int(i)] for i in rec_idxs]
    
    top_purchased_names = products.loc[products.product_id.isin(top_purchased_raw), 
                                          'product_name'].tolist()
    rec_names = products.loc[products.product_id.isin(rec_raw_ids),
                                          'product_name'].tolist()
    
    print(f"\nUser {raw_user_id}:")
    print("  Top purchased items:")
    for name in top_purchased_names:
        print("   -", name)
    print("  Recommended items:")
    for name in rec_names:
        print("   →", name)


**Final Result**

In [14]:
recommend_for_user(raw_user_id=5)


User 5:
  Top purchased items:
   - Organic Raw Agave Nectar
   - Organic Baby Arugula
   - Organic Grape Tomatoes
  Recommended items:
   → Bag of Organic Bananas
   → Goat Cheese Crumbles
   → Organic Avocado


Email: rohan11parekh@gmail.com\
LinkedIn: https://www.linkedin.com/in/rohan-parekh-39b070225/