# Retail order picking
@Luis Gerardo Baeza

## Product substitute matching with BigQuery

**Challenge**
In order to fulfill customer's online orders, retailers need to be able to respond to unexpected stockouts or product unavailibity while picking the order in the store. A proper substitution engine must consdier not only static and important rules such as diet restrictions (vegan, gluten-free) but also customer prior product and brand preferences. Real-time stock is also a factor that might be considered.

The ultimate goal should be not only to improve the Fill Rate compared with the found rate, but also to deliver such substitution suggestions that the NPS reflects customer's approval of the alternate product choices.

![product-substitute.png](product-substitute.png)

### 1. Acquire public dataset
**1.1 Download dataset**

https://www.kaggle.com/datasets/bhavikjikadara/grocery-store-dataset?resource=download

**1.2 Quickly Inspect data**

In [1]:
import pandas as pd
df = pd.read_csv("GroceryDataset.csv")

In [14]:
df.columns

Index(['Sub Category', 'Price', 'Discount', 'Rating', 'Title', 'Currency',
       'Feature', 'Product Description'],
      dtype='object')

In [2]:
df.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake\nCertified Kosher OU-...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count\nIndividually wrapped\nMade in and I...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways\n32 oz 2-Pack\nNo Prese...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake\nCertified Kosh...",A cake the dessert epicure will die for!To the...


In [3]:
df.shape

(1757, 8)

### 2. Dataset preparation

#### 2.1 Dataset upload to BigQuery

In [33]:
from google.cloud import bigquery
client = bigquery.Client()

dataset_id = "ecommerce"
table_id = "grocery_catalog_tmp"

In [34]:

dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True
job_config.quote_character = '"'
job_config.allow_quoted_newlines = True
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

load_job = client.load_table_from_file(
    open("GroceryDataset.csv", "rb"),
    table_ref,
    job_config=job_config)

In [35]:
load_job.result()

LoadJob<project=lgbaeza-202310, location=US, id=ee2a1370-f379-47e1-aacc-4f2b34b0e13e>

In [36]:
print("Loaded {} rows into {}:{}".format(load_job.output_rows, dataset_id, table_id))

Loaded 1758 rows into ecommerce:grocery_catalog_tmp


#### 2.2 Clean column names

In [38]:
%%bigquery
create or replace table ecommerce.grocery_catalog as
SELECT string_field_0 SubCategory, string_field_1 Price,
string_field_2 Discount, string_field_3 Rating, string_field_4 Title,
string_field_5 Currency, string_field_6 Feature,
string_field_7 ProductDescription from ecommerce.grocery_catalog_tmp;
drop table ecommerce.grocery_catalog_tmp;

Query is running:   0%|          |

#### 2.3 Validate upload

In [40]:
%%bigquery
select SubCategory from ecommerce.grocery_catalog group by 1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,SubCategory
0,Snacks
1,Meat & Seafood
2,Poultry
3,Gift Baskets
4,Deli
5,Household
6,Coffee
7,Pantry & Dry Goods
8,Breakfast
9,Kirkland Signature Grocery


### 3. Engine development
#### 3.1 Vertex Connection
Create a connection based on the docs https://docs.cloud.google.com/bigquery/docs/create-cloud-resource-connection#create-cloud-resource-connection
 

 #### 3.2 Create embeddings model using the Vertex Connection

In [41]:
%%bigquery
CREATE OR REPLACE MODEL ecommerce.text_embedding
  REMOTE WITH CONNECTION `us.vertex`
  OPTIONS(ENDPOINT = 'text-embedding-005');

Query is running:   0%|          |

#### 3.3 Generate embeddings for product description

In [None]:
%%bigquery
CREATE OR REPLACE TABLE `ecommerce.grocery_catalog_embed` AS
SELECT
    * except (ml_generate_embedding_result),
    ml_generate_embedding_result AS embedding
FROM
    ML.GENERATE_EMBEDDING(
        MODEL `ecommerce.text_embedding`,
        (
            SELECT
                *,
                CONCAT(SubCategory, " ", Title, " ", ProductDescription) AS content
            FROM ecommerce.grocery_catalog where SubCategory is not null and title is not null and ProductDescription is not null
        ),
        STRUCT('RETRIEVAL_DOCUMENT' AS task_type)
    );

Query is running:   0%|          |

#### 3.4 Generate substitutions using Vector search

In [46]:
%%bigquery
create or replace table ecommerce.substitutions as
WITH substitution_results AS (
  SELECT
    base.Title AS original_product,
    base.SubCategory AS category,
    query.Title AS substitute_product,
    query.Price AS substitute_price,
    distance,
    -- Creamos el ranking por producto base
    ROW_NUMBER() OVER(PARTITION BY base.Title ORDER BY distance ASC) as rank_order
  FROM
    VECTOR_SEARCH(
      TABLE `ecommerce.grocery_catalog_embed`,
      'embedding',
      TABLE `ecommerce.grocery_catalog_embed`,
      'embedding',
      top_k => 6, -- Buscamos 3 porque el #1 siempre será el mismo producto
      distance_type => 'COSINE'
    )
)
SELECT
    original_product,
    category,
    MAX(CASE WHEN rank_order = 2 THEN substitute_product END) AS substitute_rank_1,
    MAX(CASE WHEN rank_order = 3 THEN substitute_product END) AS substitute_rank_2
FROM substitution_results
WHERE rank_order > 1 -- Excluimos la coincidencia exacta con el mismo producto
GROUP BY 1, 2;

Query is running:   0%|          |

![substitutes-dashboard.png](substitutes-dashboard.png)