
![Crisp](img/logo.png)
# AI-Powered Product Categorization

[![Open in Colab](https://img.shields.io/badge/Open%20in-Colab-orange?logo=google-colab&style=for-the-badge)](https://colab.research.google.com/github/gocrisp/blueprints/blob/main/notebooks/crisp_product_categorization.ipynb)
[![Open in Vertex AI](https://img.shields.io/badge/Open%20in-Vertex%20AI%20Workbench-brightgreen?logo=google-cloud&style=for-the-badge)](https://console.cloud.google.com/vertex-ai/notebooks/deploy-notebook?download_url=https://raw.githubusercontent.com/gocrisp/blueprints/main/notebooks/crisp_product_categorization.ipynb)
[![Open in Databricks](https://img.shields.io/badge/Try-databricks-red?logo=databricks&style=for-the-badge)](https://www.databricks.com/try-databricks)
[![View on GitHub](https://img.shields.io/badge/View%20on-GitHub-lightgrey?logo=github&style=for-the-badge)](https://github.com/gocrisp/blueprints/blob/main/notebooks/crisp_product_categorization.ipynb)

> To deploy a notebook in Databricks:
> 1. Open your workspace and navigate to the folder where you want to import the notebook.
> 2. Click the triple-dot icon (next to the Share button).
> 3. Select Import and choose URL as the import method.
> 4. Paste the notebook's URL and click Import to complete the process.

This notebook helps you check that products are categorized accurately by using a large language model (LLM) to predict the most appropriate category for each product. If the predicted category differs from the existing one, the product is flagged for review.

## Set the required environment variables
We will save your account ID in the notebook, so it can be used later when accessing your account data.

In [1]:
import os

os.environ["ACCOUNT_ID"] = "999999"
# os.environ["OPENAI_API_KEY"] = ""

## Run Crisp common

This notebook uses the [crisp_common.ipynb](./crisp_common.ipynb) notebook to load common functions and variables that are used across the Crisp notebooks.

In [None]:
if not os.path.exists("crisp_common.ipynb"):
    print("Downloading crisp_common.ipynb")
    !wget https://raw.githubusercontent.com/gocrisp/blueprints/main/notebooks/crisp_common.ipynb -O crisp_common.ipynb
else:
    print("crisp_common.ipynb already exists")

%run crisp_common.ipynb

## Import dependencies
We will import necessary libraries, so we can utilize prompt templates, LLM models, and Pydantic for data validation and structure.

In [3]:
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

from pydantic import BaseModel
from enum import Enum

## Load the product categories
We will query the database to load distinct product categories and subcategories, so we can create a mapping of category hierarchy for later use in the LLM.

In [4]:
%%load product_categories_df
SELECT DISTINCT  product_category, product_sub_category, product_category || ' >>> ' || product_sub_category AS category
FROM `{project}`.`{dataset}`.`exp_harmonized_retailer_dim_product`

In [5]:
product_categories_df

Unnamed: 0,product_category,product_sub_category,category
0,Computers,Laptops,Computers >>> Laptops
1,Wearables,Wearables,Wearables >>> Wearables
2,TV & Audio,TVs,TV & Audio >>> TVs
3,TV & Audio,Soundbars,TV & Audio >>> Soundbars
4,Accessories,Cables,Accessories >>> Cables
5,Home Appliances,Home Appliances,Home Appliances >>> Home Appliances
6,Phones & Tablets,Smartphones,Phones & Tablets >>> Smartphones
7,Phones & Tablets,Tablets,Phones & Tablets >>> Tablets


## Create a Pydantic model that holds category taxonomy
We will use a Pydantic model to define the allowed product categories, then apply it to our product data to ensure each item is assigned a valid category.

In [6]:
CategoryEnum = Enum(
    "CategoryEnum",
    {category: category for category in product_categories_df["category"].unique()},
)


class ProductAttributes(BaseModel):
    category: CategoryEnum

    def product_category(self):
        return self.category.__str__().replace("CategoryEnum.", "").split(" >>> ")[0]

    def product_sub_category(self):
        return self.category.__str__().split(" >>> ")[1]

## Create the LLM processing chain
We will set up an LLM (ChatOpenAI) with a prompt template that guides the model to analyze the existing product categorization and determine the most appropriate categories.

In [7]:
model = ChatOpenAI(model_name="gpt-4o-mini", temperature=0.0)

prompt = PromptTemplate(
    template="""
### Instruction:
You are an expert in CPG (Consumer Packaged Goods) and retail product categorization. Please determine the most appropriate product category for the given information.

### Input:
Product: {product}

### Output:
Provide the correct product category that follows the structured output format.
    """,
    input_variables=["product"],
)

structured_model = model.with_structured_output(ProductAttributes)

chain = prompt | structured_model

## Load product data
We will query the database to retrieve the full list of products, so we can use the model to check if each product is correctly categorized.

In [8]:
%%load products_df
SELECT *
FROM `{project}`.`{dataset}`.`exp_harmonized_retailer_dim_product`

In [9]:
products_df

Unnamed: 0,product_id,product,product_category,product_sub_category,retailer
0,121,LunaBook Horizon,Computers,Laptops,target
1,122,LunaTech Focus,Computers,Laptops,target
2,123,LunaTech Ultra,Computers,Laptops,target
3,124,LunaBook Max,Computers,Laptops,target
4,125,LunaTech Gamma,Computers,Laptops,target
...,...,...,...,...,...
75,116,LunaTab X,Phones & Tablets,Tablets,target
76,117,LunaTech Pad,Phones & Tablets,Tablets,target
77,118,LunaTab Max,Phones & Tablets,Tablets,target
78,120,LunaTab Lite,Phones & Tablets,Tablets,target


## Get the new categories
We will check if any of the categories suggested by the LLM are different from the existing ones, then update the product data with the new categories.

In [10]:
if "product_category_new" not in products_df.columns:
    products_df["product_category_new"] = None
if "product_sub_category_new" not in products_df.columns:
    products_df["product_sub_category_new"] = None

for index, row in products_df.iterrows():
    output = chain.invoke({"product": row.product})
    products_df.loc[index, "product_category_new"] = output.product_category()
    products_df.loc[index, "product_sub_category_new"] = output.product_sub_category()

products_df["changed"] = (
    products_df["product_sub_category"] != products_df["product_sub_category_new"]
) | (products_df["product_category"] != products_df["product_category_new"])

products_df.describe(include=[bool])

Unnamed: 0,changed
count,80
unique,2
top,False
freq,77


## Identify products with changed categories
We will filter the dataset to identify products whose categories have changed, so we can review them and determine which products need category corrections based on the model's output.

In [11]:
products_df[products_df["changed"] == True]

Unnamed: 0,product_id,product,product_category,product_sub_category,retailer,product_category_new,product_sub_category_new,changed
8,147,LunaSoundBar Go,Computers,Laptops,target,TV & Audio,Soundbars,True
17,166,DisplayPort Cable,Wearables,Wearables,target,Accessories,Cables,True
18,175,LunaTech Coffee Maker,Wearables,Wearables,target,Home Appliances,Home Appliances,True
