# Products classification through AI
In this notebook, I get the products information contained in the csv file and the intention is to classify each one as Hazmat (Hazardous Material) or not. 


## Classify products based on the data obtained (title and attributes from ML API)
Given that I am using Groq/Gemini for free tier, I'll classify the products in batches of 50 products per LLM call. The amount of products in the same batch must be optimized for improvement.

In [6]:
from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum

# Important definitions

class Confidence(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"

class HazmatClassification(BaseModel):
    product_id: str = Field(..., description="The unique identifier of the product.")
    is_hazmat: bool = Field(..., description="Indicates whether the product is classified as a Hazmat.")
    reason: Optional[str] = Field(None, description="The reason for the classification, if the product is a Hazmat.")
    confidence: Optional[Confidence] = Field(None, description="The confidence level of the classification, if the product is a Hazmat.")

DATASET = 'dataset_1'

In [7]:
# Get hazmat definition from validated file
with open("data/hazmat-definition.md", "r") as f:
    hazmat_def = f.read()

# Get products information from csv file
import pandas as pd
products_df = pd.read_csv(f"data/{DATASET}/{DATASET}_annotated.csv") # Remove _annotated if you want to use the original dataset

In [8]:
hazmat_classifier_system_msg = f"""
You are a domain-expert Hazmat classifier. Your task is to classify products as Hazmat or not based on the definition provided between <hazmat_definition> tags.

Before answering, you must output your thinking process between <think> tags.

The final answer must be a jsonl structure, located between <jsonl> tags, with each line following this schema (one line per product): <json_schema>{HazmatClassification.model_json_schema()}</json_schema>.

Hazmat definition: <hazmat_definition>{hazmat_def}</hazmat_definition>

Guidelines:
- For the tag <think>: You must output your thinking process before the final answer.
- For the tag <jsonl>: The final answer must be a valid jsonl structure, with each line following the schema provided.
- Always refer to the Hazmat definition to address the classification. Do not suppose anything. If not certain of the classification, output as hazmat with lower confidence.
- Only output a product as non-hazmat if you are absolutely certain that it is not a Hazmat according to the definition provided.
"""

In [None]:
from defs_and_tools import call_llm, extract_from_tag
import requests
from docling.document_converter import DocumentConverter
from html_to_markdown import convert_to_markdown
from dotenv import load_dotenv

load_dotenv()

# model="groq/llama-3.3-70b-versatile"
model="gemini/gemini-2.5-flash"

In [None]:
def classify_products(products_df, batch_size=20, output_jsonl="classified_products.jsonl", think_log="think_log.txt"):
    """Classify products in batches and save results."""
    results = []
    
    for i in range(0, len(products_df), batch_size):
        batch = products_df.iloc[i:i + batch_size]
        batch_list = batch.to_dict(orient="records")
        
        response = call_llm(
            system=hazmat_classifier_system_msg,
            prompt=f"Products to classify:\n{batch_list}",
            model=model,
        )
        
        results.append(response)
        
        # Save JSONL output
        jsonl_content = extract_from_tag(response, "jsonl")
        if jsonl_content:
            print(f"Batch {i//batch_size + 1} jsonl content extracted!")
            with open(output_jsonl, "a") as f:
                f.write(jsonl_content + "\n")
        
        # Save think log
        think_content = extract_from_tag(response, "think")
        if think_content:
            print(f"Batch {i//batch_size + 1} think content extracted!")
            with open(think_log, "a") as f:
                f.write(f"Batch {i//batch_size + 1}:\n{think_content}\n\n")
        
        print(f"Batch {i//batch_size + 1} processed")
    
    return results

results = classify_products(products_df, 
                            output_jsonl="data/dataset_classified_products.jsonl",
                            think_log="data/dataset_think_log.txt")

Batch 1 jsonl content extracted!
Batch 1 think content extracted!
Batch 1 processed

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.


[1;31mProvider List: https://docs.litellm.ai/docs/providers[0m



RateLimitError: litellm.RateLimitError: RateLimitError: GroqException - {"error":{"message":"Rate limit reached for model `llama-3.3-70b-versatile` in organization `org_01j3ebn6tjfhyvc1f6sg2vcj92` service tier `on_demand` on tokens per minute (TPM): Limit 12000, Used 8035, Requested 9997. Please try again in 30.156999999s. Need more tokens? Upgrade to Dev Tier today at https://console.groq.com/settings/billing","type":"tokens","code":"rate_limit_exceeded"}}
