# Products classification through AI
In this notebook, I get the products information contained in the csv file and the intention is to classify each one as Hazmat (Hazardous Material) or not. 


## Classify products based on the data obtained (title and attributes from ML API)
Given that I am using Groq/Gemini for free tier, I'll classify the products in batches of 50 products per LLM call. The amount of products in the same batch must be optimized for improvement.

In [12]:
from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum

# Important definitions

class Confidence(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"

class HazmatClassification(BaseModel):
    product_id: str = Field(..., description="The unique identifier of the product.")
    is_hazmat: bool = Field(..., description="Indicates whether the product is classified as a Hazmat.")
    reason: str = Field(None, description="The reason for the classification, if the product is a Hazmat.")
    confidence: Confidence = Field(None, description="The confidence level of the classification, if the product is a Hazmat.")

DATASET = 'dataset_1'

In [13]:
# Get hazmat definition from validated file
with open("data/hazmat-definition.md", "r", encoding='utf8') as f:
    hazmat_def = f.read()

# Get products information from csv file
import pandas as pd
products_df = pd.read_csv(f"data/{DATASET}/{DATASET}.csv") 

batch_size = 100

# TEMPORARY: Get products without classification (Until 2219, the batch size was 30 and gemini-2.5-flash was used)
products_df = products_df.iloc[6020:] # Now it is 100 products per batch

In [14]:
hazmat_classifier_system_msg = f"""
You are a domain-expert Hazmat classifier. Your task is to analyze the products below and determine, for each, if it is Hazmat or not, based on the definition provided between <hazmat_definition> tags.

You must base your analysis on the following JSON schema, which describes the required analysis for each product in the fields:
<json_schema>{HazmatClassification.model_json_schema()}</json_schema>

Before answering, you must output your detailed reasoning process.

Hazmat definition: <hazmat_definition>{hazmat_def}</hazmat_definition>

Guidelines:
- Always refer to the Hazmat definition to address the classification. Do not suppose anything. If not certain of the classification, output as hazmat with lower confidence.
- Only output a product as non-hazmat if you are absolutely certain that it is not a Hazmat according to the definition provided.
"""

hazmat_json_extractor_system_msg = f"""
You are a domain-expert Hazmat classifier. Based on the analysis below, extract and output the final answer as a jsonl structure, located between <jsonl> tags, with each line following this schema (one line per product): <json_schema>{HazmatClassification.model_json_schema()}</json_schema>.

Guidelines:
- For the tag <jsonl>: The final answer must be a valid jsonl structure, with each line following the schema provided.
- If not certain of the classification, output as hazmat with lower confidence.
- Only output a product as non-hazmat if you are absolutely certain that it is not a Hazmat according to the definition provided.
"""

In [15]:
from defs_and_tools import call_llm, extract_from_tag
import requests
from docling.document_converter import DocumentConverter
from html_to_markdown import convert_to_markdown
from dotenv import load_dotenv

load_dotenv()

# json_extractor_models = ["groq/llama-3.3-70b-versatile",
#                         "groq/llama3-70b-8192",
#                         "gemini/gemini-2.0-flash"]
# json_extractor_model = "gemini/gemini-2.0-flash" # Did not create the tags correctly for output parsing
json_extractor_model = "gemini/gemini-2.5-flash"
hazmat_classifier_model = "gemini/gemini-2.5-flash"

In [None]:
def classify_products(products_df, batch_size=30, output_jsonl="classified_products.jsonl", log_file="log_file.txt"):
    """Classify products in batches and save results."""

    for i in range(0, len(products_df), batch_size):
        batch = products_df.iloc[i:i + batch_size]
        batch_list = batch.to_dict(orient="records")
        
        print(f"Processing batch {i//batch_size + 1} with {len(batch_list)} products...")
        raw_response = call_llm(
            system=hazmat_classifier_system_msg,
            prompt=f"Products to classify:\n{batch_list}",
            model=hazmat_classifier_model,
        )
        
        print("Raw response received, formatting to JSONL...")
        formatted_response = call_llm(
            system=hazmat_json_extractor_system_msg,
            prompt=raw_response,
            model=json_extractor_model,
        )
        
        # Save JSONL output
        jsonl_content = extract_from_tag(formatted_response, "jsonl")
        if jsonl_content:
            print(f"Batch {i//batch_size + 1} jsonl content extracted!")
            with open(output_jsonl, "a", encoding="utf-8") as f:
                f.write(jsonl_content + "\n")
        
        # Save raw log
        with open(log_file, "a", encoding="utf-8") as f:
            f.write(f"Batch {i//batch_size + 1}:\n{raw_response}\n\n")
        
        print(f"Batch {i//batch_size + 1} processed and saved to {output_jsonl} and {log_file}!")
        print(40*"-")

classify_products(products_df, 
                  output_jsonl=f"data/{DATASET}/{DATASET}_classified_products.jsonl",
                  log_file=f"data/{DATASET}/{DATASET}_raw_log.txt",
                  batch_size=batch_size)

Processing batch 1 with 100 products...
Raw response received, formatting to JSONL...
Batch 1 jsonl content extracted!
Batch 1 processed and saved to data/dataset_1/dataset_1_classified_products.jsonl and data/dataset_1/dataset_1_raw_log.txt!
----------------------------------------
Processing batch 2 with 100 products...
Raw response received, formatting to JSONL...
Batch 2 jsonl content extracted!
Batch 2 processed and saved to data/dataset_1/dataset_1_classified_products.jsonl and data/dataset_1/dataset_1_raw_log.txt!
----------------------------------------
Processing batch 3 with 100 products...
Raw response received, formatting to JSONL...
Batch 3 jsonl content extracted!
Batch 3 processed and saved to data/dataset_1/dataset_1_classified_products.jsonl and data/dataset_1/dataset_1_raw_log.txt!
----------------------------------------
Processing batch 4 with 100 products...
Raw response received, formatting to JSONL...
Batch 4 jsonl content extracted!
Batch 4 processed and saved t

In [None]:
import json

# Prompt: Open jsonl file and insert result into dataframe products_df

# Read the classified products JSONL file and insert results into products_df
jsonl_path = f"data/{DATASET}/{DATASET}_classified_products.jsonl"
classified_rows = []
with open(jsonl_path, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():
            classified_rows.append(json.loads(line))

classified_df = pd.DataFrame(classified_rows)

# Merge classified_df into products_df on 'product_id'
products_df = products_df.merge(classified_df, on="product_id", how="left", suffixes=("", "_classified"))

products_df.head()
# Save the updated products_df with classifications
products_df.to_csv(f"data/{DATASET}/{DATASET}_classified_products.csv", index=False)