Business Goal

1️⃣ Extract visual features from product images using CNN (EfficientNet-B7).

2️⃣ Extract text embeddings from product descriptions using LLM (e.g., OpenAI GPT, Mistral, or Llama).

3️⃣ Combine both embeddings for rich, context-aware predictions (e.g., shelf visibility, consumer interaction).

4️⃣ Send embeddings to an LLM API to generate detailed insights about a product’s performance.

End-to-End Multimodal Learning Pipeline with Trainable Fusion Layer
We will use a trainable fusion model to combine CNN-based image embeddings (EfficientNet-B7) and LLM-based text embeddings (OpenAI GPT, Mistral, or Llama). This will allow context-aware KPI predictions such as shelf visibility, consumer interaction, and buying patterns.

Business Goal with Trainable Fusion Model

✅ Step 1: Train CNN to Extract Image Embeddings

We use EfficientNet-B7 to extract visual features from product images.

These embeddings capture product placement, size, color, and shelf positioning.

The embedding will be projected to a common 512-dimension space.

✅ Step 2: Train LLM to Extract Text Embeddings

We use Sentence Transformers (BERT, MiniLM, or OpenAI GPT-4 embeddings).

The embeddings capture product name, description, and other metadata.

The embedding will be projected to a 512-dimensional vector.

✅ Step 3: 
Train a Fusion Model to Learn the Best Combination of Modalities

We concatenate CNN and LLM embeddings and train a neural network to learn the optimal fusion strategy.

The fusion model learns how visual & text embeddings relate to KPI predictions.

✅ Step 4: Send the Final Embeddings to an LLM for Insights

The trained fusion embedding is sent to an LLM (GPT-4, Mistral, or Claude-3).

The LLM predicts KPIs, provides insights, and generates recommendations.

Step 1: Train CNN to Extract Image Embeddings
python
Copy
Edit


In [None]:
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
import cv2

# Load EfficientNet-B7
efficientnet = models.efficientnet_b7(weights=models.EfficientNet_B7_Weights.DEFAULT)
efficientnet.eval()

# Define a transformation pipeline
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Function to extract CNN embeddings
def extract_cnn_embeddings(image_path):
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = transform(image).unsqueeze(0)  # Convert to tensor
    with torch.no_grad():
        features = efficientnet.features(image)  # Extract CNN features
        embedding = torch.flatten(features, start_dim=1)  # Flatten to a vector
    return embedding  # Shape: (1, 2560)


Step 2: Train LLM to Extract Text Embeddings

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load Sentence Transformer model (alternative: OpenAI GPT API)
text_model = SentenceTransformer("all-MiniLM-L6-v2")  # 384-dimension embeddings

# Function to extract text embeddings
def extract_text_embedding(text):
    embedding = text_model.encode(text)
    return torch.tensor(embedding).unsqueeze(0)  # Shape: (1, 384)

# Example product description
product_description = "Shampoo on the second shelf in Walmart, easy to find."
text_embedding = extract_text_embedding(product_description)


### Step 3: Train a Fusion Model to Learn Best Combination

Reduce CNN embeddings from 2560 → 512 (using a projection layer).

Expand Text embeddings from 384 → 512 (using a projection layer).

Concatenate both embeddings and learn a fused representation.

Train the fusion model using labeled KPI data (visibility, buying trends, etc.).

In [None]:
class FusionModel(nn.Module):
    def __init__(self):
        super(FusionModel, self).__init__()
        self.cnn_projection = nn.Linear(2560, 512)  # Reduce CNN embedding size
        self.text_projection = nn.Linear(384, 512)  # Expand text embedding size
        self.fusion_layer = nn.Linear(1024, 512)  # Final fusion layer
        self.classifier = nn.Linear(512, 5)  # Output KPI classes (e.g., visibility, findability)

    def forward(self, cnn_emb, text_emb):
        cnn_emb = self.cnn_projection(cnn_emb)  # Reduce CNN size
        text_emb = self.text_projection(text_emb)  # Expand text size
        combined = torch.cat((cnn_emb, text_emb), dim=1)  # Merge both embeddings
        fused_embedding = self.fusion_layer(combined)  # Learn fusion representation
        output = self.classifier(fused_embedding)  # Predict KPIs
        return output

# Initialize Model
fusion_model = FusionModel()


Step 4: Train the Fusion Model

In [None]:
# Define loss function & optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
fusion_model.to(device)
optimizer = torch.optim.Adam(fusion_model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Simulated KPI labels (0 = low visibility, 4 = high findability)
kpi_label = torch.tensor([3]).to(device)  # Example KPI class label

# Training Function
def train_fusion_model(image_path, text_desc, model, optimizer, criterion):
    model.train()

    # Extract embeddings
    cnn_emb = extract_cnn_embeddings(image_path).to(device)
    text_emb = extract_text_embedding(text_desc).to(device)

    # Forward pass
    optimizer.zero_grad()
    output = model(cnn_emb, text_emb)
    loss = criterion(output, kpi_label)

    # Backward pass
    loss.backward()
    optimizer.step()

    print(f"Loss: {loss.item():.4f}")

# Train for 10 epochs
for epoch in range(10):
    train_fusion_model("shelf.jpg", product_description, fusion_model, optimizer, criterion)


Step 5: Send the Final Embedding to an LLM for Insights

In [None]:
import openai
import json

openai.api_key = "your_openai_api_key"

def query_gpt(embedding):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an expert in retail shelf analysis."},
            {"role": "user", "content": json.dumps({
                "embedding": embedding.tolist(),
                "task": "Analyze shelf visibility, buying trends, and findability"
            })}
        ]
    )
    return response["choices"][0]["message"]["content"]

# Predict KPI for a product
cnn_emb = extract_cnn_embeddings("shelf.jpg").to(device)
text_emb = extract_text_embedding(product_description).to(device)
fused_embedding = fusion_model(cnn_emb, text_emb)

# Send to GPT-4 for analysis
analysis = query_gpt(fused_embedding.cpu().detach().numpy())
print("LLM Response:", analysis)


Final Business Pipeline

1️⃣ Extract CNN-based image embeddings (EfficientNet-B7).

2️⃣ Extract LLM-based text embeddings (Sentence Transformer, GPT).

3️⃣ Pass embeddings through a trainable fusion model (learn multimodal alignment).

4️⃣ Send the fused embedding to GPT-4/Mistral for retail KPI analysis.

Next--> Deploy this as an API (FastAPI, Flask)