<a href="https://colab.research.google.com/github/peeyushsinghal/AI-Engineering-ERA3/blob/main/Intro_DSPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🧪 Generative AI in Industry — Maps + FMCG with DSPy & Agentic Pipelines


**Audience:** Beginners / Practitioners who want practical exposure to Generative AI + Agentic AI  
**Focus:** **DSPy** and Industry Use Cases of Maps, FMCG

> Tip: Run cells top-to-bottom. Sections are independent; you can skip/install only what's needed.


## 0) Environment Setup

This workshop uses:
- Python 3.10+
- `dspy` (or `dspy-ai`) for programmable, optimizable LLM pipelines
- `pandas` for data handling
- `rapidfuzz` for string similarity
- An LLM provider (Gemini or OpenAI or compatible).

> If you don't have Internet in your environment, skip installs and read through the code; it will still serve as a template.


In [1]:
# If your environment allows, uncomment to install.
!pip install --quiet dspy-ai rapidfuzz pandas python-dotenv
# For Google API:
!pip install --quiet google-generativeai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m260.1/260.1 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m111.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.0/247.0 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import google.generativeai as genai

In [3]:
import os
from pathlib import Path

DATA_DIR = Path('data')
DATA_DIR.mkdir(exist_ok=True)

print("Setup complete. Data directory:", DATA_DIR.resolve())

Setup complete. Data directory: /content/data



### 0.1) Configure Model / API Keys

You can use **Google Gemini** or **OpenAI** or any provider supported by DSPy
Set the env var(s) appropriately, then initialize the DSPy model.



In [4]:
import os
import dspy

# Configure API Key
GEMINI_API_KEY = "AIzaSyAh0Kp5YuOCTmc5qKNo0R5cWzWWGe8x_OQ"
GEMINI_MODEL = "gemini/gemini-2.5-flash" # gemini-2.0-flash


### 0.2) Initialize DSPy with your chosen language model

Let's set up a simple LLM in DSPy. **Fill in the code cell below to:**
- Import dspy
- Set up a language model (you can use a placeholder for the API key)
- Configure dspy to use this LLM


In [5]:
# If DSPy is installed, this will work. Otherwise, treat as reference code.
try:
    import dspy
    # Initialize a Gemini-based LM for DSPy (e.g., Gemini-2.5-flash)
    llm = dspy.LM(
        model= GEMINI_MODEL,
        api_key=GEMINI_API_KEY
    )
    dspy.settings.configure(lm=llm)
    print("DSPy initialized with model:", GEMINI_MODEL)
except Exception as e:
    print("DSPy not available or failed to initialize:", e)


DSPy initialized with model: gemini/gemini-2.5-flash


## Try Calling the LLM

Write a code cell to call the LLM with a simple prompt, e.g., 'Say this is a test!'.

Note! If this does not work, most likely something is wrong with the setup of your LLM.

In [6]:
llm("Say: this is a test!", temperature=0.7)  # => ['This is a test!']

['This is a test!']

You can also use the traditional role format: messages=
[{"role": "user", "content": "Say this is not a test!"}]
Try it here.

In [7]:
# TODO: Call the LLM with the messages format

['This is not a test!']

<details>
<summary>Click to show solution</summary>

```python
llm(messages=[{"role": "user", "content": "Say this is not a test!"}])  # => ['This is not a test!']
```
</details>

## DSPy Signatures and Modules

**Exercise:** Define a simple DSPy signature for sentiment classification.

- Create a class `Classify` inheriting from `dspy.Signature`
- Add input and output fields for sentence, sentiment, and confidence
- Instantiate a Predict module and use it on a sample sentence

In [9]:
from typing import Literal
class Classify(dspy.Signature):
    """Classify sentiment of a given sentence."""

    sentence: str = dspy.InputField()
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
    confidence: float = dspy.OutputField()

classify = dspy.Predict(Classify)
classify(sentence="This book was super fun to read, though not the last chapter.")

Prediction(
    sentiment='positive',
    confidence=0.75
)

In [11]:
classify.history

[{'prompt': None,
  'messages': [{'role': 'system',
    'content': "Your input fields are:\n1. `sentence` (str):\nYour output fields are:\n1. `sentiment` (Literal['positive', 'negative', 'neutral']): \n2. `confidence` (float):\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## sentence ## ]]\n{sentence}\n\n[[ ## sentiment ## ]]\n{sentiment}        # note: the value you produce must exactly match (no extra characters) one of: positive; negative; neutral\n\n[[ ## confidence ## ]]\n{confidence}        # note: the value you produce must be a single float value\n\n[[ ## completed ## ]]\nIn adhering to this structure, your objective is: \n        Classify sentiment of a given sentence."},
   {'role': 'user',
    'content': "[[ ## sentence ## ]]\nThis book was super fun to read, though not the last chapter.\n\nRespond with the corresponding output fields, starting with the field `[[ ## sentiment ## ]]` (must be formatted as a valid Pytho

In [10]:
classify.inspect_history()





[34m[2025-08-25T14:02:42.665811][0m

[31mSystem message:[0m

Your input fields are:
1. `sentence` (str):
Your output fields are:
1. `sentiment` (Literal['positive', 'negative', 'neutral']): 
2. `confidence` (float):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## sentiment ## ]]
{sentiment}        # note: the value you produce must exactly match (no extra characters) one of: positive; negative; neutral

[[ ## confidence ## ]]
{confidence}        # note: the value you produce must be a single float value

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Classify sentiment of a given sentence.


[31mUser message:[0m

[[ ## sentence ## ]]
This book was super fun to read, though not the last chapter.

Respond with the corresponding output fields, starting with the field `[[ ## sentiment ## ]]` (must be formatted as a valid Python Literal['positive', 'negative',


---
## 1) Warm-Up (GenAI Basics): Product Description Generator (FMCG)

**Goal:** See how style, tone, and temperature affect outputs.

**Task:** Given a product name & features, generate a short marketing description in 3 tones.


In [12]:
# Different Tones
product = "Sunburst Orange Juice"
tones = ["Formal", "Casual", "Punchy / Ad-like"]

In [13]:
# ✅ Define signature
class ProductDescription(dspy.Signature):
    """Generate a product description given tone and product."""
    tone = dspy.InputField()
    product = dspy.InputField()
    description = dspy.OutputField()

# Predict module
gen = dspy.Predict(ProductDescription)

for tone in tones:
    result = gen(tone=tone, product=product)
    print(f"\n--- {tone} ---\n{result.description}")


--- Formal ---
We proudly present Sunburst Orange Juice, a distinguished beverage crafted from the finest, sun-ripened oranges. Each serving offers a meticulously balanced profile of natural sweetness and invigorating tang, designed to provide a refreshing and revitalizing experience. Our commitment to quality ensures that every glass delivers the pure essence of premium citrus, making Sunburst Orange Juice an exemplary choice for discerning palates seeking both exquisite taste and wholesome refreshment.

--- Casual ---
Hey there, looking for a little pick-me-up? Grab a glass of Sunburst Orange Juice! It's super refreshing, bursting with that classic, sunny orange flavor you love. Perfect for breakfast, a midday boost, or just chilling out. Seriously, it's like sunshine in a bottle – you can't go wrong!

--- Punchy / Ad-like ---
Tired of the same old? Crave a burst of pure sunshine? Grab Sunburst Orange Juice! We're talking 100% pure, squeezed-from-the-source, vibrant orange goodness.

***No writing prompts***

In [23]:
gen.inspect_history()





[34m[2025-08-24T06:41:54.304390][0m

[31mSystem message:[0m

Your input fields are:
1. `tone` (str): 
2. `product` (str):
Your output fields are:
1. `description` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## tone ## ]]
{tone}

[[ ## product ## ]]
{product}

[[ ## description ## ]]
{description}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Generate a product description given tone and product.


[31mUser message:[0m

[[ ## tone ## ]]
Punchy / Ad-like

[[ ## product ## ]]
SunBurst Orange Juice

Respond with the corresponding output fields, starting with the field `[[ ## description ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## description ## ]]
Tired of the same old? Ignite your day with SunBurst Orange Juice! We're talking pure, unadulterated sunshine in a bottle. Each sip is a vibrant explosion of fresh-squeezed, zesty 

In [8]:

product = {
    "name": "SunBurst Orange Juice",
    "features": ["No added sugar", "100% pure", "Rich in Vitamin C", "Cold-pressed"],
    "audience": "health-conscious millennials"
}

prompt = f'''
You are a marketing copywriter. Write a concise, compelling product description
for the FMCG product below in 3 distinct tones: (1) formal, (2) casual, (3) ad-like punchy.
Each variant should be 2-3 sentences max.

Product: {product["name"]}
Key features: {", ".join(product["features"])}
Target audience: {product["audience"]}
'''

try:
    import dspy
    gen = dspy.Predict("instructions -> descriptions")
    out = gen(instructions=prompt).descriptions
    print(out)
except Exception:
    # Fallback if DSPy isn't available: print the prompt for reference.
    print("DSPy not available; here's the prompt you can try with your LLM:")
    print(prompt)


DSPy not available; here's the prompt you can try with your LLM:

You are a marketing copywriter. Write a concise, compelling product description
for the FMCG product below in 3 distinct tones: (1) formal, (2) casual, (3) ad-like punchy.
Each variant should be 2-3 sentences max.

Product: SunBurst Orange Juice
Key features: No added sugar, 100% pure, Rich in Vitamin C, Cold-pressed
Target audience: health-conscious millennials






---
## 2) Maps Mini-Project: Street Name Normalization & Matching

**Problem:** Real-world street names vary (`"MG Road"`, `"M.G. Rd"`, `"Mahatma Gandhi Road"`).  
**Goal:** Normalize variants to a canonical form and match duplicates.

We'll combine:
- **LLM-based normalization** (expand abbreviations, fix casing, remove punctuation)
- **String similarity** via `rapidfuzz` for robust matching


In [None]:

import pandas as pd
from rapidfuzz import fuzz, process

# Sample data with variants
streets = pd.DataFrame({
    "raw_street": [
        "MG Road", "M.G. Rd", "Mahatma Gandhi Rd", "Mahatma Gandhi Road",
        "St John's Rd", "Saint Johns Road", "St. John’s Rd", "St. John Road",
        "Nehru Marg", "Jawaharlal Nehru Marg", "J L Nehru Marg",
        "Ring Rd", "Outer Ring Road", "Outer Rng Rd"
    ]
})

streets.to_csv("data/streets_raw.csv", index=False)
streets.head()



### 2.1) LLM Normalizer (DSPy)

We'll create a simple **Signature** and **Predictor** that maps a raw street name → canonical normalized name.


In [None]:

normalizer_spec = """
Given an Indian street name variant, return a clean, canonical, expanded form:
- Expand common abbreviations (e.g., 'Rd' → 'Road', 'St' → 'Saint' when it's a person's name; else 'Street' if context suggests)
- Remove unnecessary punctuation
- Use Title Case
- Prefer full names (e.g., 'MG' → 'Mahatma Gandhi' when unambiguous)
Return only the normalized name, no extra text.
"""

try:
    import dspy

    class NormalizeStreet(dspy.Signature):
        raw_name = dspy.InputField()
        normalized = dspy.OutputField(desc="normalized, canonical street name")

    normalize = dspy.Predict(NormalizeStreet)

    def llm_normalize(name: str) -> str:
        r = normalize(raw_name=f"{name}

Guidelines:
{normalizer_spec}")
        return r.normalized.strip()

except Exception as e:
    print("DSPy not available; falling back to a rule-based normalizer:", e)
    import re

    ABBR = {
        r"\brd\b": "Road",
        r"\brd.\b": "Road",
        r"\bst\b": "Street",
        r"\bst.\b": "Street",
        r"\bmg\b": "Mahatma Gandhi",
        r"\bjl\b": "Jawaharlal",
        r"\bmarg\b": "Marg",
        r"\brng\b": "Ring",
    }
    def rule_normalize(text: str) -> str:
        t = text.lower()
        for pat, rep in ABBR.items():
            t = re.sub(pat, rep.lower(), t)
        t = re.sub(r"[.’']", "", t)
        t = re.sub(r"\s+", " ", t).strip()
        return t.title()

    def llm_normalize(name: str) -> str:
        return rule_normalize(name)


In [None]:

df = pd.read_csv("data/streets_raw.csv")
df["normalized"] = df["raw_street"].apply(llm_normalize)
df.head(10)



### 2.2) Fuzzy Matching to Group Duplicates


In [None]:

# Group streets by similarity of their normalized form
# We'll use a simple threshold; in production, tune per locale and evaluate with ground truth.
threshold = 90

unique_norms = df["normalized"].unique().tolist()
clusters = []
visited = set()

for i, s in enumerate(unique_norms):
    if s in visited:
        continue
    visited.add(s)
    # Find close matches
    matches = process.extract(s, unique_norms, scorer=fuzz.token_sort_ratio, limit=None)
    group = [m[0] for m in matches if m[1] >= threshold]
    clusters.append(group)
    visited.update(group)

# Map each row to a cluster id
cluster_map = {}
for idx, group in enumerate(clusters):
    for g in group:
        cluster_map[g] = idx

df["cluster_id"] = df["normalized"].map(cluster_map)
df.sort_values(["cluster_id", "normalized"])



**Exercise:** Try changing the `threshold` to see how clusters merge/split.  
**Discussion:** When to trust LLM normalization vs rules; human-in-the-loop QA for map data.



---
## 3) FMCG Mini-Project: Reviews → Insights → Actions

**Goal:** Generate synthetic reviews for a new product, summarize themes, extract insights, and recommend actions.


In [None]:

product = "SunBurst Orange Juice"
aspects = ["taste", "price", "packaging", "availability", "healthiness"]

try:
    import dspy

    class ReviewSynth(dspy.Signature):
        product = dspy.InputField()
        aspects = dspy.InputField()
        reviews = dspy.OutputField(desc="10 diverse, short customer reviews")

    synth = dspy.Predict(ReviewSynth)
    reviews_text = synth(product=product, aspects=aspects).reviews
except Exception:
    # Fallback: sample static reviews
    reviews_text = """
1) Great taste but a bit pricey.
2) Love the no-sugar claim; feels healthy.
3) Packaging leaks if kept sideways.
4) Hard to find at my local store.
5) Kids enjoy it; refreshing and pulpy.
6) Price is okay during discounts.
7) Wish there was a smaller pack size.
8) Tastes natural, not too sweet.
9) Outer packaging is attractive.
10) Delivery took long; store was out of stock.
"""

print(reviews_text)



### 3.1) Summarize & Extract Insights


In [None]:

try:
    import dspy

    class SummarizeReviews(dspy.Signature):
        reviews = dspy.InputField()
        summary = dspy.OutputField(desc="pros, cons, notable quotes")

    class ExtractInsights(dspy.Signature):
        summary = dspy.InputField()
        insights = dspy.OutputField(desc="3-5 crisp insights with evidence")

    summarize = dspy.ChainOfThought(SummarizeReviews)
    extract = dspy.Predict(ExtractInsights)

    summary = summarize(reviews=reviews_text).summary
    insights = extract(summary=summary).insights

    print("SUMMARY:\n", summary)
    print("\nINSIGHTS:\n", insights)

except Exception:
    print("DSPy not available; here is a template prompt you can run with your LLM:")
    print("""
Summarize the following reviews into pros, cons, and notable quotes. Then provide 3-5 crisp insights:
""")
    print(reviews_text)



---
## 4) Agentic AI with DSPy: Compose a Pipeline

We'll build a 3-stage pipeline:
1. **Summarizer** – condense reviews/sales text
2. **Insight Generator** – extract trends/causes
3. **Recommender** – propose next actions (pricing, packaging, distribution, marketing)

You'll see: how **modules** wrap LLM calls, how to **swap models**, and how to **optimize prompts**.


In [None]:

try:
    import dspy

    class Summarizer(dspy.Module):
        def __init__(self):
            super().__init__()
            class Sig(dspy.Signature):
                text = dspy.InputField()
                summary = dspy.OutputField()
            self.step = dspy.ChainOfThought(Sig)
        def forward(self, text):
            return self.step(text=text).summary

    class InsightGen(dspy.Module):
        def __init__(self):
            super().__init__()
            class Sig(dspy.Signature):
                summary = dspy.InputField()
                insights = dspy.OutputField()
            self.step = dspy.Predict(Sig)
        def forward(self, summary):
            return self.step(summary=summary).insights

    class Recommender(dspy.Module):
        def __init__(self):
            super().__init__()
            class Sig(dspy.Signature):
                insights = dspy.InputField()
                actions = dspy.OutputField()
            self.step = dspy.Predict(Sig)
        def forward(self, insights):
            return self.step(insights=insights).actions

    class FMCGPipeline(dspy.Module):
        def __init__(self):
            super().__init__()
            self.summarizer = Summarizer()
            self.insightgen = InsightGen()
            self.recommender = Recommender()

        def forward(self, text):
            summary = self.summarizer(text=text)
            insights = self.insightgen(summary=summary)
            actions = self.recommender(insights=insights)
            return dict(summary=summary, insights=insights, actions=actions)

    pipeline = FMCGPipeline()

    sample_text = reviews_text
    result = pipeline(text=sample_text)
    print("SUMMARY:\n", result["summary"])
    print("\nINSIGHTS:\n", result["insights"])
    print("\nACTIONS:\n", result["actions"])

except Exception as e:
    print("DSPy not available; here is the logical flow you can implement with any LLM:")
    print("1) Summarize -> 2) Extract Insights -> 3) Recommend Actions")



### 4.1) (Optional) DSPy Optimization

DSPy supports **teleprompter**-style optimization given labeled examples.  
Below is a minimal sketch (fill `train_data` with (input, target) pairs).


In [None]:

try:
    import dspy

    # Minimal demo dataset (toy). Replace with real (input, target) pairs.
    train_data = [
        dict(text="Pricey but delicious. Hard to find locally.", target_actions="Run local availability campaign; limited-time discount"),
        dict(text="Leaky packaging. Love the no sugar.", target_actions="Improve cap seal; emphasize health benefit in ads"),
    ]

    class ActionsTeacher(dspy.Signature):
        text = dspy.InputField()
        actions = dspy.OutputField()

    # A tiny trainer that pretends "actions" is the supervised target.
    class TinyTrainer(dspy.Module):
        def __init__(self):
            super().__init__()
            self.pipeline = FMCGPipeline()
        def forward(self, text):
            out = self.pipeline(text=text)
            return out["actions"]

    # In real usage, use dspy.teleprompt.BootstrapFewShot or similar.
    # Here we simply run the pipeline on training data as illustration.
    trainer = TinyTrainer()
    for ex in train_data:
        _ = trainer(text=ex["text"])
    print("Optimization sketch complete (replace with DSPy teleprompters in real training).")

except Exception as e:
    print("Skipping optimization sketch due to:", e)



---
## 5) Stretch Goals
- Add a **retrieval** step (RAG) for product manuals/FAQs before generating actions.
- Use a **validator** module to check if actions are grounded in the summary.
- For Maps: add **house-number parsing**, **localization**, and **confidence scoring**.
- Log prompts/outputs and build a small **evaluation harness** with golden test cases.



---
## 6) Troubleshooting

- **No Internet?** Skip installs, read through code, and run later on a connected machine.
- **API errors?** Check `OPENAI_API_KEY`, `OPENAI_BASE_URL`, and `OPENAI_MODEL` env vars.
- **DSPy version mismatch?** Adjust the LM initialization to your version.
- **String matching too strict?** Lower the threshold or use another scorer.
- **Time check:** Generated on 2025-08-23 02:32:58.
