<a href="https://colab.research.google.com/github/nidheesh-p/AI-Learning/blob/master/Market_Research_Agent_RAG_Tavily_Detailed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Market Research Agent — RAG (FAISS) + Tavily Web Search


## What you'll learn
- Build a FAISS-based retriever from CSV documents (company profiles & trend snippets).  
- Integrate a web search tool (Tavily) for fresh, external evidence.  
- Design a Planner → Coordinator → Tools → Draft → Reflect loop (single-agent).  
- Add JSON memory for traceability and citations.  
- Implement a simple evaluation harness and exercises for learners.



## 1) Setup & Notes about running this notebook
- This notebook includes cells that require internet (model downloads, Tavily).  
- If you do not have a `TAVILY_API_KEY`, the web search cell will show a safe offline stub.  
- The notebook stores two CSVs in `/mnt/data`: `company_profiles_rag_detailed.csv` and `market_trends_rag_detailed.csv`.
- We use `sentence-transformers` (`all-MiniLM-L6-v2`) for embeddings and `faiss-cpu` for vector storage.


In [None]:

# (Run this cell to install packages if needed)
# pip install faiss-cpu sentence-transformers tavily-python requests openai pandas
print('Install these packages if you run the notebook in a fresh environment: faiss-cpu, sentence-transformers, tavily-python, requests, openai, pandas')


Install these packages if you run the notebook in a fresh environment: faiss-cpu, sentence-transformers, tavily-python, requests, openai, pandas


## 2) Load datasets (we included two CSVs for you)

Paths:
- `/mnt/data/company_profiles_rag_detailed.csv`
- `/mnt/data/market_trends_rag_detailed.csv`

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

# After mounting, navigate to your file's path in Google Drive
# For example, if your file is in "My Drive/Colab Notebooks/data/my_data.csv"
file_path = '/content/gdrive/My Drive/Colab Notebooks/data/my_data.csv'

import pandas as pd
company_csv = r"/content/gdrive/My Drive/Market Research Data/company_profiles_rag_detailed.csv"
trend_csv = r"/content/gdrive/My Drive/Market Research Data/market_trends_rag_detailed.csv"
companies = pd.read_csv(company_csv)
trends = pd.read_csv(trend_csv)
print('Companies:')
display(companies.head(10))
print('\nTrends:')
display(trends.head(10))


Mounted at /content/gdrive


FileNotFoundError: [Errno 2] No such file or directory: '/content/gdrive/My Drive/Market Research Data/company_profiles_rag_detailed.csv'


### Quick EDA (Exploratory Data Analysis)
We show simple counts and distributions to help learners understand the data shapes.


In [None]:

# Simple EDA
print('Companies by Industry:')
print(companies['Industry'].value_counts())
print('\nTrends by Industry:')
print(trends['Industry'].value_counts())


Companies by Industry:
Industry
Healthcare AI        2
AI & Cloud           1
Electric Vehicles    1
FinTech              1
ClimateTech          1
AgriTech             1
EdTech               1
Name: count, dtype: int64

Trends by Industry:
Industry
Healthcare AI        1
AI & Cloud           1
Electric Vehicles    1
FinTech              1
AgriTech             1
EdTech               1
ClimateTech          1
Name: count, dtype: int64



## 3) Build the RAG index (Embeddings + FAISS)
**Why:** RAG lets the agent retrieve semantically relevant documents from local knowledge (CSV) instead of exact string matching.


In [None]:
!pip install faiss-cpu
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Prepare docs
company_docs = companies.apply(lambda r: f"Company: {r['CompanyName']} | Industry: {r['Industry']} | Desc: {r['Description']}", axis=1).tolist()
trend_docs = trends.apply(lambda r: f"Industry: {r['Industry']} | Headline: {r['Headline']} | Snippet: {r['Snippet']} | Date: {r['Date']}", axis=1).tolist()
all_docs = company_docs + trend_docs
print(f"Preparing to embed {len(all_docs)} documents")

# Load model and create embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # downloads once
embeddings = model.encode(all_docs, convert_to_numpy=True, show_progress_bar=True)

# Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.array(embeddings).astype('float32'))

# id->doc mapping
id2doc = {i: d for i, d in enumerate(all_docs)}

def rag_search(query, top_k=4):
    q_emb = model.encode([query], convert_to_numpy=True)
    D, I = index.search(np.array(q_emb).astype('float32'), top_k)
    return [id2doc[int(i)] for i in I[0]]


Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0
Preparing to embed 15 documents


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Test RAG retrieval with example queries

In [None]:

print('RAG hits for "healthcare AI":')
for r in rag_search('healthcare AI', top_k=3):
    print('-', r[:300])
print('\nRAG hits for "precision farming":')
for r in rag_search('precision farming', top_k=3):
    print('-', r[:300])


RAG hits for "healthcare AI":
- Industry: Healthcare AI | Headline: AI diagnostic tools gain regulatory approvals | Snippet: Several imaging and diagnostic tools received approval, leading to pilot deployments in regional hospitals. | Date: 2024-11-10
- Company: MediCore Health | Industry: Healthcare AI | Desc: Clinical decision support, diagnostic imaging AI, and patient monitoring platforms used by hospitals.
- Company: BioSense | Industry: Healthcare AI | Desc: AI models for non-invasive diagnostics and remote patient monitoring sensors.

RAG hits for "precision farming":
- Company: AgriNext | Industry: AgriTech | Desc: AI + IoT for precision farming, predictive irrigation, crop yield optimization.
- Industry: AgriTech | Headline: Smart irrigation pilots save water | Snippet: Pilot farms show 25-35% water savings using predictive irrigation. | Date: 2024-07-14
- Company: TechNova | Industry: AI & Cloud | Desc: Scalable AI cloud solutions, low-latency inference, enterprise workflows,


## 4) Web Search Tool — Tavily
**Why add web search?** RAG is limited to your local snapshots. Adding a web-search tool lets the agent fetch *fresh, time-sensitive* info (news, blog posts, regulatory updates).

> Set `TAVILY_API_KEY` as an environment variable. If not set, a stub will be used so the notebook stays runnable offline.


In [None]:

import os, requests, time

TAVILY_API_KEY = os.getenv('TAVILY_API_KEY')
def web_search_tavily(query, num_results=3):
    if not TAVILY_API_KEY:
        # offline stub
        return [f"[stub] No TAVILY_API_KEY set. Would search for: {query}"]

    url = "https://api.tavily.com/search"
    headers = {"Content-Type":"application/json"}
    payload = {"api_key": TAVILY_API_KEY, "query": query, "num_results": num_results}
    resp = requests.post(url, json=payload, headers=headers, timeout=20)
    data = resp.json()
    results = []
    for r in data.get('results', []):
        title = r.get('title') or r.get('heading') or ''
        snippet = r.get('content') or r.get('snippet') or ''
        link = r.get('link') or r.get('url') or ''
        results.append({'title': title, 'snippet': snippet, 'link': link})
    return results


### Test Tavily web search (will use stub if API key missing)

In [None]:

print(web_search_tavily('global AI healthcare market 2025', num_results=2))


['[stub] No TAVILY_API_KEY set. Would search for: global AI healthcare market 2025']



## 5) Agent Design — Planner → Coordinator → Tools → Draft → Reflect

We'll implement a single `MarketResearchAgent` class with:
- `planner(query)` → creates an execution plan.
- `gather(query)` → decides which tools to call (RAG and/or Tavily) and collects evidence.
- `draft(evidence)` → calls an LLM (or Mock) to produce a draft brief.
- `critique_and_revise(draft)` → reflection: critique the draft and revise it.
- JSON memory for notes, drafts, and citations.


In [None]:
import os, json, time, re
from typing import List, Dict

# Minimal JSON memory
class JSONMemory:
    def __init__(self, path='/mnt/data/agent_memory_rag_tavily.json'):
        self.path = path
        if not os.path.exists(self.path):
            with open(self.path,'w') as f:
                json.dump({'notes':[], 'drafts':[], 'citations':[]}, f)
    def load(self):
        with open(self.path) as f:
            return json.load(f)
    def write(self, data):
        with open(self.path,'w') as f:
            json.dump(data, f, indent=2)
    def add_note(self, note):
        d = self.load(); d['notes'].append({'ts':time.time(), 'note':note}); self.write(d)
    def add_citation(self, src):
        d = self.load(); d['citations'].append(src); self.write(d)
    def add_draft(self, draft):
        d = self.load(); d['drafts'].append({'ts':time.time(),'draft':draft}); self.write(d)

# Mock LLM functions for offline use
class MockLLM:
    def draft(self, prompt):
        # create a short mock draft using heuristics
        return """Market Brief (mock)
- ICP: SMBs and mid-market
- Value Prop: Faster time-to-value
- Strengths: focused product, low TCO
- Risks: competition, pricing pressure
- Sources: [rag:0], [web:0]
"""
    def critique(self, draft):
        return "CRITIQUE: Add a metric and ensure at least one explicit citation. Clean up fluff."
    def revise(self, draft, critique):
        return draft + "\n\n[Revised: tightened bullets; added metric 20% YoY growth (example)]"

mock_llm = MockLLM()

class MarketResearchAgent:
    def __init__(self, memory:JSONMemory):
        self.memory = memory

    def planner(self, query:str)->List[str]:
        plan = ["Run RAG (internal docs)", "Run Tavily web search (freshness)", "Draft brief", "Critique & Revise"]
        self.memory.add_note('planner created')
        return plan

    def gather(self, query:str, top_k_rag=3, top_k_web=2, prefer_web=False)->List[Dict]:
        evidence = []

        rag_hits = rag_search(query, top_k=top_k_rag)
        for i,h in enumerate(rag_hits):
            evidence.append({'type':'rag', 'id':i, 'text':h})
            self.memory.add_citation(f'rag:{i}')

        web_hits = web_search_tavily(query, num_results=top_k_web)
        for i,h in enumerate(web_hits):
            evidence.append({
                'type':'web',
                'id':i,
                'text': h.get('snippet') if isinstance(h, dict) else str(h),
                'link': h.get('link') if isinstance(h, dict) else None
            })
            self.memory.add_citation(f'web:{i}')

        self.memory.add_note('gathered evidence')
        return evidence

    def draft(self, query:str, evidence:List[Dict])->str:
        # Build prompt
        context = '\n'.join([f"[{e['type']}:{e['id']}] {e['text']}" for e in evidence])
        prompt = f"Draft a concise market brief for: {query}\nUse the evidence below:\n{context}\nKeep it 4-8 bullets and include citations."
        # Call LLM (mock or real)
        draft = mock_llm.draft(prompt)
        self.memory.add_draft(draft)
        return draft

    def critique_and_revise(self, draft:str)->str:
        critique = mock_llm.critique(draft)
        revised = mock_llm.revise(draft, critique)
        self.memory.add_draft(revised)
        return revised


### 6) Example end-to-end run (Planner → Gather → Draft → Revise) — run this cell to see outputs step-by-step

In [None]:
# Make sure memory directory exists before using JSONMemory
import os
os.makedirs("/mnt/data", exist_ok=True)

# Initialize memory + agent
mem = JSONMemory(path="/mnt/data/agent_memory_rag_tavily.json")
agent = MarketResearchAgent(mem)

# Run workflow
query = 'Healthcare AI market outlook 2025'
print('Plan:')
print(agent.planner(query))

evidence = agent.gather(query, top_k_rag=3, top_k_web=2)
print('\nCollected Evidence (first 400 chars each):')
for e in evidence:
    txt = e['text'] if isinstance(e['text'], str) else str(e['text'])
    print('-', e['type'], e.get('id'), ':', txt[:400])

draft = agent.draft(query, evidence)
print('\nDraft:\n', draft)

final = agent.critique_and_revise(draft)
print('\nFinal Revised Draft:\n', final)

print('\nMemory snapshot:')
print(mem.load())


Plan:
['Run RAG (internal docs)', 'Run Tavily web search (freshness)', 'Draft brief', 'Critique & Revise']

Collected Evidence (first 400 chars each):
- rag 0 : Industry: Healthcare AI | Headline: AI diagnostic tools gain regulatory approvals | Snippet: Several imaging and diagnostic tools received approval, leading to pilot deployments in regional hospitals. | Date: 2024-11-10
- rag 1 : Company: MediCore Health | Industry: Healthcare AI | Desc: Clinical decision support, diagnostic imaging AI, and patient monitoring platforms used by hospitals.
- rag 2 : Industry: AI & Cloud | Headline: AI cloud adoption surges among SMEs | Snippet: Managed AI platforms report YoY growth ~40% as SMEs outsource model infra. | Date: 2024-10-05
- web 0 : [stub] No TAVILY_API_KEY set. Would search for: Healthcare AI market outlook 2025

Draft:
 Market Brief (mock)
- ICP: SMBs and mid-market
- Value Prop: Faster time-to-value
- Strengths: focused product, low TCO
- Risks: competition, pricing pressure
- So


## 7) Evaluation harness
We implement a small evaluator that checks:
- Relevance (on-topic), Evidence (citations present), Specificity (numbers or years), Clarity (bullets), Freshness (web evidence)


In [None]:

import re, json
def evaluate(text:str, memory_json_path='/mnt/data/agent_memory_rag_tavily.json'):
    scores = {}
    scores['Relevance'] = 1 if 'health' in text.lower() or 'healthcare' in text.lower() else 0
    with open(memory_json_path) as f: mem = json.load(f)
    scores['Evidence'] = 1 if len(mem.get('citations',[]))>0 else 0
    scores['Specificity'] = 1 if re.search(r'\d{4}|\d+%', text) else 0
    scores['Clarity'] = 1 if '-' in text else 0
    scores['Freshness'] = 1 if any(c.startswith('web:') for c in mem.get('citations',[])) else 0
    return scores

print('Eval example (from final draft):')
# if final exists in the notebook run sequence, you can call evaluate(final)
# Here we just show the function
print('Call evaluate(final_draft) after running the agent cells to get scores')


Eval example (from final draft):
Call evaluate(final_draft) after running the agent cells to get scores



## 8) Extensions & exercises
- Replace `MockLLM` with OpenAI or another LLM (ensure to secure API keys).  
- Add caching for rag_search results and incremental index updates.  
- Build a "Researcher" agent that runs web searches in parallel and a "Writer" agent that drafts.  
- Add a Streamlit app to query the agent interactively.  
- Add guardrails: schema validation for drafts (pydantic) and rejection rules if no citations.
