<a href="https://colab.research.google.com/github/p4r1ch4y/clirnet_assignment/blob/main/dspy_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

DSPy Assignment


Time  : 5/11/25 5:30pm

Subrata Choudhury - Backend Engineer - Round 1

**1. install all the dependencies**

In [1]:
!pip install dspy-ai requests beautifulsoup4 pandas lxml


Collecting dspy-ai
  Downloading dspy_ai-3.0.3-py3-none-any.whl.metadata (285 bytes)
Collecting dspy>=3.0.3 (from dspy-ai)
  Downloading dspy-3.0.3-py3-none-any.whl.metadata (7.2 kB)
Collecting backoff>=2.2 (from dspy>=3.0.3->dspy-ai)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting optuna>=3.4.0 (from dspy>=3.0.3->dspy-ai)
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting magicattr>=0.1.6 (from dspy>=3.0.3->dspy-ai)
  Downloading magicattr-0.1.6-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting litellm>=1.64.0 (from dspy>=3.0.3->dspy-ai)
  Downloading litellm-1.79.1-py3-none-any.whl.metadata (30 kB)
Collecting diskcache>=5.6.0 (from dspy>=3.0.3->dspy-ai)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting json-repair>=0.30.0 (from dspy>=3.0.3->dspy-ai)
  Downloading json_repair-0.52.5-py3-none-any.whl.metadata (11 kB)
Collecting asyncer==0.0.8 (from dspy>=3.0.3->dspy-ai)
  Downloading asyncer-0.0.8-py3-none-any.whl.m

***2. import***

In [2]:
import dspy
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from typing import List
from pydantic import BaseModel, Field

3. Add API Keys

In [3]:
API_KEY="redacted"
main_lm = dspy.LM("openai/LongCat-Flash-Chat", api_key=API_KEY, api_base="https://api.longcat.chat/openai/v1")

dspy.settings.configure(lm=main_lm, adapter=dspy.XMLAdapter())
print("DSPy set up! Hope the key works.")


DSPy set up! Hope the key works.


# 4. urls to be scraped

---



In [4]:
urls = [
    "https://en.wikipedia.org/wiki/Sustainable_agriculture",
    "https://www.nature.com/articles/d41586-025-03353-5",
    "https://www.sciencedirect.com/science/article/pii/S1043661820315152",
    "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457221/",
    "https://www.fao.org/3/y4671e/y4671e06.htm",
    "https://www.medscape.com/viewarticle/time-reconsider-tramadol-chronic-pain-2025a1000ria",
    "https://www.sciencedirect.com/science/article/pii/S0378378220307088",
    "https://www.frontiersin.org/news/2025/09/01/rectangle-telescope-finding-habitable-planets",
    "https://www.medscape.com/viewarticle/second-dose-boosts-shingles-protection-adults-aged-65-years-2025a1000ro7",
    "https://www.theguardian.com/global-development/2025/oct/13/astro-ambassadors-stargazers-himalayas-hanle-ladakh-india"
]


5. dummy user agent to avoid block

In [6]:
# ---------------------------------------------------------
# WEB SCRAPER AGENT With Logging MODULE
# ---------------------------------------------------------

def get_text_from_url(url):
    try:
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        for tag in soup(['script', 'style', 'nav', 'header', 'footer']):
            tag.decompose()
        paragraphs = soup.find_all('p')
        text = ' '.join([p.get_text() for p in paragraphs])
        text = ' '.join(text.split())
        if len(text) > 4000:
            text = text[:4000] + " ... (text cut off to fit)"
        return text, None
    except Exception as e:
        return "", str(e)


In [7]:
# ---------------------------------------------------------
# 2. ENTITY EXTRACTION with Deduplication and Relation Models MODULE
# ---------------------------------------------------------

class EntityWithAttr(BaseModel):
    entity: str = Field(description="the named entity")
    attr_type: str = Field(description="semantic type of the entity (e.g. Drug, Disease, etc.)")

class ExtractEntities(dspy.Signature):
    paragraph: str = dspy.InputField(desc="input paragraph")
    entities: List[EntityWithAttr] = dspy.OutputField(desc="list of entities and their attr types")

extractor = dspy.Predict(ExtractEntities)

class DeduplicateEntities(dspy.Signature):
    items: List[EntityWithAttr] = dspy.InputField(desc="batch of entities to deduplicate")
    deduplicated: List[EntityWithAttr] = dspy.OutputField(desc="deduplicated list")
    confidence: float = dspy.OutputField(desc="confidence (0-1) that every item is distinct")

dedup_predictor = dspy.ChainOfThought(DeduplicateEntities)

def deduplicate_with_lm(items: List[EntityWithAttr], batch_size:int=10, target_confidence:float=0.9):
    if not items: return []
    def _process_batch(batch):
        while True:
            pred = dedup_predictor(items=batch)
            if pred.confidence >= target_confidence:
                return pred.deduplicated
    results = []
    for i in range(0, len(items), batch_size):
        batch = items[i : i + batch_size]
        results.extend(_process_batch(batch))
    return results

class Relation(BaseModel):
    subj: str = Field(description="subject entity (exact string as in deduplicated list)")
    pred: str = Field(description="short predicate / relation")
    obj: str = Field(description="object entity (exact string as in deduplicated list)")

class ExtractRelations(dspy.Signature):
    paragraph: str = dspy.InputField(desc="original paragraph")
    entities: List[str] = dspy.InputField(desc="list of deduplicated entity strings")
    relations: List[Relation] = dspy.OutputField(desc="list of subject-predicate-object triples")

rel_predictor = dspy.ChainOfThought(ExtractRelations)


8. convert data to mermaid

In [8]:
# ---------------------------------------------------------
# 5. MERMAID DIAGRAM GENERATOR
# ---------------------------------------------------------

def triples_to_mermaid(triples: list, entity_list: list, max_label_len: int = 40) -> str:
    """
    Convert triples to a VALID Mermaid flowchart LR diagram.
    """
    entity_set = {e.strip().lower() for e in entity_list}
    lines = ["flowchart LR"]

    def _make_id(s: str) -> str:
        # Create valid Mermaid node ID (no spaces or special chars)
        return s.strip().replace(" ", "_").replace("(", "").replace(")", "").replace("-", "_")

    for t in triples:
        subj_norm, obj_norm = t.subj.strip().lower(), t.obj.strip().lower()
        if obj_norm in entity_set:
            src, dst, lbl = t.subj, t.obj, t.pred
        elif subj_norm in entity_set:
            src, dst, lbl = t.obj, t.subj, t.pred
        else:
            continue
        lbl = lbl.strip()
        if len(lbl) > max_label_len:
            lbl = lbl[:max_label_len - 3] + "..."
        src_id, dst_id = _make_id(src), _make_id(dst)
        lines.append(f'    {src_id}["{src}"] -->|{lbl}| {dst_id}["{dst}"]')
    return "\n".join(lines)

In [16]:
# ---------------------------------------------------------
# 9. MAIN PROCESSING and File Generation Logging
# ---------------------------------------------------------

process_logs = []

def process_url(url, index):
    log = {'url': url, 'index': index+1, 'scrape': '', 'entities': '', 'relations': '', 'mermaid_save': ''}
    paragraph, scrape_err = get_text_from_url(url)
    if not paragraph:
        mermaid_code = "flowchart LR\n    No_data[\"No text scraped - empty graph\"]"
        filename = f'mermaid_{index+1:02d}.md'
        try:
            with open(filename, 'w', encoding='utf-8') as f:
                f.write(f"```mermaid\n{mermaid_code}\n```")
            log['mermaid_save'] = 'OK (empty graph)'
        except Exception as e:
            log['mermaid_save'] = f'ERROR {e}'
        log['scrape'] = f'ERROR {scrape_err}'
        process_logs.append(log)
        return {'url': url, 'entities': [], 'relations': [], 'mermaid': mermaid_code, 'log': log}
    log['scrape'] = 'OK'

    time.sleep(2)
    try:
        extracted = extractor(paragraph=paragraph)
        if not extracted.entities:
            log['entities'] = 'ERROR (zero extracted)'
        else:
            log['entities'] = f'OK ({len(extracted.entities)})'
    except Exception as e:
        extracted = None
        log['entities'] = f'ERROR {e}'

    if not extracted or not extracted.entities:
        mermaid_code = "flowchart LR\n    No_entities[\"No entities - empty graph\"]"
        filename = f'mermaid_{index+1:02d}.md'
        try:
            with open(filename, 'w', encoding='utf-8') as f:
                f.write(f"```mermaid\n{mermaid_code}\n```")
            log['mermaid_save'] = 'OK (empty graph)'
        except Exception as e:
            log['mermaid_save'] = f'ERROR {e}'
        process_logs.append(log)
        return {'url': url, 'entities': [], 'relations': [], 'mermaid': mermaid_code, 'log': log}

    unique = deduplicate_with_lm(extracted.entities, batch_size=10, target_confidence=0.9)
    entity_strings = [e.entity for e in unique]

    try:
        rel_out = rel_predictor(paragraph=paragraph, entities=entity_strings)
        rels_ok = len(rel_out.relations) if rel_out else 0
        log['relations'] = f'OK ({rels_ok})'
    except Exception as e:
        rel_out = None
        log['relations'] = f'ERROR {e}'

    relations = rel_out.relations if rel_out else []

    mermaid_code = triples_to_mermaid(relations, entity_strings)
    filename = f'mermaid_{index+1:02d}.md'
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(f"```mermaid\n{mermaid_code}\n```")
        log['mermaid_save'] = 'OK'
    except Exception as e:
        log['mermaid_save'] = f'ERROR {e}'
    process_logs.append(log)

    return {
        'url': url,
        'entities': unique,
        'relations': relations,
        'mermaid': mermaid_code,
        'log': log
    }

10. **Process URLS and Generate FILES**(*Wait a bit as it's scapes and processes and call the functions*)

In [17]:
all_results = []
for i, url in enumerate(urls):
    try:
        result = process_url(url, i)
        all_results.append(result)
    except Exception as e:
        print(f"Unhandled error on URL {i+1}: {e}")


**CSV FILE Generation**

In [18]:
# ---------------------------------------------------------
# 7. CSV FILE GENERATION and ERROR LOGGING MODULE
# ---------------------------------------------------------

csv_rows = []
for result in all_results:
    url = result['url']
    for ent in result['entities']:
        csv_rows.append({
            'link': url,
            'tag': ent.entity,
            'tag_type': ent.attr_type
        })

csv_status = ""
df = pd.DataFrame(csv_rows)
if not df.empty:
    try:
        df = df.drop_duplicates()
        df.to_csv('tags.csv', index=False)
        csv_status = f'OK ({len(df)} rows)'
    except Exception as e:
        csv_status = f'ERROR {e}'


    csv_content = df.to_csv(index=False)
    md_content = f"# Tags CSV Export\n\n```csv\n{csv_content}```"
    try:
        with open('tags.md', 'w', encoding='utf-8') as f:
            f.write(md_content)
    except Exception as e:
        print(f"Error saving tags.md: {e}")
    print(f"Saved tags.csv, csv_status: {csv_status}")
else:
    csv_status = "ERROR (nothing to save)"
    with open('tags.md', 'w', encoding='utf-8') as f:
        f.write("# Tags CSV Export\n\nNo data available.")
    print("No entities for CSV :(")

Saved tags.csv, csv_status: OK (250 rows)


** 12 scapping and generation summarry **

In [19]:
# ---------------------------------------------------------
# 8. Log Summary
# ---------------------------------------------------------

df_log = pd.DataFrame(process_logs)
print("\nPROCESS LOG SUMMARY:")
print(df_log)

df_log.to_csv('process_log.csv', index=False)

errors_only = df_log[
    (df_log['scrape'].str.startswith('ERROR')) |
    (df_log['entities'].str.startswith('ERROR')) |
    (df_log['relations'].str.startswith('ERROR')) |
    (df_log['mermaid_save'].str.startswith('ERROR'))
]
print("\nERRORS FOUND:")
print(errors_only if not errors_only.empty else "No errors, all OK!")



PROCESS LOG SUMMARY:
                                                 url  index  \
0  https://en.wikipedia.org/wiki/Sustainable_agri...      1   
1  https://www.nature.com/articles/d41586-025-033...      2   
2  https://www.sciencedirect.com/science/article/...      3   
3  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...      4   
4          https://www.fao.org/3/y4671e/y4671e06.htm      5   
5  https://www.medscape.com/viewarticle/time-reco...      6   
6  https://www.sciencedirect.com/science/article/...      7   
7  https://www.frontiersin.org/news/2025/09/01/re...      8   
8  https://www.medscape.com/viewarticle/second-do...      9   
9  https://www.theguardian.com/global-development...     10   

                                              scrape entities relations  \
0                                                 OK  OK (46)   OK (48)   
1                                                 OK  OK (40)   OK (24)   
2  ERROR 403 Client Error: Forbidden for url: htt...       

summary

In [20]:
total_entities = sum(len(r['entities']) for r in all_results)
total_rels = sum(len(r['relations']) for r in all_results)
print("=" * 50)
print("ASSIGNMENT COMPLETE!")
print("=" * 50)
print(f"URLs processed: {len(urls)}")
print(f"Total deduplicated entities: {total_entities}")
print(f"Total relations: {total_rels}")
print("Files: tags.csv, tags.md (CSV & Markdown), 10 mermaid_XX.md graphs, process_log.csv (log)")
print("Mermaid Diagrams are already embeeded, Paste mermaid codes into mermaid.live if you wish to view code and graphs.")
print(f"CSV Save Status: {csv_status}")


ASSIGNMENT COMPLETE!
URLs processed: 10
Total deduplicated entities: 251
Total relations: 255
Files: tags.csv, tags.md (CSV & Markdown), 10 mermaid_XX.md graphs, process_log.csv (log)
Mermaid Diagrams are already embeeded, Paste mermaid codes into mermaid.live if you wish to view code and graphs.
CSV Save Status: OK (250 rows)
