# AI-Powered Brochure Generator (Website Scraper + Intelligent Navigation)

This notebook generates a clean **Markdown brochure** for a company by:
1) Scraping the homepage
2) Selecting high-signal pages (About, Product, Solutions, Pricing, Customers, Careers)
3) Optionally using an LLM to write a polished brochure

Outputs go to `outputs/`.


## What "intelligent navigation" means

Company sites have messy menus. I combine:
- **Heuristics**: prioritize common high-signal pages (about, product, solutions, pricing, customers, careers)
- **Optional LLM ranking**: pick the best pages from candidates and return structured JSON

You can run:
- **Heuristics-only** (no API key)
- **Heuristics + LLM** (better page selection and brochure quality)


## Setup

Run the next cell once. It installs dependencies and creates a basic repo-friendly folder layout:

- `src/` (optional export of the core code)
- `outputs/` (generated brochures)


In [None]:
# If you're running in a fresh environment, uncomment the pip installs.
# %pip install -U requests beautifulsoup4 lxml tiktoken python-dotenv pydantic

import os
from pathlib import Path

Path("outputs").mkdir(exist_ok=True)
Path("src").mkdir(exist_ok=True)

print("Ready. Folders created: outputs/, src/")


## Configuration

Set environment variables (recommended) or edit the defaults below.

### For OpenAI (optional)
- `OPENAI_API_KEY`
- `OPENAI_MODEL` (example: `gpt-4.1-mini`)

Tip: in a real repo, store these in a `.env` file and **do not commit it**.


In [None]:
# Optional: load a .env file if you use one locally (not required in hosted notebooks)
try:
    from dotenv import load_dotenv
    load_dotenv()
except Exception:
    pass

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
OPENAI_MODEL   = os.getenv("OPENAI_MODEL", "gpt-4.1-mini")

print("OPENAI_API_KEY set:", bool(OPENAI_API_KEY))
print("OPENAI_MODEL:", OPENAI_MODEL)


## GitHub notes

- Keep secrets out of Git: use environment variables or a local `.env` file (and add `.env` to `.gitignore`)
- Commit the notebook and `outputs/` (optional) or keep `outputs/` ignored if you only want source in the repo


## Core utilities: fetching, parsing, and cleaning

I keep things polite and predictable:
- reasonable user-agent
- timeouts
- max pages to crawl
- ignore non-http links, mailto, tel, javascript, etc.


In [None]:
import re
import time
import json
import random
import urllib.parse
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional, Iterable

import requests
from bs4 import BeautifulSoup

DEFAULT_HEADERS = {
    "User-Agent": "ai-brochure-generator/1.0 (+https://github.com/yourname/yourrepo)"
}

def normalize_url(base_url: str, href: str) -> Optional[str]:
    """Return an absolute, clean URL or None if it should be ignored."""
    if not href:
        return None
    href = href.strip()

    # Ignore anchors / non-web schemes
    if href.startswith("#"):
        return None
    bad_prefixes = ("mailto:", "tel:", "javascript:", "data:")
    if href.lower().startswith(bad_prefixes):
        return None

    abs_url = urllib.parse.urljoin(base_url, href)
    parsed = urllib.parse.urlparse(abs_url)

    if parsed.scheme not in ("http", "https"):
        return None

    # Drop fragments
    parsed = parsed._replace(fragment="")
    return parsed.geturl()

def same_site(a: str, b: str) -> bool:
    """True if URLs share the same netloc (domain)."""
    return urllib.parse.urlparse(a).netloc == urllib.parse.urlparse(b).netloc

def fetch_html(url: str, timeout: int = 20) -> str:
    resp = requests.get(url, headers=DEFAULT_HEADERS, timeout=timeout)
    resp.raise_for_status()
    return resp.text

def extract_links(base_url: str, html: str) -> List[str]:
    soup = BeautifulSoup(html, "lxml")
    urls = []
    for a in soup.select("a[href]"):
        u = normalize_url(base_url, a.get("href"))
        if u:
            urls.append(u)
    # de-dupe while preserving order
    seen = set()
    out = []
    for u in urls:
        if u not in seen:
            seen.add(u)
            out.append(u)
    return out

def html_to_text(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")

    # Remove obvious non-content elements
    for tag in soup(["script", "style", "noscript", "svg", "header", "footer", "nav", "aside"]):
        tag.decompose()

    text = soup.get_text(separator="\n")
    # Clean whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]{2,}", " ", text)
    return text.strip()


## Link scoring (heuristics)

I give extra points to high-signal pages and penalize low-signal ones.

This is intentionally simple and transparent.


In [None]:
HIGH_SIGNAL = [
    ("about", 4.0),
    ("company", 3.0),
    ("product", 4.0),
    ("platform", 3.0),
    ("solutions", 4.0),
    ("pricing", 5.0),
    ("customers", 4.0),
    ("case-stud", 4.0),
    ("security", 2.0),
    ("trust", 2.0),
    ("careers", 3.0),
    ("jobs", 3.0),
    ("team", 2.0),
    ("contact", 1.0),
]

LOW_SIGNAL = [
    ("blog", -3.0),
    ("news", -2.0),
    ("press", -2.0),
    ("events", -2.0),
    ("privacy", -1.0),
    ("terms", -1.0),
    ("cookie", -1.0),
    ("login", -5.0),
    ("signin", -5.0),
    ("signup", -4.0),
    ("status", -2.0),
    ("docs", -1.5),  # docs can be useful, but often too detailed for a brochure
]

def heuristic_score(url: str) -> float:
    u = url.lower()
    score = 0.0
    for k, w in HIGH_SIGNAL:
        if k in u:
            score += w
    for k, w in LOW_SIGNAL:
        if k in u:
            score += w
    # Slight preference for shorter paths
    path = urllib.parse.urlparse(url).path
    score -= 0.02 * len(path)
    return score

def rank_links(links: List[str], top_k: int = 20) -> List[Tuple[str, float]]:
    scored = [(u, heuristic_score(u)) for u in links]
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_k]


## Optional LLM step: pick best links and generate brochure

Two LLM calls:
1) **Pick links**: choose the best pages for a brochure from the candidate list
2) **Write brochure**: synthesize scraped text into a brochure in Markdown

If you do not have an API key, skip this and run heuristics-only mode.


In [None]:
def try_import_openai():
    try:
        from openai import OpenAI
        return OpenAI
    except Exception:
        return None

def llm_select_links(
    company_name: str,
    home_url: str,
    candidate_links: List[str],
    max_pages: int = 6
) -> List[str]:
    """Use an LLM to pick the most relevant links. Requires OPENAI_API_KEY."""
    OpenAI = try_import_openai()
    if not OpenAI or not OPENAI_API_KEY:
        raise RuntimeError("OpenAI client not available or OPENAI_API_KEY missing")

    client = OpenAI(api_key=OPENAI_API_KEY)

    system = (
        "You are a careful analyst. You select the most relevant company website pages "
        "to build a marketing brochure for prospective clients, investors, and recruits."
    )

    prompt = {
        "company_name": company_name,
        "home_url": home_url,
        "instructions": (
            "Pick the best pages for a brochure. Prefer About, Product/Platform, Solutions, "
            "Pricing, Customers/Case Studies, Careers/Team, Security/Trust, Contact. "
            "Avoid login pages, blog/news, policies. Return JSON only."
        ),
        "max_pages": max_pages,
        "candidate_links": candidate_links,
        "output_schema": {"selected_links": ["https://..."]}
    }

    resp = client.chat.completions.create(
        model=OPENAI_MODEL,
        messages=[
            {"role":"system","content":system},
            {"role":"user","content":json.dumps(prompt, indent=2)}
        ],
        temperature=0.2,
    )

    content = resp.choices[0].message.content.strip()

    # Robust-ish JSON extraction
    m = re.search(r"\{.*\}", content, re.S)
    if not m:
        raise ValueError(f"Model did not return JSON: {content[:500]}")
    data = json.loads(m.group(0))
    links = data.get("selected_links", [])
    # Basic validation
    out = []
    for u in links:
        if isinstance(u, str) and u.startswith(("http://","https://")):
            out.append(u)
    # de-dupe
    seen = set()
    cleaned = []
    for u in out:
        if u not in seen:
            seen.add(u)
            cleaned.append(u)
    return cleaned[:max_pages]

def llm_write_brochure(company_name: str, source_pages: Dict[str, str]) -> str:
    """Use an LLM to write a brochure in Markdown. Requires OPENAI_API_KEY."""
    OpenAI = try_import_openai()
    if not OpenAI or not OPENAI_API_KEY:
        raise RuntimeError("OpenAI client not available or OPENAI_API_KEY missing")

    client = OpenAI(api_key=OPENAI_API_KEY)

    # Keep the input bounded. In production you'd do chunking + summarization.
    max_chars_per_page = 12000
    compact = {u: t[:max_chars_per_page] for u, t in source_pages.items()}

    system = (
        "You are a senior product marketer. You write crisp, factual brochures. "
        "Do not invent facts. If something is unknown, omit it."
    )

    user = {
        "company_name": company_name,
        "task": (
            "Create a brochure in Markdown with these sections: "
            "1) One-line summary, 2) What they do, 3) Key products/solutions, "
            "4) Differentiators, 5) Proof (customers/case studies if available), "
            "6) Pricing (only if found), 7) Security/compliance (only if found), "
            "8) Careers snapshot (only if found), 9) CTA / next steps."
        ),
        "source_pages": compact
    }

    resp = client.chat.completions.create(
        model=OPENAI_MODEL,
        messages=[
            {"role":"system","content":system},
            {"role":"user","content":json.dumps(user)}
        ],
        temperature=0.3,
    )
    return resp.choices[0].message.content.strip()


## Crawl pipeline

This ties everything together:
- fetch homepage
- collect links
- rank links
- optionally ask LLM to pick final pages
- crawl selected pages and extract text
- optionally ask LLM to write brochure
- save to `outputs/`


In [None]:
@dataclass
class CrawlConfig:
    max_candidate_links: int = 50
    max_pages: int = 6
    timeout: int = 20
    sleep_s: float = 0.6  # be polite
    use_llm_for_links: bool = False
    use_llm_for_brochure: bool = False

def crawl_for_brochure(company_name: str, home_url: str, cfg: CrawlConfig) -> Dict[str, str]:
    home_html = fetch_html(home_url, timeout=cfg.timeout)
    links = extract_links(home_url, home_html)
    links = [u for u in links if same_site(home_url, u)]

    ranked = rank_links(links, top_k=cfg.max_candidate_links)
    candidate_links = [u for u, _ in ranked]

    if cfg.use_llm_for_links:
        selected = llm_select_links(company_name, home_url, candidate_links, max_pages=cfg.max_pages)
    else:
        selected = candidate_links[:cfg.max_pages]

    pages: Dict[str, str] = {}
    for u in selected:
        try:
            html = fetch_html(u, timeout=cfg.timeout)
            pages[u] = html_to_text(html)
        except Exception as e:
            pages[u] = f"[ERROR fetching {u}: {e}]"
        time.sleep(cfg.sleep_s + random.random()*0.2)

    return pages

def save_markdown(md_text: str, filename: str) -> str:
    out_path = Path("outputs") / filename
    out_path.write_text(md_text, encoding="utf-8")
    return str(out_path)


## Demo

Pick a company and run.

Notes:
- This notebook will only fetch pages if your runtime has internet access.
- For a GitHub repo, consider adding unit tests and a CLI wrapper (easy upgrade path).


In [None]:
# Example (replace with a real company + homepage)
company_name = "Example Company"
home_url = "https://example.com"

cfg = CrawlConfig(
    max_candidate_links=40,
    max_pages=6,
    timeout=20,
    sleep_s=0.6,
    use_llm_for_links=bool(OPENAI_API_KEY),        # auto-enable if key exists
    use_llm_for_brochure=bool(OPENAI_API_KEY),     # auto-enable if key exists
)

print(cfg)


In [None]:
# Run the crawl
# pages = crawl_for_brochure(company_name, home_url, cfg)
# print("Crawled pages:", len(pages))
# list(pages.keys())[:10]


In [None]:
# Generate brochure (heuristics-only mode just dumps extracted text as a starting point)
def simple_brochure_fallback(company_name: str, pages: Dict[str, str]) -> str:
    lines = [f"# {company_name} Brochure (Draft)", ""]
    lines.append("## Source pages") 
    for u in pages.keys():
        lines.append(f"- {u}")
    lines.append("")
    lines.append("## Extracted content (raw)")
    for u, t in pages.items():
        lines.append("")
        lines.append(f"### {u}")
        lines.append(t[:6000])
    return "\n".join(lines)

# brochure_md = llm_write_brochure(company_name, pages) if cfg.use_llm_for_brochure else simple_brochure_fallback(company_name, pages)
# out_file = save_markdown(brochure_md, f"{company_name.lower().replace(' ','_')}_brochure.md")
# out_file
