# Web Brochure Generator Notebook
This Jupyter Notebook demonstrates how to scrape a website, extract relevant links, and generate a company brochure using a local Ollama model. Each cell is designed to handle a specific part of the process, from setting up dependencies to generating and displaying the final brochure.

## Cell 1 — Imports & Configuration
This cell imports the necessary Python libraries for web scraping, HTTP requests, JSON handling, and Jupyter Notebook display. It also defines configuration variables, such as the URL for the Ollama API and the HTTP headers for web requests.

In [None]:
# Imports & Configuration
import requests
import json
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from IPython.display import Markdown, display, update_display

# Configuration
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "llama3.2"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

## Cell 2 — Ollama Helper Function
This cell defines a helper function `ollama_generate` to interact with the local Ollama model. It sends prompts to the model and handles both streaming and non-streaming responses, ensuring robust communication with the API. The function returns the generated text or an empty string if an error occurs.

In [None]:
# Ollama Helper Function
def ollama_generate(prompt, model=MODEL, stream=False):
    """Generate text using a local Ollama model."""
    try:
        response = requests.post(
            OLLAMA_URL,
            json={"model": model, "prompt": prompt, "stream": stream},
            stream=stream
        )
        response.raise_for_status()

        text = ""
        if stream:
            for line in response.iter_lines():
                if line:
                    data = json.loads(line)
                    if "response" in data:
                        print(data["response"], end="", flush=True)
                        text += data["response"]
            print()
        else:
            for line in response.iter_lines():
                if line:
                    data = json.loads(line)
                    if "response" in data:
                        text += data["response"]

        return text.strip()
    except requests.RequestException as e:
        print(f"⚠️ Error communicating with Ollama server: {e}")
        return ""

## Cell 3 — Website Scraper Class
This cell defines the `Website` class, which handles fetching and parsing web pages. It uses `requests` to retrieve the page content and `BeautifulSoup` to extract the title, text, and links. The class removes irrelevant elements like scripts and images to focus on textual content.

In [None]:
# Website Scraper Class
class Website:
    def __init__(self, url):
        self.url = url
        self.body = b""
        self.title = ""
        self.text = ""
        self.links = []

        self._fetch_page()
        self._parse_page()

    def _fetch_page(self):
        try:
            response = requests.get(self.url, headers=HEADERS)
            print("HTTP Status Code:", response.status_code)
            if response.status_code == 200:
                self.body = response.content
            else:
                print(f"Failed to fetch {self.url}")
        except requests.RequestException as e:
            print(f"⚠️ Error fetching {self.url}: {e}")

    def _parse_page(self):
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"

        if soup.body:
            for tag in soup.body(["script", "style", "img", "input"]):
                tag.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)

        self.links = [urljoin(self.url, a.get("href")) for a in soup.find_all('a') if a.get("href")]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

## Cell 4 — Prompt Definitions
This cell defines the system prompts and a helper function for generating user prompts. The `link_system_prompt` ensures the Ollama model returns JSON-formatted links relevant for a brochure, while the `brochure_system_prompt` sets up the model to generate a markdown brochure based on company information.

In [None]:
# Prompt Definitions
link_system_prompt = """You are provided with a list of links found on a webpage.
Return strictly valid JSON only, with no additional text.
The JSON format:
{
    "links": [
        {"type": "about page", "url": "https://example.com/about-us/"}
    ]
}
If no relevant links, return {"links": []}.
"""

def get_links_user_prompt(website):
    prompt = f"Here is the list of links on {website.url}.\n"
    prompt += "Select relevant links for a company brochure. Exclude Terms, Privacy, or emails.\n"
    prompt += "\n".join(website.links)
    return prompt

def brochure_system_prompt():
    return """You are an assistant that creates a short brochure from relevant pages
of a company website. Output in markdown including culture, customers, careers/jobs."""

## Cell 5 — Link Extraction Function
This cell defines the `get_links` function, which uses the `Website` class to scrape a webpage and extract relevant links for a brochure. It sends the links to the Ollama model with a prompt to filter and format them as JSON, handling any JSON parsing errors gracefully.

In [None]:
# Link Extraction Function
def get_links(url):
    website = Website(url)
    prompt = f"{link_system_prompt}\n\n{get_links_user_prompt(website)}"
    response_text = ollama_generate(prompt)

    try:
        result = json.loads(response_text)
    except json.JSONDecodeError:
        print("⚠️ JSON parsing error")
        result = {"links": []}

    if not result.get("links"):
        print("⚠️ No relevant links found")
    return result

## Cell 6 — Brochure Generation Functions
This cell contains functions to gather content from a landing page and its relevant links (`get_all_details`), create a user prompt for the brochure (`get_brochure_user_prompt`), and generate the brochure in two modes: normal (`create_brochure`) and streaming (`stream_brochure`). The streaming mode updates the display in real-time.

In [None]:
# Brochure Generation Functions
def get_all_details(url):
    result = "Landing page:\n"
    landing_page = Website(url)
    result += landing_page.get_contents()

    links_data = get_links(url)
    for link_info in links_data.get("links", []):
        page_url = link_info.get("url")
        page_type = link_info.get("type", "Page")
        if not page_url:
            continue
        page_content = Website(page_url)
        result += f"\n\n{page_type}\n"
        result += page_content.get_contents()
    return result

def get_brochure_user_prompt(company_name, url):
    user_prompt = f"Company: {company_name}\n"
    user_prompt += get_all_details(url)
    return user_prompt[:5000]  # truncate to avoid overwhelming the model

def create_brochure(company_name, url):
    prompt = f"{brochure_system_prompt()}\n\n{get_brochure_user_prompt(company_name, url)}"
    result = ollama_generate(prompt)
    display(Markdown(result))

def stream_brochure(company_name, url):
    prompt = f"{brochure_system_prompt()}\n\n{get_brochure_user_prompt(company_name, url)}"
    try:
        response = requests.post(OLLAMA_URL, json={"model": MODEL, "prompt": prompt, "stream": True}, stream=True)
        response.raise_for_status()
        
        display_handle = display(Markdown(""), display_id=True)
        current_text = ""
        for line in response.iter_lines():
            if line:
                data = json.loads(line)
                if "response" in data:
                    current_text += data["response"]
                    update_display(Markdown(current_text), display_id=display_handle.display_id)
        print("\n✅ Brochure generation complete.")
    except requests.RequestException as e:
        print(f"⚠️ Error streaming brochure: {e}")

## Cell 7 — Testing the Brochure Generation
This cell tests the brochure generation by creating a brochure for Hugging Face using the `create_brochure` function. A commented-out call to `stream_brochure` is included for optional streaming output. Run this cell to see the generated markdown brochure displayed in the notebook.

In [None]:
# Testing Brochure Generation
# Normal brochure generation
create_brochure("Hugging Face", "https://huggingface.co")

# Uncomment the following line to test streaming brochure generation
# stream_brochure("Hugging Face", "https://huggingface.co")