# Brochure
By scrapping important links of a website generate a report



### 🏷️ **Classes**

* `Website`

  * `__init__(self, url)` → Fetches the page, extracts title, text, and links
  * `get_contents(self) -> str` → Returns the main text content of the page


### 🏷️ **Functions**

* `get_links_user_prompt(website)` → Interacts with user (or a model) to choose important links from the website
* `get_links(url)` → Scrapes and returns links from a given URL
* `get_all_details(url)` → Gets title, text, and links for a URL as a combined result
* `get_brochure_user_prompt(company_name, url)` → Generates a brochure draft via user prompt interaction (likely via LLM)
* `create_brochure(company_name, url)` → Creates a report (brochure) using OpenAI API
* `stream_brochure(company_name, url)` → Streams brochure content generation (incremental output)



In [5]:
import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import display, Markdown, update_display
from openai import OpenAI

In [6]:
load_dotenv(override=True)
api_key = os.getenv("OPENAI_API_KEY")

if not api_key and api_key.startswith("sk-proj-") and len(api_key) > 10:
    print("API is valid")
else:
    print("API is not valid, please check your .env file")
    
    
MODEL = "gpt-4o-mini"
openai = OpenAI()
    

API is not valid, please check your .env file


In [7]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

class Website:
    def __init__(self, url: str):
        self.url = url
        
        response = requests.get(url, headers=headers)
        
        self.body = response.content
        
        soup = BeautifulSoup(self.body, "html.parser")
        
        self.title = soup.title.string if soup.title else "No title found"
        
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
            
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]
            
    def get_contents(self) -> str:
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"
        
        

In [8]:
imranpollob = Website("https://imranpollob.com")
imranpollob.links

['/',
 '/',
 '/experience',
 '/projects',
 '/blog',
 '/search',
 '/',
 '/experience',
 '/projects',
 '/blog',
 '/search',
 '/blog',
 '/projects',
 '/blog',
 '/blog/19-wrapped-tokens-the-key-to-cross-chain-liquidity-and-utility',
 '/blog/18-system-design-interview-understanding-proxies',
 '/blog/17-system-design-interview-understanding-load-balancers',
 '/projects',
 '/projects/29-color-shade-generator',
 '/projects/04-snippet-app',
 '/projects/33-slugify-and-copy',
 '/experience',
 'mailto:polboy777@gmail.com',
 'https://github.com/imranpollob',
 'https://www.linkedin.com/in/imranpollob/',
 '/legal/terms',
 '/legal/privacy',
 'mailto:polboy777@gmail.com',
 'https://github.com/imranpollob',
 'https://www.linkedin.com/in/imranpollob/']

In [9]:
link_system_prompt = """You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.

You should respond in JSON as in this example:

{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [11]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.

You should respond in JSON as in this example:

{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}



In [12]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [13]:
print(get_links_user_prompt(imranpollob))

Here is the list of links on the website of https://imranpollob.com - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
/
/
/experience
/projects
/blog
/search
/
/experience
/projects
/blog
/search
/blog
/projects
/blog
/blog/19-wrapped-tokens-the-key-to-cross-chain-liquidity-and-utility
/blog/18-system-design-interview-understanding-proxies
/blog/17-system-design-interview-understanding-load-balancers
/projects
/projects/29-color-shade-generator
/projects/04-snippet-app
/projects/33-slugify-and-copy
/experience
mailto:polboy777@gmail.com
https://github.com/imranpollob
https://www.linkedin.com/in/imranpollob/
/legal/terms
/legal/privacy
mailto:polboy777@gmail.com
https://github.com/imranpollob
https://www.linkedin.com/in/imranpollob/


In [14]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [15]:
huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/spaces',
 '/models',
 '/nanonets/Nanonets-OCR-s',
 '/MiniMaxAI/MiniMax-M1-80k',
 '/google/magenta-realtime',
 '/Menlo/Jan-nano',
 '/mistralai/Mistral-Small-3.2-24B-Instruct-2506',
 '/models',
 '/spaces/ilcve21/Sparc3D',
 '/spaces/enzostvs/deepsite',
 '/spaces/tencent/Hunyuan3D-2.1',
 '/spaces/MiniMaxAI/MiniMax-M1',
 '/spaces/nvidia/PartPacker',
 '/spaces',
 '/datasets/EssentialAI/essential-web-v1.0',
 '/datasets/institutional/institutional-books-1.0',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/nvidia/AceReason-1.1-SFT',
 '/datasets/nvidia/Nemotron-Personas',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/microsoft',
 '/grammarly',
 '/Writer',
 '/docs/transformers'

In [16]:
get_links("https://huggingface.co")

{'links': [{'type': 'about page', 'url': 'https://huggingface.co'},
  {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'},
  {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'blog page', 'url': 'https://huggingface.co/blog'},
  {'type': 'github page', 'url': 'https://github.com/huggingface'},
  {'type': 'twitter page', 'url': 'https://twitter.com/huggingface'},
  {'type': 'linkedin page',
   'url': 'https://www.linkedin.com/company/huggingface/'}]}

In [17]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [20]:
print(get_all_details("https://imranpollob.com"))

Found links: {'links': [{'type': 'about page', 'url': 'https://imranpollob.com/'}, {'type': 'experience page', 'url': 'https://imranpollob.com/experience'}, {'type': 'projects page', 'url': 'https://imranpollob.com/projects'}, {'type': 'blog page', 'url': 'https://imranpollob.com/blog'}, {'type': 'GitHub', 'url': 'https://github.com/imranpollob'}, {'type': 'LinkedIn', 'url': 'https://www.linkedin.com/in/imranpollob/'}]}
Landing page:
Webpage Title:
Home | Imran Pollob
Webpage Contents:
Imran Pollob
Home
Experience
Projects
Blog
Home
Experience
Projects
Blog
👋 I'm Imran Pollob
Software Engineer
Currently doing research on Blockchain Security
Read my blog
View my projects
Scroll
About Me
I'm a PhD student. I do research on Blockchain Security and trying
            to build automated smart contract vulnerability detection tools.
My Journey
Before starting my PhD journey, I spent over 5 years as a software
            engineer, working with a variety of tech stacks. I collaborated with
  

## Create brochure

In [21]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

In [22]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [23]:
get_brochure_user_prompt("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'documentation page', 'url': 'https://huggingface.co/docs'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}]}


'You are looking at a company called: HuggingFace\nHere are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\nLanding page:\nWebpage Title:\nHugging Face – The AI community building the future.\nWebpage Contents:\nHugging Face\nModels\nDatasets\nSpaces\nCommunity\nDocs\nEnterprise\nPricing\nLog In\nSign Up\nThe AI community building the future.\nThe platform where the machine learning community collaborates on models, datasets, and applications.\nExplore AI Apps\nor\nBrowse 1M+ models\nTrending on\nthis week\nModels\nnanonets/Nanonets-OCR-s\nUpdated\n3 days ago\n•\n177k\n•\n1.07k\nMiniMaxAI/MiniMax-M1-80k\nUpdated\n5 days ago\n•\n10.4k\n•\n523\ngoogle/magenta-realtime\nUpdated\nabout 16 hours ago\n•\n225\nMenlo/Jan-nano\nUpdated\n6 days ago\n•\n29.1k\n•\n372\nmistralai/Mistral-Small-3.2-24B-Instruct-2506\nUpdated\n1 day ago\n•\n5.37k\n•\n192\nBrowse 1M+ models\nSpaces\nRunning\n715\n715\nSparc3D\n🏃\nNe

In [24]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [25]:
create_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/about'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'company page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'contact page', 'url': 'https://discuss.huggingface.co'}]}


# Hugging Face Brochure

## Company Overview
**Hugging Face** is an innovative platform dedicated to building the future of AI. As a vibrant community, we enable collaboration among machine learning practitioners, providing access to a wealth of models, datasets, and applications. Our mission is to democratize AI and simplify the machine learning process for all.

## What We Offer
- **Models**: Explore over **1 million models** including leading-edge solutions for various tasks.
- **Datasets**: Access around **250k datasets** specifically curated for machine learning.
- **Spaces**: Run applications across a plethora of AI models, including image-to-3D generation and code generation from text prompts.
- **Enterprise Solutions**: Tailored options for businesses, with a focus on security and dedicated support for AI development.

## Company Culture
At Hugging Face, we believe in **openness, collaboration, and community**. Our culture is centered around fostering an inclusive environment where all members feel valued and can contribute meaningfully. We are committed to open-source principles, actively involving our community in perceiving and shaping the future of machine learning technology.

## Customers
We proudly serve more than **50,000 organizations**, including renowned names like:
- **Google**
- **Microsoft**
- **Amazon**
- **Meta**

These partnerships underscore our robust platform and commitment to providing high-quality, cutting-edge solutions. 

## Career Opportunities
Hugging Face is always on the lookout for talented individuals who share our passion for AI and machine learning. Join us to work in an environment that promotes learning, growth, and innovation. Explore open positions on our [Careers page](#) and be part of a team that is shaping the AI landscape.

## Join Our Community
Whether you are a developer, researcher, or enthusiast, we invite you to become part of the Hugging Face community. Connect with us through our various channels, including GitHub, Twitter, and our [community forum](#).

> Let's build the future together. 

---
For more information, visit our [website](https://huggingface.co).

## Streaming answer

In [26]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [27]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'company page', 'url': 'https://www.linkedin.com/company/huggingface/'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}]}


# Hugging Face Company Brochure

---

## About Us

**Hugging Face** is the leading AI community dedicated to building the future of machine learning and artificial intelligence. Our platform enables collaboration among researchers and developers to create, share, and improve models, datasets, and applications. With a strong focus on open-source solutions, Hugging Face aims to democratize machine learning for everyone.

---

## Our Offerings

- **Models**: Explore over 1 million models ranging from natural language processing to image and audio analysis.
- **Datasets**: Access a vast library of 250,000+ datasets curated for various machine learning tasks.
- **Spaces**: Run and share your applications effortlessly on our collaborative platform.
- **Enterprise Solutions**: Offered with advanced security features and dedicated support.

---

## Community and Collaboration

More than **50,000 organizations** globally have harnessed Hugging Face's technology, including top enterprises such as Google, Microsoft, Amazon, and Meta. We pride ourselves on being the home of the machine learning community, with contributions from thousands of enthusiasts who are paving the way for future innovations.

---

## Company Culture

At Hugging Face, we embody a culture of **inclusivity, innovation, and collaboration**. Our team members, numbering over **213**, deeply believe in the power of AI for good and work together to promote knowledge sharing and collective growth. We encourage everyone to contribute to the pool of knowledge, fostering an environment where ideas can thrive.

---

## Careers at Hugging Face

Join us in our mission to democratize machine learning! We seek passionate individuals eager to be part of an innovative team. Whether your expertise lies in engineering, data science, community management, or any relevant field, we invite you to explore a career at Hugging Face. Check out our [Jobs Page](https://huggingface.co/jobs) for current openings and become part of the AI revolution.

---

## Connect With Us

Stay updated on our latest endeavors and community contributions. Follow us on:
- **[GitHub](https://github.com/huggingface)**
- **[Twitter](https://twitter.com/huggingface)**
- **[LinkedIn](https://linkedin.com/company/huggingface)**
- **[Discord](https://discord.gg/huggingface)**

---

**Hugging Face**: Collaborate, innovate, and build the future of AI together!