# WebPage Summarizer with OpenAI 

# 1️⃣ Introduction

- 🌍 **Task:** Summarizing webpage content using AI.  
- 🧠 **Model:** OpenAI's ``gpt-4o-mini`` and ``llama3.2`` for text summarization.  
- 🕵️‍♂️ **Data Extraction:**: Selenium for handling both static and JavaScript-rendered websites.  
- 📌 **Output Format:** Markdown-formatted summaries.  
- 🔗 **Scope:** Processes only the given webpage URL (not the entire site).  
- 🚀 **Tools:** Python, Requests, Selenium, BeautifulSoup, OpenAI API, Ollama. 
- 🧑‍💻 **Skill Level:** Beginner.

Let’s get started and automate website summarization! 🚀

![Basic LLM Pipeline](assets/basic_llm_project.jpg)

# 2️⃣ Import Required Libraries

The default environment.yml file does not include selenium or undetected-chromedriver, which is why we need to install them manually.

The purpose of using selenium and undetected-chromedriver is to dynamically crawl and extract JavaScript-rendered content. selenium automates browser interactions, while undetected-chromedriver helps bypass bot detection systems.

In [1]:
# !pip install selenium
# !pip install undetected-chromedriver

In [2]:
# ===========================
# System & Environment
# ===========================
import os
import requests
from dotenv import load_dotenv  

# ===========================
# Web Scraping
# ===========================
import time
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By  
from selenium.webdriver.support.ui import WebDriverWait  
from selenium.webdriver.support import expected_conditions as EC  
from bs4 import BeautifulSoup  

# ===========================
# AI-related
# ===========================
from IPython.display import Markdown, display  
from openai import OpenAI 
import ollama

# 3️⃣ Data Collection and Pre-processing

In [3]:
# Webpage Extraction
class WebsiteCrawler:
    def __init__(self, url, wait_time=20, chrome_binary_path=None):
        """
        Initialize the WebsiteCrawler using Selenium to scrape JavaScript-rendered content.
        """
        self.url = url
        self.wait_time = wait_time

        options = uc.ChromeOptions()
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-blink-features=AutomationControlled")
        options.add_argument("start-maximized")
        options.add_argument(
            "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
        )
        if chrome_binary_path:
            options.binary_location = chrome_binary_path

        self.driver = uc.Chrome(options=options)

        try:
            # Load the URL
            self.driver.get(url)

            # Wait for Cloudflare or similar checks
            time.sleep(10)

            # Ensure the main content is loaded
            WebDriverWait(self.driver, self.wait_time).until(
                EC.presence_of_element_located((By.TAG_NAME, "main"))
            )

            # Extract the main content
            main_content = self.driver.find_element(By.CSS_SELECTOR, "main").get_attribute("outerHTML")

            # Parse with BeautifulSoup
            soup = BeautifulSoup(main_content, "html.parser")
            self.title = self.driver.title if self.driver.title else "No title found"
            self.text = soup.get_text(separator="\n", strip=True)

        except Exception as e:
            print(f"Error occurred: {e}")
            self.title = "Error occurred"
            self.text = ""

        finally:
            self.driver.quit()


# 4️⃣ Prompt Engineering

In [4]:
def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [5]:
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

# 5️⃣ Structure Messages

In [6]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

# 6️⃣ Interacting with Model

## 🤖 GPT

In [None]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key
if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

openai = OpenAI()

API key found and looks good so far!


In [None]:
def summarize_gpt(url):
    website = WebsiteCrawler(url)  # Create a Website object and fetch webpage content
    response = openai.chat.completions.create(
        model="gpt-4o-mini",  # Call OpenAI's GPT-4o model
        messages= [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt_for(website)}
        ],
    )
    display(Markdown(response.choices[0].message.content))  # Generate and display summary

## 🤖 llama (Ollama)

In [10]:
def summarize_ollama(url):
    website = WebsiteCrawler(url)
    response = ollama.chat(
        model="llama3.2", 
        messages=messages_for(website))
    display(Markdown(response['message']['content']))  # Generate and display output

# 7️⃣ Displaying Results

In [11]:
summarize_ollama("https://anthropic.com")

# Website Summary: Anthropic

## Overview

Anthropic is an AI safety and research company that aims to put safety at the forefront of its research and products.

## Key Highlights

* **Claude 3.7 Sonnet**: The company's most intelligent AI model, now available for use.
* **Claude Code**: A new agentic tool for coding launched alongside Claude 3.7 Sonnet.
* **Alignment Research**: Anthropic is working on research into alignment and constitutional AI to ensure harmlessness from AI feedback.

## News and Announcements

* **Core Views on AI Safety**: The company released a report on core views on AI safety, discussing when, why, what, and how to approach AI development with safety in mind.
* **Constitutional AI: Harmlessness from AI Feedback**: A research paper published by Anthropic exploring constitutional AI for harmlessness from AI feedback.

## Products and Services

* **Claude for Enterprise**: A product offering for enterprise clients seeking reliable and beneficial AI systems.
* **Talk to Claude API**: An API allowing users to build AI-powered applications and custom experiences using Claude.

In [12]:
summarize_gpt("https://anthropic.com")

# Summary of Anthropic Website

Anthropic is an AI research and product company focused on safety in artificial intelligence. Their latest model, **Claude 3.7 Sonnet**, is highlighted as their most advanced AI model to date, described as a hybrid reasoning model. Additionally, they have introduced **Claude Code**, a tool designed to assist with coding tasks.

### Recent Announcements
- **Claude 3.7 Sonnet Available**: Their most intelligent model is now available for users to engage with.
- **Claude Code Launch**: A new tool aimed at supporting coding efforts.

### Research and Products
Anthropic emphasizes its commitment to AI safety and alignment. They are involved in various research initiatives, including topics like *Constitutional AI* and *AI safety protocols*. The company is based in San Francisco and has a diverse team with expertise in machine learning, physics, policy, and product development, aiming to create beneficial AI systems. 

For those interested in joining, they have open job positions available.

In [13]:
summarize_ollama("https://openai.com")

# OpenAI Website Summary
================================

### Overview

The OpenAI website provides a range of services and resources, including AI-powered chatbots, tutorials, and tools for various industries such as education, business, and entertainment.

### News and Announcements

* **Introducing GPT-4.5**: A new version of the popular language model, announced on February 27, 2025.
* **Introducing deep research**: A new release that enables users to conduct deeper research using OpenAI's tools.
* **OpenAI and the CSU system bring AI to 500,000 students & faculty**: A partnership between OpenAI and the Colorado State University (CSU) system to provide AI-powered services to 500,000 students and faculty members.
* **Announcing The Stargate Project**: A new project that aims to explore the intersection of AI and space exploration.

### Research and Publications

* **Computer-Using Agent**: A research paper on trading inference-time compute for adversarial robustness.
* **Building smarter maps with GPT-4o vision fine-tuning**: An article on using OpenAI's GPT-4o model to build smarter maps.
* **Data-driven beauty and creativity with ChatGPT**: An article on using ChatGPT for data-driven approaches in beauty and creativity.

### Events

* **Lyndon Barrois & Sora**: A collaboration between Lyndon Barrois and Sora on building a custom math tutor powered by ChatGPT.

Note: This summary only includes the most recent news, announcements, research papers, and publications mentioned on the website.