# Introduction

This notebook shows how to get a summary of a website using Selenium, BeautifulSoup and OpenAI API.

# Load OpenAI API key

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

if api_key:
    print(f"API key loaded: {api_key[:5]}... (truncated)")
else:
    print("API key not loaded")

API key loaded: sk-pr... (truncated)


# Get title and body of a website

**Notes**: 
-  Many websites are rendered with Javascript, like React apps. [Selenium](https://www.selenium.dev/) is a solid solution for handling JavaScript-heavy pages. Selenium renders the full page like a browser would, so you get the final HTML after JavaScript execution. 
- Make sure [ChromeDriver](https://googlechromelabs.github.io/chrome-for-testing/#stable) is installed and in your `PATH` environment variable
- Some websites detect automation tools like Selenium and is actively blocking access, possibly using Cloudflare's bot protection or similar JavaScript-based checks. Sites like https://openai.com, https://cloudflare.com, or https://medium.com use browser fingerprinting, JavaScript challenges, and cookie validation that headless browsers like Selenium can't easily pass out-of-the-box.
- Do Not Attempt to Scrape Bot-Protected Sites. If the site explicitly forbids bots (check their robots.txt), scraping may be against their terms of service. 

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

class WebSite:
    def __init__(self, url, wait_time=3):
        self.url = url
        self.title = None
        self.body = None

        # Setup headless Chrome options
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--no-sandbox")

        try:
            # Initialize WebDriver
            driver = webdriver.Chrome(options=chrome_options)
            driver.get(url)

            # Wait for JavaScript to render (can adjust this)
            time.sleep(wait_time)

            # Get rendered page source
            html = driver.page_source
            driver.quit()

            # Parse with BeautifulSoup
            soup = BeautifulSoup(html, 'html.parser')

            # Extract title
            self.title = soup.title.string.strip() if soup.title and soup.title.string else None

            # Remove unwanted tags
            for tag in soup(['img', 'input', 'audio', 'video', 'script', 'style']):
                tag.decompose()

            # Extract and clean body text
            body_tag = soup.body
            self.body = body_tag.get_text(separator=' ', strip=True) if body_tag else None

        except Exception as e:
            print(f"Failed to retrieve or parse the website: {e}")


# Use OpenAI API to generate a summary of a website

In [3]:
# Define the system prompt 

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [4]:
# Create a user prompt using the title and body from a WebSite instance
def user_prompt(website):
    return f"Website Title: {website.title}\nWebsite Content: {website.body}\n\nPlease summarize the content in a concise manner."

In [5]:
# Create an openAI client
from openai import OpenAI

try:
    openai_client = OpenAI()
    print("OpenAI client created successfully.")
except Exception as e:
    print(f"Failed to create OpenAI client: {e}")


OpenAI client created successfully.


In [6]:
# Function to get a summary from OpenAI
def get_summary(url):
    user_prompt_text = user_prompt(WebSite(url))
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt_text}
        ]
    )
    return response.choices[0].message.content.strip()

In [7]:
# Function to display markdown in Jupyter Notebook

from IPython.display import Markdown, display

def display_markdown(text):
    display(Markdown(text))

In [8]:
# Use OpenAI API to generate a summary of cnn
summary = get_summary("https://www.cnn.com")
display_markdown(summary)

The CNN website provides current news across various topics including US and world politics, business, health, entertainment, and science. It covers major ongoing geopolitical events such as the Ukraine-Russia war and the Israel-Hamas conflict. The site also features sections on technology, climate, style, travel, sports, and opinion pieces. In addition, CNN offers video content, live TV options, and podcasts. The homepage highlights a mix of breaking news, analysis, and human interest stories designed to engage a wide audience. It also provides feedback options for users to report issues with video content and advertisements.

In [9]:
# Use OpenAI API to generate a summary of www.nu.nl
summary = get_summary("https://www.nu.nl")
display_markdown(summary)

NU.nl provides the latest news updates in Dutch across various categories such as national and international events, politics, economy, sports, culture, and more. Key stories include the continuation of the Tomorrowland festival despite a mainstage fire, a ceasefire agreement in Syria, and a case of femicide in Gouda. Other highlights include issues in the housing market, particularly the pressure in the social rental sector, and updates on economic activities such as the European Commission's tax plans. In sports, there are updates from the Tour de France and football transfers, while in culture, the auction of Darth Vader's lightsaber and other media highlights are featured. The platform also offers video content, discussions, and recommendations for entertainment and lifestyle.