In [1]:
# imports

import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI

### Connecting to OpenAI

The next cell is to load in the environment variables from `.env` file and connect to OpenAI.

In [2]:
# Load environment variables in a .env file

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")


API key found and looks good so far!


In [3]:
openai = OpenAI()

### A quick call to a Frontier model to get started, as a test!

In [4]:
message = "Hello, GPT! This is my first ever message to you! Hi!"
response = openai.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user", "content":message}])
print(response.choices[0].message.content)

Hello! It’s great to hear from you, and I’m glad you decided to reach out. How can I assist you today?


### Scraper_Summarizer Project

In [5]:
# A class to represent a Webpage

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:

    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [9]:
cnn_web = Website("https://cnn.com")
print(cnn_web.title)
print(cnn_web.text)

Breaking News, Latest News and Videos | CNN
CNN values your feedback
1. How relevant is this ad to you?
2. Did you encounter any technical issues?
Video player was slow to load content
Video content never loaded
Ad froze or did not finish loading
Video content did not start after ad
Audio on ad was too loud
Other issues
Ad never loaded
Ad prevented/slowed the page from loading
Content moved around while ad loaded
Ad was repetitive to ads I've seen previously
Other issues
Cancel
Submit
Thank You!
Your effort and contribution in providing this feedback is much
                                        appreciated.
Close
Ad Feedback
Close icon
US
World
Politics
Business
Health
Entertainment
Style
Travel
Sports
Science
Climate
Weather
Ukraine-Russia War
Israel-Hamas War
Underscored
Games
More
US
World
Politics
Business
Health
Entertainment
Style
Travel
Sports
Science
Climate
Weather
Ukraine-Russia War
Israel-Hamas War
Underscored
Games
Watch
Listen
Live TV
Subscribe
Sign in
My Account
Settin

## Types of prompts

Models like GPT4o have been trained to receive instructions in a particular way.

They expect to receive:

**A system prompt** that tells them what task they are performing and what tone they should use

**A user prompt** -- the conversation starter that they should reply to

In [7]:
# Define your system prompt - you can experiment with this, i.e. changing the last sentence to 'Respond in markdown in Spanish."

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [8]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [10]:
print(user_prompt_for(cnn_web))

You are looking at a website titled Breaking News, Latest News and Videos | CNN
The contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.

CNN values your feedback
1. How relevant is this ad to you?
2. Did you encounter any technical issues?
Video player was slow to load content
Video content never loaded
Ad froze or did not finish loading
Video content did not start after ad
Audio on ad was too loud
Other issues
Ad never loaded
Ad prevented/slowed the page from loading
Content moved around while ad loaded
Ad was repetitive to ads I've seen previously
Other issues
Cancel
Submit
Thank You!
Your effort and contribution in providing this feedback is much
                                        appreciated.
Close
Ad Feedback
Close icon
US
World
Politics
Business
Health
Entertainment
Style
Travel
Sports
Science
Climate
Weather
Ukraine-Russia War
Israel-Hamas War
Underscored
Games
Mo

## Messages

The API from OpenAI expects to receive messages in a particular structure.
Many of the other APIs share this structure:

```
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]

To preview, the next 2 cells make a rather simple call! 
```
**Note that *messasges* is a list of dictinaries**


In [11]:
messages = [
    {"role": "system", "content": "You are a snarky assistant"},
    {"role": "user", "content": "What is 2 + 2?"}
]

In [12]:
# To give a preview -- calling OpenAI with system and user messages:

response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
print(response.choices[0].message.content)

Oh, you really want to dive into the deep end of math, huh? Well, hold onto your hat—it’s a thrilling 4! Mind-blowing, I know.


#### Now, let's build useful messages for GPT-4o-mini, using a function

In [13]:
# This function creates exactly the format above

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [14]:

messages_for(cnn_web)

[{'role': 'system',
  'content': 'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'},
 {'role': 'user',

### Time to bring it together!

In [15]:
def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [16]:
summarize("https://cnn.com")

"# CNN Website Summary\n\n## Overview\nThe CNN website serves as a comprehensive source for breaking news, covering various topics including U.S. and world news, politics, business, health, entertainment, and sports. It features live updates and provides access to video content and podcasts.\n\n## Key News Highlights\n1. **Tariffs and Economy**\n   - President Trump's imminent tariffs are set to impact American consumers significantly, with many scrambling to purchase cars ahead of their implementation. EU leaders have expressed concerns and threatened retaliation.\n   - Layoffs in U.S. health agencies are at their highest in four years, with reports indicating massive job cuts.\n\n2. **Political Landscape**\n   - Senator Cory Booker is protesting the Trump administration's agenda with one of the longest speeches in Senate history.\n   - Recent elections could influence House dynamics, with potential shifts in support for Republican Speaker Johnson.\n\n3. **Crime and Justice**\n   - An

In [17]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [18]:
display_summary("https://cnn.com")

# Summary of CNN Website

CNN provides up-to-the-minute news coverage and analysis on various topics, including:

## Key Areas of Coverage:
- **US News**: Updates on crime, justice, and political events within the United States.
- **World News**: Coverage of significant events globally, such as natural disasters and international conflicts.
- **Politics**: Insights into Congress, SCOTUS, the 2024 Elections, and ongoing political issues, including analysis of key figures like President Trump and his policies.
- **Business**: Current trends in the economy, including market analyses and updates on various industries.
- **Health**: Health-related news and scientific discoveries affecting the public.
- **Entertainment**: Updates on movies, television shows, celebrities, and notable cultural trends.
- **Sports**: Coverage of various sports, including updates on professional leagues and significant events.
- **Science & Climate**: Discoveries in space exploration and environmental issues.

## Recent Highlights:
- President Trump's upcoming tariffs may affect consumer behavior, particularly in the auto industry.
- Ongoing political debates regarding healthcare funding and immigration policies.
- Significant job cuts within US health agencies amid broader concerns about economic stability.
- Coverage of international conflicts such as the Israel-Hamas War and the Ukraine-Russia War.

## Features:
CNN also offers a variety of video content, live updates, and analysis that keep the audience informed on current affairs, and it aims to engage users through interactive elements like newsletters and topic-specific programming.

## Trying other websites

Note that this will only work on websites that can be scraped using this simplistic approach.

Websites that are rendered with Javascript, like React apps, won't show up and you'll need to read up on installing Selenium (ask ChatGPT!)

Also, Websites protected with CloudFront (and similar) may give 403 errors!

But many websites will work just fine!

In [19]:
display_summary("https://foxnews.com")

# Fox News Summary

**Website Overview:**
Fox News provides a comprehensive platform for the latest breaking news, covering a wide array of topics including U.S. and world events, politics, entertainment, sports, lifestyle, and health. The site also features various multimedia content, such as videos and interactive games, alongside live television broadcasts.

**Recent News Highlights:**
1. **U.S. Politics:**
   - A federal judge is under scrutiny due to ties with Supreme Court justices, impacting a high-profile deportation case.
   - Cory Booker’s staffer was arrested for carrying an unlicensed gun at the Capitol. 
   - Nine Republicans have disrupted party unity with a rarely used motion, affecting Trump's agenda.

2. **Global Affairs:**
   - Iran has threatened a preemptive strike against a U.S. base amid tensions with Trump’s administration.

3. **Entertainment:**
   - Kid Rock discussed a White House visit from Bill Maher, expressing surprise at the details of the encounter.

4. **Crime & Safety:**
   - A troubling case in Ohio involves six suspects accused of torturing a man during a week-long kidnapping ordeal.
   - Several injuries were reported after a truck struck pedestrians in Boston.

5. **Health Insights:**
   - New experimental treatments show promise in combating common cancers, with notable advancements in survival rates reported.

**Other Key Topics:**
- The site also includes sections on financial planning, educational games, and health alerts regarding food products.
- Special attention is given to the impact of sports and entertainment on societal discussions, including topics on trans athletes in competitive sports.

**Interactive Features:**
- Fox News offers various games, including crossword puzzles and quizzes, to engage users beyond news content.

Overall, the site combines timely updates on pressing news with commentary and diverse forms of entertainment, appealing to a wide audience interested in current affairs and popular culture.

In [31]:
display_summary("https://anthropic.com")

# Anthropic Overview

Anthropic is an AI safety and research company based in San Francisco, focusing on creating reliable and beneficial AI systems. Their interdisciplinary team consists of professionals from various fields, including machine learning, physics, policy, and product development.

## Key Features

- **Claude Models**: 
  - **Claude 3.5 Sonnet**: The latest and most advanced AI model launched.
  - **Claude 3.5 Haiku**: A new model introduced alongside Sonnet.

- **API Access**: Developers can build AI-powered applications and custom experiences using Claude's API.

## Recent Announcements

- **October 22, 2024**: Introduced two new Claude models: **Claude 3.5 Sonnet** and **Claude 3.5 Haiku**.
  
- **September 4, 2024**: Announcement related to **Claude for Enterprise**.

- **Research Update (Dec 15, 2022)**: Discussion on **Constitutional AI** focusing on harmlessness from AI feedback.

- **March 8, 2023**: Publication of **Core Views on AI Safety**, outlining the company’s perspective on AI safety protocols.

The site also provides a link for career opportunities and encourages visitors to engage with Claude’s capabilities.

In [34]:
display_summary("https://BBC.com")

# BBC Home Summary

The BBC Home website serves as a central hub for breaking news across various domains including world events, US news, sports, business, innovation, culture, and climate. It features a variety of news reports, providing updates on significant global concerns, political developments, cultural events, and scientific advancements.

## Recent News Highlights

- **Israel-Gaza War**: A minister has directed the Israeli army to prepare for the potential relocation of Palestinians from Gaza amidst ongoing conflict.
- **Santorini Earthquakes**: A state of emergency has been declared after a series of tremors prompted over 11,000 residents to evacuate the island.
- **US Politics**: A judge has halted former President Trump’s plan to reduce government personnel through buyouts. Additionally, Trump is set to impose sanctions on the International Criminal Court following its actions against the US and its allies.
- **Cultural Events**: Notable mentions include the unveiling of a statue of Cristiano Ronaldo in Times Square and a new exhibition featuring David Hockney's iPad art in Bradford.

The site is rich in multimedia content and is known for providing thorough analyses and expert opinions on current events, showcasing the BBC's commitment to delivering trusted news from around the globe.