## Day 1: Building a Web Page Summarizer

*[Coding along with the Udemy online course [LLM Engineering: Master AI & Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/) by Ed Donner; GitHub repo can be found at [github.com/ed-donner/llm_engineering](https://github.com/ed-donner/llm_engineering)]*

Simple example for using a frontier LLM that has some business value. Our goal here is to code a Web Browser that will take an URL as input and respond with a summary.

In [1]:
import os
import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
import pandas as pd
from openai import OpenAI

In [2]:
api_key = pd.read_csv("~/tmp/chat_gpt/agentic-design-1.txt", sep=" ", header=None)[0][0]
print("Don't be a fool and sent your api key to github")

Don't be a fool and sent your api key to github


In [3]:
openai = OpenAI(api_key=api_key)

In [4]:
# representing a web page with a class
class Website:
    url: str
    title: str
    text: str

    def __init__(self, url):
        self.url = url
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [5]:
# let's grap an example
ed = Website("https://edwarddonner.com")
print(ed.title)
print(ed.text)

Home - Edward Donner
Home
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of press coverage.
Connect
with me for

### Types of prompts

Models like GPT4o have been trained to receive instructions in a particular way.

There are two kinds of prompts they expect to receive:

**1. A System Prompt** that tells them what task they are performing and what tone they should use

**2. A User Prompt** -- the conversation starter that they should reply to

In [6]:
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [7]:
def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "The contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

### Messages

Many APIs of LLMs (including OpenAI) expect to receive messages in a particular structure:

```
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]

In [8]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [9]:
def summarize(url):
    website = Website(url)
    # https://platform.openai.com/docs/guides/text-generation
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [10]:
summarize("https://edwarddonner.com")

"# Summary of Edward Donner's Website\n\nEdward Donner's website showcases his work and interests in coding, experimentation with Large Language Models (LLMs), and his expertise in AI. He is the co-founder and CTO of Nebula.io, a company utilizing AI to enhance talent discovery and management, and he previously founded the AI startup, untapt, which was acquired in 2021. \n\n## Key Features:\n- **Outsmart LLM Arena:** A section dedicated to a competitive environment where LLMs engage in diplomacy and tactics.\n- **Personal Interests:** Ed shares his passion for coding, DJing, and electronic music production, as well as his engagement with tech news.\n- **Resources and Posts:** The site includes various posts related to AI and LLMs, including topics like transitioning from software engineer to AI data scientist and fine-tuning LLMs.\n\n## Recent News and Announcements:\n- **October 16, 2024:** Post about resources for transitioning from Software Engineer to AI Data Scientist.\n- **August

In [11]:
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [12]:
display_summary("https://edwarddonner.com")

# Summary of Edward Donner's Website

Edward Donner's website highlights his interests and professional background in the field of artificial intelligence, particularly with large language models (LLMs). He is the co-founder and CTO of Nebula.io, a company focused on leveraging AI to aid in talent discovery and engagement. 

## Key Sections:
- **About**: Ed describes himself as a coder and LLM enthusiast who also enjoys DJing and electronic music production. He has experience as the founder of the AI startup untapt, which was acquired in 2021.
- **News/Announcements**:
  - **October 16, 2024**: Articles about transitioning from Software Engineer to AI Data Scientist with recommended resources.
  - **August 6, 2024**: Introduction to the "Outsmart LLM Arena," a platform for LLM competitions focusing on diplomacy and strategy.
  - **June 26, 2024**: Guidance on selecting the appropriate LLM, including toolkits and resources.
  - **February 7, 2024**: Techniques for fine-tuning LLMs using personal texts to simulate individual style.

The site encourages readers to connect with Ed for further engagement and information.

In [13]:
display_summary("https://cnn.com")

# Summary of CNN Website

CNN is a major news outlet providing a wide range of global news coverage, including topics such as:

- **US and World News**
- **Politics**
- **Business**
- **Health**
- **Entertainment**
- **Science and Climate**
- **Sports**

### Recent News Highlights:
- **Israel-Hamas Conflict**: Ongoing updates about airstrikes and military strategies related to the conflict, including leaked documents regarding U.S. intelligence.
- **US Politics**: Coverage of the 2024 election campaign, including strategies from candidates like Kamala Harris and Donald Trump during election ads and public appearances.
- **Global Events**: News about Indonesia's new president, challenges from Hezbollah to Israeli defenses, and various cultural stories from around the world, such as the rise of toxicity in masculinity in Kenya.

### Additional Features:
- CNN provides exclusive reports, analysis pieces, and live video content, covering in-depth investigations and detailed explorations of significant events and societal trends.

This summary encapsulates the site's intent to deliver timely and relevant reporting on international issues while engaging viewers with various multimedia content.

In [14]:
display_summary("https://anthropic.com")

# Summary of Anthropic Website

Anthropic is an AI safety and research company based in San Francisco, dedicated to developing reliable and beneficial AI systems with a focus on safety. Their interdisciplinary team combines expertise in machine learning, physics, policy, and product development.

## Key Offerings
- **Claude 3.5 Sonnet**: The latest and most intelligent AI model, available for users to interact with and integrate into their applications.
- **API Access**: Users can build applications and drive efficiency using the Claude API.

## News and Announcements
- **Claude 3.5 Sonnet Release**: Announced on June 21, 2024, highlighting the launch of their newest AI model.
- **Research Publications**: Includes notable works like "Constitutional AI: Harmlessness from AI Feedback" (Dec 15, 2022) and "Core Views on AI Safety: When, Why, What, and How" (Mar 8, 2023).

Overall, Anthropic aims to put safety at the forefront of AI development through innovative research and technology.

### Business Applications

This exercise was about calling the API of a Frontier Model (a leading model at the frontier of AI). 


We've applied this to Summarization - a classic Gen AI use case to make a summary. This can be used for summarizing the news, summarizing financial performance or summarizing a resume in a cover letter.

Using our Summarization functionality on Websites that are using Javascript to build their HTML structure (like https://openai.com) won't return satisfactory results. A workaround might be using the Selenium framework which runs a browser behind the scenes and renders the page for you.

In [15]:
display_summary("https://openai.com")

# Website Summary

The website currently prompts users to enable JavaScript and cookies in their web browsers in order to access the content. There is no additional information or content available without these settings enabled, therefore no specific news or announcements can be summarized at this time.