## Day 1: Building a Web Page Summarizer

*[Coding along with the Udemy online course [LLM Engineering: Master AI & Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/) by Ed Donner; GitHub repo can be found at [github.com/ed-donner/llm_engineering](https://github.com/ed-donner/llm_engineering)]*

Simple example for using a frontier LLM that has some business value. Our goal here is to code a Web Browser that will take an URL as input and respond with a summary.

In [5]:
import os
import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
import pandas as pd
from openai import OpenAI

In [6]:
api_key = pd.read_csv("~/tmp/chat_gpt/agentic-design-1.txt", sep=" ", header=None)[0][0]
print("Don't be a fool and sent your api key to github")

Don't be a fool and sent your api key to github


In [7]:
openai = OpenAI(api_key=api_key)

In [8]:
# representing a web page with a class
class Website:
    url: str
    title: str
    text: str

    def __init__(self, url):
        self.url = url
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [9]:
# let's grap an example
ed = Website("https://edwarddonner.com")
print(ed.title)
print(ed.text)

Home - Edward Donner
Home
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of press coverage.
Connect
with me for

### Types of prompts

Models like GPT4o have been trained to receive instructions in a particular way.

There are two kinds of prompts they expect to receive:

**1. A System Prompt** that tells them what task they are performing and what tone they should use

**2. A User Prompt** -- the conversation starter that they should reply to

In [10]:
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [11]:
def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "The contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

### Messages

Many APIs of LLMs (including OpenAI) expect to receive messages in a particular structure:

```
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]

In [12]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [13]:
def summarize(url):
    website = Website(url)
    # https://platform.openai.com/docs/guides/text-generation
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [14]:
summarize("https://edwarddonner.com")

"# Summary of Edward Donner's Website\n\nEdward Donner's website serves as a personal and professional hub for sharing insights related to Large Language Models (LLMs), coding, and technology. Ed, the site's creator, is the co-founder and CTO of Nebula.io, a company focused on applying AI to optimize talent discovery and engagement. He has a background in founding AI startups, including untapt.\n\n## Key Features:\n- **Outsmart**: This section introduces a unique arena that challenges LLMs in a competitive environment centered on diplomacy and strategy.\n- **About Ed**: The creator shares his interests in coding, DJing, and electronic music, along with his professional journey in the AI industry.\n\n## Recent Announcements:\n1. **August 6, 2024**: Launch of the Outsmart LLM Arena.\n2. **June 26, 2024**: Resources and tools for choosing the right LLM.\n3. **February 7, 2024**: Guide on fine-tuning LLMs using personal text data.\n4. **January 31, 2024**: Continuation of the LLM fine-tuni

In [15]:
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [16]:
display_summary("https://edwarddonner.com")

# Summary of Edward Donner's Website

Edward Donner's website serves as a platform for sharing his interests in coding, large language models (LLMs), and electronic music. He is the co-founder and CTO of Nebula.io, focused on utilizing AI for talent discovery and management. The site features a section titled "Outsmart," where users can engage in competitions between LLMs, testing their skills in diplomacy and strategy.

## Recent Posts
- **August 6, 2024**: Discusses the "Outsmart LLM Arena," a platform for competitive LLM interactions.
- **June 26, 2024**: Offers insights on "Choosing the Right LLM," providing a toolkit and resources.
- **February 7, 2024**: Explores the topic of "Fine-tuning an LLM on your texts," simulating personal input.
- **January 31, 2024**: Continues the discussion on fine-tuning LLMs, specifically focusing on QLoRA. 

Overall, the website combines personal interests, professional insights, and educational resources surrounding AI and LLMs.

In [17]:
display_summary("https://cnn.com")

# CNN News Summary

CNN provides comprehensive coverage of breaking news across various categories including US and world news, politics, business, health, entertainment, sports, and more. 

### Key Highlights:
- **Hurricane Milton:** Over a dozen fatalities reported, and millions are without power in Florida. Cleanup efforts are ongoing as the storm's aftermath reveals significant destruction.
- **Israel-Hamas Conflict:** The US urges Israel to protect civilians following deadly strikes in Beirut. A UN inquiry accused Israel of "extermination" through attacks on Gaza's healthcare system.
- **Nobel Peace Prize:** Survivors of Hiroshima and Nagasaki were awarded the Nobel Peace Prize for their advocacy.
- **Celebrity News:** Notable stories include Rafael Nadal announcing his retirement and Al Pacino celebrating fatherhood at 84.
- **Economic Updates:** US mortgage rates surged for the second straight week, with ongoing discussions about economic concerns ahead of the upcoming elections.

CNN offers a wealth of video content, live updates, and analyses across its various news segments, making it a crucial resource for staying informed on current events.

In [18]:
display_summary("https://anthropic.com")

# Summary of Anthropic Website

Anthropic is an AI safety and research company based in San Francisco, focused on developing reliable and beneficial AI systems. Their latest offering is **Claude 3.5 Sonnet**, touted as the most intelligent AI model available as of June 21, 2024. The company emphasizes AI safety and has a diverse team with expertise in machine learning, physics, policy, and product development.

## News and Announcements
- **Claude 3.5 Sonnet Release**: Announced on June 21, 2024.
- **Research on AI Safety**: Recent research outputs include:
  - **Constitutional AI: Harmlessness from AI Feedback** (Dec 15, 2022)
  - **Core Views on AI Safety** (Mar 8, 2023)

The website invites users to engage with Claude via an API to drive efficiency and innovation.

### Business Applications

This exercise was about calling the API of a Frontier Model (a leading model at the frontier of AI). 


We've applied this to Summarization - a classic Gen AI use case to make a summary. This can be used for summarizing the news, summarizing financial performance or summarizing a resume in a cover letter.

Using our Summarization functionality on Websites that are using Javascript to build their HTML structure (like https://openai.com) won't return satisfactory results. A workaround might be using the Selenium framework which runs a browser behind the scenes and renders the page for you.

In [20]:
display_summary("https://openai.com")

It appears that the website does not have a title or identifiable content beyond instructions related to browser settings. Specifically, it instructs users to enable JavaScript and Cookies to view the page properly. 

### Summary
- **Website Content**: The website lacks substantial information and primarily consists of instructions.
- **Instructions**: Users are advised to turn on JavaScript and Cookies, then reload the page for full functionality.

There are no news or announcements to summarize.