# Webpage Summarizer with Ollama and Llama 3.2

This Jupyter Notebook demonstrates a simple yet powerful workflow for summarizing any webpage using a locally-run Large Language Model (LLM) with Ollama.

**The process is as follows:**
1.  Input a URL of a webpage.
2.  The notebook fetches the HTML content of the page.
3.  It uses the `BeautifulSoup` library to parse the HTML and extract the main text content, cleaning out irrelevant tags like scripts, styles, and images.
4.  It constructs a prompt containing the extracted text.
5.  It sends this prompt to a locally running Ollama model (`llama3.2`) to generate a concise summary.
6.  The final summary is displayed in a clean, readable format.

## Recap on installation of Ollama

Simply visit [ollama.com](https://ollama.com) and install!

Once complete, the ollama server should already be running locally.  
If you visit:  
[http://localhost:11434/](http://localhost:11434/)

You should see the message `Ollama is running`.  

If not, bring up a new Terminal (Mac) or Powershell (Windows) and enter `ollama serve`  
And in another Terminal (Mac) or Powershell (Windows), enter `ollama pull llama3.2`  
Then try [http://localhost:11434/](http://localhost:11434/) again.

If Ollama is slow on your machine, try using `llama3.2:1b` as an alternative. Run `ollama pull llama3.2:1b` from a Terminal or Powershell, and change the code below from `MODEL = "llama3.2"` to `MODEL = "llama3.2:1b"`

### Importing Necessary Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
import ollama

In [2]:
# Constants
MODEL = "llama3.2"

### Fetching and Cleaning Webpage Content

In [3]:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:

    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [4]:
# We can test beautiful soup and the website class is working.
# Change the website and add print statements to follow along.

website = Website("https://www.cnn.com/")
print(website.title)
print(website.text)

Breaking News, Latest News and Videos | CNN
CNN values your feedback
1. How relevant is this ad to you?
2. Did you encounter any technical issues?
Video player was slow to load content
Video content never loaded
Ad froze or did not finish loading
Video content did not start after ad
Audio on ad was too loud
Other issues
Ad never loaded
Ad prevented/slowed the page from loading
Content moved around while ad loaded
Ad was repetitive to ads I've seen previously
Other issues
Cancel
Submit
Thank You!
Your effort and contribution in providing this feedback is much
                                        appreciated.
Close
Ad Feedback
Close icon
US
World
Politics
Business
Health
Entertainment
Underscored
Style
Travel
Sports
Science
Climate
Weather
Ukraine-Russia War
Israel-Hamas War
Games
More
US
World
Politics
Business
Health
Entertainment
Underscored
Style
Travel
Sports
Science
Climate
Weather
Ukraine-Russia War
Israel-Hamas War
Games
Watch
Listen
Live TV
Subscribe
Sign in
My Account
Settin

### Summarizing the Content with Ollama

In [5]:
# Define our system prompt

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [6]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [7]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [8]:
def summarize(url):
    website = Website(url)
    
    response = ollama.chat(
        model=MODEL, 
        messages=messages_for(website)
    )

    response
    return response.message.content

In [9]:
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

### Run the Summarizer!

In [11]:
# Feel free to change the URL in the cell below to summarize any webpage you want!
display_summary("https://cnn.com")

This appears to be a sample page of the CNN website, featuring various articles and sections from different categories such as news, entertainment, technology, and more. The content is organized into several sections, including:

* Headlines: A selection of top headlines from around the world.
* Politics: Articles related to politics and government, including news about the US presidential election.
* Business: News and analysis on business, finance, and economics.
* Sports: Coverage of various sports teams and events, including pro football, college football, basketball, baseball, soccer, and more.
* Science: Articles about scientific discoveries, space exploration, climate change, and related topics.
* Entertainment: News and reviews about movies, television shows, music, celebrity gossip, and more.
* Health: Articles on health and wellness, including fitness, nutrition, mental health, and disease prevention.
* Technology: News and analysis on technology, including gadgets, innovation, and emerging trends.

The page also features several recurring sections, such as:

* "5 Things" Quiz: A daily quiz with five questions related to current events and news.
* Daily Crossword: A daily crossword puzzle for entertainment purposes only.
* Jumble Crossword: Another version of the crosswords.
* Photo Shuffle: A random selection of photos from around the world.
* Sudoblock, Sudoku, and other games and puzzles.

In addition to these sections, the page also features several paid content areas, such as:

* CNN Underscored: An area featuring deals on products and services.
* Fashion, Beauty, and Home: Articles and reviews about fashion, beauty, and home decor.
* Entertainment: News and reviews about movies, television shows, music, and celebrity gossip.

Overall, this page appears to be a sample of the CNN website's content, showcasing a variety of topics and features from different sections.