Business Problem : write a program which is going to be able to look at any web page on the internet, scrape the contents of the web page and then summarize it and present back a short summary of that web page.

In [2]:
import os 
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI

In [3]:
# Load environment variables from .env

load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found")
elif api_key[:8]!= "sk-proj-":
    print("An API Key was found, but it doesn't start with 'sk-proj-', please check you're using right api key")
else:
    print("API key found and looks good so far!")


API key found and looks good so far!


In [4]:
# Create openai instance

openai = OpenAI()

In [5]:
# A class to represent a Webpage

class Website:
    """ 
    A utility class to represent a Website that we have scraped.
    """
    url: str
    title: str
    text: str

    def __init__(self, url):
        """ 
        Create this website object from the given url using the BeautifulSoap library.
        """
        self.url = url
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found."

        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

    

In [6]:
# Lets try to parse one website

web = Website("https://en.wikipedia.org/wiki/DeepSeek")
print(web.title)
print(web.text)

DeepSeek - Wikipedia
Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Contribute
Help
Learn to edit
Community portal
Recent changes
Upload file
Search
Search
Appearance
Donate
Create account
Log in
Personal tools
Donate
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Contents
move to sidebar
hide
(Top)
1
Background
2
Development and release history
Toggle Development and release history subsection
2.1
DeepSeek LLM
2.2
V2
2.3
V3
2.4
R1
3
Assessment and reactions
4
Concerns
Toggle Concerns subsection
4.1
Censorship
4.2
Security and privacy
5
See also
6
Notes
7
References
8
External links
Toggle the table of contents
DeepSeek
55 languages
Afrikaans
العربية
Aragonés
অসমীয়া
Azərbaycanca
বাংলা
Български
Català
Čeština
Dansk
الدارجة
Deutsch
Ελληνικά
Español
Esperanto
Euskara
فارسی
Français
Frysk
Fulfulde
Gaeilge
Galego
한국어
Bahasa Indonesia
Italiano
עברית
Kiswahili
M

Types of Prompts:
1. System prompt : this tells the model what task to be performed and what tone should be used
2. User prompt : the conversation starter that the model should reply to

In [7]:
# Define our System prompt:

system_prompt = "You are an assistant that analyzes the content of a website \
    and provides a short summary, ignoring text that might be navigation related \
    or any other irrelevant advirtisements. Respond in markdown."

In [8]:
# Construct our User Prompt considering input as website title

def user_prompt_for(website):
    if not website.text:  # Handle empty content
        return "The webpage could not be parsed. No content available."
    
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\n The contents of this website is as follows: \
        please provide a short summary of this website in markdown. \
        If it includes news or announcements, then summarize these too. \n\n"
    
    user_prompt += website.text
    return user_prompt

In [9]:
system_prompt

'You are an assistant that analyzes the content of a website     and provides a short summary, ignoring text that might be navigation related     or any other irrelevant advirtisements. Respond in markdown.'

In [10]:
print(user_prompt_for(web))

You are looking at a website titled DeepSeek - Wikipedia
 The contents of this website is as follows:         please provide a short summary of this website in markdown.         If it includes news or announcements, then summarize these too. 

Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Contribute
Help
Learn to edit
Community portal
Recent changes
Upload file
Search
Search
Appearance
Donate
Create account
Log in
Personal tools
Donate
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Contents
move to sidebar
hide
(Top)
1
Background
2
Development and release history
Toggle Development and release history subsection
2.1
DeepSeek LLM
2.2
V2
2.3
V3
2.4
R1
3
Assessment and reactions
4
Concerns
Toggle Concerns subsection
4.1
Censorship
4.2
Security and privacy
5
See also
6
Notes
7
References
8
External links
Toggle the table of contents
DeepSeek
55 languages
Afrik

### Messages object :
- It's just a dictionary which has role and content, system and system, message user and the user message. 
- It has many roles which we can play with as per the requirements

In [11]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [12]:
messages_for(web)

[{'role': 'system',
  'content': 'You are an assistant that analyzes the content of a website     and provides a short summary, ignoring text that might be navigation related     or any other irrelevant advirtisements. Respond in markdown.'},
 {'role': 'user',
  'content': 'You are looking at a website titled DeepSeek - Wikipedia\n The contents of this website is as follows:         please provide a short summary of this website in markdown.         If it includes news or announcements, then summarize these too. \n\nJump to content\nMain menu\nMain menu\nmove to sidebar\nhide\nNavigation\nMain page\nContents\nCurrent events\nRandom article\nAbout Wikipedia\nContact us\nContribute\nHelp\nLearn to edit\nCommunity portal\nRecent changes\nUpload file\nSearch\nSearch\nAppearance\nDonate\nCreate account\nLog in\nPersonal tools\nDonate\nCreate account\nLog in\nPages for logged out editors\nlearn more\nContributions\nTalk\nContents\nmove to sidebar\nhide\n(Top)\n1\nBackground\n2\nDevelopment a

Time to bring it together : Lets call openai completion api

In [12]:
def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages= messages_for(website)
    )
    return response.choices[0].message.content

In [13]:
# def summarize(url):
#     website = Website(url)
    
#     if not website or not hasattr(website, "content") or website.content is None:
#         return "Failed to retrieve website content."

#     messages = messages_for(website)
    
#     if not messages or any(msg.get("content") is None for msg in messages):
#         return "Invalid messages format, content is missing."

#     response = openai.chat.completions.create(
#         model="gpt-4o-mini",
#         messages=messages
#     )
#     return response


In [15]:
summarize("https://en.wikipedia.org/wiki/DeepSeek")
# summarize("https://edwarddonner.com")


'# DeepSeek - Overview\n\nDeepSeek is a Chinese artificial intelligence company established in May 2023, based in Hangzhou, Zhejiang. It focuses on developing open-source large language models (LLMs) and is funded by the hedge fund High-Flyer. The founder, Liang Wenfeng, also serves as the CEO. The company aims to provide competitive AI models at a significantly lower cost compared to its rivals.\n\n## Key Models and Releases:\n- **DeepSeek-R1**: The flagship model comparable to OpenAI\'s GPT-4, designed to be cost-effective and efficient in resource usage.\n- **DeepSeek-V2 and V3**: Subsequent models that have improved upon the previous versions, with advancements in training methods and model architectures.\n- **DeepSeek AI Assistant**: A chatbot released in January 2025 that quickly became popular, surpassing ChatGPT in downloads in the iOS App Store by the end of January 2025.\n\n## Achievements and Market Impact:\n- By January 2025, DeepSeek-R1 had become the most downloaded free 

WE can use feature of jupyter to nicely display the content of markdown

In [16]:
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))
    

In [17]:
display_summary("https://en.wikipedia.org/wiki/DeepSeek")

# DeepSeek - Overview

**DeepSeek** is a Chinese artificial intelligence company founded in May 2023 by Liang Wenfeng, based in Hangzhou, Zhejiang. The company specializes in developing open-source large language models (LLMs). DeepSeek is funded by the hedge fund High-Flyer and has reported under 200 employees. 

## Key Features
- **Models**: The DeepSeek-R1 model competes with notable LLMs like OpenAI's GPT-4, offering similar performance at significantly reduced costs—approximately $6 million compared to $100 million for GPT-4.
- **Open-source**: All models and their training methodologies are open-source, promoting transparency and facilitating broader development and research.
- **Innovations**: DeepSeek has released several models, including DeepSeek-V2 and V3, which are engineered for enhanced performance and economics compared to competitors.

## Recent Developments
1. **Chatbot Release**: On January 10, 2025, DeepSeek launched a free chatbot application for iOS and Android based on the DeepSeek-R1 model, which quickly became the top free app on the iOS App Store in the United States.
2. **Market Impact**: Following the success of DeepSeek's app, shares of major tech companies like Nvidia and Microsoft saw significant declines, contributing to a broader sell-off in technology stocks which saw a loss of about $593 billion in market capitalization.
3. **Cybersecurity Concerns**: A cyberattack on January 27, 2025, led DeepSeek to limit new user registrations and caused service slowdowns.

## Concerns
- **Censorship**: The company has faced scrutiny over political sensitivities, with reports indicating that its AI models employ censorship on certain topics considered politically sensitive in China.
- **Privacy**: There are concerns about data privacy, with implications regarding how personal data is handled and potential misuse of AI technology for government surveillance or influence operations.

## Conclusion
DeepSeek’s rapid rise and innovative AI technologies mark significant developments in the AI sector, particularly against the backdrop of global tensions and competition in technology. Its impact has been likened to a "Sputnik moment" for American AI, indicating a shift in the balance of AI power towards China.

Their are some websites which cannot be scrapped with BeautifulSoap as they uses JavaScript to render the webpage, The solution for such a web scrapping is Selenium

WebPage Summarizer with Ollama

In [13]:
import ollama

In [14]:
OLLAMA_API = "http://localhost:11434/api/chat"
HEADERS = {"Content-Type": "application/json"}
MODEL = "llama3.2"

In [15]:
def summarize(url):
    website = Website(url)
    response = ollama.chat(
        model = MODEL,
        messages= messages_for(website)
    )
    return response['message']['content']

In [16]:
summarize("https://en.wikipedia.org/wiki/DeepSeek")

"Here's a summary of the article about DeepSeek:\n\n**What is DeepSeek?**\n\nDeepSeek is an open-source, generative AI chatbot developed in China. It was launched in 2023 and has gained significant attention due to its impressive language generation capabilities.\n\n**Key Features**\n\nDeepSeek is built on top of transformer-based architectures and uses a combination of self-supervised learning and reinforcement learning from human feedback to improve its performance.\n\n**Capabilities**\n\nDeepSeek can generate human-like text, answer questions, and even engage in conversations. It has been trained on a large dataset of texts and has demonstrated impressive capabilities in tasks such as:\n\n* Text summarization\n* Question answering\n* Sentiment analysis\n* Dialogue generation\n\n**Concerns and Controversies**\n\nThere have been concerns raised about the potential risks of using DeepSeek, including:\n\n* **Data privacy**: Some experts have expressed concerns that DeepSeek may be colle

In [17]:
def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))
    

In [18]:
display_summary("https://en.wikipedia.org/wiki/DeepSeek")

"DeepSeek" is a Chinese artificial intelligence (AI) company that has gained significant attention for its generative AI technology. Here's an overview:

**Founding and Background**

DeepSeek was founded in 2023, but the exact date of founding is not publicly available. The company is based in Hangzhou, China.

**Technology and Capabilities**

DeepSeek's technology focuses on generating text, images, and videos using deep learning techniques. Their platform can produce high-quality content, including articles, blog posts, social media posts, product descriptions, and more.

**Key Features**

Some key features of DeepSeek's AI technology include:

1. **Generative Adversarial Network (GAN)**: A technique used to generate realistic images, videos, or text.
2. **Large Language Model**: Capable of processing vast amounts of text data and generating coherent, context-specific responses.

**Applications and Use Cases**

DeepSeek's technology has various applications across industries, such as:

1. **Content Generation**: Create high-quality content for websites, blogs, social media platforms, and more.
2. **Chatbots and Virtual Assistants**: Develop conversational AI systems that can engage with users in a natural, human-like manner.
3. **Product Description and Marketing**: Generate compelling product descriptions, marketing materials, and sales copy.

**Concerns and Controversies**

As with any emerging AI technology, DeepSeek's platform has raised concerns about:

1. **Data Security**: The potential for user data to be compromised or misused by the company.
2. **Content Quality**: The possibility of generated content being low-quality, biased, or factually inaccurate.

**Reception and Impact**

DeepSeek's technology has received attention from various stakeholders, including:

1. **Media Outlets**: Featured in prominent publications, such as Reuters, CNN, and Forbes.
2. **Industry Experts**: Praised for its potential to transform content creation and customer engagement.
3. **Concerned Groups**: Cited concerns about data security, content quality, and the implications of large language models on society.

Overall, DeepSeek's emergence marks an exciting development in the field of generative AI, with potential applications across various industries. However, it also highlights the need for careful consideration of the technologies' limitations and risks.