# Web Content Summarization using Llama3.2-3B (Local Ollama Deployment)

## Project Overview
This project automates the retrieval and summarization of web content using AI. The system fetches relevant data from:
1. **Wikipedia**.
2. **2 Top Search Result** – The most relevant webpage from Google Search.

The extracted content is summarized using **Llama3.2-3B**, running locally via **Ollama**.

## Methodology
1. **User Input**: The user provides a topic.
2. **Data Retrieval**:
   - Wikipedia content is fetched using **BeautifulSoup**.
   - The top search result is identified via **Google Search API** and scraped using **Selenium** with Safari WebDriver.
3. **Content Processing**:
   - Unnecessary elements (scripts, styles, inputs) are removed.
   - Wikipedia and the top search result are merged for a comprehensive overview.
4. **Summarization**:
   - A structured prompt guides the AI.
   - **Llama3.2-3B**, running locally on **Ollama**, generates a markdown-formatted summary.
5. **Output Display**: The summary is displayed in markdown format.

## Features
- **Locally hosted AI inference** with **Ollama**, ensuring privacy and performance.
- **Multi-source content aggregation** for improved accuracy.
- **Automated web scraping** using Wikipedia and Google Search.
- **AI-powered structured summaries** with markdown formatting.

This system does not make use of any APIs.

## Importing modules

In [1]:
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.safari.service import Service
import re
from IPython.display import Markdown, display
from openai import OpenAI
from googlesearch import search

## Model using Ollama

In [10]:
MODEL = "llama3.2:3b"
openai = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

## Web Scraping

In [3]:
def get_top_site(query):
    """Retrieve the top search result URL for a given query."""
    return next(search(query, num_results=2), None)

In [4]:
def fetch_page_content(url):
    """Fetch and process webpage content using Safari WebDriver."""
    driver = webdriver.Safari()
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.quit()
    
    title = soup.title.string if soup.title else "No title found"
    for irrelevant in soup.body(["script", "style", "img", "input"]):
        irrelevant.decompose()
    text = soup.body.get_text(separator="\n", strip=True)
    
    return title, text

def fetch_wikipedia_content(topic):
    """Fetch Wikipedia page content for a given topic."""
    url = f"https://en.wikipedia.org/wiki/{topic.replace(' ', '_')}"
    response = requests.get(url)
    if response.status_code != 200:
        return None, None
    
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.title.string if soup.title else "No title found"
    for irrelevant in soup.body(["script", "style", "img", "input"]):
        irrelevant.decompose()
    text = soup.body.get_text(separator="\n", strip=True)
    
    return title, text

In [5]:
system_prompt = (
    "You are an educational assistant tasked with helping users understand topics "
    "by providing structured and detailed summaries of web pages. Ignore navigation-related "
    "text and provide answers in markdown format. The response should include an introduction, "
    "key points, and a conclusion to make the summary more informative. You are to print everything in english"
)

def user_prompt_for(title, text):
    """Construct a user prompt to generate structured web page summaries."""
    user_prompt = f"""
    You are reading a web page titled **{title}**.
    Below is the combined content extracted from a Wikipedia page and a top search result:

    ---
    {text}
    ---

    Please summarize this page in markdown format with the following structure:
    
    # {title}
    
    ## Introduction
    Provide a brief introduction to the topic.
    
    ## Key Points
    Highlight the most important details, covering significant aspects and notable events or concepts.
    
    ## Conclusion
    Summarize the topic concisely and mention its importance or relevance.
    """
    return user_prompt

In [6]:
def messages_for(title, text):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(title, text)}
    ]

## Summarizing

In [7]:
def summarize(topic):
    """Fetch Wikipedia and top site content, combine them, and generate a structured summary."""
    wiki_title, wiki_text = fetch_wikipedia_content(topic)
    site_url = get_top_site(topic)
    site_title, site_text = fetch_page_content(site_url) if site_url else (None, None)
    
    if not wiki_text and not site_text:
        print("No suitable content found for this topic.")
        return
    
    combined_text = "\n\n".join(filter(None, [wiki_text, site_text]))
    final_title = f"Summary for {topic}"
    
    response = openai.chat.completions.create(
        model=MODEL,
        messages=messages_for(final_title, combined_text)
    )
    return response.choices[0].message.content

In [8]:
def display_summary(topic):
    """Fetch, summarize, and display the Wikipedia and top search result summary."""
    summary = summarize(topic)
    if summary:
        display(Markdown(summary))

In [12]:
topic = input('Enter a topic to search and summarize: ').strip()
display_summary(topic)

Enter a topic to search and summarize:  LLM


# Summary for LLM

## Introduction

Large language models (LLMs) are a type of artificial intelligence (AI) designed to process and understand human language. They have become increasingly popular in recent years, revolutionizing various fields such as natural language processing, machine learning, and artificial intelligence.

## Key Points

### Notable Developments and Advancements

*   The first LLM was developed in 2018 by Google's BERT (Bidirectional Encoder Representations from Transformers) team.
*   Since then, numerous variants of BERT have been created, including RoBERTa, DistilBERT, and ALBERT.
*   Other notable advancements include the introduction of transformer-based models like XLNet, T5, and GPT-1 (also known as GPT).
*   Some notable applications of LLMs include:
    *   Chatbots and conversational AI
    *   Human-computer interaction
    *   Sentiment analysis and opinion mining
    *   Natural language question answering

### Technology and Architecture

*   Most modern LLMs are based on transformer architectures, which have shown to be effective for sequential tasks.
*   These models often use self-supervised learning approaches to fine-tune their performance.

    LLMs can be classified into different types based on:
    *   **Training data**:
        +   **Supervised**: They receive annotations and labels during training.
        +   **Unsupervised**: They do not have annotated labels, relying solely on internal mechanisms.
        +   **Self-supervised**: Training data is provided but does not contain direct labels — model learns from the environment using external information.

    *   **Evaluation criteria**:
        +   **Perplexity** 
        +   **Arousal**
    *   **Interpretability models**:
        +   Attention mechanisms
        +   VQ-VAE

### Applications and Impact

*   Large language models have shown potential in applications such as customer service chatbots, content generation, natural language processing (NLP), question answering, sentiment analysis, machine learning tasks.

    The impact of large language models can be both positive and negative:

    **Positive impact**: They can help with:
        - Automating various customer support tasks
          - Speeding up the process of generating new texts
          - Improving the performance in Natural Language Processing (NLP) fields
    Negative effects include:
    - Job displacement
      - The risk of loss in jobs that previously relied on human interaction.

## Conclusion

Large language models are transforming various industries with their capabilities and efficiency. They have opened up new avenues for creativity, productivity, and communication but also raised concerns about the potential impact on humans' roles and social behavior when these powerful tools are available throughout society.

In [13]:
topic = input('Enter a topic to search and summarize: ').strip()
display_summary(topic)

Enter a topic to search and summarize:  NLP


# Summary for NLP

## Introduction
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that deals with the interaction between computers and humans in natural language. It aims to enable computers to understand, interpret, and generate human-like language.

## Key Points
* **Key Concepts:
	+ Tokenization: breaking down text into individual words or tokens.
	+ Part-of-speech tagging: identifying the grammatical category of each word (e.g., noun, verb, adjective).
	+ Named entity recognition (NER): identifying specific entities such as names, locations, and organizations.
	+ Sentiment analysis: determining the emotional tone or sentiment behind a piece of text.
* ** Techniques:
	+ Machine learning algorithms for NLP tasks.
	+ Deep learning architectures like Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
	+ Natural Language Generation (NLG): generating human-like text from structured data.
* **Applications:
	+ Chatbots and voice assistants.
	+ Sentiment analysis for social media monitoring.
	+ Machine translation and language understanding.

## Conclusion
NLP has become a crucial application of AI, enabling computers to understand and interact with humans in natural language. Its applications are vast, ranging from everyday interactions like chatbots to more complex tasks such as sentiment analysis and machine translation. As NLP continues to evolve, we can expect even more sophisticated systems that seamlessly integrate with human communication.

In [15]:
topic = input('Enter a topic to search and summarize: ').strip()
display_summary(topic)

Enter a topic to search and summarize:  Transformer attention


# Summary for Transformer Attention

## Introduction
Transformers are a family of neural network models that utilize self-attention mechanisms to process sequential data. The transformer architecture was introduced in 2017 by Vaswani et al. in a research paper titled "Attention is All You Need" and has since become one of the most popular models in deep learning.

## Key Points

* **Self-Attention Mechanism**: Transformers use self-attention mechanisms to weigh the importance of different input tokens, allowing them to capture long-range dependencies and relationships between different elements.
* **Parallel Processing**: Unlike RNNs and LSTMs, which process data sequentially, Transformers process entire sequences simultaneously, making them more efficient for large amounts of data.
* **Multi-Head Attention**: The transformer model uses multi-head attention, which allows the model to attend to multiple representations of the input data simultaneously.
* **Feed-Forward Neural Networks (FFNNs)**: FFNNs are used in Transformers to transform the output of the self-attention mechanism into a more complex representation that can be processed by the neural network.

## Conclusion
Transformers have revolutionized the field of natural language processing and computer vision, offering state-of-the-art performance on a wide range of tasks. Their ability to capture long-range dependencies and relationships between input tokens has made them one of the most popular models in deep learning, with applications in image and speech recognition, machine translation, and text generation. As the field continues to evolve, Transformers will likely remain a fundamental component of many computer vision and NLP applications.