 # Scraping a Wikipedia Webpage and Summarization

Step-01: Scraping the Wikipedia page (https://en.wikipedia.org/wiki/Alexander_the_Great) using beautiful soup by retaining the original headings and sub-headings.

Step-02: Using the t5-base model for the summarization from the retrieved data from the scrapping.

### Importing essential required libraries

In [None]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

### Function to fetch webpage content

In [None]:
def fetch_webpage(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        raise Exception(f"Failed to fetch the webpage: Status code {response.status_code}")

### Function to extract headings and text content using BeautifulSoup

In [None]:
def extract_headings_and_text(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    content = {}
    current_section = 'Introduction'  # Default section for content before first heading

    for element in soup.find_all(['h1', 'h2', 'h3', 'h4', 'p']):
        if element.name == 'h1':
            content['Title'] = element.get_text().strip()
        elif element.name in ['h2', 'h3', 'h4']:
            current_section = element.get_text().strip()
            content[current_section] = []
        elif element.name == 'p':
            if current_section not in content:
                content[current_section] = []
            content[current_section].append(element.get_text().strip())

    return content

### Function to summarize content using transformers pipeline


In [None]:
def summarize_content(content):
    summarizer = pipeline("summarization", model="t5-base", device=0)  # Adjust 'device' as needed

    summarized_content = {}

    for section, texts in content.items():
        summarized_content[section] = []

        if isinstance(texts, list):
            for text in texts:
                summary = summarizer(text, max_length=150, min_length=30, do_sample=False)[0]['summary_text'].strip()
                summarized_content[section].append(summary)
        else:
            summary = summarizer(texts, max_length=150, min_length=30, do_sample=False)[0]['summary_text'].strip()
            summarized_content[section].append(summary)

    return summarized_content

### Function call and printing the summarized content

In [1]:
def main():
    url = "https://en.wikipedia.org/wiki/Alexander_the_Great"
    webpage_content = fetch_webpage(url)
    content = extract_headings_and_text(webpage_content)
    summarized_content = summarize_content(content)

    for section, summaries in summarized_content.items():
        print(f"### {section}")
        if isinstance(summaries, list):
            for summary in summaries:
                print(summary)
        else:
            print(summaries)
        print()

if __name__ == "__main__":
    main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Your max_length is set to 150, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)
Your max_length is set to 150, but your input_length is only 136. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)
Your max_length is set to 150, but your input_length is only 6. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=3)
Your max_length is set to 150, but your input_length is only 139. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=69)
You se

### Contents
. - (EL) - -________________________ -______________________ __________ __
Alexander the Great was a king of the ancient Greek kingdom of Macedon . he succeeded his father Philip II to the throne in 336 BC . by the age of 30, he had created one of the largest empires in history .
until the age of 16, Alexander was tutored by Aristotle . he campaigned in the balkans and reasserted control over Thrace and parts of Illyria . in 335 BC, he led the League of Corinth, assuming leadership over all Greeks .
in 334 BC, he invaded the Achaemenid Persian Empire and began a series of campaigns . he overthrew Darius III and conquered the Macedonian Empire in its entirety . in the years following his death, civil wars broke out across the macedonian empire .
his death marks the start of the Hellenistic period . he founded more than twenty cities, with the most prominent being the city of Alexandria . the Greek language became the lingua franca of the region .
his military achievements a