Exercise 1 : Parsing HTML With BeautifulSoup

Instructions

Objective: Use urlopen() to fetch the HTML content of a webpage and then parse it using BeautifulSoup.



<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sports World</title>
    <style>
        body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
    </style>
</head>
<body>

    <header>
        <h1>Welcome to Sports World</h1>
        <p>Your one-stop destination for the latest sports news and videos.</p>
    </header>

    <nav>
        <a href="#football">Football</a>
        <a href="#basketball">Basketball</a>
        <a href="#tennis">Tennis</a>
    </nav>

    <section id="football">
        <h2>Football</h2>
        <article>
            <h3>Latest Football News</h3>
            <p>Read about the latest football matches and player news.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/football-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="basketball">
        <h2>Basketball</h2>
        <article>
            <h3>NBA Highlights</h3>
            <p>Watch highlights from the latest NBA games.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/basketball-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="tennis">
        <h2>Tennis</h2>
        <article>
            <h3>Grand Slam Updates</h3>
            <p>Get the latest updates from the world of Grand Slam tennis.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/tennis-video-id" frameborder="0" allowfullscreen></iframe>
            </div>
        </article>
    </section>

    <footer>
        <form action="mailto:contact@sportsworld.com" method="post" enctype="text/plain">
            <label for="name">Name:</label><br>
            <input type="text" id="name" name="name"><br>
            <label for="email">Email:</label><br>
            <input type="email" id="email" name="email"><br>
            <label for="message">Message:</label><br>
            <textarea id="message" name="message" rows="4" cols="50"></textarea><br><br>
            <input type="submit" value="Send">
        </form>
    </footer>

</body>
</html>


Read the HTML content of the page.
Create a BeautifulSoup object to parse this HTML.
Find the title of the webpage (the content inside the <title> tag).
Extract all paragraphs (<p> tags) from the page.
Retrieve all links (URLs in <a href=""> tags) on the page.


In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import tempfile
import os

html_content = '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sports World</title>
    <style>
        body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
    </style>
</head>
<body>

    <header>
        <h1>Welcome to Sports World</h1>
        <p>Your one-stop destination for the latest sports news and videos.</p>
    </header>

    <nav>
        <a href="#football">Football</a>
        <a href="#basketball">Basketball</a>
        <a href="#tennis">Tennis</a>
    </nav>

    <section id="football">
        <h2>Football</h2>
        <article>
            <h3>Latest Football News</h3>
            <p>Read about the latest football matches and player news.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/football-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="basketball">
        <h2>Basketball</h2>
        <article>
            <h3>NBA Highlights</h3>
            <p>Watch highlights from the latest NBA games.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/basketball-video-id" frameborder="0" allowfullscreen>
                </iframe>
            </div>
        </article>
    </section>

    <section id="tennis">
        <h2>Tennis</h2>
        <article>
            <h3>Grand Slam Updates</h3>
            <p>Get the latest updates from the world of Grand Slam tennis.</p>
            <div class="video">
                <iframe width="560" height="315" src="https://www.youtube.com/embed/tennis-video-id" frameborder="0" allowfullscreen></iframe>
            </div>
        </article>
    </section>

    <footer>
        <form action="mailto:contact@sportsworld.com" method="post" enctype="text/plain">
            <label for="name">Name:</label><br>
            <input type="text" id="name" name="name"><br>
            <label for="email">Email:</label><br>
            <input type="email" id="email" name="email"><br>
            <label for="message">Message:</label><br>
            <textarea id="message" name="message" rows="4" cols="50"></textarea><br><br>
            <input type="submit" value="Send">
        </form>
    </footer>

</body>
</html>
'''

with tempfile.NamedTemporaryFile(delete=False, suffix='.html') as tmp_file:
    tmp_file_name = tmp_file.name
    tmp_file.write(html_content.encode('utf-8'))

file_url = 'file://' + tmp_file_name
html = urlopen(file_url).read()

soup = BeautifulSoup(html, 'html.parser')

title = soup.title.string
print('Title of the webpage:', title)

paragraphs = soup.find_all('p')
print('\nParagraphs:')
for p in paragraphs:
    print('-', p.get_text())

links = soup.find_all('a')
print('\nLinks:')
for link in links:
    href = link.get('href')
    print('-', href)

os.unlink(tmp_file_name)

Title of the webpage: Sports World

Paragraphs:
- Your one-stop destination for the latest sports news and videos.
- Read about the latest football matches and player news.
- Watch highlights from the latest NBA games.
- Get the latest updates from the world of Grand Slam tennis.

Links:
- #football
- #basketball
- #tennis


🌟 Exercise 2 : Scraping Robots.txt From Wikipedia

Instructions

Write a Python program to download and display the content of robot.txt for en.wikipedia.org



In [2]:
import requests

def get_robots_txt(url):
    if not url.endswith('/'):
        url += '/'
    robots_url = url + 'robots.txt'
    response = requests.get(robots_url)
    if response.status_code == 200:
        print(response.text)
    else:
        print(f"Could not retrieve robots.txt file. Status code: {response.status_code}")

if __name__ == "__main__":
    get_robots_txt('https://en.wikipedia.org')

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: 

🌟 Exercise 3 : Extracting Headers From Wikipedia’s Main Page

Instructions

Write a Python program to extract and display all the header tags from en.wikipedia.org/wiki/Main_Page.



In [3]:
def extract_headers(url):
    # Include a User-Agent to mimic a regular browser request
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; HeaderExtractor/1.0)'}

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        headers = []
        for level in range(1, 7):
            for header in soup.find_all(f'h{level}'):
                headers.append((f'h{level}', header.text.strip()))

        for header_tag, header_text in headers:
            print(f'{header_tag}: {header_text}')
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")

if __name__ == "__main__":
    url = 'https://en.wikipedia.org/wiki/Main_Page'
    extract_headers(url)

h1: Main Page
h1: Welcome to Wikipedia
h2: From today's featured article
h2: Did you know ...
h2: In the news
h2: On this day
h2: Today's featured picture
h2: Other areas of Wikipedia
h2: Wikipedia's sister projects
h2: Wikipedia languages


🌟 Exercise 4 : Checking For Page Title

Instructions

Write a Python program to check whether a page contains a title or not.



In [4]:
def check_page_title(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (compatible; TitleChecker/1.0)'}

        response = requests.get(url, headers=headers)
        response.raise_for_status()


        soup = BeautifulSoup(response.content, 'html.parser')

        title_tag = soup.find('title')

        if title_tag and title_tag.text.strip():
            print(f"The page at '{url}' contains a title: '{title_tag.text.strip()}'")
        else:
            print(f"The page at '{url}' does NOT contain a title.")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred while fetching the page: {e}")

if __name__ == "__main__":
    url_to_check = 'https://github.com/'
    check_page_title(url_to_check)

The page at 'https://github.com/' contains a title: 'GitHub: Let’s build from here · GitHub'


🌟 Exercise 5 : Analyzing US-CERT Security Alerts

Instructions

Write a Python program get the number of security alerts issued by US-CERT in the current year.
Source: https://www.cisa.gov/news-events/cybersecurity-advisories?f%5B0%5D=advisory_type%3A93



In [5]:
from datetime import datetime

def get_us_cert_alerts(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (compatible; USCertAlerts/1.0)',
            'Accept-Language': 'en-US,en;q=0.9',
        }

        response = requests.get(url, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')

        advisories = soup.find_all('div', class_='usa-card__container')

        current_year = datetime.now().year

        alert_count = 0

        for advisory in advisories:
            date_element = advisory.find('span', class_='usa-card__date')
            if date_element:
                date_text = date_element.text.strip()
                try:
                    advisory_date = datetime.strptime(date_text, '%B %d, %Y')
                    if advisory_date.year == current_year:
                        alert_count += 1
                except ValueError:
                    continue

        print(f"Number of US-CERT security alerts issued in {current_year}: {alert_count}")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred while fetching the page: {e}")

if __name__ == "__main__":
    url = 'https://www.cisa.gov/news-events/cybersecurity-advisories?f%5B0%5D=advisory_type%3A93'
    get_us_cert_alerts(url)

Number of US-CERT security alerts issued in 2024: 0


🌟 Exercise 6 : Scraping Movie Details

Instructions

Write a Python program to get movie name, year and a brief summary of the top 10 random movies.
use this website : https://www.imdb.com/list/ls091294718/



In [7]:
import random

url = "https://www.imdb.com/list/ls091294718/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='lister-item mode-detail')
movie_data = []

for movie in movies:
    title = movie.h3.a.text.strip()
    year = movie.find('span', class_='lister-item-year').text.strip('()')
    summary = movie.find('p', class_='').text.strip()
    movie_data.append({'title': title, 'year': year, 'summary': summary})

random_movies = random.sample(movie_data, 10)

for idx, movie in enumerate(random_movies, 1):
    print(f"{idx}. {movie['title']} ({movie['year']})")
    print(f"Summary: {movie['summary']}\n")

ValueError: Sample larger than population or is negative

In [10]:
import requests
from bs4 import BeautifulSoup
import random

url = "https://www.imdb.com/list/ls091294718/"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='lister-item mode-detail')
movie_data = []

for movie in movies:
    title = movie.h3.a.text.strip()
    year = movie.find('span', class_='lister-item-year').text.strip('()')
    summary = movie.find('p', class_='').text.strip()
    movie_data.append({'title': title, 'year': year, 'summary': summary})

num_movies = len(movie_data)
if num_movies < 10:
    print(f"Total movies found: {num_movies}, showing all.")
    random_movies = movie_data
else:
    random_movies = random.sample(movie_data, 10)

for idx, movie in enumerate(random_movies, 1):
    print(f"{idx}. {movie['title']} ({movie['year']})")
    print(f"Summary: {movie['summary']}\n")

Total movies found: 0, showing all.
