🌟 Exercise 1 : Parsing HTML With BeautifulSoup

Objective: Use urlopen() to fetch the HTML content of a webpage and then parse it using BeautifulSoup.

Read the HTML content of the page.
Create a BeautifulSoup object to parse this HTML.
Find the title of the webpage (the content inside the <title> tag).
Extract all paragraphs (<p> tags) from the page.
Retrieve all links (URLs in <a href=""> tags) on the page.

In [4]:
from bs4 import BeautifulSoup
# Read the local HTML file
file_path = '/Users/ilyagelfgat/Documents/DI_Bootcamp/Week8/Day1/Exercise_XP/task.html'
with open(file_path, 'r', encoding='utf-8') as file:
    html_ocean_ws = file.read()

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html_ocean_ws, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Sports World
  </title>
  <style>
   body { font-family: Arial, sans-serif; }
        header, nav, section, article, footer { margin: 20px; padding: 15px; }
        nav { background-color: #333; }
        nav a { color: white; padding: 14px 20px; text-decoration: none; display: inline-block; }
        nav a:hover { background-color: #ddd; color: black; }
        .video { text-align: center; margin: 20px 0; }
  </style>
 </head>
 <body>
  <header>
   <h1>
    Welcome to Sports World
   </h1>
   <p>
    Your one-stop destination for the latest sports news and videos.
   </p>
  </header>
  <nav>
   <a href="#football">
    Football
   </a>
   <a href="#basketball">
    Basketball
   </a>
   <a href="#tennis">
    Tennis
   </a>
  </nav>
  <section id="football">
   <h2>
    Football
   </h2>
   <article>
    <h3>
     Latest Football New

In [5]:
# Find the title of the webpage (the content inside the <title> tag).
soup.title.get_text()

'Sports World'

In [6]:
# Extract all paragraphs (<p> tags) from the page.
soup.find_all("p")

[<p>Your one-stop destination for the latest sports news and videos.</p>,
 <p>Read about the latest football matches and player news.</p>,
 <p>Watch highlights from the latest NBA games.</p>,
 <p>Get the latest updates from the world of Grand Slam tennis.</p>]

In [10]:
# Retrieve all links (URLs in <a href=""> tags) on the page.
soup.find_all("a", href=True)

[<a href="#football">Football</a>,
 <a href="#basketball">Basketball</a>,
 <a href="#tennis">Tennis</a>]

🌟 Exercise 2 : Scraping Robots.txt From Wikipedia

Instructions

Write a Python program to download and display the content of robot.txt for en.wikipedia.org

In [12]:
import requests
url = 'https://en.wikipedia.org/robots.txt'
response = requests.get(url)
html_ocean_ws = response.text
html_ocean_ws
soup = BeautifulSoup(html_ocean_ws, 'html.parser')
print(soup.prettify())

# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Z

🌟 Exercise 3 : Extracting Headers From Wikipedia’s Main Page

Instructions

Write a Python program to extract and display all the header tags from en.wikipedia.org/wiki/Main_Page.

In [17]:
url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)
html_ocean_ws = response.text
html_ocean_ws
soup = BeautifulSoup(html_ocean_ws, 'html.parser')
# List to store header tags
headers = []
# Extract all header tags (h1, h2, h3, h4, h5, h6)
for i in range(1, 7):
    headers.extend(soup.find_all(f'h{i}')) # Loop through header tags (h1 to h6) and store them in a list.

# Display the header tags
for header in headers:
    print(header.get_text())

Main Page
Welcome to Wikipedia
From today's featured article
Did you know ...
In the news
On this day
Today's featured picture
Other areas of Wikipedia
Wikipedia's sister projects
Wikipedia languages


🌟 Exercise 4 : Checking For Page Title

Instructions

Write a Python program to check whether a page contains a title or not.

In [20]:
if soup.title:
    print(soup.title.get_text())
else:
    print("This page don't contains a title")

Wikipedia, the free encyclopedia


🌟 Exercise 5 : Analyzing US-CERT Security Alerts

Instructions

Write a Python program get the number of security alerts issued by US-CERT in the current year.
Source: https://www.us-cert.gov/ncas/alerts

🌟 Exercise 6 : Scraping Movie Details

Instructions

Write a Python program to get movie name, year and a brief summary of the top 10 random movies.

In [22]:
from bs4 import BeautifulSoup
import requests

# URL of IMDb's Top 250 movies page
url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'

# Send a request to the URL
response = requests.get(url)

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Extract the top 10 movie entries
movies = soup.select('td.titleColumn')[:10]
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')][:10]

# Base URL for IMDb
base_url = 'https://www.imdb.com'

# List to store movie details
movie_details = []

# Loop through the top 10 movies
for index, movie in enumerate(movies):
    movie_name = movie.a.get_text()
    year = movie.span.get_text().strip('()')
    
    # Fetch the movie detail page to get the summary
    movie_url = base_url + links[index]
    movie_response = requests.get(movie_url)
    movie_soup = BeautifulSoup(movie_response.content, 'html.parser')
    
    # Extract the summary from the movie detail page
    summary = movie_soup.find('span', {'data-testid': 'plot-l'}).get_text().strip()
    
    # Append the movie details to the list
    movie_details.append({
        'name': movie_name,
        'year': year,
        'summary': summary
    })

# Display the movie details
for details in movie_details:
    print(f"Name: {details['name']}")
    print(f"Year: {details['year']}")
    print(f"Summary: {details['summary']}\n")
