# Lab | Web Scraping

## Exercise: Extracting Information from an HTML Document

In this exercise, you will practice web scraping using **BeautifulSoup**. The goal is to extract useful information from an HTML document.

Follow the steps below to complete the exercise:

1. **Load the HTML document**  
   - Use the given HTML code and parse it using `BeautifulSoup`.

2. **Find the `<title>` tag**  
   - Extract and print the content of the `<title>` tag.

3. **Retrieve all paragraph (`<p>`) tags**  
   - Find and print all paragraph tags in the document.

4. **Count the number of paragraph (`<p>`) tags**  
   - Calculate and print the total number of `<p>` tags.

5. **Extract the text from the first paragraph (`<p>`)**  
   - Retrieve and print only the text inside the first paragraph.

6. **Find the length of the text inside the first `<h2>` tag**  
   - Extract the text from the first `<h2>` tag and print its length.

7. **Find the `href` attribute of the first `<a>` tag**  
   - Extract and print the `href` attribute from the first `<a>` tag.

8. **Extract all text from the HTML document**  
   - Print all the text contained in the document.

### More Practice
For additional practice, check out more `BeautifulSoup` exercises:  
🔹 [w3resource BeautifulSoup Exercises](https://www.w3resource.com/python-exercises/BeautifulSoup/index.php)

---

Complete each task using the appropriate **BeautifulSoup methods**, such as:
- `find()`
- `find_all()`
- `text`
- `get_text()`
- `get()`
- `select()`

Implement the steps using Python and **explain your results** in markdown cells where necessary.

In [1]:
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<title>An example of HTML page</title>
</head>
<body>
<h2>This is an example HTML page</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.</p>
<p><a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a></p>
<p><a href="https://www.w3resource.com/css/CSS-tutorials.php">Learn CSS from
w3resource.com</a></p>
</body>
</html>
"""

In [2]:
# Write a Python program to find the title tags from a given html document.

def find_title_tag(html):
    soup = BeautifulSoup(html, 'html.parser')
    title_tag = soup.find('title')  # Find the <title> tag
    return title_tag.text if title_tag else "No title tag found"

# Execute the function and print the title tag content
title_content = find_title_tag(html_doc)
print("Title Tag Content:", title_content)

Title Tag Content: An example of HTML page


In [4]:
# Write a Python program to retrieve all the paragraph tags from a given html document

# Function to extract all paragraph tags from an HTML document
def find_paragraph_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    paragraphs = soup.find_all('p')  # Find all <p> tags
    return [p.text for p in paragraphs] if paragraphs else ["No paragraph tags found"]

# Execute the function and print all paragraph contents
paragraphs_content = find_paragraph_tags(html_doc)
print("Paragraph Tags Content:")
for i, paragraph in enumerate(paragraphs_content, start=1):
    print(f"Paragraph {i}: {paragraph}")

Paragraph Tags Content:
Paragraph 1: 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.
Paragraph 2: Learn HTML from
w3resource.com
Paragraph 3: Learn CSS from
w3resource.com


In [5]:
# Write a Python program to get the number of paragraph tags of a given html document.

# Function to count the number of paragraph tags in an HTML document
def count_paragraph_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    paragraphs = soup.find_all('p')  # Find all <p> tags
    return len(paragraphs)  # Return the count of <p> tags

# Execute the function and print the count of paragraph tags
num_paragraphs = count_paragraph_tags(html_doc)
print("Number of Paragraph Tags:", num_paragraphs)

Number of Paragraph Tags: 3


In [6]:
# Write a Python program to extract the text in the first paragraph tag of a given html document.

# Function to extract the text from the first paragraph tag
def get_first_paragraph_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    first_paragraph = soup.find('p')  # Find the first <p> tag
    return first_paragraph.text if first_paragraph else "No paragraph tag found"

# Execute the function and print the first paragraph text
first_paragraph_text = get_first_paragraph_text(html_doc)
print("First Paragraph Text:", first_paragraph_text)

First Paragraph Text: 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.


In [9]:
# Write a Python program to find the length of the text of the first <h2> tag of a given html document.

# Function to find the length of the text inside the first <h2> tag
def get_h2_text_length(html):
    soup = BeautifulSoup(html, 'html.parser')
    h2_tag = soup.find('h2')  # Find the first <h2> tag
    return len(h2_tag.text) if h2_tag else "No <h2> tag found"

# Execute the function and print the length of the text inside the first <h2> tag
h2_text_length = get_h2_text_length(html_doc)
print("Length of the First <h2> Tag Text:", h2_text_length)

Length of the First <h2> Tag Text: 28


In [10]:
# Write a Python program to find the href of the first <a> tag of a given html document.

# Function to find the href attribute of the first <a> tag
def get_first_a_href(html):
    soup = BeautifulSoup(html, 'html.parser')
    first_a_tag = soup.find('a')  # Find the first <a> tag
    return first_a_tag['href'] if first_a_tag else "No <a> tag found"

# Execute the function and print the href attribute of the first <a> tag
first_a_href = get_first_a_href(html_doc)
print("Href of the First <a> Tag:", first_a_href)

Href of the First <a> Tag: https://www.w3resource.com/html/HTML-tutorials.php


In [17]:
# Write a Python program to extract all the text from a given web page.

!pip install requests

import requests
from bs4 import BeautifulSoup

# Function to extract all text from a given web page URL
def extract_text_from_url(url):
    response = requests.get(url)  # Fetch the web page
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')  # Parse the HTML content
        return soup.get_text(separator="\n", strip=True)  # Extract and return all text
    else:
        return f"Failed to retrieve page. Status code: {response.status_code}"
# URL
url = "https://www.w3resource.com/html/HTML-tutorials.php"

# Extract and print all text from the web page
page_text = extract_text_from_url(url)
print("Extracted Text from Web Page:\n", page_text)

Extracted Text from Web Page:
 HTML tutorials - w3resource
w3resource
home
Front End
HTML
CSS
JavaScript
HTML5
Schema.org
php.js
Twitter Bootstrap
Responsive Web Design tutorial
Zurb Foundation 3 tutorials
Pure CSS
HTML5 Canvas
JavaScript Course
Icon
Angular
Vue
Jest
Mocha
NPM
Yarn
Back End
PHP
Python
Java
Node.js
Ruby
C programming
PHP Composer
Laravel
PHPUnit
Database
SQL(2003 standard of ANSI)
MySQL
PostgreSQL
SQLite
NoSQL
MongoDB
Oracle
Redis
Apollo GraphQL
API
Google Plus API
Youtube API
Google Maps API
Flickr API
Last.fm API
Twitter REST API
Data Interchnage
XML
JSON
Ajax
Exercises
HTML CSS Exercises
JavaScript Exercises
jQuery Exercises
jQuery-UI Exercises
CoffeeScript Exercises
PHP Exercises
Python Exercises
C Programming Exercises
C# Sharp Exercises
Java Exercises
SQL Exercises
Oracle Exercises
MySQL Exercises
SQLite Exercises
PostgreSQL Exercises
MongoDB Exercises
Twitter Bootstrap Examples
Others
Excel Tutorials
Useful tools
Google Docs Forms Templates
Google Docs Slide Pres