# Lab | Web Scraping

## Exercise: Extracting Information from an HTML Document

In this exercise, you will practice web scraping using **BeautifulSoup**. The goal is to extract useful information from an HTML document.

Follow the steps below to complete the exercise:

1. **Load the HTML document**  
   - Use the given HTML code and parse it using `BeautifulSoup`.

2. **Find the `<title>` tag**  
   - Extract and print the content of the `<title>` tag.

3. **Retrieve all paragraph (`<p>`) tags**  
   - Find and print all paragraph tags in the document.

4. **Count the number of paragraph (`<p>`) tags**  
   - Calculate and print the total number of `<p>` tags.

5. **Extract the text from the first paragraph (`<p>`)**  
   - Retrieve and print only the text inside the first paragraph.

6. **Find the length of the text inside the first `<h2>` tag**  
   - Extract the text from the first `<h2>` tag and print its length.

7. **Find the `href` attribute of the first `<a>` tag**  
   - Extract and print the `href` attribute from the first `<a>` tag.

8. **Extract all text from the HTML document**  
   - Print all the text contained in the document.

### More Practice
For additional practice, check out more `BeautifulSoup` exercises:  
🔹 [w3resource BeautifulSoup Exercises](https://www.w3resource.com/python-exercises/BeautifulSoup/index.php)

---

Complete each task using the appropriate **BeautifulSoup methods**, such as:
- `find()`
- `find_all()`
- `text`
- `get_text()`
- `get()`
- `select()`

Implement the steps using Python and **explain your results** in markdown cells where necessary.

In [1]:
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<title>An example of HTML page</title>
</head>
<body>
<h2>This is an example HTML page</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.</p>
<p><a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a></p>
<p><a href="https://www.w3resource.com/css/CSS-tutorials.php">Learn CSS from
w3resource.com</a></p>
</body>
</html>
"""

In [3]:
soup = BeautifulSoup(html_doc, "html.parser")

In [4]:
# Write a Python program to find the title tags from a given html document.
title_tag = soup.find("title")
print("Title tag:", title_tag.text)

Title tag: An example of HTML page


In [5]:
# Write a Python program to retrieve all the paragraph tags from a given html document
paragraphs = soup.find_all("p")
print("All paragraph tags:", paragraphs)

All paragraph tags: [<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.</p>, <p><a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a></p>, <p><a href="https://www.w3resource.com/css/CSS-tutorials.php">Learn CSS from
w3resource.com</a></p>]


In [6]:
# Write a Python program to get the number of paragraph tags of a given html document.
num_paragraphs = len(paragraphs)
print("Number of paragraph tags:", num_paragraphs)

Number of paragraph tags: 3


In [7]:
# Write a Python program to extract the text in the first paragraph tag of a given html document.
first_paragraph_text = paragraphs[0].get_text()
print("First paragraph text:", first_paragraph_text)

First paragraph text: 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.


In [8]:
# Write a Python program to find the length of the text of the first <h2> tag of a given html document.
h2_tag = soup.find("h2")
h2_text_length = len(h2_tag.get_text()) if h2_tag else 0
print("Length of first <h2> tag text:", h2_text_length)

Length of first <h2> tag text: 28


In [9]:
# Write a Python program to find the href of the first <a> tag of a given html document.
first_link = soup.find("a")
first_link_href = first_link.get("href") if first_link else None
print("Href of first <a> tag:", first_link_href)

Href of first <a> tag: https://www.w3resource.com/html/HTML-tutorials.php


In [10]:
# Write a Python program to extract all the text from a given web page.
all_text = soup.get_text()
print("All extracted text:\n", all_text)

All extracted text:
 



An example of HTML page


This is an example HTML page

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.
Learn HTML from
w3resource.com
Learn CSS from
w3resource.com



