# Lab | Web Scraping

## Exercise: Extracting Information from an HTML Document

In this exercise, you will practice web scraping using **BeautifulSoup**. The goal is to extract useful information from an HTML document.

Follow the steps below to complete the exercise:

1. **Load the HTML document**  
   - Use the given HTML code and parse it using `BeautifulSoup`.

2. **Find the `<title>` tag**  
   - Extract and print the content of the `<title>` tag.

3. **Retrieve all paragraph (`<p>`) tags**  
   - Find and print all paragraph tags in the document.

4. **Count the number of paragraph (`<p>`) tags**  
   - Calculate and print the total number of `<p>` tags.

5. **Extract the text from the first paragraph (`<p>`)**  
   - Retrieve and print only the text inside the first paragraph.

6. **Find the length of the text inside the first `<h2>` tag**  
   - Extract the text from the first `<h2>` tag and print its length.

7. **Find the `href` attribute of the first `<a>` tag**  
   - Extract and print the `href` attribute from the first `<a>` tag.

8. **Extract all text from the HTML document**  
   - Print all the text contained in the document.

### More Practice
For additional practice, check out more `BeautifulSoup` exercises:  
🔹 [w3resource BeautifulSoup Exercises](https://www.w3resource.com/python-exercises/BeautifulSoup/index.php)

---

Complete each task using the appropriate **BeautifulSoup methods**, such as:
- `find()`
- `find_all()`
- `text`
- `get_text()`
- `get()`
- `select()`

Implement the steps using Python and **explain your results** in markdown cells where necessary.

In [4]:
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<title>An example of HTML page</title>
</head>
<body>
<h2>This is an example HTML page</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.</p>
<p><a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a></p>
<p><a href="https://www.w3resource.com/css/CSS-tutorials.php">Learn CSS from
w3resource.com</a></p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser') # standard
soup


<html>
<head>
<meta content="text/html;
charset=utf-8" http-equiv="Content-Type"/>
<title>An example of HTML page</title>
</head>
<body>
<h2>This is an example HTML page</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.</p>
<p><a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a></p>
<p><a

In [20]:
# Write a Python program to find the title tags from a given html document.
soup.title.string

'An example of HTML page'

In [32]:
# Write a Python program to retrieve all the paragraph tags from a given html document

for l in soup.find_all("p"):
    l.get_text()
    print(l)

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.</p>
<p><a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a></p>
<p><a href="https://www.w3resource.com/css/CSS-tutorials.php">Learn CSS from
w3resource.com</a></p>


In [48]:
# Write a Python program to get the number of paragraph tags of a given html document.

l = soup.find_all("p")

len(l)

3

In [50]:
# Write a Python program to extract the text in the first paragraph tag of a given html document.

soup.find_all("p")[0].text

'\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,\naliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac\nhabitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus\nsagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.\nDonec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque\nadipiscing tortor, at condimentum elit elementum sed. Mauris dignissim\nelementum nunc, non elementum felis condimentum eu. In in turpis quis erat\nimperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,\neuismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl\neuismod porta.'

In [52]:
# Write a Python program to find the length of the text of the first <h2> tag of a given html document.

len(soup.find("h2").text)

28

In [64]:
# Write a Python program to find the href of the first <a> tag of a given html document.

href = soup.find('a')
href

<a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a>

In [79]:
# Write a Python program to extract all the text from a given web page.
soup.text

'\n\n\n\nAn example of HTML page\n\n\nThis is an example HTML page\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,\naliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac\nhabitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus\nsagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.\nDonec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque\nadipiscing tortor, at condimentum elit elementum sed. Mauris dignissim\nelementum nunc, non elementum felis condimentum eu. In in turpis quis erat\nimperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,\neuismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl\neuismod porta.\nLearn HTML from\nw3resource.com\nLearn CSS from\nw3resource.com\n\n\n'

In [75]:
import requests
r = requests.get('https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)')
r.status_code

html = r.content

html

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Silicon Valley (TV series) - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-