BeautifulSoup is a Python library used for parsing HTML and XML documents. It is part of the bs4 package and is widely used for web scraping, data extraction, and working with structured data in HTML or XML. Below is a detailed explanation of BeautifulSoup, including its features, methods, and common use cases.



In [1]:
pip install beautifulsoup4 lxml


Note: you may need to restart the kernel to use updated packages.


Key Features of BeautifulSoup

Parse and navigate HTML/XML documents.

Modify or extract data from web pages.

Handle malformed or poorly structured HTML gracefully.

Works with different parsers like:

html.parser (built-in parser).

lxml (faster and more powerful).

html5lib (more forgiving for badly-formed HTML).

In [3]:
from bs4 import BeautifulSoup

In [5]:
html_doc = """
<html>
  <head><title>Sample Page</title></head>
  <body>
    <h1>Main Heading</h1>
    <p class="intro">Welcome to the page!</p>
    <a href="http://example.com">Example Link</a>
  </body>
</html>
"""

soup=BeautifulSoup(html_doc,"html.parser")

In [12]:
print(soup.h1.string)

Main Heading


In [13]:
ptag=soup.find('p')

In [14]:
print(ptag)

<p class="intro">Welcome to the page!</p>


In [15]:
print(ptag.__class__)

<class 'bs4.element.Tag'>


In [17]:
print(ptag.string)

Welcome to the page!


In [18]:
atag=soup.find('a')

In [19]:
print(atag)

<a href="http://example.com">Example Link</a>


In [27]:
print(atag['href'])

http://example.com


In [30]:
print(soup.select(".intro"))

[<p class="intro">Welcome to the page!</p>]


In [31]:
ptag.string="Hello Meet Mavani"
atag['href']="_blank"

In [32]:
print(atag)

<a href="_blank">Example Link</a>


In [33]:
print(soup.get_text())



Sample Page

Main Heading
Hello Meet Mavani
Example Link





Write a Python program to replace a given tag with whatever’s inside a given tag.



In [37]:
from bs4 import BeautifulSoup
markup = '<a href="https://w3resource.com/">Python exercises.<i>w3resource.com</i></a>'
soup = BeautifulSoup(markup, "html.parser")
atag=soup.find('a')

print(atag.i.unwrap())

print(atag)


<i></i>
<a href="https://w3resource.com/">Python exercises.w3resource.com</a>
