# **Introduction to BeautifulSoup**
## **What is BeautifulSoup?**
BeautifulSoup is a Python library used for **parsing HTML and XML documents**. It helps in extracting data from web pages, modifying the HTML structure, and navigating through elements efficiently.

It creates a **parse tree** from page source code, allowing easy extraction of elements like titles, headings, links, and tables.

### **Why Use BeautifulSoup?**
- **Easy to Use**: Simple syntax for extracting and modifying elements.
- **Handles Imperfect HTML**: Can parse malformed HTML and still extract useful data.
- **Integration with Other Libraries**: Works well with `requests` for fetching web pages.
- **Flexible Parsing**: Supports different parsers such as `html.parser`, `lxml`, and `html5lib`.

## **Installing BeautifulSoup**
To use BeautifulSoup, install it using pip:


! pip install beautifulsoup4 requests lxml

- `beautifulsoup4`: The main library for parsing HTML/XML.
- `requests`: Helps fetch web pages.
- `lxml`: A faster parser alternative.

## **Fetching a Web Page**
Before parsing, we need to **retrieve the webpage's source code**. We use the `requests` library:

In [None]:
import requests
from bs4 import BeautifulSoup

# Fetch the webpage
url = "https://example.com"
response = requests.get(url)

# Parse HTML using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Print the formatted HTML
print(soup.prettify())


- `requests.get(url)`: Fetches the webpage content.
- `soup = BeautifulSoup(response.text, "html.parser")`: Parses the HTML.
- `soup.prettify()`: Beautifies and prints the HTML.

## **Navigating the HTML Structure**
### **1. Extracting the Title**

In [None]:
print(soup.title)      # Full <title> tag
print(soup.title.text) # Title content only

### **2. Extracting the First Heading**

In [None]:
print(soup.h1)         # First <h1> tag
print(soup.h1.text)    # Content inside <h1>

### **3. Finding All Links**

In [None]:
for link in soup.find_all("a"):
    print(link["href"])  # Extracts URLs from <a> tags

## **Searching for Elements**
### **1. Finding a Specific Element by Class**

In [None]:
soup.find("div", class_="example-class")

- **`.find()`** returns the first matching element.

### **2. Finding Multiple Elements**

In [None]:
soup.find_all("p")

- **`.find_all()`** returns a list of matching elements.

## **Modifying HTML Content**
BeautifulSoup allows **modifying** the structure of a webpage.

### **1. Changing Text Inside an Element**

In [None]:
tag = soup.find("h1")
tag.string = "New Heading"
print(soup.h1)  # Modified heading

### **2. Removing an Element**

In [None]:
soup.find("p").decompose()  # Removes first <p> tag

## **Extracting Data from a Table**
Tables are common in web scraping. Let's extract a **table's content**:

In [None]:
table = soup.find("table")
rows = table.find_all("tr")

for row in rows:
    columns = row.find_all("td")
    data = [col.text for col in columns]
    print(data)

- **`find("table")`**: Locates the first table.
- **`find_all("tr")`**: Finds all rows.
- **`find_all("td")`**: Extracts data from columns.

## **Handling Attributes**
Extracting **attributes** like `href`, `src`, and `id`:

In [None]:
link = soup.find("a")  
print(link["href"])  # Extracts the href attribute

For images:

In [None]:
img = soup.find("img")
print(img["src"])  # Gets image source URL

## **Using CSS Selectors**
BeautifulSoup supports CSS selectors with `.select()`:

In [None]:
soup.select("div.class-name")   # Finds elements with a specific class
soup.select("p > a")            # Finds links inside paragraphs
soup.select("table tr td")      # Finds table cells

## **Conclusion**
BeautifulSoup is a powerful tool for **web scraping and data extraction**. It provides:
1. **Simple navigation** through HTML structures.
2. **Flexible element searching** using tags, classes, and attributes.
3. **Modifications** to web content.
4. **Data extraction from tables, links, and forms**.

For large-scale projects, **combining BeautifulSoup with Selenium or Scrapy** can improve efficiency.