# Web Scraping with BeautifulSoup: Important Methods
This notebook provides examples of important methods in BeautifulSoup (`bs4`) for web scraping tasks. Each method is demonstrated with code examples.

## 1. `find` and `find_all`
**Description**: These methods are used to find elements by their tag name.

In [1]:
from bs4 import BeautifulSoup

html = '''
<html>
    <body>
        <div class="content">
            <h1>Welcome to the Site</h1>
            <p class="intro">This is an introductory paragraph.</p>
            <p>This is another paragraph.</p>
        </div>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Find the first <p> tag
first_paragraph = soup.find('p')
print(first_paragraph.text)  # Output: This is an introductory paragraph.

# Find all <p> tags
all_paragraphs = soup.find_all('p')
for p in all_paragraphs:
    print(p.text)
# Output:
# This is an introductory paragraph.
# This is another paragraph.


This is an introductory paragraph.
This is an introductory paragraph.
This is another paragraph.


## 2. `find_by_class` and `find_all_by_class`
**Description**: Find elements by their CSS class.

In [2]:
# Find the first element with the class 'intro'
first_intro = soup.find(class_='intro')
print(first_intro.text)  # Output: This is an introductory paragraph.

# Find all elements with the class 'intro'
all_intro = soup.find_all(class_='intro')
for intro in all_intro:
    print(intro.text)
# Output:
# This is an introductory paragraph.


This is an introductory paragraph.
This is an introductory paragraph.


## 3. `find_by_id`
**Description**: Find an element by its ID.

In [3]:
html = '''
<html>
    <body>
        <div id="header">Header Content</div>
        <div id="footer">Footer Content</div>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Find the element with id 'header'
header = soup.find(id='header')
print(header.text)  # Output: Header Content


Header Content


## 4. `select` and `select_one`
**Description**: Use CSS selectors to find elements.

In [4]:
html = '''
<html>
    <body>
        <div class="container">
            <a href="/home">Home</a>
            <a href="/about">About</a>
        </div>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Select all <a> tags within a div with class 'container'
links = soup.select('div.container a')
for link in links:
    print(link['href'])
# Output:
# /home
# /about

# Select the first <a> tag within a div with class 'container'
first_link = soup.select_one('div.container a')
print(first_link['href'])  # Output: /home


/home
/about
/home


## 5. `get` Method
**Description**: Retrieve the value of an attribute.

In [5]:
html = '''
<html>
    <body>
        <a href="https://example.com" title="Example Site">Visit Example</a>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Find the <a> tag and get the 'href' attribute
link = soup.find('a')
href_value = link.get('href')
print(href_value)  # Output: https://example.com

# Get the 'title' attribute
title_value = link.get('title')
print(title_value)  # Output: Example Site


https://example.com
Example Site


## 6. Navigating the Parse Tree
**Description**: Navigate through elements using parent, children, siblings, etc.

In [6]:
html = '''
<html>
    <body>
        <div class="content">
            <h1>Title</h1>
            <p>First paragraph.</p>
            <p>Second paragraph.</p>
        </div>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Find the first <p> tag
first_paragraph = soup.find('p')

# Get parent element
parent_div = first_paragraph.parent
print(parent_div['class'])  # Output: ['content']

# Get next sibling (next <p> tag)
next_paragraph = first_paragraph.find_next_sibling('p')
print(next_paragraph.text)  # Output: Second paragraph.


['content']
Second paragraph.


## 7. `get_text` Method
**Description**: Extract all text from an element, including from nested tags.

In [7]:
html = '''
<html>
    <body>
        <div class="content">
            <h1>Title</h1>
            <p>First paragraph.</p>
            <p>Second paragraph.</p>
        </div>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Extract all text from the div with class 'content'
content_text = soup.find('div', class_='content').get_text(separator=' ')
print(content_text)
# Output:
# Title First paragraph. Second paragraph.



 Title 
 First paragraph. 
 Second paragraph. 



## 8. `attrs` Method
**Description**: Get all attributes of an element as a dictionary.

In [8]:
html = '''
<html>
    <body>
        <a href="https://example.com" title="Example Site" class="link">Visit Example</a>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Find the <a> tag and get all its attributes
link = soup.find('a')
attributes = link.attrs
print(attributes)
# Output: {'href': 'https://example.com', 'title': 'Example Site', 'class': ['link']}


{'href': 'https://example.com', 'title': 'Example Site', 'class': ['link']}


## 9. Using Regular Expressions
**Description**: Use regex to match tags, attributes, or text content.

In [9]:
import re
from bs4 import BeautifulSoup

html = '''
<html>
    <body>
        <a href="https://example.com/page1">Link 1</a>
        <a href="https://example.com/page2">Link 2</a>
        <a href="/internal/page3">Internal Link</a>
    </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Find all <a> tags with href starting with 'https'
external_links = soup.find_all('a', href=re.compile(r'^https://'))
for link in external_links:
    print(link['href'])
# Output:
# https://example.com/page1
# https://example.com/page2


https://example.com/page1
https://example.com/page2
