# üìö Complete Guide: Requests & BeautifulSoup

## Comprehensive Reference for Web Scraping

This notebook covers:
- **Requests Library**: HTTP methods, response attributes, headers, sessions
- **BeautifulSoup**: Parsing methods, navigation, searching, attributes
- **Common Patterns**: Real-world examples and best practices

---

# Part 1: Requests Library üåê

## What is Requests?

The `requests` library is used to send HTTP requests to web servers and receive responses.

In [None]:
import requests
from bs4 import BeautifulSoup
import json

## 1.1 Basic HTTP Methods

### GET Request (Most Common)

In [None]:
# Basic GET request
url = "https://books.toscrape.com/"
response = requests.get(url)

print(f"Status Code: {response.status_code}")
print(f"Response Type: {type(response)}")

### POST Request (Sending Data)

In [None]:
# POST request with data
post_url = "https://httpbin.org/post"
data = {
    'username': 'testuser',
    'password': 'testpass'
}

response = requests.post(post_url, data=data)
print(f"Status: {response.status_code}")
print(f"Response JSON: {response.json()}")

### Other HTTP Methods

In [None]:
# PUT - Update resource
# response = requests.put(url, data=data)

# DELETE - Delete resource
# response = requests.delete(url)

# PATCH - Partial update
# response = requests.patch(url, data=data)

# HEAD - Get headers only (no body)
response = requests.head(url)
print(f"HEAD request headers: {response.headers}")

## 1.2 Response Object Attributes

### Essential Attributes

In [None]:
response = requests.get(url)

# Status code (200 = OK, 404 = Not Found, 500 = Server Error)
print(f"1. status_code: {response.status_code}")

# Boolean: True if status_code < 400
print(f"2. ok: {response.ok}")

# Response text (HTML as string)
print(f"3. text (first 100 chars): {response.text[:100]}")

# Response content (bytes)
print(f"4. content (first 100 bytes): {response.content[:100]}")

# Response URL (final URL after redirects)
print(f"5. url: {response.url}")

# Response headers (dictionary-like)
print(f"6. headers: {dict(response.headers)}")

# Encoding
print(f"7. encoding: {response.encoding}")

# Cookies
print(f"8. cookies: {response.cookies}")

# Elapsed time
print(f"9. elapsed: {response.elapsed}")

# Request history (redirects)
print(f"10. history: {response.history}")

### Response Methods

In [None]:
# Test JSON endpoint
json_url = "https://httpbin.org/json"
response = requests.get(json_url)

# Parse JSON response
print("1. json() - Parse JSON:")
print(response.json())

# Raise exception for bad status codes
print("\n2. raise_for_status() - Check for errors:")
try:
    response.raise_for_status()
    print("   No errors!")
except requests.exceptions.HTTPError as e:
    print(f"   Error: {e}")

## 1.3 Request Parameters

### Query Parameters

In [None]:
# URL with query parameters
params_url = "https://httpbin.org/get"
params = {
    'search': 'python',
    'page': 1,
    'limit': 10
}

response = requests.get(params_url, params=params)
print(f"Final URL: {response.url}")
print(f"Response: {response.json()}")

### Headers

In [None]:
# Custom headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://www.google.com/'
}

response = requests.get(url, headers=headers)
print(f"Request sent with custom headers")
print(f"Status: {response.status_code}")

### Cookies

In [None]:
# Send cookies with request
cookies = {
    'session_id': 'abc123',
    'user_token': 'xyz789'
}

cookie_url = "https://httpbin.org/cookies"
response = requests.get(cookie_url, cookies=cookies)
print(f"Cookies sent: {response.json()}")

### Timeout

In [None]:
# Set timeout (in seconds)
try:
    response = requests.get(url, timeout=5)
    print(f"Request completed in {response.elapsed.total_seconds():.2f} seconds")
except requests.exceptions.Timeout:
    print("Request timed out!")

## 1.4 Sessions (Persistent Connections)

Sessions maintain cookies and settings across requests.

In [None]:
# Create a session
session = requests.Session()

# Set default headers for all requests
session.headers.update({
    'User-Agent': 'My Scraper 1.0'
})

# Make requests with session
response1 = session.get(url)
response2 = session.get(url)

print(f"Session cookies: {session.cookies}")

# Close session
session.close()

## 1.5 Error Handling

In [None]:
try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()  # Raise exception for 4xx/5xx status codes
    
except requests.exceptions.HTTPError as e:
    print(f"HTTP Error: {e}")
    
except requests.exceptions.ConnectionError as e:
    print(f"Connection Error: {e}")
    
except requests.exceptions.Timeout as e:
    print(f"Timeout Error: {e}")
    
except requests.exceptions.RequestException as e:
    print(f"General Error: {e}")
    
else:
    print(f"Success! Status: {response.status_code}")

---

# Part 2: BeautifulSoup üçú

## What is BeautifulSoup?

BeautifulSoup parses HTML/XML and creates a navigable tree structure.

## 2.1 Creating a Soup Object

In [None]:
# Fetch HTML
response = requests.get(url)
html = response.text

# Create soup object
# Parsers: 'html.parser' (built-in), 'lxml' (fast), 'html5lib' (lenient)
soup = BeautifulSoup(html, 'lxml')

print(f"Soup type: {type(soup)}")
print(f"Title: {soup.title}")
print(f"Title text: {soup.title.string}")

## 2.2 Finding Elements

### find() - Find First Match

In [None]:
# Find first <h1> tag
h1 = soup.find('h1')
print(f"First h1: {h1}")
print(f"h1 text: {h1.get_text() if h1 else 'Not found'}")

# Find by class
price = soup.find('p', class_='price_color')
print(f"\nFirst price: {price}")

# Find by id
element = soup.find(id='some-id')
print(f"\nElement by ID: {element}")

# Find by attributes
link = soup.find('a', href=True)  # Find <a> with href attribute
print(f"\nFirst link: {link}")

### find_all() - Find All Matches

In [None]:
# Find all <h3> tags
all_h3 = soup.find_all('h3')
print(f"Found {len(all_h3)} h3 tags")

# Find all with class
all_prices = soup.find_all('p', class_='price_color')
print(f"Found {len(all_prices)} prices")

# Find multiple tags
headers = soup.find_all(['h1', 'h2', 'h3'])
print(f"Found {len(headers)} headers (h1, h2, h3)")

# Limit results
first_5_links = soup.find_all('a', limit=5)
print(f"First 5 links: {len(first_5_links)}")

# Find with attributes
images = soup.find_all('img', src=True)
print(f"Found {len(images)} images with src attribute")

### select() - CSS Selectors

In [None]:
# CSS selector examples

# By tag
divs = soup.select('div')
print(f"1. All divs: {len(divs)}")

# By class (use .classname)
prices = soup.select('.price_color')
print(f"2. By class .price_color: {len(prices)}")

# By ID (use #id)
element = soup.select('#some-id')
print(f"3. By ID #some-id: {len(element)}")

# Descendant selector
article_h3 = soup.select('article h3')
print(f"4. h3 inside article: {len(article_h3)}")

# Direct child (>)
direct_children = soup.select('div > p')
print(f"5. p directly inside div: {len(direct_children)}")

# Attribute selector
links_with_href = soup.select('a[href]')
print(f"6. Links with href: {len(links_with_href)}")

# Attribute value
specific_links = soup.select('a[href="/index.html"]')
print(f"7. Links to /index.html: {len(specific_links)}")

# Multiple classes
elements = soup.select('.class1.class2')
print(f"8. Elements with both classes: {len(elements)}")

### select_one() - First CSS Match

In [None]:
# Find first match with CSS selector
first_price = soup.select_one('.price_color')
print(f"First price: {first_price}")
print(f"Price text: {first_price.get_text() if first_price else 'Not found'}")

## 2.3 Navigating the Tree

### Parent, Children, Siblings

In [None]:
# Get first article
article = soup.find('article')

if article:
    # Parent
    print(f"1. Parent tag: {article.parent.name}")
    
    # Children (direct descendants)
    print(f"\n2. Children:")
    for i, child in enumerate(article.children, 1):
        if child.name:  # Skip text nodes
            print(f"   {i}. {child.name}")
    
    # Descendants (all nested elements)
    print(f"\n3. Total descendants: {len(list(article.descendants))}")
    
    # Next sibling
    next_sib = article.find_next_sibling()
    print(f"\n4. Next sibling: {next_sib.name if next_sib else 'None'}")
    
    # Previous sibling
    prev_sib = article.find_previous_sibling()
    print(f"5. Previous sibling: {prev_sib.name if prev_sib else 'None'}")

### Finding Next/Previous Elements

In [None]:
# Find first h3
h3 = soup.find('h3')

if h3:
    # Find next <p> tag
    next_p = h3.find_next('p')
    print(f"Next <p> after h3: {next_p}")
    
    # Find previous <div>
    prev_div = h3.find_previous('div')
    print(f"\nPrevious <div> before h3: {prev_div.name if prev_div else 'None'}")
    
    # Find all next siblings
    next_siblings = h3.find_next_siblings()
    print(f"\nNext siblings: {len(next_siblings)}")

## 2.4 Extracting Data

### Getting Text

In [None]:
element = soup.find('h1')

if element:
    # Get text (with whitespace)
    text1 = element.get_text()
    print(f"1. get_text(): '{text1}'")
    
    # Get text (stripped)
    text2 = element.get_text(strip=True)
    print(f"2. get_text(strip=True): '{text2}'")
    
    # Get text with separator
    text3 = element.get_text(separator=' | ')
    print(f"3. get_text(separator=' | '): '{text3}'")
    
    # Using .string (only if element has single string)
    text4 = element.string
    print(f"4. .string: '{text4}'")
    
    # Using .text (shortcut for get_text())
    text5 = element.text
    print(f"5. .text: '{text5}'")

### Getting Attributes

In [None]:
# Find a link
link = soup.find('a')

if link:
    # Get attribute using dictionary syntax
    href1 = link['href']
    print(f"1. link['href']: {href1}")
    
    # Get attribute using .get() (safer, returns None if not found)
    href2 = link.get('href')
    print(f"2. link.get('href'): {href2}")
    
    # Get with default value
    title = link.get('title', 'No title')
    print(f"3. link.get('title', 'No title'): {title}")
    
    # Get all attributes
    attrs = link.attrs
    print(f"4. All attributes: {attrs}")
    
    # Check if attribute exists
    has_href = link.has_attr('href')
    print(f"5. Has href attribute: {has_href}")

### Getting Class Names

In [None]:
# Find element with classes
element = soup.find('p', class_='price_color')

if element:
    # Get classes as list
    classes = element.get('class', [])
    print(f"Classes: {classes}")
    
    # Check if has specific class
    has_class = 'price_color' in element.get('class', [])
    print(f"Has 'price_color' class: {has_class}")

## 2.5 Advanced Searching

### Using Functions as Filters

In [None]:
# Find all tags with more than 2 attributes
def has_many_attrs(tag):
    return len(tag.attrs) > 2

elements = soup.find_all(has_many_attrs)
print(f"Elements with >2 attributes: {len(elements)}")

# Find all links to external sites
def is_external_link(tag):
    return tag.name == 'a' and tag.get('href', '').startswith('http')

external_links = soup.find_all(is_external_link)
print(f"External links: {len(external_links)}")

### Using Regular Expressions

In [None]:
import re

# Find tags matching regex
headers = soup.find_all(re.compile('^h[1-6]$'))  # h1, h2, h3, h4, h5, h6
print(f"All headers: {len(headers)}")

# Find by attribute value matching regex
price_elements = soup.find_all('p', class_=re.compile('price'))
print(f"Elements with 'price' in class: {len(price_elements)}")

## 2.6 Practical Examples

### Example 1: Extract All Links

In [None]:
# Find all links
links = soup.find_all('a', href=True)

print(f"Found {len(links)} links\n")

for i, link in enumerate(links[:5], 1):
    href = link['href']
    text = link.get_text(strip=True)
    print(f"{i}. {text[:30]:30} ‚Üí {href}")

### Example 2: Extract Table Data

In [None]:
# Example HTML with table
table_html = """
<table>
    <tr>
        <th>Name</th>
        <th>Age</th>
        <th>City</th>
    </tr>
    <tr>
        <td>Alice</td>
        <td>25</td>
        <td>NYC</td>
    </tr>
    <tr>
        <td>Bob</td>
        <td>30</td>
        <td>LA</td>
    </tr>
</table>
"""

table_soup = BeautifulSoup(table_html, 'lxml')
table = table_soup.find('table')

# Extract headers
headers = [th.get_text(strip=True) for th in table.find_all('th')]
print(f"Headers: {headers}")

# Extract rows
rows = []
for tr in table.find_all('tr')[1:]:  # Skip header row
    cells = [td.get_text(strip=True) for td in tr.find_all('td')]
    rows.append(cells)

print(f"\nRows:")
for row in rows:
    print(row)

### Example 3: Extract Nested Data

In [None]:
# Find all book articles
books = soup.find_all('article', class_='product_pod')

print(f"Found {len(books)} books\n")

for i, book in enumerate(books[:3], 1):
    # Title (nested in h3 > a)
    title = book.find('h3').find('a')['title']
    
    # Price (in p.price_color)
    price = book.find('p', class_='price_color').get_text()
    
    # Rating (class name)
    rating_class = book.find('p', class_='star-rating')['class'][1]
    
    print(f"{i}. {title}")
    print(f"   Price: {price}")
    print(f"   Rating: {rating_class}\n")

## 2.7 Common Attributes Summary

### Tag Attributes

In [None]:
element = soup.find('article')

if element:
    print("Tag Attributes:")
    print(f"1. .name: {element.name}")  # Tag name
    print(f"2. .attrs: {element.attrs}")  # All attributes
    print(f"3. .string: {element.string}")  # Direct string content
    print(f"4. .text: {element.text[:50]}")  # All text content
    print(f"5. .parent: {element.parent.name if element.parent else None}")  # Parent tag
    print(f"6. .contents: {len(element.contents)}")  # Direct children
    print(f"7. .children: {len(list(element.children))}")  # Direct children iterator
    print(f"8. .descendants: {len(list(element.descendants))}")  # All descendants

## üéì Quick Reference

### Requests Cheat Sheet

```python
# Basic request
response = requests.get(url)

# With parameters
response = requests.get(url, params={'key': 'value'})

# With headers
response = requests.get(url, headers={'User-Agent': 'MyBot'})

# With timeout
response = requests.get(url, timeout=5)

# Response attributes
response.status_code  # HTTP status
response.text         # HTML as string
response.content      # HTML as bytes
response.json()       # Parse JSON
response.headers      # Response headers
response.cookies      # Cookies
```

### BeautifulSoup Cheat Sheet

```python
# Create soup
soup = BeautifulSoup(html, 'lxml')

# Finding
soup.find('tag')                    # First match
soup.find_all('tag')                # All matches
soup.find('tag', class_='name')     # By class
soup.find(id='name')                # By ID
soup.select('.class')               # CSS selector
soup.select_one('#id')              # First CSS match

# Extracting
tag.get_text()                      # Get text
tag['attribute']                    # Get attribute
tag.get('attribute', 'default')     # Safe get
tag.attrs                           # All attributes

# Navigation
tag.parent                          # Parent element
tag.children                        # Direct children
tag.find_next_sibling()             # Next sibling
tag.find_previous_sibling()         # Previous sibling
```