# Module - 3 Web Scraping Essentials

#### Learning Objective - 

- What is an API?
    - Client-server model
    - REST vs SOAP (intro only)
- HTTP methods: GET, POST, PUT, DELETE
- Understanding endpoints, headers, status codes
- Using the requests library to call APIs
- Working with JSON data in Python
    - Parsing JSON to Python dict
    - Extracting nested values
- Hands-on:
    - Call a public API (e.g., OpenWeatherMap or JSONPlaceholder)
    - Parse and print useful info from response


- What is web scraping? When to use it?
- Static vs dynamic websites
- BeautifulSoup:
    - Parsing HTML
    - Finding tags, attributes, and text
- Selenium (basic usage):
    - Launching browser, finding elements
    - Extracting values, clicking buttons
- Playwright (intro only):
    - Installing and setting up
    - Simple headless scraping example
- Ethical scraping: robots.txt, rate limits
- Hands-on:
    - Scrape titles/prices from a simple product listing page
    - Compare results from BeautifulSoup and Selenium

## Introduction to API

- **API** stands for Application Programming Interface.
- It is a set of protocols and tools that allow different software applications to communicate with each other.
- **Key purposes:**
   - Exposes data and functionality from one application to others.
   - Acts as a contract between provider and consumer.
- **APIs abstract implementation details**, letting consumers use functionalities without knowing internal workings.
- Examples: Weather APIs, Payment gateway APIs, Mapping/Geolocation APIs.



#### How APIs Work: The Client-Server Model


- **Client:** Application or system that sends requests for resources or actions (e.g., your Python script, web browser).
- **Server:** Application or system that listens for requests and sends responses (e.g., a web server hosting the API).
- **Interaction:**
   - Client sends a request (usually over HTTP/S).
   - Server processes the request and returns a response (data or status).
- **Benefits of client-server model:**
   - Separation of concerns.
   - Easier scaling and maintenance.
   - Enables distributed computing.



#### Making API Requests in Python


- Popular libraries for calling APIs in Python:
   - `requests` (most common and beginner-friendly).
   - `http.client` (standard library, lower level).



#### Introduction to REST and SOAP

- **REST (Representational State Transfer):**
   - Lightweight architectural style for building APIs.
   - Uses standard HTTP methods (GET, POST, PUT, DELETE).
   - Data is often sent/received in JSON.
   - Simple to use with Python's `requests` library.
     
- **SOAP (Simple Object Access Protocol):**
   - Protocol for exchanging structured information.
   - Uses XML for messages.
   - More strict, with enveloping and schema requirements.
   - Can be used in Python with libraries like `zeep` or `suds`.
     

#### REST vs SOAP

<table style="width: 60%; border-collapse: collapse; border: 1px solid #ccc; margin-left: 0;">
  <thead>
    <tr style="text-align: center; background-color: #050A30; color: white;">
      <th style="border: 1px solid #ccc; padding: 8px;">Feature</th>
      <th style="border: 1px solid #ccc; padding: 8px;">REST</th>
      <th style="border: 1px solid #ccc; padding: 8px;">SOAP</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Data Format</td>
      <td style="border: 1px solid #ccc; padding: 8px;">JSON (commonly), XML</td>
      <td style="border: 1px solid #ccc; padding: 8px;">XML only</td>
    </tr>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Python Library</td>
      <td style="border: 1px solid #ccc; padding: 8px;"><code>requests</code></td>
      <td style="border: 1px solid #ccc; padding: 8px;"><code>zeep</code>, <code>suds-py3</code></td>
    </tr>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Structure</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Lightweight, less strict</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Strict message structure</td>
    </tr>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Use Cases</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Web apps, microservices</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Enterprise apps, legacy</td>
    </tr>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Ease of Use</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Very easy in Python</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Moderate complexity</td>
    </tr>
  </tbody>
</table>



**Common Use Cases in Python Projects**
- **REST APIs:**  
   - Web scraping, automation, integrating third-party services (payment processing, social media, maps).
- **SOAP APIs:**  
   - Working with enterprise systems (banking, telecom), where SOAP remains prevalent.

### Core HTTP Components

#### 1. HTTP Methods

HTTP methods define actions that can be performed on resources in web APIs. Python's `requests` library makes it easy to work with these methods when interacting with APIs.

<table style="width: 80%; border-collapse: collapse; border: 1px solid #ccc; margin-left: 0;">
  <thead>
    <tr style="text-align: center; background-color: #050A30; color: white;">
      <th style="border: 1px solid #ccc; padding: 8px;">HTTP Method</th>
      <th style="border: 1px solid #ccc; padding: 8px;">Purpose</th>
      <th style="border: 1px solid #ccc; padding: 8px;">Idempotent?</th>
      <th style="border: 1px solid #ccc; padding: 8px;">Python <code>requests</code> Function</th>
      <th style="border: 1px solid #ccc; padding: 8px;">Typical Status Codes</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px;">GET</td>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Retrieve data</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Yes</td>
      <td style="border: 1px solid #ccc; padding: 8px;"><code>requests.get()</code></td>
      <td style="border: 1px solid #ccc; padding: 8px;">200 OK, 404 Not Found</td>
    </tr>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px;">POST</td>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Create new resource</td>
      <td style="border: 1px solid #ccc; padding: 8px;">No</td>
      <td style="border: 1px solid #ccc; padding: 8px;"><code>requests.post()</code></td>
      <td style="border: 1px solid #ccc; padding: 8px;">201 Created, 400 Bad Request</td>
    </tr>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px;">PUT</td>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Update/replace resource</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Yes</td>
      <td style="border: 1px solid #ccc; padding: 8px;"><code>requests.put()</code></td>
      <td style="border: 1px solid #ccc; padding: 8px;">200 OK, 204 No Content</td>
    </tr>
    <tr style="text-align: center;">
      <td style="border: 1px solid #ccc; padding: 8px;">DELETE</td>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Delete resource</td>
      <td style="border: 1px solid #ccc; padding: 8px;">Yes</td>
      <td style="border: 1px solid #ccc; padding: 8px;"><code>requests.delete()</code></td>
      <td style="border: 1px solid #ccc; padding: 8px;">204 No Content, 404 Not Found</td>
    </tr>
  </tbody>
</table>



#### 2. Endpoints

- **Definition:**  
  An endpoint is a specific URL (Uniform Resource Locator) where an API can be accessed by a client to perform a particular function.
- **Structure:**  
  Typically follows the pattern:  
  `https://api.example.com/resource`
- **Examples:**  
  - `https://api.example.com/users` – might return a list of users.
  - `https://api.example.com/users/123` – fetches details of user with ID 123.
  - Endpoints can change based on the resource or the method (GET/POST/PUT/DELETE).

**In Python (using `requests`):**
```python
response = requests.get("https://api.example.com/items/5")
```

#### 3. Headers

- **Definition:**  
  Headers are key-value pairs sent with the HTTP request or response to provide additional context, metadata, or instructions for client and server.
- **Common Use Cases:**
  - **Authentication:** e.g., `Authorization: Bearer `
  - **Content-Type:** Specifies the format (e.g., `application/json`, `application/xml`) of the data.
  - **Accept:** Informs the server what response format is expected.

**In Python:**
```python
headers = {
    "Authorization": "Bearer ",
    "Content-Type": "application/json"
}
response = requests.get("https://api.example.com/items", headers=headers)
```

#### 4. Status Codes

- **Definition:**  
  Status codes are standardized 3-digit numbers returned with every HTTP response, providing information about the result of the request.
- **Major Categories:**
  - **200–299:** Success (e.g., 200 OK, 201 Created)
  - **300–399:** Redirection (e.g., 301 Moved Permanently)
  - **400–499:** Client Error (e.g., 400 Bad Request, 401 Unauthorized, 404 Not Found)
  - **500–599:** Server Error (e.g., 500 Internal Server Error)
- **How to use in Python:**
  - Check `response.status_code` to determine how your code should react.

```python
response = requests.get("https://api.example.com/items")
if response.status_code == 200:
    print("Success!")
elif response.status_code == 404:
    print("Resource not found.")
else:
    print("Error:", response.status_code)
```

#### **Summary**

<table style="width: 50%; border-collapse: collapse; border: 1px solid #ccc; margin-left: 0;">
  <thead>
    <tr style="text-align: left; background-color: #050A30; color: white;">
      <th style="width: 20%;border: 1px solid #ccc; padding: 8px;">Concept</th>
      <th style="border: 1px solid #ccc; padding: 8px;">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ccc; padding: 8px; font-weight: bold;">Endpoints</td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Are the addresses of specific API resources.
      </td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 8px; font-weight: bold;">Headers</td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Carry additional information, like authentication or content type.
      </td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 8px; font-weight: bold;">Status codes</td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Tell you if a request succeeded, redirected, failed, or broke due to a server error.
      </td>
    </tr>
  </tbody>
</table>

### Examples - Using the requests library to call APIs

**Problem Statement:**
Users need a simple way to programmatically interact with a shipment tracking system—retrieving shipment details and adding new consignments remotely via HTTP. Explore how to use Python's requests library to perform GET and POST requests to a real-world shipment REST API, enabling automation of these common tasks.

In [None]:
BASE_URL = r"https://shipment-tracker-ehfh.onrender.com"

#### 1. GET (Retrieve Data)

- Used to request data from a specified resource (API endpoint).
- Does **not** modify data on the server.
- Data is often passed via URL query parameters.



###### Ex. Get all the shipments and store them in dataframe

In [2]:
import requests

###### Ex. Get shipment details by consignment ID


#### 2. POST (Create Data)

- Used to send data to the server to create a new resource.
- Data is included in the request body (usually JSON or form data).
  


###### Ex. Add a new shipment



#### 3. PUT (Update Data)

- Used to update an existing resource fully or create if it doesn't exist.
- Sends all fields of the resource; replaces the current representation.



###### Ex. Update status of consignment_id = DEL-00023 to Delivered


#### 4. DELETE (Remove Data)

- Used to delete a resource identified by a URI.
- Typically returns status code 204 No Content on success.



###### Ex. Delete shipment by consignment_id

## Web Scraping

Web scraping is the automated process of extracting data from websites. It involves using programs (called "scrapers" or "bots") to send requests to web pages, retrieve their contents (HTML, JSON, etc.), and parse out the desired information—such as tables, text, images, or links. Python libraries like BeautifulSoup, Scrapy, and Selenium are popular for this task.

### Web Scraping usage

- Collecting large amounts of data from websites that do not provide a public API.
- Aggregating product prices, reviews, or inventory from online stores.
- Gathering news articles, research publications, or social media content for analysis.
- Monitoring changes on websites (e.g., job postings or real estate listings).
- Any situation where repeated, structured download of web-based public information is needed.

**Important:**  
Always respect website terms of use, robots.txt, and copyright laws. Prefer official APIs if available.

### Static vs Dynamic Websites

<table style="width: 90%; border-collapse: collapse; border: 1px solid #ccc; margin-left: 0;">
  <thead>
    <tr style="text-align: left; background-color: #050A30; color: white;">
      <th style="width: 15%;border: 1px solid #ccc; padding: 8px;">Feature</th>
      <th style="width: 35%;border: 1px solid #ccc; padding: 8px;">Static Website</th>
      <th style="width: 45%;border: 1px solid #ccc; padding: 8px;">Dynamic Website</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Content Delivery</td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Content is fixed in HTML, served as-is by server
      </td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Content is generated/manipulated on-the-fly via JavaScript, APIs, or server processes
      </td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Scraping Ease</td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Easy to scrape—BeautifulSoup, Requests suffice
      </td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Harder to scrape—often need Selenium, Playwright, or to analyze API calls
      </td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Examples</td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Blogs, documentation, company info pages
      </td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        News feeds, social networks, dashboards, e-commerce product pages
      </td>
    </tr>
    <tr>
      <td style="border: 1px solid #ccc; padding: 8px; text-align: left;">Source Code</td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Page source contains all visible data
      </td>
      <td style="border: 1px solid #ccc; padding: 8px;">
        Data may load asynchronously, not present in initial HTML
      </td>
    </tr>
  </tbody>
</table>


- **Static:** What you see in the page source is what appears in the browser.
- **Dynamic:** Content may load after the page loads, requiring tools that mimic a browser or interact with backend APIs.

**Summary:**  
- Web scraping harvests website data automatically.
- Static sites are easy to scrape; dynamic sites often require extra work.
- Use web scraping when APIs aren’t available, but always act ethically and legally.

### BeautifulSoup vs Selenium vs Playwright

**A Comprehensive Comparison for Python Web Scraping and Automation**

<table style="width: 90%; border-collapse: collapse; border: 1px solid #ccc; margin-left: 0;">
  <thead>
    <tr style="text-align: left; background-color: #050A30; color: white;">
      <th>Feature / Tool</th>
      <th>BeautifulSoup</th>
      <th>Selenium</th>
      <th>Playwright</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Core Use</strong></td>
      <td>Static HTML/XML parsing &amp; extraction</td>
      <td>Browser automation, dynamic content, user interaction testing</td>
      <td>Modern browser automation, dynamic JS content, testing</td>
    </tr>
    <tr>
      <td><strong>Best For</strong></td>
      <td>Simple scraping, static pages</td>
      <td>Dynamic sites, JS-heavy pages, UI testing, automation</td>
      <td>Dynamic web pages, reliable scraping/testing, cross-browser</td>
    </tr>
    <tr>
      <td><strong>How It Works</strong></td>
      <td>Parses HTML/XML from text, doesn’t control browsers</td>
      <td>Controls real browsers via WebDriver API</td>
      <td>Controls browsers, high-level API, auto-waiting</td>
    </tr>
    <tr>
      <td><strong>Python Implementation</strong></td>
      <td><code>from bs4 import BeautifulSoup</code><br><code>soup = BeautifulSoup(html, 'parser')</code></td>
      <td><code>from selenium import webdriver</code><br><code>driver = webdriver.Chrome()</code></td>
      <td><code>from playwright.sync_api import sync_playwright</code><br>Use context/browser methods</td>
    </tr>
    <tr>
      <td><strong>Handles JavaScript?</strong></td>
      <td>❌ No (HTML must be fully loaded on fetch)</td>
      <td>✅ Yes (executes JS in browser context)</td>
      <td>✅ Yes (executes JS; modern async support)</td>
    </tr>
    <tr>
      <td><strong>User Interaction?</strong></td>
      <td>❌ None</td>
      <td>✅ Yes (click, type, scroll, etc.)</td>
      <td>✅ Yes (click, type, fill, hover, etc.)</td>
    </tr>
    <tr>
      <td><strong>Speed/Performance</strong></td>
      <td>Very fast (text parsing only)</td>
      <td>Slow (full browser launch for each session)</td>
      <td>Faster than Selenium (uses WebSockets, modern engines)</td>
    </tr>
    <tr>
      <td><strong>Setup Complexity</strong></td>
      <td>Simple: <code>pip install beautifulsoup4</code><br>Needs <code>requests</code></td>
      <td>Medium: <code>pip install selenium</code>, install browser driver</td>
      <td>Simple: <code>pip install playwright</code>, then <code>playwright install</code></td>
    </tr>
    <tr>
      <td><strong>Resource Usage</strong></td>
      <td>Minimal CPU/RAM</td>
      <td>High (opens browsers)</td>
      <td>High (browsers bundled; efficient parallel runs)</td>
    </tr>
    <tr>
      <td><strong>Supports Screenshots?</strong></td>
      <td>❌ No</td>
      <td>✅ Yes</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>Headless Mode?</strong></td>
      <td>Not applicable</td>
      <td>✅ Yes</td>
      <td>✅ Yes (default)</td>
    </tr>
    <tr>
      <td><strong>Waits for Content</strong></td>
      <td>❌ No (parses only supplied HTML)</td>
      <td>Needs manual waits or explicit waits</td>
      <td>✅ Built-in auto-waiting for elements/actions</td>
    </tr>
    <tr>
      <td><strong>Browser Automation</strong></td>
      <td>❌ No</td>
      <td>✅ Yes</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>Community/Ecosystem</strong></td>
      <td>Large, mature</td>
      <td>Very large, mature</td>
      <td>Growing, newer but developer-friendly</td>
    </tr>
    <tr>
      <td><strong>Learning Curve</strong></td>
      <td>Low</td>
      <td>Moderate (web concepts, dealing with WebDriver)</td>
      <td>Moderate (modern async, rich API)</td>
    </tr>
    <tr>
      <td><strong>Licensing</strong></td>
      <td>Open source (BSD 3-Clause, permissive for commercial/proprietary use)</td>
      <td>Open source (Apache 2.0, commercial/proprietary allowed)</td>
      <td>Open source (Apache 2.0, permissive for commercial use)</td>
    </tr>
    <tr>
      <td><strong>Typical Use Cases</strong></td>
      <td>Static sites, quick prototyping, parsing HTML files</td>
      <td>Web app testing, scraping dynamic/interactive sites</td>
      <td>Modern apps, cross-browser testing, reliable scraping</td>
    </tr>
    <tr>
      <td><strong>Limitations</strong></td>
      <td>Can’t handle JS content or user input</td>
      <td>Slow; complex setup for large suites; struggles with some dynamic elements</td>
      <td>Larger install, new community, some browser quirks</td>
    </tr>
  </tbody>
</table>

**Strengths & Weaknesses at a Glance**

**BeautifulSoup**
- **Pros:** Fast, easy, minimal setup, lightweight, ideal for static content.
- **Cons:** Can’t execute JavaScript, no automation, can’t interact with UI, unsuitable for dynamic pages.
- **Licensing:** BSD 3-Clause; fully open for commercial and proprietary use

**Selenium**
- **Pros:** Supports dynamic JS sites, full browser automation, broad language/browser support, huge ecosystem and integrations.
- **Cons:** Slower (full browsers), higher resource use, complex setup with drivers, more fragile to website changes, handling dynamic elements can be tricky, steeper learning curve.
- **Licensing:** Apache 2.0; open for commercial use.

**Playwright**
- **Pros:** Fast (uses WebSockets, parallel runs), supports rich modern web features, auto-waiting, reliable cross-browser automation, easy element handling, strong support for headless operation; modern API.
- **Cons:** Newer, smaller community, larger install size, more complex with many features, some advanced browser quirks, requires managing versions.
- **Licensing:** Apache 2.0; open for commercial/proprietary use.

Why This Matters for Choosing Libraries:
If it's SSR, you can use Beautiful Soup or similar parsers directly on the fully rendered HTML.

If it’s CSR (dynamic content via JavaScript fetch), Beautiful Soup alone won’t see the final data — you’ll need tools like Selenium to render the JavaScript or parse the JSON API responses directly.

### Python Example: Minimal Implementations

**BeautifulSoup:**
```python
from bs4 import BeautifulSoup
import requests
html = requests.get('https://example.com').text
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.text)
```
**Selenium:**
```python
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com')
print(driver.title)
driver.quit()
```
**Playwright:**
```python
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto('https://example.com')
    print(page.title())
    browser.close()
```

**Licensing/Usage**

- **No license fees required** for any of these tools; all are open source. Permissive licenses mean you can use them in commercial and closed-source projects, distribute, and modify them freely[7][8].

**Bottom Line:**  
- Use **BeautifulSoup** for simple, static sites: fastest and lightest.
- Use **Selenium** or **Playwright** for JS-heavy, interactive, or test automation: Playwright is typically more modern and faster, Selenium has broader support and a larger legacy ecosystem.
- All can be used freely in both personal and commercial projects.

### Ethical Web Scraping

**Ethical scraping** means collecting web data responsibly, respecting both the technical boundaries set by website owners and broader legal/ethical considerations. Two fundamental pillars of ethical scraping are: (1) obeying a site’s `robots.txt`, and (2) adhering to reasonable rate limits.

**1. What is `robots.txt`?**

- `robots.txt` is a simple text file placed in the root directory of a website (e.g., `https://example.com/robots.txt`) as part of the Robots Exclusion Protocol.
- This file communicates *which parts* of a website *can* and *cannot* be accessed by automated bots, including web scrapers and search engine crawlers.
- Main directives in `robots.txt`:
  - `User-agent`: specifies which bot(s) the rule applies to (e.g., Googlebot, or all: `*`).
  - `Disallow`: URLs/folders that should not be crawled.
  - `Allow`: exceptions to Disallow rules (used more in Google’s implementation).
  - `Crawl-delay`: specifies how many seconds a bot should wait between requests. Not all bots observe this.
- Example:
  ```
  User-agent: *
  Disallow: /private/
  Allow: /public/
  Crawl-delay: 10
  ```
- While not legally binding everywhere, *respecting robots.txt is considered foundational to ethical web scraping*. Websites may set honeypot traps or block noncompliant bots that ignore it.

**2. Why Respect `robots.txt`?**

- It’s a way for website owners to protect private, sensitive, or resource-intensive areas from automated scraping.
- Sites may restrict bots from crawling admin panels, search result pages, or massive archives to avoid excessive server load and protect privacy.
- Disregarding robots.txt:
  - Is considered *unethical* by the data community.
  - Risks you being blocked, blacklisted, or even facing legal action.
  - May harm a website by overloading it, especially if you hit disallowed sections.
- Respecting it keeps you within the “good bot” norms and reduces your risk profile.

**3. What Are Rate Limits? Why Are They Important?**

- **Rate limiting** is how websites restrict the number of requests you (or your bot) can make over a period (e.g., 10 requests per minute) to prevent server overload and abuse.
- Web servers monitor request rates per IP, User-Agent, or session. Exceeding their threshold triggers blocking, CAPTCHAs, or rate-limiting HTTP errors (429 “Too Many Requests”).
- Even if a site’s robots.txt doesn’t mention crawl delays, ethical scraping means avoiding “hammering” the site with rapid-fire requests.

**Best Practices:**
- Begin with conservative request rates: often 1 request every 10–30 seconds. Some recommend never exceeding 1,000 requests per IP per day for major sites; even less for smaller sites.
- Look for `Crawl-delay` in robots.txt, and always obey it if present (e.g., `Crawl-delay: 5` means at least 5 seconds between requests).
- Randomize your delays and monitor for HTTP 429 errors or CAPTCHAs.
- Respect business hours and slow down during peak times.
- Rotate IPs/users if necessary, but never to circumvent clear restrictions.

**4. Additional Ethical Practices**

- Always check and abide by the website’s Terms of Service or usage policies.
- Prefer official APIs when available instead of scraping HTML.
- Avoid collecting sensitive or personal data unless explicitly permitted.
- Identify your scraper with a meaningful User-Agent string; do not misrepresent your bot as a browser if you are scraping.
- Consider reaching out for permission before scraping large volumes of data.



### Examples - 

In [None]:
BASE_URL = r"https://shipment-tracker-ehfh.onrender.com"

###### Ex. Extract list of container shipping companies from wikipedia

###### Ex. Extract all the shipment details

##### Note - 

- BeautifulSoup is an HTML parsing library designed to work with static websites.  
- Static websites deliver full page content directly in the initial HTML response from the server.  
- BeautifulSoup parses this static HTML effectively to extract required data.  
- Many modern websites use JavaScript to dynamically generate or update content after the initial page load.  
- Such dynamic content is commonly rendered via client-side rendering or AJAX calls.  
- BeautifulSoup **cannot execute JavaScript**, so it cannot see or parse dynamically generated content.  
- For websites with JavaScript-driven dynamic content, you need a tool that can render the page as a real browser would.  
- Selenium automates a web browser and fully renders pages including executing JavaScript.  
- Using Selenium allows you to access the complete, up-to-date DOM after JavaScript has run.  
- Therefore, for dynamic JavaScript-based websites, **Selenium (or similar tools) must be used instead of BeautifulSoup** alone. 

**Selenium + BeautifulSoup**

**Only Selenium**

###### Ex. Extract all the delivered shipment details

###### Ex. Get Consigment details by consignment ID

## Cheat Sheets

#### Beautiful Soup

**Install & Import**

```bash
pip install beautifulsoup4 requests
```

```python
from bs4 import BeautifulSoup
import requests
```

---

**Fetch & Parse HTML**

```python
url = "https://example.com"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")  # or "lxml" if installed
```

---

**Finding Elements**

**- Find first matching element**

```python
soup.find('h1')  
soup.find('div', {'class': 'product'})
```

**- Find all matching elements**

```python
soup.find_all('a')  
soup.find_all('div', {'class': 'item'})
```

**- Find by ID**

```python
soup.find(id='main-content')
```

**- CSS Selectors**

```python
soup.select('.classname')      # class
soup.select('#idname')         # id
soup.select('div > p')         # child
soup.select('table.wikitable') # tag + class
```

---

**Get Text & Attributes**

```python
tag = soup.find('a')

tag.get_text()      # text inside tag
tag.text            # same as above
tag['href']         # value of href attribute
```

---

**Looping Over Multiple Elements**

```python
for link in soup.find_all('a'):
    print(link.get_text(), link['href'])
```

---

**Navigating**

```python
element.parent            # go up one level
element.children          # iterate child elements
element.next_sibling      # next element at same level
element.previous_sibling  # previous element
```

---

**Tables → DataFrame (Pandas)**

```python
import pandas as pd

table = soup.find('table', {'class': 'wikitable'})
df = pd.read_html(str(table))[0]
print(df.head())
```

---

**Cleaning Text**

```python
text = tag.get_text(strip=True)  # remove extra spaces & newlines
```

---

**Common Patterns**

| Goal                | Code                                                     |
| ------------------- | -------------------------------------------------------- |
| All links on page   | `[a['href'] for a in soup.find_all('a', href=True)]`     |
| All images          | `[img['src'] for img in soup.find_all('img', src=True)]` |
| First heading       | `soup.find('h1').text`                                   |
| All rows of a table | `soup.find_all('tr')`                                    |

---

**Tips**

* Use **`soup.prettify()`** to format HTML for inspection.
* Use browser **Inspect** tool to find tags/classes/IDs.
* Respect **robots.txt** and terms of service.