# Part 1 — Introduction to scraping data

### What is web scraping?
Web scraping is programmatically fetching content from websites and extracting structured data (tables, text, links, images, JSON, etc.). Scraping is useful for research, collecting datasets, monitoring prices, and more.

### When to scrape vs. use an API

Prefer an official API when available (cleaner, stable, respects provider limits).

Scrape when there’s no API or the API lacks data you need — but proceed cautiously (see legal/ethical section).

### High-level workflow

- Inspect the page manually (browser dev tools).
- Identify the URL(s) and the HTML elements or API endpoints that contain the data.
- Write code to fetch the page (or call the API).
- Parse the HTML/JSON and extract fields.
- Respect robots.txt, rate limits, and terms of service; add delays and caching.
- Store data (CSV, JSON, database).

# Part 2 — Overview of popular scraping tools in Python

Short intro to the toolset and when to choose each:

1. requests + BeautifulSoup (bs4) — Classic stack for static HTML pages. Easy, lightweight, perfect for beginners and many real-world tasks where JS is not required.

2. Scrapy — Full-featured scraping framework: spiders, request scheduling, pipelines, auto-throttling, built-in exporters, concurrent crawling. Use it for medium→large scale scraping projects.

3. Selenium — Drives a real browser (or headless browser) and can handle dynamic pages, heavy JavaScript, and interactions (clicks, logins). Slower but powerful.

4. APIs & JSON endpoints — Many sites load data via XHR/Fetch; reverse-engineering these endpoints is often easier and more reliable than parsing rendered HTML.

# Part 3 — Simple use of BeautifulSoup

```
pip install bs4 requests
```


In [None]:
# import necessary libraries
from bs4 import BeautifulSoup
import requests
import re


# function to extract html document from given url
def getHTMLdocument(url):
    
    # request for HTML document of given url
    response = requests.get(url)
    
    # response will be provided in JSON format
    return response.text

  
# assign required credentials
# assign URL
url_to_scrape = "https://www.geeksforgeeks.org/courses"

# create document
html_document = getHTMLdocument(url_to_scrape)

# create soap object
soup = BeautifulSoup(html_document, 'html.parser')


# find all the anchor tags with "href" 
# attribute starting with "https://"
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    # display the actual urls
    print(link.get('href'))

# Part 4 — Simple use of Scrapy

Use Scrapy when you need concurrency, auto-throttling, pipelines for cleaning/exporting, retries, or a project structure for many pages.

```
pip install scrapy
```
```
scrapy startproject my_scraper
cd my_scraper
scrapy crawl simple -o output.json
```

# Part 5 — Simple use of Selenium

When the page content is rendered/assembled by JavaScript or needs interaction (logins, forms, clicking "load more").

```
pip install selenium webdriver-manager
```

# Part 6 — Tips and traps

- Anti-bot defenses

    - Many sites use behavior analysis, JS challenges, CAPTCHAs, or rate-limit by IP. For legitimate large-scale crawling, contact the site owner for an API or data access.
    - Do not use tools or techniques to bypass CAPTCHAs; it’s against terms of use and often illegal.

- Common traps

    - Dynamic content loaded after initial HTML (use browser automation or inspect network panel for JSON endpoints).
    - Data hidden in embedded scripts — sometimes pages include JSON blobs inside script tags; parsing those is often easier and more reliable than HTML scraping.
    - Pagination implemented by JavaScript — inspect XHR calls to find the real paginated endpoint.
    - Locale / authentication changes content.

- Debugging tips

    - Start with requests + BeautifulSoup on a single page. Save the HTML to disk and debug selectors offline.
    - Use curl -I to check headers and redirects.
    - Use browser devtools (Network tab) to find JSON endpoints and to see exact request headers.