---

# Web Scraping

Web scraping is the process of automatically extracting data from websites. Unlike web crawling, which focuses on discovering and listing URLs, web scraping involves **accessing the content within each page and pulling out specific pieces of information** — such as titles, prices, reviews, dates, or any structured content.



Web scraping typically involves:
1. **Sending HTTP requests** to web pages
2. **Parsing the HTML** of the response
3. **Extracting target data** using selectors (CSS, XPath, etc.)
4. **Structuring and storing the data** in formats like JSON, CSV, or databases

Scraping can be done using tools like Python (with libraries such as Scrapy, BeautifulSoup, Selenium)
> Note: Always check a website’s Terms of Service and robots.txt file before scraping. Respect ethical and legal boundaries when collecting data.





---

### Common Use Cases of Web Scraping

- **Price Monitoring**  
  Websites that track prices across e-commerce sites use scrapers to collect prices daily or hourly.

- **Job Aggregators**  
  Platforms like Indeed or Glassdoor scrape job listings from multiple company websites.

- **Market Research**  
  Scrapers collect customer reviews, product specs, or competitor data to support business decisions.

- **Academic or Data Projects**  
  Researchers and students scrape data for machine learning, NLP, or data analysis tasks.

---

### Real-Life Analogy: The Data Collector

Let’s go back to our **library analogy**.

- A **web scraper** is like a **researcher** who walks up to the books listed by the scout (crawler).
- This researcher **opens the books**, **reads specific pages**, and **copies down key facts** — like author names, chapters, or quotes.
- So, while the crawler builds the roadmap, the scraper collects the actual **content**.

---

### Popular Tools for Web Scraping

| Tool                        | Category                | Notes                                             |
| --------------------------- | ----------------------- | ------------------------------------------------ |
| **BeautifulSoup**           | Scraper                 | Great for parsing HTML and extracting data        |
| **Selenium**                | Scraper                 | Handles JavaScript-rendered pages with automation |
| **Scrapy**                  | Crawler + Scraper       | End-to-end framework for crawling and scraping    |


---

### Crawling vs Scraping – Quick Recap

| Feature         | Web Crawling                | Web Scraping                        |
|-----------------|-----------------------------|-------------------------------------|
| Purpose         | Find and index web pages    | Extract data from web pages         |
| Output          | List of URLs                | Structured data (text, tables, etc) |
| Analogy         | Library scout               | Researcher/data collector           |
| Tools           | Scrapy, Nutch               | BeautifulSoup, Selenium, Scrapy     |

---




### When They Work Together

Often, we use **both crawling and scraping** in the same project:
1. The crawler discovers relevant pages.
2. The scraper extracts data from those pages.

For example, a real estate bot may first crawl all listings on a site, then scrape each one for price, size, and location.

---