# Web Crawling

Web crawling is the process where a program (called a web crawler or spider) automatically visits websites and follows links to discover more pages. This helps build a list of URLs that might contain information we need, either from the entire internet or a specific section of a site.

## What is Pure Web Crawling?

Pure web crawling returns a set of discovered URLs that are likely to contain relevant information — based on link structures, URL patterns, and navigation paths — without extracting any actual content from the pages.

## Rationale: How Pure Crawlers Identify Relevant URLs

Pure crawlers don't access or analyze full page content. Instead, they infer which pages might be relevant by relying on structural clues like:

- **URL patterns** (e.g., `/product/`, `/catalogue/` often indicate detail pages)  
- **HTML context** (e.g., links inside listing containers likely point to items)  
- **Anchor text or attributes** (e.g., "Logitech Keyboard" in link text suggests relevance)

These heuristics let crawlers intelligently guess page relevance and build a list of target URLs — all without actually opening and scraping the content.

## Example Use Case: Crawling for Keyboards

A valid crawling task could be:

> Create a pure crawler that visits a predefined list of e-commerce websites and collects product URLs that likely correspond to keyboards. This is done by following category pages and inferring product relevance based on heuristics like URL paths, link text, or DOM structure — without extracting any product data.

This is a classic case of targeted web crawling — where the focus is on discovery, not extraction.


---

###  Common Use Cases of Web Crawling.

- **Search Engines**  
  Google and Bing use crawlers to find and visit websites, so they can show them when someone searches for something.

- **Checking for New Updates**  
  Some crawlers are used to regularly check websites for updates — like new blog posts, products, or news articles.


---

###  Real-Life Analogy: The Library Scout

Imagine the internet is a **huge library**, and you're looking for books about "keyboards".

- A **web crawler** is like a **scout** walking through every aisle, checking the titles and notes on shelves.
- The scout **doesn’t open any books** — just **makes a list** of all the books that look interesting.
- Later, you (or a **web scraper**) go and **open the books** to copy the information you actually want.

---

###  Popular Tools for Web Crawling
| Tool                        | Category                | Notes                                            |
| --------------------------- | ----------------------- | ------------------------------------------------ |
| **Scrapy**                  | Crawler + Scraper       | Designed for large-scale crawling and scraping   |
| **Apache Nutch**            | Crawler                 | Big data web crawling                            |
| **Selenium**                | Scraper (but can crawl) | Best for scraping JavaScript-heavy websites      |

