# Web Crawling with Python — Notebook 4
## Handling Common Crawling Issues

---

### Why Handle Issues?

Crawling is not always smooth!  
Websites may block requests, move content, or show errors.  
You must make your crawler robust.


---

## Task 1: Handling Request Errors

When making requests, things can go wrong:
- Page not found (404)
- Server error (500)
- Connection timeouts

**Starter:**  
The code below tries to download a page and checks for a successful response.

**Your task:**  
- Use a try/except block to catch exceptions.
- Print a message if the request fails.


In [None]:
import requests

url = "https://quotes.toscrape.com/nonexistent-page"

try:
    response = requests.get(url, timeout=5)
    # TODO: Check if status code is 200. If not, print the code and a message.
    if response.status_code != 200:
        print(f"Error: Received status code {response.status_code} for URL: {url}")
        
    else:
        print(f"Successfully fetched {url} with status code {response.status_code}.")
        
# You can add further processing of response.text here if needed
except requests.exceptions.Timeout:
    print(f"Error: Request to {url} timed out after 5 seconds.")
except requests.exceptions.ConnectionError:
    print(f"Error: Could not connect to {url}. Check your internet connection or URL.")
except requests.exceptions.HTTPError as e:
    print(f"Error: HTTP error occurred for {url} - {e}")
except Exception as e:

    # TODO: Print an error message and the exception
    print(f"An unexpected error occurred: {e}")
    pass


---

## Task 2: Handling Delays (Be Polite!)

If you crawl too quickly, you may overload servers or get blocked.  
A polite crawler **waits** between requests.

**Your task:**  
- Use `time.sleep()` to pause for 1 second after each request in a loop.


In [None]:
import time

# Example of crawling multiple pages (simulate)
urls = [
    "https://quotes.toscrape.com/page/1/",
    "https://quotes.toscrape.com/page/2/"
]

for url in urls:
    response = requests.get(url)
    print(f"Crawled {url}")
    # TODO: Pause for 1 second here before the next request
    print("pausing for a second.......")
    
print("Finished crawling all URLs.")


---

## Task 3: Detecting and Respecting robots.txt

Most websites have a `robots.txt` file describing rules for crawlers.  
While Python has a module for this, let’s first **fetch and print** the robots.txt contents.

**Your task:**  
- Download and print the contents of `robots.txt` for a site (e.g., "https://quotes.toscrape.com/robots.txt").


In [None]:
robots_url = "https://quotes.toscrape.com/robots.txt"
# TODO: Download and print the robots.txt content
print(f"An unexpected error occurred while fetching robots.txt: {e}")

---

## Task 4: Faking User-Agent Headers

Some websites block requests with a default Python user-agent.  
You can fake a browser header to appear more like a real visitor.

**Starter:**  
A common way to set headers:


In [None]:
url = "https://quotes.toscrape.com"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    # You can try changing this string!
}
# TODO: Make a request with headers and print the status code
try:
    response = requests.get(url, headers=headers, timeout=10) # Added timeout for robustness
    response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)

    print(f"Request to {url} made with custom headers.")
    
except requests.exceptions.HTTPError as e:
    print(f"Headers used: {headers}")


---

## Challenge: Handling Redirects

Some pages may redirect your crawler (status codes 301 or 302).

**Your task:**  
- Use `allow_redirects=False` in `requests.get()`
- Print the status code and the value of the `Location` header (if any)


In [None]:
url = "http://quotes.toscrape.com/redirect"
# TODO: Make the request with allow_redirects=False and print results
try:
    response = requests.get(url, allow_redirects=False, timeout=5)

    print(f"Request URL: {url}")
    
    
    if 'Location' in response.headers:
        print(f"Location Header (Redirect URL): {response.headers['Location']}")
    else:
        print("Location Header: Not found (No redirect indicated in headers, or not a redirect response).")

    # You can also see the response history (which will be empty here because redirects are not followed)
    print(f"Response History (redirects followed by requests): {response.history}")

except requests.exceptions.Timeout:
    print(f"Error: Request to {url} timed out after 5 seconds.")

---

## Reflection

- Which of these issues did you find most important?  
- How could you improve your crawler to be more polite or robust?

- Write your thoughts here:_
- Being polite is crucial for avoiding IP bans, legal issues, and being a good internet citizen
- respecting robots.txt 
- Robustness ensures your crawler doesn't crash and can handle unexpected situations gracefully, providing more reliable data.

